Data Scientist Intern
Cline Center for Advanced Social Research
May 2024 - May 2025
Key Achievements
- Built ETL pipelines to create datasets for downstream machine learning tasks
- Improved existing classification models and set up retraining & evaluation pipelines
- Created multiple Power BI dashboards to present project results used by users across the world
- Built agentic AI pipeline to automate data collection
Problem Statements
- Text data needed to be classified with low false positive rate for downstream business tasks
- New scalable ETL pipelines were to be built to support multiple machine learning tasks
- Data warehouse had duplicate text data and needed to be cleaned
Technical Implementation
- Debugged and refactored source code, wrote unit tests for existing modules
- Updated project documentation—code comments, readme files, project wikis, and runbooks
- Deployed Azure resources (Databricks, Data Factory, Linux Virtual Machines)
- Built an agentic AI pipeline using oLlama LLM agents to automate data extraction
- Engineered a HDBSCAN-based clustering solution to flag duplicate data, cutting down duplicate data by 88%
- Designed and implemented a two-stage cascading binary classifier for low false positive rate
- Created interactive Power BI dashboards with scheduled refreshes used by end-users worldwide
Two-Stage Cascading Binary Classifier Architecture
+-------------------+
| Raw Text Input |
+-------------------+
|
v
+-------------------------------------+
| Stage 1: High-Recall Classifier(s) |
+-------------------------------------+
|
| P(positive) > threshold₁?
| (e.g. 0.20 to ensure high recall)
v
+-------------+ +------------------+
| Discard/ | | Pass to Stage 2 |
| Flag | | (Positives) |
| (Negatives) | | (Candidates) |
+-------------+ +------------------+
|
v
+----------------------------------------------+
| Stage 2: High-Precision Refiner |
| - BERT-based classifier |
| (trained on Stage 1 positives + sampled negs)|
| (optimized for precision) |
+----------------------------------------------+
|
v
+-------------------+
| Final Positives |
+-------------------+
Technologies Used
Python
Azure Databricks
Machine Learning
ETL Pipelines
Power BI
HDBSCAN
Clustering
NLP