Data Scientist Intern
Cline Center for Advanced Social Research
May 2024 - May 2025
Key Achievements
- Built ETL pipelines to create datasets for downstream machine learning tasks
 - Improved existing classification models and set up retraining & evaluation pipelines
 - Created multiple Power BI dashboards to present project results used by users across the world
 - Built agentic AI pipeline to automate data collection
 
Problem Statements
- Text data needed to be classified with low false positive rate for downstream business tasks
 - New scalable ETL pipelines were to be built to support multiple machine learning tasks
 - Data warehouse had duplicate text data and needed to be cleaned
 
Technical Implementation
- Debugged and refactored source code, wrote unit tests for existing modules
 - Updated project documentation—code comments, readme files, project wikis, and runbooks
 - Deployed Azure resources (Databricks, Data Factory, Linux Virtual Machines)
 - Built an agentic AI pipeline using oLlama LLM agents to automate data extraction
 - Engineered a HDBSCAN-based clustering solution to flag duplicate data, cutting down duplicate data by 88%
 - Designed and implemented a two-stage cascading binary classifier for low false positive rate
 - Created interactive Power BI dashboards with scheduled refreshes used by end-users worldwide
 
Two-Stage Cascading Binary Classifier Architecture
                     +-------------------+
                     |   Raw Text Input  |
                     +-------------------+
                                |
                                v
                     +-------------------------------------+
                     |  Stage 1: High-Recall Classifier(s) |
                     +-------------------------------------+
                                |
                                |  P(positive) > threshold₁?
                                |  (e.g. 0.20 to ensure high recall)
                                v
                         +-------------+      +------------------+
                         | Discard/    |      |  Pass to Stage 2 |
                         |  Flag       |      |  (Positives)     |
                         | (Negatives) |      |  (Candidates)    |
                         +-------------+      +------------------+
                                                    |
                                                    v
                     +----------------------------------------------+
                     | Stage 2: High-Precision Refiner              |
                     | - BERT-based classifier                      |
                     | (trained on Stage 1 positives + sampled negs)|
                     | (optimized for precision)                    |
                     +----------------------------------------------+
                             |
                             v
                     +-------------------+
                     |  Final Positives  |
                     +-------------------+
                                        
                                    Technologies Used
                                    Python
                                    Azure Databricks
                                    Machine Learning
                                    ETL Pipelines
                                    Power BI
                                    HDBSCAN
                                    Clustering
                                    NLP