Rahul Reddy Chappidi Venkata

Data Scientist

Data Scientist Intern

Cline Center for Advanced Social Research

May 2024 - May 2025

Key Achievements

  • Built ETL pipelines to create datasets for downstream machine learning tasks
  • Improved existing classification models and set up retraining & evaluation pipelines
  • Created multiple Power BI dashboards to present project results used by users across the world
  • Built agentic AI pipeline to automate data collection


Problem Statements

  • Text data needed to be classified with low false positive rate for downstream business tasks
  • New scalable ETL pipelines were to be built to support multiple machine learning tasks
  • Data warehouse had duplicate text data and needed to be cleaned


Technical Implementation

  • Debugged and refactored source code, wrote unit tests for existing modules
  • Updated project documentation—code comments, readme files, project wikis, and runbooks
  • Deployed Azure resources (Databricks, Data Factory, Linux Virtual Machines)
  • Built an agentic AI pipeline using oLlama LLM agents to automate data extraction
  • Engineered a HDBSCAN-based clustering solution to flag duplicate data, cutting down duplicate data by 88%
  • Designed and implemented a two-stage cascading binary classifier for low false positive rate
  • Created interactive Power BI dashboards with scheduled refreshes used by end-users worldwide

Two-Stage Cascading Binary Classifier Architecture

                     +-------------------+
                     |   Raw Text Input  |
                     +-------------------+
                                |
                                v
                     +-------------------------------------+
                     |  Stage 1: High-Recall Classifier(s) |
                     +-------------------------------------+
                                |
                                |  P(positive) > threshold₁?
                                |  (e.g. 0.20 to ensure high recall)
                                v
                         +-------------+      +------------------+
                         | Discard/    |      |  Pass to Stage 2 |
                         |  Flag       |      |  (Positives)     |
                         | (Negatives) |      |  (Candidates)    |
                         +-------------+      +------------------+
                                                    |
                                                    v
                     +----------------------------------------------+
                     | Stage 2: High-Precision Refiner              |
                     | - BERT-based classifier                      |
                     | (trained on Stage 1 positives + sampled negs)|
                     | (optimized for precision)                    |
                     +----------------------------------------------+
                             |
                             v
                     +-------------------+
                     |  Final Positives  |
                     +-------------------+
                                        


Technologies Used

Python Azure Databricks Machine Learning ETL Pipelines Power BI HDBSCAN Clustering NLP