VENKATA NAGA SAI VISHNU ROHIT PULIPAKA pulipakav1

Hi, I'm Rohit 👋

🎓 MS in Data Science — Montclair State University | 📍 United States

I build end-to-end machine learning systems, real-time data pipelines, and business intelligence solutions.
My work spans predictive modeling, streaming data engineering, LLM research, and production ML deployment.

💼 Actively seeking full-time roles in Data Science · Data Engineering · Data Analytics · AI/ML Engineer

🧑‍💻 About Me

🔭 Currently researching on low-resource NLP (Telugu BabyLM)
🌱 Deepening expertise in MLOps, RAG pipelines, and distributed data systems
🎯 I enjoy turning messy data into clear decisions — whether through a model, a dashboard, or a pipeline

🏷️ Tech Stack

🔬 Featured Projects

🔬 Data Science

Telco Churn Prediction

End-to-end churn prediction system with statistical validation, ML modeling, explainability, and production deployment.

XGBoost · ROC-AUC 0.82 · PR-AUC 0.63
SHAP explainability (global + local waterfall)
Threshold optimization for churn recall (76%)
FastAPI + Streamlit + Docker deployment

XGBoost SHAP FastAPI Docker Streamlit

🏗️ Data Engineering

Real-Time Streaming Pipeline

Production-grade e-commerce streaming pipeline ingesting, processing, and storing live events at scale.

80K+ events/hour across 3 Kafka topics
PySpark Structured Streaming → AWS S3 Parquet
Date/hour partitioned output for efficient querying
Fully containerized with Docker Compose

Kafka PySpark AWS S3 Docker Streamlit

📊 Data Analysis

Customer Segmentation — RFM

Segmented 9,943 SaaS customers into 4 actionable groups using RFM scoring and K-Means clustering.

Champions (1,152) generate 35.5% of $11.5M revenue
At-Risk segment (4,400) = largest retention opportunity
Actionable business recommendations per segment
K-Means with elbow method · StandardScaler

Pandas scikit-learn K-Means Matplotlib PowerBI

🧠 Research Projects

LLM Fairness Drift Study

Large-scale evaluation of bias and fairness drift across LLM families (GPT, Claude, Gemini, LLaMA, Gemma)

50K+ model evaluation runs · 5+ LLM families benchmarked
Distributed experimentation pipelines via SLURM
Longitudinal statistical analysis for fairness benchmarking

Telugu BabyLM — Low-Resource Language Model

Transformer model trained for Telugu under the BabyLM framework

GPT-2 style model via HuggingFace Transformers on A100 GPUs
Custom preprocessing pipelines for low-resource corpora
Investigated challenges unique to morphologically rich, low-resource scripts

📊 GitHub Stats

🌐 Let's Connect

Open to full-time roles in Data Science · Data Engineering · Data Analytics — feel free to reach out!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly