Skip to content
View pulipakav1's full-sized avatar
🎯
Focusing
🎯
Focusing
  • United States

Block or report pulipakav1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
pulipakav1/README.md

Hi, I'm Rohit πŸ‘‹

πŸŽ“ MS in Data Science β€” Montclair State University Β |Β  πŸ“ United States

I build end-to-end machine learning systems, real-time data pipelines, and business intelligence solutions.
My work spans predictive modeling, streaming data engineering, LLM research, and production ML deployment.

πŸ’Ό Actively seeking full-time roles in Data Science Β· Data Engineering Β· Data Analytics Β· AI/ML Engineer


πŸ§‘β€πŸ’» About Me

  • πŸ”­ Currently researching on low-resource NLP (Telugu BabyLM)
  • 🌱 Deepening expertise in MLOps, RAG pipelines, and distributed data systems
  • 🎯 I enjoy turning messy data into clear decisions β€” whether through a model, a dashboard, or a pipeline

🏷️ Tech Stack


πŸ”¬ Featured Projects

πŸ”¬ Data Science

Telco Churn Prediction

End-to-end churn prediction system with statistical validation, ML modeling, explainability, and production deployment.

  • XGBoost Β· ROC-AUC 0.82 Β· PR-AUC 0.63
  • SHAP explainability (global + local waterfall)
  • Threshold optimization for churn recall (76%)
  • FastAPI + Streamlit + Docker deployment

XGBoost SHAP FastAPI Docker Streamlit

πŸ—οΈ Data Engineering

Real-Time Streaming Pipeline

Production-grade e-commerce streaming pipeline ingesting, processing, and storing live events at scale.

  • 80K+ events/hour across 3 Kafka topics
  • PySpark Structured Streaming β†’ AWS S3 Parquet
  • Date/hour partitioned output for efficient querying
  • Fully containerized with Docker Compose

Kafka PySpark AWS S3 Docker Streamlit

πŸ“Š Data Analysis

Customer Segmentation β€” RFM

Segmented 9,943 SaaS customers into 4 actionable groups using RFM scoring and K-Means clustering.

  • Champions (1,152) generate 35.5% of $11.5M revenue
  • At-Risk segment (4,400) = largest retention opportunity
  • Actionable business recommendations per segment
  • K-Means with elbow method Β· StandardScaler

Pandas scikit-learn K-Means Matplotlib PowerBI


🧠 Research Projects

LLM Fairness Drift Study

Large-scale evaluation of bias and fairness drift across LLM families (GPT, Claude, Gemini, LLaMA, Gemma)

  • 50K+ model evaluation runs Β· 5+ LLM families benchmarked
  • Distributed experimentation pipelines via SLURM
  • Longitudinal statistical analysis for fairness benchmarking

Telugu BabyLM β€” Low-Resource Language Model

Transformer model trained for Telugu under the BabyLM framework

  • GPT-2 style model via HuggingFace Transformers on A100 GPUs
  • Custom preprocessing pipelines for low-resource corpora
  • Investigated challenges unique to morphologically rich, low-resource scripts

πŸ“Š GitHub Stats

Β 


🌐 Let's Connect

Β 

Open to full-time roles in Data Science Β· Data Engineering Β· Data Analytics β€” feel free to reach out!

Pinned Loading

  1. AI_bias AI_bias Public

    Jupyter Notebook

  2. churn_rate churn_rate Public

    Python

  3. rag_qa rag_qa Public

    Python

  4. Telugu_babyLM Telugu_babyLM Public

    Jupyter Notebook