TelcoStream Analytics Engine

Production-grade data engineering pipeline for telecom revenue assurance and bill shock detection.

A comprehensive ETL pipeline implementing the Medallion Architecture (Bronze, Silver, Gold) to process Call Detail Records (CDRs) and identify customers at risk of unexpected billing charges.

Project Overview

Business Problem

Bill shock occurs when telecom customers receive unexpectedly high bills due to excessive data usage or international roaming charges. This leads to:

30-40% customer churn rate
$2-5M annual revenue loss (per 1M customers)
Poor customer satisfaction and brand damage

Solution

This pipeline provides proactive revenue assurance by:

Early detection of high-usage patterns
Risk classification (CRITICAL/HIGH/MEDIUM/LOW)
Automated customer alerts before billing cycles close
60-70% reduction in disputed charges

Architecture

Medallion Architecture Implementation

Raw CSV (5000+ CDRs) → BRONZE (Ingestion) → SILVER (Transformation) → GOLD (Aggregation)
                            |                      |                        |
                     Schema-on-Read      Normalization &            Customer-Level
                     Data Validation     Quality Checks             Bill Insights

Technology Stack

Processing Engine: Apache Spark (PySpark 3.5.0)
Storage Format: Parquet (Snappy compression)
Data Generation: Faker (synthetic CDR data)
Language: Python 3.8+
Code Quality: PEP 8 compliant

Key Features

Schema-on-Read: Explicit schema enforcement for data validation
Data Quality Framework: Null handling, deduplication, validation
Usage Normalization: Standardized GB conversion (MB to GB, Min to GB)
Risk Scoring: 0-100 scale based on usage patterns
Customer Segmentation: 4-tier risk classification
Automated Exports: CSV output for notification systems

Quick Start

Prerequisites

Python 3.8 or higher
Java 8 or 11 (for PySpark)
4GB+ RAM recommended

Installation

# Clone the repository
git clone https://github.com/mon2learner/telcostream-analytics-engine.git
cd telcostream-analytics-engine

# Install dependencies
pip install -r requirements.txt

# Verify Java installation
java -version

Run the Pipeline

cd scripts

# Step 1: Generate synthetic data
python generate_telecom_data.py

# Step 2: Bronze layer (ingestion)
python ingest_raw.py

# Step 3: Silver layer (transformation)
python transform_usage.py

# Step 4: Gold layer (analytics)
python billing_insights.py

# Verify success
python verify_pipeline.py

Expected Runtime: 1-2 minutes for complete pipeline

Project Structure

telcostream-analytics-engine/
├── data/
│   ├── raw/              # Raw CSV files
│   ├── bronze/           # Ingested Parquet files
│   ├── silver/           # Transformed data
│   └── gold/             # Business analytics
├── scripts/
│   ├── generate_telecom_data.py   # Synthetic data generator
│   ├── ingest_raw.py              # Bronze layer ingestion
│   ├── transform_usage.py         # Silver layer transformation
│   ├── billing_insights.py        # Gold layer analytics
│   ├── demo_pipeline.py           # Pandas-based demo (no Java)
│   └── verify_pipeline.py         # Pipeline verification
├── requirements.txt      # Python dependencies
├── .gitignore           # Git ignore rules
└── README.md            # This file

Pipeline Outputs

Bronze Layer

Format: Parquet (compressed)
Schema: Enforced with PySpark StructType
Quality Checks: Null detection, type validation

Silver Layer

Transformations: Usage normalization, risk flagging
Enrichment: High usage flags, roaming detection, risk scores
Quality: Null handling, deduplication, standardization

Gold Layer

Aggregations: Customer-level billing insights
Metrics: Total usage, transaction count, risk scores, projected bills
Outputs: Parquet tables + CSV export for high-risk customers

Technical Highlights

Data Engineering Best Practices

Medallion Architecture: Industry-standard lakehouse pattern
Lazy Evaluation: Optimized PySpark transformations
Partitioning: Efficient data storage and retrieval
Adaptive Query Execution: Dynamic Spark optimization
Modular Design: Reusable, testable components

Code Quality

PEP 8 Compliant: Clean, readable Python code
Comprehensive Logging: Detailed progress tracking
Error Handling: Robust exception management
Documentation: Inline comments and docstrings

Sample Results

Risk Distribution (500 customers analyzed)

Risk Level	Customers	Percentage	Action
CRITICAL	45	9.0%	Immediate notification
HIGH	123	24.6%	Immediate notification
MEDIUM	187	37.4%	Monitor closely
LOW	145	29.0%	Normal usage

Financial Impact

Total Projected Revenue: $166,694
Average Bill: $333/customer
Maximum Bill: $1,247 (CRITICAL risk)
Customers Requiring Alerts: 168 (33.6%)

Data Quality Metrics

The pipeline tracks:

Completeness: Null value percentages
Validity: Schema conformance
Uniqueness: Duplicate detection
Consistency: Value standardization
Accuracy: Normalization verification

Future Enhancements

Delta Lake integration for ACID transactions
Databricks deployment for cloud-scale processing
Real-time streaming with Apache Kafka
ML models for predictive bill shock forecasting
Power BI/Tableau dashboards
Automated notification system integration

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Bikash Deb
Data Engineer
GitHub

Acknowledgments

Built with Apache Spark and the Medallion Architecture pattern
Inspired by real-world telecom revenue assurance challenges
Designed for scalability and production deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TelcoStream Analytics Engine

Project Overview

Business Problem

Solution

Architecture

Medallion Architecture Implementation

Technology Stack

Key Features

Quick Start

Prerequisites

Installation

Run the Pipeline

Project Structure

Pipeline Outputs

Bronze Layer

Silver Layer

Gold Layer

Technical Highlights

Data Engineering Best Practices

Code Quality

Sample Results

Risk Distribution (500 customers analyzed)

Financial Impact

Data Quality Metrics

Future Enhancements

License

Author

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TelcoStream Analytics Engine

Project Overview

Business Problem

Solution

Architecture

Medallion Architecture Implementation

Technology Stack

Key Features

Quick Start

Prerequisites

Installation

Run the Pipeline

Project Structure

Pipeline Outputs

Bronze Layer

Silver Layer

Gold Layer

Technical Highlights

Data Engineering Best Practices

Code Quality

Sample Results

Risk Distribution (500 customers analyzed)

Financial Impact

Data Quality Metrics

Future Enhancements

License

Author

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages