A production-ready real-time data pipeline built with Apache Spark Structured Streaming to process, analyze, and monitor API logs at scale. This project demonstrates enterprise-grade Big Data engineering practices including ETL, streaming analytics, data quality validation, and anomaly detection.
APIStream Analytics simulates and processes high-volume API traffic logs in real-time, providing actionable insights into API performance, error rates, and usage patterns. The pipeline processes millions of events, aggregates metrics, detects anomalies, and stores results for further analysis.
- Real-Time Streaming: Apache Spark Structured Streaming for continuous data processing
- Scalable ETL: Modular data transformation and validation pipeline
- Data Quality Checks: Automated validation, anomaly detection, and monitoring
- Performance Analytics: Response time analysis, error rate tracking, throughput metrics
- Production-Ready: Configurable, tested, and documented for enterprise deployment
- Cloud-Ready: Designed for deployment on GCP (Dataproc), AWS (EMR), or Azure (Databricks)
API Log Generator
↓
Bronze Layer (Raw Streaming Events)
↓
Data Quality & Validation (DLQ Handling)
↓
Silver Layer (Cleaned & Enriched Events)
↓
Stateful Aggregations (Event-Time + Watermarks)
↓
Gold Layer (Analytics & SLIs)
↓
Parquet (Analytics Lake) | SQLite (Query Layer)
↓
Visualization & Monitoring
- Language: Python 3.8+
- Big Data: Apache Spark 3.5.0 (PySpark, Structured Streaming)
- Storage: Parquet (columnar), SQLite (analytics)
- Data Quality: Custom validation framework
- Visualization: Matplotlib, Plotly, Seaborn
- Testing: pytest
- Configuration: YAML
- Python 3.8 or higher
- Java 8 or 11 (required for Spark)
- 4GB+ RAM recommended
- Git
git clone https://github.com/yourusername/apistream-analytics.git
cd apistream-analytics# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtEdit config/config.yaml to customize:
- Data generation rate
- API endpoint configurations
- Output paths
- Spark settings
Terminal 1: Start Data Generator
python src/data_generator.pyTerminal 2: Start Spark Streaming Pipeline
python src/spark_streaming_pipeline.pyTerminal 3: Run Batch Analytics (optional)
python src/batch_analytics.py# Open Jupyter notebook for visualization
jupyter notebook notebooks/analytics_visualization.ipynb- Response Time Analytics: Min, Max, Avg, P95, P99 latencies per API
- Error Rate Tracking: 4xx, 5xx error percentages by endpoint
- Throughput Monitoring: Requests per second, minute, hour
- Anomaly Detection: Statistical outliers in response times
- Top N Analysis: Slowest APIs, highest error endpoints
- Missing field validation
- Schema enforcement
- Anomaly detection (> 2 standard deviations)
- Duplicate detection
- Timestamp validation
- Micro-batch processing with Spark Structured Streaming
- Watermarking for late-arriving data
- Stateful streaming aggregations
- Extract: Read from streaming source
- Transform: Clean, validate, enrich data
- Load: Write to Parquet and database
- Schema validation
- Null/missing value handling
- Statistical anomaly detection
- Data reconciliation
- Partitioning strategies for Parquet files
- Column pruning and predicate pushdown
- Memory management and caching
- Broadcast joins for dimension tables
- Modular, reusable code
- Configuration management
- Error handling and logging
- Unit testing
- Documentation
pytest tests/ -v# Create Dataproc cluster
gcloud dataproc clusters create apistream-cluster \
--region=us-central1 \
--num-workers=2 \
--worker-machine-type=n1-standard-4
# Submit Spark job
gcloud dataproc jobs submit pyspark \
src/spark_streaming_pipeline.py \
--cluster=apistream-cluster \
--region=us-central1# Create EMR cluster
aws emr create-cluster \
--name "APIStream Analytics" \
--release-label emr-6.10.0 \
--applications Name=Spark \
--instance-type m5.xlarge \
--instance-count 3
# Submit job via EMR console or AWS CLI- Kafka Integration: Replace file-based streaming with Apache Kafka
- Real-Time Dashboard: Add Grafana/Kibana for live monitoring
- ML-Based Anomaly Detection: Use Spark MLlib for pattern recognition
- Multi-Source Ingestion: Add support for database CDC, message queues
- Data Lake Integration: Store raw data in S3/GCS, processed in BigQuery
- Airflow Orchestration: Schedule and monitor with Apache Airflow
- Delta Lake: Add ACID transactions and time travel