Skip to content

This project performs a comprehensive big data analysis of the Apache Spark GitHub repository to track its evolution over time. By examining commit history, issues, and pull requests, we aim to identify key contributors, module churn patterns, and trends in community engagement, particularly around major feature releases.

Notifications You must be signed in to change notification settings

aravpanwar/Spark_Codebase_Evolution

Repository files navigation

Spark Codebase Evolution

Pull data using download_and_filter.sh from GHArchive

Filter Data using filter_spark_events_v3.py

Big Data Analytics, 7th semester 10 mark mini-project. 6 month Analysis of Apache Spark GitHub activity using GHArchive data.

Fixed {Issue: 7th month bleed-over}

Open venv Install Requirements Run python3 spark_6month_analysis.py

Data Loading

image

Data Combination and final data processing

image

Contributor Analysis

image

Event Type Analysis

image

COntributor Growth Analysis

image

Temporal Patterns

image

Executive Analysis

image

Final Visualizations

comprehensive_analysis

About

This project performs a comprehensive big data analysis of the Apache Spark GitHub repository to track its evolution over time. By examining commit history, issues, and pull requests, we aim to identify key contributors, module churn patterns, and trends in community engagement, particularly around major feature releases.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published