This looks like a complete and well-structured Streamlit application for analyzing GitHub repository data! Here is a README file for the github_repo_analysis.py project, along with key learning points about the tools used.
This Streamlit application fetches and analyzes trending GitHub repositories based on a user-defined search query. It provides summary statistics and various visualizations to explore language popularity, creation trends, and log-transformed popularity metrics over time.
- GitHub API Integration: Fetches repository data using the GitHub Search API.
- Data Analysis: Utilizes Pandas for efficient data processing and calculates summary statistics.
- Popularity Normalization: Employs NumPy's log transformation ($\log_{10}(1+x)$) to normalize highly skewed popularity metrics (stars and forks).
- Interactive UI: Built with Streamlit for a simple and interactive user experience.
-
Visualizations: Generates plots using Matplotlib to show:
- Top languages by total stars and forks.
- Repository creation trend over the years.
- Log-transformed popularity (stars and forks) over the creation date.
- Python 3.7+
- A GitHub Personal Access Token (PAT).
-
Clone the repository:
git clone <repository_url> cd github-repo-analyzer
-
Install dependencies:
pip install -r requirements.txt
-
Create an environment file: Create a file named
.envin the root directory and add your GitHub Personal Access Token:GITHUB_TOKEN="YOUR_PERSONAL_ACCESS_TOKEN_HERE"
Note: The token is used for authorization to increase the API rate limit.
-
Run the application:
streamlit run github_repo_analysis.py
The application will open in your web browser.
- Enter Search Query: Specify your search criteria (e.g.,
language:JavaScript,topic:machine-learning). - Configure Fetch: Select the number of repos to fetch, the sorting criteria (
stars,forks,updated), and the order (desc,asc). - Click "Analyze Data": The app will fetch the data, perform the analysis, and display statistics and plots.
The development of this application highlights several key concepts and best practices in data science and API interaction:
- Authentication: The use of a
GITHUB_TOKENstored in an environment variable (.envfile) and passed in the request header via theAuthorizationfield is the standard and secure way to interact with the GitHub API. This token significantly increases your rate limit compared to unauthenticated requests. - Search API: The core functionality relies on the
/search/repositoriesendpoint. This allows for complex querying (e.g., filtering by language, topic, stars) directly in the URL query parameters (q={query}&sort={sort}). - Handling Errors: The
search_reposfunction includes explicit error checking for aresponse.status_codeof 200, providing helpful messages about potential issues like incorrect tokens or hitting the API rate limit.
- Log Transformation for Skewed Data: The popularity metrics like "stars" and "forks" on GitHub are often heavily right-skewed (a few repos have millions, most have very few).
-
Normalization: The code uses
np.log10(1 + x)to apply a log base 10 transformation.- The Log: It pulls in the extremely high values, making the distribution closer to normal and more suitable for averaging and comparison, allowing for normalized popularity metrics.
-
The
$+1$ : Adding 1 to the count (1 + x) is crucial because$\log(0)$ is undefined. Since a repository can have 0 stars or forks, adding 1 ensures the log is calculated for all data points, with$0$ stars/forks mapping to$\log(1)=0$ .
- Data Structuring: The
analyze_reposfunction efficiently converts the list of dictionaries returned by the API into a structured Pandas DataFrame, which is the foundation for all subsequent analysis and plotting. - Feature Engineering: It dynamically adds new, derived columns like
'log_stars','log_forks', and'year'(df['created_at'].dt.year) for enhanced analysis without modifying the original data. - Grouped Analysis: Functions like
plot_language_trendsutilize the powerfulgroupby()andagg()methods to quickly calculate the sum of stars/forks for each programming language, making the visualization data generation concise.
- Simplified UI: Streamlit makes creating an interactive web application trivial. The code uses functions like
st.title(),st.text_input(),st.selectbox(), andst.button()to build the UI with minimal effort. - Displaying Metrics: The
st.metric()function is excellent for displaying key performance indicators (KPIs) and summary statistics in an organized layout. - Visualization Integration: Streamlit seamlessly integrates with Matplotlib, using
st.pyplot(fig)to display the generated charts.