This repository was an experiment for grabbing metadata about open source code repositories across entire GitHub organizations with the idea that the data could be used as the dataset of a Houston Data Visualization meetup Saturday data jam, which it was on Saturday February 17th, 2024.
At a high level this repository is broken into 3 parts:
- The
datadirectory holds the collected data. - The
srcdirectory holds the python code used to harvest the data via the Ecosyste.ms API. - The
frameworkdirectory holds a quick experiment using the new Observable Framework library to create a static page that visualizes the data briefly.
There's also a index.html at the top level for quickly inspecting the data CSVs.
The datasets can potentially be used in a Houston Data Jam.
This data file was created by grabbing all the NASA repositories that Ecosyste.ms has data on (not every repository) for the NASA organization on GitHub.com. Approximately 270 repositories, so not every repository. The ones with low engagement are probably the skipped ones.
This is the same data as in data/nasa_repos.json but flattened into a CSV using the flattenJSON() function in src/main.py.
The CSV can be seen in an easy to see formatted manner on github.com direct link: https://github.com/JustinGOSSES/repo_data_experiment/blob/main/data/nasa_repos_flat.csv
The all_orgs_merged_20240120.csv file has 1111 repositories from the GitHub organizations
nasa, CMSgov, airbnb, houstondataviz, home-assistant, NationalSecurityAgency.
These organizations were selected as they represented organizations with different histories or patterns of how they use GitHub for open source.
The NASA GitHub organization has a comparably longer history on GitHub for a government organization. They also have more than normal suspect pattern of "publishing" code that then quickly has not other development happening with it due to the culture of "publishing" papers, reports, etc. that exists in the organization.
NationalSecurityAgency is the GitHub organization of the US government's National Security Agency or NSA. It has a more narrow scope of the type and reasons for open source and less suspected tendency to drop repositories without continued development.
Home-assistant is an extremely popular open source home automation collection of products and tools with expected extremely diverse and large contribution community. The GitHub organization ~ the product ~ the people organization.
AirBnB is a tech company founded as a digital first company. They also have an engineering blog and a record of contributing open source used by others in some cases.
houstondataviz is the GitHub organization used by the Houston DataViz Meetup. Most of the use is associated with brief one-time only data jams as opposed to being repositories of products, tools, packages, websites, etc.
CMSgov is the GitHub Organization of the Centers for Medicare & Medicaid Services. It is suspect to not have as long of a history on GitHub compared to NASA with more of a focus on actual products and services run by the organization with GitHub being used in part as a way to make it others to use, build upon, and contribute to the code bases.
See the combined CSV as an HTML table here: https://justingosses.github.io/repo_data_experiment/
This work is motivated by the idea that a lot of understanding of open source presence and activity is prevented by the need to manually read so many repositories.
There are times when it is useful to be able to generate high level descriptions of the types of repositories in an organization. This can be useful to compare the types of open source an organization releases. It can also be useful for the organizations as it helps to identify repositories that are highly used, build packages, are primarily samples, or any of a variety of other "repository types" that otherwise require a person to manually read the repository to figure out what is there, a task that isn't possible with hundreds or thousands of repositories.
The purpose of this repository is to test out functionality and performance of using ecosyte.ms API for gathering repository metrics on all the repositories in an organization.
In past efforts to do this, I have used the GitHub API to gather data on an entire organization as seen in https://github.com/JustinGOSSES/awesome-list-visual-explorer-template/ but for large organizations in hundreds of organizations it would take dozens of minutes to gather all the data.
Ecosyst.ms has 270 repositories while the number of repositories in the GitHub organization is 504.
It seems likely based on the repositories that are captured in ecosyste.ms are limited to those that are more active or used in terms of being source for package, stars, forks, etc. which makes sense as ecosyste.mss might be trying to ignore the repositories without engagement that make up the bulk of the repositories on GitHub.
Gathering basic repository metrics for the 270 repositories that Ecosyste.ms has took a couple seconds. Previous experiences with the scripts on https://github.com/JustinGOSSES/awesome-list-visual-explorer-template/ was dozens of minutes.
Repository cohorts is a concept that forms the basis of a talk that has been submitted to the Open Source Summit North America Conference.
It refers to the idea that it can be advantageous to have pre-calculated cohorts of repositories identified based on threshold boundaries across key data dimensions.
These are potential thresholds you might use to create categorical data from continuous data.
- Age: [YES, CAN CREATE WITH ECOSYSTE.MS]
- baby: 0-30 days
- toddler: 31-90 days
- teen:91-365 days
- adult: 366 - 1095
- senior: >1095
-
Last update in days: [YES, CAN CREATE WITH ECOSYSTE.MS]
- past 7 days
- past 8-30 days
- past 31-90 days
- past 90-365 days
- past 366-730 days
- past 731 + days
-
Number of commits in past 90 days: [NOT WITH FIRST SET OF API RESULTS?????? MAYBE REPO API ENDPOINT]
- 0
- 1-10
- 11-200
- 200+
-
Size of contributor community [NOT WITH FIRST SET OF API RESULTS?????? MAYBE REPO API ENDPOINT]
- 1
- 2-4
- 5-10
- 10-75
- 76+
-
External vs. internal contributors (probably not possible in this context)
-
Ehgbal type community types: [YES, CAN CREATE WITH ECOSYSTE.MS]
- toys (small size and low ratio of watches/stars are contributors)
- clubs (small size and high ratio of watches/stars are contributors)
- federation (large size and high ratio of watches/stars are contributors)
- stadium (large size and low ratio of watches/stars are contributors)
-
GitHub Actions [YES, CAN CREATE WITH ECOSYSTE.MS]
- True
- False
-
Samples [YES, with more work]
- True (based on seeing works like 'sample', 'demo', 'example' in org name or repo name)
- False
- How would you quickly summarize how each of these organizations' open source presence?
- What repositories are most impactful for each organization? What metrics could you pick for 'impact'?
- What are dimensions you might use to group similar organizations across GitHub? For example, how would you find all the organizations that are apparently trying to do similar things with their open source presence as the National Security Administration?
- What organization is most similar to CMSgov and why?
- Make a visualization that summarizes for management the organization's open source presence in order to give them a quick overview of the ways the GitHub organization is used and benefits for individuals and organization?
Or whatever you want to answer or try.
Clone repository:
- Run in terminal
git clone https://github.com/JustinGOSSES/repo_data_experiment.git cd repo_data_experiment
Only basic python packages are used (pandas, requests, etc.) so you existing base environment might be fine. However, best practice is to you virtual environments.
-
Create a new conda environment:
conda create --name myenv
-
Activate the environment:
conda activate myenv
-
Install the required packages from the requirements.txt file:
conda install --file requirements.txt
-
Create a new virtual environment:
python -m venv myenv
-
Activate the environment:
- On Windows:
myenv\Scripts\activate
- On macOS and Linux:
source myenv/bin/activate
- On Windows:
-
Install the required packages from the requirements.txt file:
pip install -r requirements.txt
Getting data from another GitHub organization on the subset of repositories that Ecosyste.ms API has data on
In a terminal, call the functions like this replacing the string after --orgName, in this case houstondatavis.
Python src/main.py --orgName houstondatavis --function call_api
Flattening the JSON that is returned in the last step into a flat CSV to make it easier to work with the data
In a terminal, call the functions like this replacing the strings after --inputFilePath and after --outputFilePath.
Python src/main.py --inputFilePath data/houstondatavis_repos.json --outputFilePath data/houstondatavis_repos_flat.csv --function flattenJSON
Python src/main.py --folderPathToLookForCSVsToMerge data --outputFilePath data/combined_org_data/all_orgs_merged_20240120.csv --function mergeMultipleOrgCSV
See the src/main.py file for how this all works.
There is a top-level index.html page which when stood up and viewed in a browser or as a GitHub pages page
will make it easy to see all the columns and the amount of empty cells.
I have the node.js program http-server installed globally
so I start up a local server like http-server and then
navigate to http://127.0.0.1:8080/ in a browser. A python option that does the same thing
is http.server
The GitHub pages URL is https://justingosses.github.io/repo_data_experiment/
https://observablehq.com/@justingosses/analyzing-repositories-by-their-metadata-with-sql