Skip to content

OpenLambdaVerse: A Dataset and Analysis of Open-Source Serverless Applications on GitHub

Notifications You must be signed in to change notification settings

disel-espol/openlambdaverse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenLambdaVerse: A Dataset and Analysis of Open-Source Serverless Applications

This repository contains the code and instructions for building the dataset and performing its characterization.

See the end of this README for citation instructions.

Note: All development and testing was done on macOS. Depending on the specific operating system, some of these instructions/commands may need to be adjusted.

Requirements

  • Python 3.11.3 (also tested with Python 3.13.3)
  • A GitHub account
  • jq command-line tool
    brew install jq
  • cloc
    brew install cloc

Setup

  1. Create and activate a virtual environment:
    python -m venv env
    source env/bin/activate
  2. Install the required dependencies:
    pip install -r requirements.txt
  3. Launch a local instance of Jupyter Notebook:
    jupyter notebook

About the .env File

The .env_demo file provides the environment variables required by the scripts, and their placeholder values should be changed to those that fit your implementation in an unversioned .env file.

Environment Variables

Methodology for Data Extraction

Based on Wonderless. The original implementation can be found here. This implementation is also based on improvements introduced in OS3. Its implementation can be found here.

Code search

The get_urls.sh script performs a targeted code search on GitHub via its API, identifying repositories that contain specific serverless configuration files, such as:

  • serverless.yml: (used by the Serverless Framework)

Additional configuration files can also be specified within the script.

To run:

  1. Make the script executable:
    chmod +x src/get_urls.sh
  2. Run the script:
    sh src/get_urls.sh

The results will be stored in the data/raw/code_search_YYYYMMDD_hhmmss directory, where:

  • YYYY: Year the search started
  • MM: Month (zero-padded)
  • DD: Day (zero-padded)
  • hh: Hour (24-hour format)
  • mm: Minute
  • ss: Second

Inside the code search directory, you will find the following subdirectories:

  1. errors: Contains JSON files that failed to parse, often due to improperly escaped special characters or malformed GitHub JSON responses.
  2. logs: Stores a single .TXT file logging the entire process.
  3. results_by_filename: Includes .TXT files listing URLs for each filename specified in the search configuration.

A combined results file containing all the URLs from the entire code search will also be available.


Filtering out configuration files in test and demo directories

To clean up the initial set of file URLs and exclude those located in example, test, or demo directories, run the following script:

python src/filter_urls.py

Output

  • Filtered URLs: The filtered results will be saved in data/processed/code_search_YYYYMMDD_hhmmss/filtered_urls/filtered_urls.txt
  • Logs: Logs detailing the filtering process will be stored in data/processed/code_search_YYYYMMDD_hhmmss/logs/url_filter_log.log

Notes

  • The script automatically identifies the latest code_search_YYYYMMDD_hhmmss directory and processes the corresponding results.
  • It also does some cleaning of the previous output, keeping only valid URLs in the end (lines such as # Results for: are excluded).
  • Ensure that the data/raw directory contains the results from the code search before running this script.

Removing duplicates and generating unique repository URLs

To generate unique repository URLs by removing duplicates that arose from multiple serverless.yml files associated with the same repository, run the following script:

python src/generate_unique_repo_urls.py

Output

  • Unique Repository URLs: The unique repository URLs will be saved in data/processed/code_search_YYYYMMDD_hhmmss/unique_repo_urls/unique_repo_urls.txt
  • Logs: Logs detailing the deduplication process will be stored in data/processed/code_search_YYYYMMDD_hhmmss/logs/base_url_extractor_log.log

Notes

  • The script automatically identifies the latest filtered_urls.txt file from the data/processed/code_search_YYYYMMDD_hhmmss/filtered_urls/ directory.
  • It extracts and deduplicates repository URLs, ensuring only unique base repository URLs are retained.
  • Ensure that the data/processed directory contains the filtered URLs from the previous step before running this script.

Removing Serverless Framework Contributions

To filter out projects belonging to the users serverless and serverless-components, run the following script:

python src/filter_serverless_repos.py

Output

  • Filtered Repository URLs: The filtered repository URLs will be saved in data/processed/code_search_YYYYMMDD_hhmmss/filtered_repo_urls/filtered_repo_urls.txt
  • Logs: Logs detailing the filtering process will be stored in data/processed/code_search_YYYYMMDD_hhmmss/logs/filter_serverless_repos_log.log

Notes

  • The script automatically identifies the latest unique_repo_urls.txt file from the data/processed/code_search_YYYYMMDD_hhmmss/unique_repo_urls/ directory.
  • It filters out repositories owned by the users serverless and serverless-components.
  • Ensure that the data/processed directory contains the unique repository URLs from the previous step before running this script.

Fetching repository metadata

To gather metadata for each repository, run the following command:

python src/get_repos_metadata.py

Output

  • repository_metadata.jsonl: Contains the extracted metadata for each repository in JSON Lines format.
  • extract_repo_metadata_log.log: Log file detailing the metadata extraction process.

Filtering unlicensed projects

To filter out repositories that do not have a valid license, run the following script:

python src/filter_unlicensed_repos.py

Output

  • Filtered licensed repositories: The filtered repositories with a valid license will be saved in data/processed/code_search_YYYYMMDD_hhmmss/results/licensed_repos.jsonl
  • Logs: Logs detailing the filtering process will be stored in data/processed/code_search_YYYYMMDD_hhmmss/logs/filter_unlicensed_repos_log.log

Filtering forks, shallow, inactive, and toy projects

To ensure dataset quality, we apply a series of filters to remove repositories that are forks, shallow (small in size), inactive, or likely to be toy/example projects. Run the following scripts in order:

python src/filter_forked_projects.py
python src/filter_shallow_projects.py
python src/filter_inactive_projects.py
python src/filter_toy_projects.py

Inputs and Outputs

The input and output files are all located in the data/processed/code_search_YYYYMMDD_hhmmss/results/ directory.

Script Input File Output File
filter_forked_projects.py licensed_repos.jsonl filtered_no_forks.jsonl
filter_shallow_projects.py filtered_no_forks.jsonl filtered_no_shallow.jsonl
filter_inactive_projects.py filtered_no_shallow.jsonl filtered_no_inactive.jsonl
filter_toy_projects.py filtered_no_inactive.jsonl filtered_no_toy.jsonl
  • Logs:
    Each script produces a log file in data/processed/code_search_YYYYMMDD_hhmmss/logs/, named according to the script (e.g., filter_forked_projects_log.log).

Filtering AWS-only projects

To retain only repositories that use AWS as their serverless provider, run:

python src/filter_non_aws_projects.py

Input

  • data/processed/code_search_YYYYMMDD_hhmmss/results/filtered_no_toy.jsonl

Output

  • Filtered AWS Projects:
    data/processed/code_search_YYYYMMDD_hhmmss/results/aws_provider_repos.jsonl
  • Logs:
    data/processed/code_search_YYYYMMDD_hhmmss/logs/filter_by_provider_log.log

Cloning repositories

To clone all filtered AWS repositories, define a target directory (e.g. on an external SSD) by setting the CLONED_REPOS_DIRECTORY variable in your .env file. Then, run:

python src/clone_projects.py

Notes

  • If no CLONED_REPOS_DIRECTORY variable is present in the .env file, a default cloned_aws_repos will be used.

Input

  • data/processed/code_search_YYYYMMDD_hhmmss/results/aws_provider_repos.jsonl

Output

  • Cloned Repositories:
    All repositories will be cloned into the directory specified by CLONED_REPOS_DIRECTORY in your .env file, or into cloned_aws_repos by default.
  • Logs:
    data/processed/code_search_YYYYMMDD_hhmmss/logs/clone_aws_repos.log

Dataset characterization

We provide the notebooks used for the dataset analysis and characterization.

Year of creation

  • notebooks/analysis_github_repo_creation_dates.ipynb

Event triggers

  • notebooks/analysis_event_triggers.ipynb

Code complexity

CLOC

If you haven't already, install CLOC:

brew install cloc

Make sure to export your environment variables:

export $(cat .env | xargs)

And create a cloc directory to store the outputs:

mkdir -p cloc

Then run the following command:

for d in "$CLONED_REPOS_DIRECTORY"/*/; do (cd "$d" && cloc --vcs=git --exclude-dir=node_modules,.venv,venv,env,vendor,target,dist,build,out,__pycache__,.vscode,.idea --not-match-f='(package-lock\.json|yarn\.lock|pnpm-lock\.yaml|poetry\.lock|Pipfile\.lock)' --json .) | jq -c --arg repo "$(basename "$d")" '. + {repository: $repo}' >> cloc/cloc_output.jsonl; done

jq -c --slurp '.[]' cloc/cloc_reports.jsonl > cloc/cloc_reports_fixed.jsonl
It iterates through the specified directory of projects, runs a detailed analysis on each one, and aggregates the results into a single cloc_output.jsonl file.

Criteria for this analysis includes:

  • Only files tracked by Git are counted (--vcs=git).
  • Excludes common dependency directories (e.g., node_modules, .venv, vendor).
  • Excludes common build artifact directories (e.g., dist, build, target).
  • Filters out auto-generated package manager lock files (e.g., package-lock.json, poetry.lock).

With this additional filtering, the CLOC analysis' output better reflects the code complexity of the serverless functions and/or the additional code supporting their implementation.

Treating non-git repos

If any repos. have a missing .git directory, run this command on a separate directory with those repos.:

for d in "$REPO_DIR"; do (cd "$d" && cloc --exclude-dir=node_modules,.venv,venv,env,vendor,target,dist,build,out,__pycache__,.vscode,.idea --not-match-f='(package-lock\.json|yarn\.lock|pnpm-lock\.yaml|poetry\.lock|Pipfile\.lock)' --json .) | jq -c --arg repo "$(basename "$d")" '. + {repository: $repo}' >> cloc/cloc_output_non_git.jsonl; done

The output is saved in cloc/cloc_output_non_git.jsonl. Make sure to add your desired records into the broader cloc/cloc_output.jsonl output file.

Then, run the following notebooks to generate the code complexity analysis of the repositories:

  • notebooks/analysis_code_complexity.ipynb

Collecting configuration files

To make it easier to navigate through the SF config. files, run:

python scripts/move_config_files.py

Preparing assets for the paper

Figures

We run the following scripts to create additional CSV files to be used on the notebooks.

python scripts/generate_providers_runtimes.py
python scripts/generate_repo_metadata_counts.py
python scripts/generate_repo_sizes_detail.py
python scripts/generate_repo_topics_counts.py 

Outputs

  • csvs/runtimes.csv
  • csvs/repo_metadata.csv
  • csvs/repo_sizes_detail.csv
  • csvs/repo_topics.csv

After running these, we proceed to use additional notebooks to generate the images (executed in order):

  • notebooks/eda_functions_events_counts_loc_repos.ipynb
  • notebooks/eda_functions_runtimes.ipynb
  • notebooks/eda_github_repo_counts.ipynb
  • notebooks/eda_github_repo_sizes.ipynb
  • notebooks/eda_github_repo_topics.ipynb

Please cite as follows:

Ángel C. Chávez-Moreno and Cristina L. Abad. OpenLambdaVerse: A Dataset and Analysis of Open-Source Serverless Applications. IEEE International Conference on Cloud Engineering (IC2E), to appear. 2025.

Link to code repository: https://github.com/disel-espol/openlambdaverse.
Link to dataset (latest): https://doi.org/10.5281/zenodo.16533580. Link to pre-print: https://arxiv.org/abs/2508.01492.

About

OpenLambdaVerse: A Dataset and Analysis of Open-Source Serverless Applications on GitHub

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •