The Highly Stealthy Backdoor (HSB) Risk Evaluator provides security risk assessment for software repositories by analyzing individual evaluation entries across three key dimensions.
This work has been accepted by AAAI 2026, Paper Link.
@misc{yan2025llmbasedquantitativeframeworkevaluating,
title={An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains},
author={Zihe Yan and Kai Luo and Haoyu Yang and Yang Yu and Zhuosheng Zhang and Guancheng Li},
year={2025},
eprint={2511.13341},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2511.13341},
}
The evaluator analyzes repositories across three critical dimensions:
- Software Supply Chain Dependency Location - Assesses the repository's position in the software supply chain
- Difficulty of Hiding Malicious Code - Evaluates how easily malicious payloads could be hidden
- Community Quality - Analyzes the health and security practices of the repository's community
The easiest way to run the HSB Risk Evaluator is using Docker, which provides a consistent environment:
# Build and run the development environment
./dev.sh
uv sync
uv venv
source .venv/bin/activateThis will use the Dockerfile to create a container with all necessary dependencies and provide an interactive shell for running the evaluator and scripts.
Note: Currently only supports Debian-based systems (Ubuntu, Debian, etc.)
For native installation on Debian-series systems:
uv sync
uv venv
source .venv/bin/activateSystem Requirements:
- Debian-based Linux distribution (Ubuntu, Debian, Mint, etc.)
- Python 3.12 or higher
- APT package manager (for package analysis features)
OPENAI_API_KEY- Required for LLM-based upstream repository discoveryGITHUB_TOKEN- Primary GitHub token for API access
Collector configuration options are available through CollectorSettings. See src/hsbriskevaluator/collector/settings.py for detailed configuration parameters.
from hsbriskevaluator.collector.settings import CollectorSettings
settings = CollectorSettings(github_tokens=["token1", "token2"])
repo_info = await collect_all(settings=settings, ...)Evaluator configuration parameters are defined in src/hsbriskevaluator/evaluator/settings.py. Configure risk analysis thresholds and weights as needed.
from hsbriskevaluator.evaluator.settings import EvaluatorSettings
evaluator_settings = EvaluatorSettings()
evaluator = HSBRiskEvaluator(repo_info, settings=evaluator_settings)from hsbriskevaluator.evaluator import HSBRiskEvaluator
from hsbriskevaluator.collector.repo_info import RepoInfo
from hsbriskevaluator.collector import collect_all
from datetime import timedelta
import asyncio
# Load repository information (from collector)
repo_info = await collect_all(
pkt_type='debian',
pkt_name='xz-utils',
repo_name='tukaani-project/xz',
)
# Create evaluator
evaluator = HSBRiskEvaluator(repo_info)
# Run evaluation
result = asyncio.run(evaluator.evaluate())
print(result)You can also use individual evaluators for specific assessments:
from hsbriskevaluator.evaluator import (
CommunityEvaluator,
PayloadEvaluator,
DependencyEvaluator,
CIEvaluator
)
# Community evaluation only
community_eval = CommunityEvaluator(repo_info)
community_result = asyncio.run(community_eval.evaluate())
# Payload evaluation only
payload_eval = PayloadEvaluator(repo_info)
payload_result = asyncio.run(payload_eval.evaluate())
# Dependency evaluation only
dependency_eval = DependencyEvaluator(repo_info)
dependency_result = asyncio.run(dependency_eval.evaluate())
CI_eval = CIEvaluator(repo_info)
CI_result = asyncio.run(CI_eval.evaluate())The scripts/ directory contains utilities for collecting repository information in batch processing workflows. These scripts work together in a specific order to gather comprehensive data about Debian packages and their upstream repositories:
scripts/get_priority_packages.py
Extracts Debian packages with "required", "important", and "standard" priorities into a text file for further processing.
scripts/get_cloud_packages.py
Generates lists of cloud-related packages from various sources for specialized analysis workflows.
scripts/generate_packages_from_file.py
Reads a package list file and generates detailed package and dependency information using APT utilities. Creates packages.yaml and dependencies.yaml files with comprehensive package metadata.
scripts/convert_packages.py
Convert the packages.yaml to another format that can be easier to read by the evaluator.
scripts/fetch_upstream_infos.py
Uses LLM-based analysis to discover and validate upstream Git repository URLs for packages. Updates package files with upstream repository information, which is essential for the next steps.
scripts/generate_package_metadata.py
Processes packages and dependencies to generate metadata about package relationships and sibling packages sharing the same upstream repository. Creates meta_data.yaml files for efficient package grouping.
scripts/fetch_repo_infos.py
Fetches comprehensive repository information from GitHub for packages with upstream URLs. Collects commit history, contributor data, security information, and other repository metrics for risk analysis.
scripts/delete_symlinks.py
Delete the symlinks and merge the siblings to one single yaml.
The scripts/fetch_repo_infos.pyt script requires GitHub API access and supports multiple tokens for improved rate limiting:
- Create
.github_tokensfile in the project root with one token per line - Generate GitHub tokens with repository read permissions
- The script automatically distributes API requests across available tokens
- Rate limiting is handled automatically to maximize throughput
Example .github_tokens file:
ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ghp_yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
ghp_zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
Benefits of multiple tokens:
- Higher API rate limits (5,000 requests/hour per token)
- Reduced processing time for large package sets
- Automatic failover if a token becomes rate limited
# 1. Generate package list
python scripts/get_priority_packages.py
# 2. Generate package information
python scripts/generate_packages_from_file.py debian_priority_packages.txt -d data/debian
python scripts/convert_packages.py
# 3. Fetch upstream repository URLs
python scripts/fetch_upstream_infos.py -d data/debian
# 4. Generate package metadata
python scripts/generate_package_metadata.py -d data/debian
# 5. Collect repository information
python scripts/fetch_repo_infos.py -d data/debian
python scripts/delete_symlinks.py
# 6. Evaluate and score calculation
python scripts/evaluate.py
python scripts/calculate_score.py
All scripts support the --directory parameter to specify custom input/output directories and include comprehensive help documentation accessible via --help.
Each repository evaluation produces detailed analysis across these core aspects:
class CommunityEvalResult(BaseModel):
stargazers_count: int # Number of stargazers of the repository
watchers_count: int # Number of watchers of the repository
forks_count: int # Number of forks of the repository
community_users_count: int # Number of users actively participating in the community
direct_commits_ratio: float # Ratio of direct commits in the main branch
direct_commit_users_count: int # Number of users with direct commit access
maintainers_count: int # Number of maintainers with authority to merge pull requests or directly commit code
pr_reviewers_count: int # Number of active PR reviewers
required_reviewers_distribution: Dict[int, float] # Distribution of reviewers required to approve a PR before merge
estimated_prs_to_become_maintainer: float # Estimated number of PRs needed to become a maintainer
estimated_prs_to_become_reviewer: float # Estimated number of PRs needed to become a reviewer
prs_merged_without_discussion_ratio: float # Ratio of PRs merged without discussion
prs_with_inconsistent_description_ratio: float # Ratio of PRs with mismatched descriptions
avg_participants_per_issue: float # Average participants in issues
avg_participants_per_pr: float # Average participants in PRs
community_activity_score: float # Overall community engagement score (0.0-1.0) (just ignore it, do not have a good formula yet)Key Indicators:
- Few direct committers increases centralization risk
- Low PR reviewer count suggests limited oversight
- PRs merged without discussion may bypass scrutiny
- Inconsistent PR descriptions could hide malicious changes
class PayloadHiddenEvalResult(BaseModel):
allows_binary_test_files: bool # Whether binary files are allowed as tests
allows_binary_document_files: bool # Whether binary files are allowed as document files
allows_binary_code_files: bool # Whether binary files are allowed as code files
allows_binary_asset_files: bool # Whether binary files are allowed as assets
allows_other_binary_files: bool # Whether binary files are allowed as other files
binary_files_count: int # Total binary files detectedclass DependencyEvalResult(BaseModel):
self_priority_required_count: int # Number of required packages corresponding to the repo"
self_priority_important_count: int # Number of important packages corresponding to the repo"
self_priority_standard_count: int # Number of standard packages corresponding to the repo"
self_essential_count: int # Number of essential packages corresponding to the repo"
dependency_priority_required_count: int # Number of required packages that contains the repo as a dependency"
dependency_priority_important_count: int # Number of important packages that contains the repo as a dependency"
dependency_priority_standard_count: int # Number of standard packages that contains the repo as a dependency"
dependency_essential_count: int # Number of essential packages that contains the repo as a dependency"class CIEvalResult(BaseModel):
has_dependabot: bool # Whether repository has Dependabot enabled"
dangerous_token_permission_ratio: float # Ratio of workflows with dangerous token permissions"
dangerous_action_provider_ratio: float # Ratio of workflows with dangerous action providers"
dangerous_action_pin_ratio: float # Ratio of workflows with dangerous action pins"
dangerous_trigger_ratio: float # Ratio of workflows with dangerous triggers"from hsbriskevaluator.collector import GitHubRepoCollector
from hsbriskevaluator.utils.diff import Comparator
from hsbriskevaluator.utils.apt_utils import AptUtils
from hsbriskevaluator.utils.file import get_data_dir
github_collector = GitHubRepoCollector()
apt_utils = AptUtils()
comparator = Comparator(github_collector, apt_utils)
diff_result=comparator.clone_and_compare("xz-utils")ci:
dangerous_action_pin_ratio: 0.1111111111111111
dangerous_action_provider_ratio: 0.7777777777777778
dangerous_token_permission_ratio: 0.0
dangerous_trigger_ratio: 0.0
has_dependabot: false
community_quality:
approvers_count: 18
avg_participants_per_issue: 3.4875
avg_participants_per_pr: 1.8316831683168318
community_activity_score: 0.637
community_users_count: 238
direct_commit_users_count: 10
direct_commits_ratio: 0.95
forks_count: 164
maintainers_count: 10
prs_merged_without_discussion_ratio: 0.5294117647058824
prs_needed_to_become_approver:
0: 11
1: 2
2: 4
6: 1
prs_needed_to_become_maintainer:
0: 2
1: 2
2: 4
5: 1
6: 1
prs_with_inconsistent_description_ratio: 0.0
required_approvals_distribution:
0: 0.868421052631579
1: 0.10526315789473684
4: 0.02631578947368421
stargazers_count: 799
watchers_count: 24
dependency:
dependency_essential_count: 8
dependency_priority_important_count: 13
dependency_priority_required_count: 10
dependency_priority_standard_count: 15
self_essential_count: 0
self_priority_important_count: 0
self_priority_required_count: 0
self_priority_standard_count: 1
payload_hidden_difficulty:
allows_binary_assets_files: false
allows_binary_code_files: false
allows_binary_document_files: false
allows_binary_test_files: true
allows_other_binary_files: false
binary_files_count: 95
pkt_name:
- liblzma5
- xz-utils
url: https://github.com/tukaani-project/xz- API Key Errors: Ensure your OpenRouter API key is correctly set in the
.envfile - Rate Limiting: Reduce
max_concurrencyif hitting API rate limits
Enable debug logging for detailed evaluation information:
import logging
logging.basicConfig(
level=logging.INFO,
)To extend the evaluator:
- New Metrics: Add new evaluation metrics by extending the result models
- Custom Evaluators: Create custom evaluators by inheriting from
BaseEvaluator - Risk Algorithms: Modify risk calculation algorithms in the main evaluator
- LLM Prompts: Improve LLM prompts for better semantic analysis
See the source code for implementation details and extension points.