Trustworthy Model Re-use System

A comprehensive system for evaluating machine learning models based on trustworthiness metrics including licensing, documentation quality, maintainer health, performance claims, and more.

Team Members

Simon Greenaway
Luke Miller
Ayddan Hartle
Luke Meyer

Overview

This system provides automated evaluation of machine learning models across multiple trustworthiness dimensions. It analyzes models, their associated datasets, and code repositories to generate comprehensive trust scores.

Key Features

Multi-dimensional Evaluation: 8 different metrics covering licensing, documentation, maintainer health, performance claims, and more
Automated Analysis: Processes models from URLs or batch files
Caching System: Database-backed caching for efficient re-evaluation
Extensible Architecture: Plugin-based metrics system for easy expansion
Comprehensive Testing: Full test suite with coverage tracking

Architecture

Core Components

run (CLI entry point)
├── Orchestrator.py (Main controller)
│   ├── Installer.py (Dependency management)
│   ├── Tester.py (Test runner with coverage)
│   └── Url_Parser.py (URL processing)
└── src/app/
    ├── metrics/ (Evaluation system)
    ├── database/ (Caching layer)
    ├── integrations/ (External APIs)
    └── dataset_tracker.py (Dataset inference)

Data Flow

Input: URLs or batch files containing model/dataset/code references
Processing: Extract metadata, analyze documentation, evaluate metrics
Scoring: Apply weighted scoring across all metrics
Output: NDJSON format with detailed scores and timing information

Installation

Prerequisites

Python 3.8+
pip package manager
Git (for repository analysis)

Quick Setup

# Clone the repository
git clone <repository-url>
cd team_repo

# Install dependencies
python run install

# Verify installation
python run test

Manual Installation

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install required packages
pip install -r requirements.txt

Required Packages

huggingface_hub[hf_xet] - HuggingFace API integration
transformers - Model metadata extraction
pytest - Testing framework
pytest-cov - Coverage reporting
pytest-json-report - Machine-readable test reports

Usage

Command Line Interface

Single Model Evaluation

# Evaluate a single model
echo "https://huggingface.co/microsoft/DialoGPT-medium" > model.txt
python run model.txt

Batch Processing

# Create a file with multiple URLs
cat > models.txt << EOF
https://huggingface.co/microsoft/DialoGPT-medium
https://huggingface.co/facebook/bart-large
https://huggingface.co/google/flan-t5-base
EOF

# Process all models
python run models.txt

Grouped Evaluation (CSV Format)

# Format: code_url,dataset_url,model_url
cat > grouped_models.txt << EOF
https://github.com/microsoft/DialoGPT,https://huggingface.co/datasets/daily_dialog,https://huggingface.co/microsoft/DialoGPT-medium
,https://huggingface.co/datasets/squad,https://huggingface.co/facebook/bart-large
https://github.com/google-research/text-to-text-transfer-transformer,,https://huggingface.co/google/flan-t5-base
EOF

python run grouped_models.txt

Testing and Coverage

# Run all tests
python run test

# Run with coverage report
python run test --cov=src --cov-report=html

# Run specific test file
python run test tests/test_metrics.py

# Run with verbose output
python run test -v

Output Format

The system outputs NDJSON (newline-delimited JSON) with the following structure:

{
    "URL": "https://huggingface.co/microsoft/DialoGPT-medium",
    "NetScore": 0.75,
    "NetScore_Latency": 2.34,
    "license": 0.8,
    "license_Latency": 0.12,
    "ramp_up_time": 0.9,
    "ramp_up_time_Latency": 0.45,
    "bus_factor": 0.6,
    "bus_factor_Latency": 0.23,
    "performance_claims": 0.85,
    "performance_claims_Latency": 0.67,
    "size_score": 0.7,
    "size_score_Latency": 0.15,
    "dataset_and_code_score": 0.8,
    "dataset_and_code_score_Latency": 0.34,
    "dataset_quality": 0.75,
    "dataset_quality_Latency": 0.28,
    "code_quality": 0.65,
    "code_quality_Latency": 0.41
}

Configuration

Environment Variables

Logging Configuration

# Log file path (must exist and be writable)
export LOG_FILE="/path/to/logfile.log"

# Log verbosity level
export LOG_LEVEL=1  # 0=silent, 1=info, 2=debug

API Tokens (Optional)

# GitHub token for enhanced repository analysis
export GITHUB_TOKEN="your_github_token_here"

# HuggingFace token for private models
export HUGGINGFACE_TOKEN="your_hf_token_here"

System Configuration

# Disable HuggingFace warnings
export HF_HUB_DISABLE_SYMLINKS_WARNING=1
export TRANSFORMERS_NO_ADVISORY_WARNINGS=1
export TRANSFORMERS_VERBOSITY=error
export HF_HUB_VERBOSITY=error
export HF_HUB_DISABLE_PROGRESS_BARS=1

Logging System

This project includes a configurable logging system to help track errors, warnings, and debug messages. Logging is controlled entirely through environment variables.

LOG_FILE: Path to the file where logs will be written

Example: /tmp/myapp.log (Linux/macOS) or C:\temp\myapp.log (Windows)
Default: app.log in the project root if not set
Important: File must exist and be writable

LOG_LEVEL: Verbosity of the logs

0 → Silent (no logs written)
1 → Informational messages (INFO, WARNING, ERROR, CRITICAL)
2 → Debug messages (DEBUG and above)
Default: 0 (silent)

Metrics System Implementation

src/app/metrics/
├── implementations/          # Individual metric implementations
│   ├── license.py           # License compatibility (LGPLv2.1)
│   ├── documentation.py     # Ramp-up time / docs quality  
│   ├── maintainer.py        # Bus factor / contributor health
│   ├── performance.py       # Performance claims evidence
│   ├── size.py              # Device compatibility scoring
│   ├── dataset.py           # Dataset quality metrics
│   └── code_quality.py      # Code maintainability analysis
├── base.py                  # Core data structures (ResourceBundle, MetricResult)
├── base_metric.py           # Abstract base class
├── registry.py              # Auto-discovery system (@register decorator)
└── engine.py                # Parallel orchestration + NDJSON output

Metrics Implemented

Metric	Purpose	Scoring Focus
`license`	LGPLv2.1 compatibility analysis	License clarity, commercial use permissions
`ramp_up_time`	Documentation quality assessment	README completeness, examples, tutorials
`bus_factor`	Maintainer health evaluation	Contributor diversity, project sustainability
`performance_claims`	Benchmark evidence validation	Performance metrics credibility
`size_score`	Device compatibility analysis	Model size vs deployment targets
`dataset_and_code_score`	Ecosystem completeness	Linked datasets and code repositories
`dataset_quality`	Dataset reliability metrics	Data quality, popularity, maintenance
`code_quality`	Repository maintainability	Code style, testing, CI/CD practices

Metrics Directory Structure

src/app/metrics/
├── base.py                  # Core data structures (ResourceBundle, MetricResult)
├── base_metric.py          # Abstract base class for all metrics
├── registry.py             # Auto-discovery system (@register decorator)
├── engine.py               # Parallel orchestration + NDJSON output
├── bus_factor.py           # Maintainer health evaluation
├── code_quality.py         # Repository maintainability analysis
├── dataset_and_code.py     # Ecosystem completeness scoring
├── dataset_quality.py      # Dataset reliability metrics
├── license_metric.py       # License compatibility analysis
├── llm_metadata_extractor.py # LLM-powered metadata extraction
├── performance_claims.py   # Benchmark evidence validation
├── ramp_up_time.py        # Documentation quality assessment
└── size.py                # Device compatibility analysis

Metrics Implemented

Metric	Weight	Purpose	Scoring Focus
`license`	15%	LGPLv2.1 compatibility analysis	License clarity, commercial use permissions
`ramp_up_time`	15%	Documentation quality assessment	README completeness, examples, tutorials
`bus_factor`	10%	Maintainer health evaluation	Contributor diversity, project sustainability
`performance_claims`	20%	Benchmark evidence validation	Performance metrics credibility
`size_score`	15%	Device compatibility analysis	Model size vs deployment targets
`dataset_and_code_score`	10%	Ecosystem completeness	Linked datasets and code repositories
`dataset_quality`	10%	Dataset reliability metrics	Data quality, popularity, maintenance
`code_quality`	5%	Repository maintainability	Code style, testing, CI/CD practices

Adding New Metrics

Create metric file in src/app/metrics/:

from app.metrics.base import ResourceBundle
from app.metrics.registry import register
from app.metrics.base_metric import BaseMetric

@register("my_new_metric")
class MyNewMetric(BaseMetric):
    name = "my_new_metric"

    def _compute_score(self, resource: ResourceBundle) -> float:
        # Your scoring logic here
        return 0.5  # Return score between 0-1

    def _get_computation_notes(self, resource: ResourceBundle) -> str:
        return f"Analysis notes for {resource.model_url}"

Update weights in Orchestrator.py line 241-250
Add tests in tests/test_metrics.py

Testing

Running Tests

# Run all tests with coverage
python run test

# Run specific test categories
python run test tests/test_metrics.py        # Metrics system
python run test tests/test_database_basic.py # Database functionality
python run test tests/test_orchestrator.py   # Core orchestration

# Generate HTML coverage report
python run test --cov=src --cov-report=html
open htmlcov/index.html  # View detailed coverage

Test Structure

tests/
├── conftest.py                 # Shared test fixtures
├── test_metrics.py            # Core metrics system tests
├── test_metrics_fast.py       # Fast metric validation tests
├── test_orchestrator.py       # Orchestrator functionality
├── test_database_basic.py     # Database operations
├── test_dataset_quality.py    # Dataset metric specific tests
├── test_code_quality.py       # Code quality metric tests
├── test_size.py              # Size metric tests
└── test_cli_*.py             # Command-line interface tests

Coverage Requirements

Minimum 70% line coverage across all modules
Critical components (Orchestrator, metrics engine) should have 90%+ coverage
All new metrics must include comprehensive tests

Development

Project Structure

team_repo/
├── run                        # CLI entry point (executable)
├── Orchestrator.py           # Main controller and workflow
├── Installer.py              # Dependency installation
├── Tester.py                 # Test runner with reporting
├── Url_Parser.py             # URL processing utilities
├── Logger.py                 # Logging configuration
├── requirements.txt          # Python dependencies
├── pyproject.toml           # Project metadata
├── mypy.ini                 # Type checking configuration
├── .flake8                  # Code style configuration
├── src/app/                 # Main application code
│   ├── metrics/             # Metrics evaluation system
│   ├── database/            # Caching and persistence
│   ├── integrations/        # External API clients
│   └── dataset_tracker.py   # Dataset inference system
└── tests/                   # Test suite

Development Workflow

Setup development environment:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run quality checks:

python run test                    # Run tests
python -m flake8 src/             # Code style
python -m mypy src/               # Type checking

Add new features:
- Create feature branch
- Implement changes with tests
- Ensure coverage doesn't decrease
- Update documentation

Code Style Guidelines

Follow PEP 8 style guide
Use type hints for all public functions
Maximum line length: 100 characters
Use descriptive variable and function names
Add docstrings for all public methods

Database Schema

The system uses SQLite for caching with the following structure:

CREATE TABLE metrics_cache (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    model_url TEXT NOT NULL,
    dataset_urls TEXT,
    code_urls TEXT,
    result TEXT NOT NULL,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(model_url, dataset_urls, code_urls)
);

Troubleshooting

Common Issues

Installation Problems

Issue: ModuleNotFoundError when running

# Solution: Ensure dependencies are installed
python run install
# Or manually:
pip install -r requirements.txt

Issue: Permission denied on log file

# Solution: Create writable log file or unset LOG_FILE
touch app.log
chmod 666 app.log
# Or disable logging:
unset LOG_FILE

Runtime Errors

Issue: Network timeouts when fetching model data

# Solution: Check internet connection and try again
# Models are fetched from HuggingFace Hub - ensure access

Issue: "No metrics registered" error

# Solution: Verify metrics are properly imported
python -c "from src.app.metrics.registry import all_metrics; print(list(all_metrics().keys()))"

Issue: Database lock errors

# Solution: Remove cache database to reset
rm -f src/app/database/metrics_cache.db

Testing Issues

Issue: Tests fail with import errors

# Solution: Ensure Python path includes src/
export PYTHONPATH="${PYTHONPATH}:$(pwd)/src"
python run test

Issue: Coverage reports missing files

# Solution: Run tests from project root with proper coverage config
python run test --cov=src --cov-report=term-missing

Performance Optimization

Enable caching: Use use_cache=True in run_metrics() calls
Batch processing: Process multiple models in single runs
Parallel evaluation: Metrics run in parallel automatically
Network optimization: Set appropriate timeouts for API calls

Debugging

Enable debug logging:

export LOG_LEVEL=2
export LOG_FILE=debug.log
python run test

Check metric computation:

# In Python console
from src.app.metrics.registry import all_metrics
from src.app.metrics.base import ResourceBundle

metrics = all_metrics()
bundle = ResourceBundle(model_url="https://huggingface.co/gpt2")
for name, factory in metrics.items():
    metric = factory()
    score = metric.compute_score(bundle)
    print(f"{name}: {score}")

API Reference

Core Functions

`Orchestrator.run_metrics()`

def run_metrics(
    model_url: str,
    dataset_urls: List[str] = None,
    code_urls: List[str] = None,
    weights: Dict[str, float] = None,
    use_cache: bool = False,
    category: str = "MODEL"
) -> str:
    """
    Run metrics evaluation on a model.

    Args:
        model_url: Primary model URL (required)
        dataset_urls: Associated dataset URLs
        code_urls: Associated code repository URLs
        weights: Custom metric weights (defaults provided)
        use_cache: Enable database caching
        category: Evaluation category (MODEL, DATASET, etc.)

    Returns:
        NDJSON string with metric results
    """

`Orchestrator.process_urls()`

def process_urls(file_path: Path) -> int:
    """
    Process URLs from a file.

    Args:
        file_path: Path to file containing URLs or CSV data

    Returns:
        0 on success, non-zero on error
    """

Metric Base Classes

`BaseMetric`

class BaseMetric:
    name: str  # Metric identifier

    def compute_score(self, resource: ResourceBundle) -> MetricResult:
        """Compute metric score with timing."""

    def _compute_score(self, resource: ResourceBundle) -> float:
        """Override this method in subclasses."""

    def _get_computation_notes(self, resource: ResourceBundle) -> str:
        """Override to provide computation details."""

`ResourceBundle`

@dataclass
class ResourceBundle:
    model_url: str
    dataset_urls: List[str] = field(default_factory=list)
    code_urls: List[str] = field(default_factory=list)
    model_id: str = ""

Database API

`get_database()`

def get_database() -> Database:
    """Get database instance with caching support."""

class Database:
    def cache_result(self, model_url: str, result: str,
                    dataset_urls: List[str] = None,
                    code_urls: List[str] = None) -> None:
        """Cache evaluation result."""

    def get_cached_result(self, model_url: str,
                         dataset_urls: List[str] = None,
                         code_urls: List[str] = None) -> Optional[str]:
        """Retrieve cached result."""

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
src/app		src/app
tests		tests
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
Installer.py		Installer.py
Logger.py		Logger.py
Orchestrator.py		Orchestrator.py
README.md		README.md
Tester.py		Tester.py
Url_Parser.py		Url_Parser.py
app.log		app.log
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run		run
urls.txt		urls.txt

Luckysreee/Model_Reuse_CLI

Folders and files

Latest commit

History

Repository files navigation

Trustworthy Model Re-use System

Team Members

Table of Contents

Overview

Key Features

Architecture

Core Components

Data Flow

Installation

Prerequisites

Quick Setup

Manual Installation

Required Packages

Usage

Command Line Interface

Single Model Evaluation

Batch Processing

Grouped Evaluation (CSV Format)

Testing and Coverage

Output Format

Configuration

Environment Variables

Logging Configuration

API Tokens (Optional)

System Configuration

Logging System

Metrics System Implementation

Metrics Implemented

Metrics Directory Structure

Metrics Implemented

Adding New Metrics

Testing

Running Tests

Test Structure

Coverage Requirements

Development

Project Structure

Development Workflow

Code Style Guidelines

Database Schema

Troubleshooting

Common Issues

Installation Problems

Runtime Errors

Testing Issues

Performance Optimization

Debugging

API Reference

Core Functions

Orchestrator.run_metrics()

Orchestrator.process_urls()

Metric Base Classes

BaseMetric

ResourceBundle

Database API

get_database()

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Orchestrator.run_metrics()`

`Orchestrator.process_urls()`

`BaseMetric`

`ResourceBundle`

`get_database()`

Packages