The RUNPOD_API_KEY in .env.worker has access to ALL RunPod pods across the ENTIRE account, not just the test organization's pods. NEVER call runpod.get_pods() and terminate all returned pods — this will kill other users' workloads. Always terminate pods by specific pod_id only, obtained from the worker database record for the target organization.
OpenWeights is a Python SDK for running distributed compute jobs on managed RunPod GPU infrastructure. It provides a simple, OpenAI-like API with full flexibility for custom workloads including fine-tuning, inference, evaluations, and arbitrary Python scripts.
Key Features:
- Simple Python SDK with OpenAI-compatible interfaces
- Full flexibility to define custom jobs with arbitrary Docker images and entrypoints
- Automated management of RunPod GPU infrastructure
- Multi-tenancy with organization-based isolation
- Content-addressable job and file IDs for deduplication
A job is the fundamental unit of work in OpenWeights. It consists of three components:
- Docker Image: The container environment (e.g.,
nielsrolf/ow-default, custom images) - Mounted Files: Files uploaded to Supabase storage and mounted into the container
- Entrypoint: The command/script to execute (e.g.,
python train.py --model=llama)
Jobs can be:
- Built-in jobs: Pre-configured templates for common tasks (fine-tuning with Unsloth, inference with vLLM, Inspect AI evaluations)
- Custom jobs: User-defined jobs using the
@registerdecorator andJobsbase class
Jobs progress through the following states:
pending: Job is queued, waiting for a workerin_progress: Job is currently executing on a workercompleted: Job finished successfullyfailed: Job encountered an errorcanceled: Job was manually canceled or timed out
Jobs are reusable templates that define what to execute:
- Identified by content hash of their parameters (e.g.,
unsloth-abc123def456) - If you submit the same job twice, it uses the existing job (deduplication)
- Contain: docker image, script/entrypoint, parameters, VRAM requirements, hardware constraints
Runs are individual executions of a job:
- Each job can have multiple runs (e.g., if restarted after failure)
- Track execution status, assigned worker, and log file
- Created when a worker picks up a job or when using
ow.runcontext
Events are structured logs/outputs during a run:
- Store arbitrary JSON data (metrics, checkpoints, errors)
- Can reference uploaded files (model checkpoints, outputs)
- Used to track progress and collect results
Relationship:
Job (1) ──< (many) Runs (1) ──< (many) Events
OpenWeights follows a queue-based architecture with three main components:
Database Tables:
jobs: Job definitions and statusruns: Execution records linking jobs to workersevents: Structured logs and outputs from runsfiles: File metadata (actual files stored in Supabase Storage)worker: Worker registration and health trackingorganizations: Multi-tenant isolationorganization_secrets: API keys and credentials (HF_TOKEN, RUNPOD_API_KEY, etc.)service_account_tokens: JWT tokens for API authentication
Key Features:
- Row Level Security (RLS) ensures organization isolation
- Atomic job acquisition using PostgreSQL functions (
acquire_job,update_job_status_if_in_progress) - Content-addressable IDs prevent duplicate jobs and files
Architecture:
- Supervisor (
cluster/supervisor.py): Top-level process that spawns one manager per organization - Organization Manager (
cluster/org_manager.py): Manages GPU workers for a single organization
Responsibilities:
- Monitor job queue for pending jobs
- Provision RunPod workers when jobs arrive
- Scale workers based on demand (up to MAX_WORKERS per org)
- Terminate idle workers (idle > 5 minutes)
- Clean up unresponsive workers (no ping > 2 minutes)
- Match jobs to hardware based on VRAM requirements and
allowed_hardwareconstraints
Worker Provisioning:
- Determines GPU type based on job's
requires_vram_gbandallowed_hardware - Supports multi-GPU configurations (1x, 2x, 4x, 8x GPUs)
- Creates worker record in database with
status='starting' - Launches RunPod pod with appropriate Docker image and environment variables
- Updates worker record with
pod_idwhen pod is ready
Worker Lifecycle:
-
Initialization (
worker/main.py):- Detects GPU configuration (type, count, VRAM)
- Runs GPU health checks
- Registers in database with hardware specs
- Starts health check background thread
-
Job Acquisition:
- Polls database for pending jobs matching its Docker image
- Filters by hardware compatibility (VRAM or
allowed_hardware) - Prefers jobs with cached models
- Uses
acquire_job()RPC for atomic job claiming
-
Job Execution:
- Downloads mounted files from Supabase Storage
- Creates temporary directory for job execution
- Runs job script with
OPENWEIGHTS_RUN_IDenvironment variable - Streams logs to local file and stdout
- Monitors for cancellation signals
-
Result Collection:
- Uploads log file to Supabase Storage
- Uploads files from
/uploadsdirectory as results - Creates events with file references
- Updates job status atomically
-
Health Monitoring:
- Pings database every 5 seconds
- Checks for job cancellation or timeout
- Listens for shutdown signal from cluster manager
-
Shutdown:
- Reverts in-progress jobs to pending (if worker dies)
- Uploads final logs
- Terminates RunPod pod
- Sign Up: Users create accounts via Supabase Auth in the dashboard
- Organization Creation: Users create organizations in the dashboard UI
- API Key Generation:
- Users create API tokens via the CLI:
ow token create --name "my-token" - API tokens are prefixed with
ow_and stored securely in theapi_tokenstable - Tokens can optionally have expiration dates and can be revoked
- Format:
ow_followed by a randomly generated secure token
- Users create API tokens via the CLI:
Client-Side:
ow = OpenWeights(auth_token=os.getenv("OPENWEIGHTS_API_KEY"))The client:
- Accepts an OpenWeights API token (starting with
ow_) - Automatically exchanges the API token for a short-lived JWT using
exchange_api_token_for_jwt()RPC - Passes the JWT in the
Authorizationheader to Supabase - Extracts organization ID from the JWT using
get_organization_from_token()RPC - Supports backwards compatibility: if the token is already a JWT (doesn't start with
ow_), it uses it directly
Database-Side:
- Supabase Row Level Security (RLS) policies automatically filter queries
- Policies check
organization_idcolumn against the authenticated token's org - Ensures users can only access their organization's jobs, runs, events, files, workers
Key RLS Policies:
- Jobs: Can only query/insert/update jobs where
organization_idmatches token - Files: Can only access files stored under
organizations/{org_id}/path - Workers: Can only view workers belonging to their organization
- Events/Runs: Accessible through their parent job's organization
Workers can operate in two modes:
- User-Provided Token: Uses the organization's service account token from environment
- Auto-Generated Token: Worker creates its own service account token at startup using
create_service_account_token()RPC
Both approaches leverage RLS to ensure workers can only access their organization's data.
OpenWeights class (__init__.py):
- Entry point for SDK
- Initializes Supabase client with auth token
- Provides accessors for jobs, runs, events, files, chat
- Supports custom job registration via
@registerdecorator
Jobs class (jobs.py):
- Base class for job definitions
- Handles file uploads and mounting
- Computes content-addressable job IDs
- Implements
get_or_create_or_reset()for job deduplication
Run class (run.py):
- Represents a single job execution
- Created automatically when jobs execute
- Provides logging and file upload from within jobs
- Can be used standalone for script-based jobs
Files class (files.py):
- Content-addressable file storage
- Format:
{purpose}:file-{hash[:12]} - Validates conversation/preference datasets
- Handles organization-specific storage paths
Events class (events.py):
- Structured logging for runs
- Supports file attachments
- Provides
latest()to extract most recent metric values
Jobs:
- SFT (Supervised Fine-Tuning)
- DPO (Direct Preference Optimization)
- ORPO (Odds Ratio Preference Optimization)
- Weighted SFT (token-level loss weighting)
Features:
- Built on Unsloth for memory-efficient training
- Automatic model upload to Hugging Face
- Support for LoRA/QLoRA
- Checkpoint tracking via events
- Log probability tracking
Backend: vLLM
Features:
- Batch inference on JSONL datasets
- OpenAI-compatible API endpoints
- Support for conversation and text completion formats
- Automatic result file upload
Backend: Inspect AI framework
Features:
- Run evaluations from the Inspect AI library
- Automatic result download
- Flexible eval options pass-through
Users can define custom jobs:
from openweights import OpenWeights, register, Jobs
from pydantic import BaseModel
@register('my_job')
class MyCustomJob(Jobs):
mount = {'local/script.py': 'script.py'}
params = MyParamsModel # Pydantic model
requires_vram_gb = 24
base_image = 'nielsrolf/ow-default'
def get_entrypoint(self, params):
return f'python script.py --arg={params.arg}'The openweights/jobs/ directory contains several built-in job implementations:
unsloth/: Fine-tuning jobsweighted_sft/: Token-weighted SFTinference/: vLLM inferencevllm/: vLLM configurationinspect_ai.py: Inspect AI evaluationsmmlu_pro/: MMLU Pro evaluation
Important: These are simply convenient job definitions included in the repository. There is nothing architecturally special about them—they could just as easily live in external repositories or be defined by users in their own codebases.
Backend (backend/main.py): FastAPI service
- REST API for job/run/worker management
- Proxies Supabase with additional business logic
- Token management endpoints
- File content serving
Frontend (frontend/src/): React + TypeScript
- Job/run/worker list and detail views
- Real-time log streaming
- Metrics visualization
- Organization management
- Token creation and management
Supabase Storage (files bucket):
- Organization-scoped paths:
organizations/{org_id}/{file_id} - Files are content-addressed with purpose prefix:
{purpose}:file-{hash[:12]} - RLS policies enforce organization boundaries
File Types:
conversations: Training datasets (validated JSONL)preference: Preference datasets for DPO/ORPOresult: Job outputs (model checkpoints, predictions)log: Execution logscustom_job_file: Mounted files for custom jobs
GPU Selection:
- Jobs specify
requires_vram_gb(default: 24) - Optionally specify
allowed_hardwarelist (e.g.,["2x A100", "4x H100"]) - Cluster manager determines GPU type and count from
HARDWARE_CONFIGmapping - Workers register their exact hardware type (e.g., "2x L40")
Supported GPUs:
- NVIDIA L40, A100, A100S, H100N, H100S, H200
- Multi-GPU: 1x, 2x, 4x, 8x configurations
- Configurable in
cluster/start_runpod.py
Worker Matching:
- Workers filter jobs by Docker image first
- Then by hardware compatibility (VRAM or
allowed_hardwarematch) - Prefer jobs with cached models
Job Atomicity:
acquire_job(): Atomically transitions job from pending → in_progressupdate_job_status_if_in_progress(): Only updates if still assigned to worker- Prevents race conditions when multiple workers or managers interact
Worker Failure Handling:
-
Unresponsive Workers (no ping > 2 min):
- Cluster manager reverts their in-progress jobs to pending
- Terminates RunPod pod
- Marks worker as terminated
-
Worker Crashes:
atexithandler attempts to revert jobs to pending- Cluster manager's health check catches missed cases
-
Repeated Failures:
- Workers track last 5 job outcomes
- Self-terminate if all 5 failed (likely bad worker)
Job IDs:
job_id = f"{job_type}-{sha256(params + org_id).hex()[:12]}"- Deterministic based on parameters and organization
- Resubmitting identical job returns existing job
- Optional suffix for manual job variants
File IDs:
file_id = f"{purpose}:file-{sha256(content + org_id).hex()[:12]}"- Automatic deduplication within organization
- Content changes = new file ID
Horizontal Scaling:
- One organization manager per organization
- Managers provision workers dynamically
- Workers execute jobs concurrently
Cost Optimization:
- Idle workers terminated after 5 minutes
- Content addressing prevents redundant work
- Workers prefer cached models to reduce download time
Limits:
MAX_WORKERS_PER_ORG: Default 8 (configurable per org)- Worker TTL: 24 hours (configurable, extendable from within pod)
Worker Health:
- Ping every 5 seconds
- GPU health checks at startup
- Log aggregation via Supabase Storage
Job Progress:
- Events table for structured logging
- Real-time log streaming in dashboard
- Metrics visualization (loss curves, accuracy, etc.)
System State:
- Database tables provide complete audit trail
- Worker status: starting, active, shutdown, terminated
- Job status: pending, in_progress, completed, failed, canceled