Hardware-aware AI workload scheduler with kernel bypass networking.
GPU topology discovery, NVLink-aware placement, DPDK zero-copy transfers,
and predictive autoscaling for production GPU clusters.
Features Β· Architecture Β· Quick Start Β· Benchmarks Β· Contributing
CORTEX β GPU cluster dashboard with topology view, scheduling timeline, and memory allocation
- Features
- Architecture Overview
- GPU Topology Awareness
- Memory Management: Buddy Allocator
- Kernel Bypass Networking
- Components
- Quick Start
- Benchmarks
- Configuration
- Contributing
- License
- NVML Topology Discovery -- Reads physical GPU interconnect hierarchy at runtime
- NVLink-Aware Bin-Packing -- Places workloads on GPUs connected via high-bandwidth NVLink
- NUMA Affinity Scoring -- Weighted affinity model across NVLink, PCIe, NUMA, and socket boundaries
- DPDK Kernel Bypass -- Eliminates kernel overhead for tensor transfers (~5us vs ~100us)
- io_uring Zero-Copy -- Asynchronous I/O with zero-copy semantics for data plane operations
- Buddy Allocator -- Efficient GPU memory management with splitting, coalescing, and defragmentation
- Prophet + LSTM Autoscaling -- Ensemble forecasting for predictive cluster scaling
- Energy & Carbon Tracking -- NVML power monitoring, dynamic frequency scaling, carbon-aware scheduling
- Kubernetes Native -- Scheduler extender, device plugin, and custom metrics server
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CORTEX Control Plane β
β β
β βββββββββββββββ ββββββββββββββββ βββββββββββββ ββββββββββββββ β
β β Scheduler β β Autoscaler β β Energy β β Dashboard β β
β β (Rust) β β (Python) β β (Rust) β β (React) β β
β β β β β β β β β β
β β - Topology β β - Prophet β β - NVML β β - Topology β β
β β - Affinity β β - LSTM β β - Clock β β - Heatmap β β
β β - Placement β β - K8s HPA β β - Carbon β β - Latency β β
β ββββββββ¬βββββββ ββββββββ¬ββββββββ βββββββ¬ββββββ βββββββ¬βββββββ β
β β β β β β
β ββββββββ΄βββββββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββ΄βββββββ β
β β Kubernetes Integration β β
β β ββββββββββββββββ βββββββββββββββββ ββββββββββββββββββββββ β β
β β β Scheduler β β Device Plugin β β Custom Metrics β β β
β β β Extender(Go) β β (Go) β β Server (Go) β β β
β β ββββββββββββββββ βββββββββββββββββ ββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Plane β
β β
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β Network Layer β β Memory Manager β β
β β (Rust) β β (Rust) β β
β β β β β β
β β - DPDK Bypass β β - Buddy Allocator β β
β β - io_uring ββββββΊβ - Defragmenter β β
β β - Tensor Xfer β β - Pattern Monitor β β
β β - Zero-copy β β - Compactor β β
β βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CORTEX reads the physical GPU topology via NVML and makes placement decisions that respect the hardware interconnect hierarchy:
βββββββββββββββ
β CPU 0 β
β (NUMA 0) β
ββββββββ¬βββββββ
β QPI/UPI
ββββββββββββββββΌβββββββββββββββ
β β β
ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ
β PCIe β β PCIe β β PCIe β
β Switch 0β β Switch 1β β Root β
ββββββ¬βββββ ββββββ¬βββββ β Complex β
β β ββββββ¬βββββ
ββββββ΄βββββ ββββββ΄βββββ β
β β β β ββββββ΄βββββ
βββββ΄βββ βββββ΄βββββββ΄βββββββ΄ββββ CPU 1 β
βGPU 0 β βGPU 1 ββGPU 2ββGPU 3ββ (NUMA 1)β
β β β ββ ββ βββββββ¬βββββ
βββββ¬βββ βββββ¬βββββββ¬βββββββ¬βββ β
β NVLink β βNVLinkβ ββββββ΄βββββ
ββββββββββ ββββββββ β β
βββββ΄βββ βββββ΄βββ
βGPU 4 β βGPU 5 β
βββββ¬βββ βββββ¬βββ
βNVLinkβ
ββββββββ
Affinity Score Calculation:
ββββββββββββββββββββββββββββββββββββββ
β Same GPU β Score: 1.0 β
β NVLink Peer β Score: 0.9 β
β Same PCIe Switch β Score: 0.7 β
β Same NUMA Node β Score: 0.5 β
β Cross NUMA (QPI) β Score: 0.2 β
β Cross Socket β Score: 0.1 β
ββββββββββββββββββββββββββββββββββββββ
GPU Memory Space (e.g., 16 GB)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Order 10 (16 GB) β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ€
β Order 9 (8 GB) β Order 9 (8 GB) β
ββββββββββββββ¬βββββββββββββΌβββββββββββββ¬βββββββββββββββββ€
β O8 (4GB) β O8 (4GB) β O8 (4GB) β O8 (4GB) β
ββββββββ¬ββββββΌβββββββ¬ββββββΌβββββββ¬ββββββΌβββββββ¬βββββββββ€
βO7 2G βO7 2GβO7 2G βO7 2GβO7 2G βO7 2GβO7 2G βO7 2G β
ββββββββ΄ββββββ΄βββββββ΄ββββββ΄βββββββ΄ββββββ΄βββββββ΄βββββββββ
Allocation: Split larger blocks until right size found
Deallocation: Merge buddy pairs back into larger blocks
Defragmentation: Compact scattered allocations
Traditional Path: CORTEX Path:
ββββββββββββ ββββββββββββ
β App β β App β
ββββββββββββ€ ββββββββββββ€
β gRPC β β Tensor β
ββββββββββββ€ β Protocol β
β HTTP/2 β ββββββββββββ€
ββββββββββββ€ vs. β io_uring β
β TCP β β zero-copyβ
ββββββββββββ€ ββββββββββββ€
β Kernel β β DPDK β
β Network β β (bypass) β
β Stack β ββββββ¬ββββββ
ββββββββββββ€ β
β NIC β ββββββ΄ββββββ
ββββββββββββ β NIC β
ββββββββββββ
Latency: ~100us Latency: ~5us
| Component | Language | Description |
|---|---|---|
scheduler/ |
Rust | GPU topology discovery, NVLink-aware bin-packing, NUMA affinity |
network/ |
Rust | DPDK kernel bypass, io_uring zero-copy, tensor transfer protocol |
memory/ |
Rust | Buddy allocator for GPU memory, defragmentation, compaction |
autoscaler/ |
Python | Prophet + LSTM ensemble forecasting, K8s HPA integration |
energy/ |
Rust | NVML power monitoring, dynamic frequency scaling, carbon tracking |
kubernetes/ |
Go | Scheduler extender, device plugin, custom metrics server |
dashboard/ |
TypeScript/React | Real-time topology, memory heatmap, latency, energy dashboards |
- Rust 1.75+ with nightly toolchain (for io_uring)
- Go 1.21+
- Node.js 20+ and pnpm
- Python 3.11+ with Poetry
- NVIDIA drivers with NVML support
- Linux kernel 5.19+ (for io_uring features)
- Kubernetes 1.28+ (for cluster deployment)
# Build all Rust components
make build-rust
# Build Go components
make build-go
# Build dashboard
make build-dashboard
# Install Python autoscaler
make install-autoscaler
# Build everything
make all# Start the scheduler with mock GPU topology
CORTEX_MOCK_TOPOLOGY=1 cargo run --bin cortex-scheduler
# Start the dashboard
cd dashboard && pnpm dev
# Start the autoscaler
cd autoscaler && poetry run cortex-autoscaler --dev# Apply CRDs and RBAC
kubectl apply -f deploy/k8s/
# Deploy CORTEX components
make deploy
# Verify
kubectl get pods -n cortex-system# Run all benchmarks
cargo bench
# Run specific benchmark
cargo bench --bench scheduler_bench
cargo bench --bench memory_benchCORTEX is configured via environment variables and a TOML config file:
[scheduler]
topology_refresh_interval_secs = 30
placement_strategy = "bin-pack" # bin-pack | spread | affinity-first
nvlink_weight = 0.4
numa_weight = 0.3
pcie_weight = 0.2
memory_weight = 0.1
[network]
transport = "io-uring" # io-uring | dpdk | tcp
zero_copy = true
ring_size = 256
tensor_chunk_size = "4MB"
[memory]
min_block_order = 12 # 4 KB minimum
max_block_order = 34 # 16 GB maximum
defrag_threshold = 0.3
compaction_interval_secs = 60
[energy]
power_cap_watts = 300
carbon_region = "us-west-2"
frequency_scaling = true
[autoscaler]
forecast_horizon_minutes = 30
lstm_sequence_length = 60
prophet_changepoint_prior = 0.05
ensemble_weight_lstm = 0.6
ensemble_weight_prophet = 0.4We welcome contributions of all kinds. Please read our Contributing Guide to get started.
Before contributing, please also review our Code of Conduct.
MIT License -- See LICENSE for details.