Version: 0.3.4 Date: 2026-02-27 Status: GPU isolation complete, inference stack near-complete (profiling done, benchmark remaining)
- Introduction & Motivation
- System Architecture Overview
- Microkernel Shard Architecture
- GPU Hardware Abstraction Layer
- Security Model
- Boot Process
- Scheduler Design
- Memory Management
- Inter-Process Communication (IPC)
- Networking
- Userland & Programming Model
- Filesystem & Storage
- Development Roadmap
- Open Questions & Risks
- Appendices
GPU isolation in modern operating systems is fundamentally bolted-on. Linux treats GPUs as monolithic devices behind a single kernel driver, with isolation enforced through ad-hoc mechanisms (cgroups, MIG, SR-IOV) that were never designed for security-critical AI inference workloads. The result:
- No real isolation. A compromised inference job can read VRAM belonging to another tenant. GPU memory is not zeroed between allocations by default. Side-channel attacks on GPU caches are practical and largely unmitigated.
- Massive attack surface. GPU kernel drivers (e.g.,
amdgpuat ~500K LoC, NVIDIA's proprietary blob at ~30M LoC) run in ring 0. A single driver bug yields full kernel compromise. - Architectural mismatch. Monolithic kernels cannot enforce per-workload GPU access policies. There is no
pledge(2)for GPU compute, nounveil(2)for VRAM regions.
Patching Linux is insufficient because the isolation boundary is in the wrong place. Linux's driver model assumes a trusted kernel with exclusive GPU access. Moving the driver to userspace (as in a microkernel) changes the trust model entirely: the supervisor never touches GPU registers directly, and each GPU partition is mediated by a shard with its own address space and capabilities.
OpenBSD demonstrates that a security-first UNIX can be practical. coconutOS takes OpenBSD's principles — small attack surface, pledge/unveil, W^X, ASLR, minimal defaults — and applies them to a GPU-native microkernel.
| # | Principle | Implication |
|---|---|---|
| 1 | Isolation by default | Every inference workload runs in its own shard with separate address space, VRAM partition, and capability set. No sharing unless explicitly granted. |
| 2 | Small trusted computing base | The supervisor (microkernel) targets <10K LoC. GPU drivers run in user-mode shards. |
| 3 | GPU as a first-class resource | The scheduler, memory manager, and IPC system are all GPU-aware from day one — not retrofitted. |
| 4 | Rust everywhere | All kernel and userland code is Rust (with unsafe blocks auditable and minimized). No C in the TCB. |
| 5 | OpenBSD philosophy | Secure defaults, minimal surface, correct over fast, audit everything. |
| 6 | Formal verifiability | The supervisor's critical paths (capability checks, IPC dispatch, shard lifecycle) are designed to be amenable to formal verification. |
coconutOS is the first operating system designed from scratch where GPU compute isolation is a kernel-level primitive rather than a userspace afterthought. The combination of microkernel shards, GPU-native scheduling, and OpenBSD-style security policies is unique.
┌─────────────────────────────────────────────────────────┐
│ Applications │
│ (inference clients, management tools) │
├─────────────────────────────────────────────────────────┤
│ Service Shards │
│ (network stack, filesystem, logging, metrics) │
├─────────────────────────────────────────────────────────┤
│ Inference Shards │
│ (model runtime, GPU compute, per-workload isolation) │
├─────────────────────────────────────────────────────────┤
│ GPU HAL Shards │
│ (user-mode GPU drivers, one per GPU/partition) │
├─────────────────────────────────────────────────────────┤
│ ┌───────────────────┐ │
│ │ Supervisor │ │
│ │ (<10K LoC) │ │
│ │ │ │
│ │ - Capability mgr │ │
│ │ - IPC dispatch │ │
│ │ - Shard lifecycle│ │
│ │ - Scheduler core │ │
│ │ - Memory regions │ │
│ └───────────────────┘ │
├─────────────────────────────────────────────────────────┤
│ Hardware │
│ CPU cores │ RAM │ GPUs │ NVMe │ NIC │ IOMMU │ TPM │
└─────────────────────────────────────────────────────────┘
| Component | Trust Level | LoC Target | Language | Runs In |
|---|---|---|---|---|
| Supervisor | TCB | <10K | Rust (no_std) |
Ring 0 / EL1 |
| GPU HAL shard | Semi-trusted | ~50K per vendor | Rust | User-mode shard |
| Network shard | Untrusted | ~20K | Rust | User-mode shard |
| Filesystem shard | Untrusted | ~10K | Rust | User-mode shard |
| Inference shard | Untrusted | Varies | Rust + C FFI | User-mode shard |
| Boot loader | TCB (transient) | ~5K | Rust | Firmware/EL2 |
In scope:
- Malicious inference workloads attempting to escape their shard
- Side-channel attacks between GPU shards (timing, cache, power)
- Compromised GPU drivers attempting kernel escalation
- DMA attacks from malicious/buggy peripherals
- Network-based attacks on exposed inference endpoints
Out of scope (v1):
- Physical access attacks (cold boot, bus probing)
- Supply-chain attacks on GPU firmware/microcode
- Denial-of-service via legitimate resource exhaustion (handled by quotas, not security boundaries)
Trust boundaries:
- Supervisor ↔ any shard (capability-mediated syscall interface)
- Shard ↔ shard (IPC channels, no direct memory access)
- GPU HAL shard ↔ GPU hardware (IOMMU-enforced DMA regions)
- Network shard ↔ external network (packet filtering, TLS termination)
This is the central abstraction in coconutOS. A shard is the unit of isolation, scheduling, and resource management.
A shard is a lightweight isolated execution environment that combines:
- An independent virtual address space (CPU)
- Zero or more GPU memory partitions
- A capability set defining permitted operations
- One or more threads of execution
- A set of IPC channel endpoints
Shards are not containers (no shared kernel state), not VMs (no hardware virtualization overhead), and not processes (GPU resources are first-class, not bolted on).
┌──────────────── Shard ─────────────────┐
│ │
│ ┌─────────┐ ┌─────────┐ ┌────────┐ │
│ │ Thread 0│ │ Thread 1│ │Thread N│ │
│ └────┬────┘ └────┬────┘ └───┬────┘ │
│ │ │ │ │
│ ┌────┴────────────┴────────────┴────┐ │
│ │ Virtual Address Space │ │
│ │ ┌──────┐ ┌──────┐ ┌───────────┐ │ │
│ │ │ Code │ │ Heap │ │ GPU MMIO │ │ │
│ │ └──────┘ └──────┘ └───────────┘ │ │
│ └───────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────┐ │
│ │ GPU Partition │ │
│ │ ┌─────────┐ ┌─────────────────┐ │ │
│ │ │ VRAM │ │ Command Queues │ │ │
│ │ │ Region │ │ (compute/copy) │ │ │
│ │ └─────────┘ └─────────────────┘ │ │
│ └───────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────┐ │
│ │ Capabilities: {gpu.compute, │ │
│ │ ipc.channel:5, net.none, │ │
│ │ mem.alloc:4GiB, vram.alloc:8GiB}│ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
create()
┌────────┐
│ ▼
│ ┌─────────┐ boot() ┌─────────┐
│ │ Created │────────────▶│ Booting │
│ └─────────┘ └────┬────┘
│ │
│ ready │
│ ▼
│ ┌─────────┐ scale() ┌─────────┐
│ │ Running │◀───────────▶│ Scaling │
│ └────┬────┘ └─────────┘
│ │
│ destroy() or fault │
│ ▼
│ ┌──────────┐
└────────────────────────────│ Destroyed│
└──────────┘
States (implemented):
| State | Description |
|---|---|
| Free | Shard slot unoccupied. |
| Ready | Runnable, waiting for scheduler to select it. |
| Running | Currently executing on the CPU. |
| Blocked | Waiting on IPC channel recv. |
| Exited | Shard called SYS_EXIT, awaiting cleanup. |
| Destroyed | All resources reclaimed — frames freed, page tables torn down, capabilities cleared. |
Lifecycle operations (implemented):
shard::create(code, name, priority)— Allocate page tables, map code + stack, prepare kernel context.scheduler::run_loop()— Pick Ready shards, context-switch, handle exit/block.shard::destroy(id)— Free all frames, tear down page tables, clear capabilities, zero memory.
The supervisor is the only code running in ring 0 (or EL1 on ARM). It is intentionally minimal:
Responsibilities:
- Shard lifecycle management (create, boot, destroy)
- Capability creation, transfer, and revocation
- IPC message dispatch (fast-path)
- CPU scheduling (shard-level time slicing)
- Physical memory region management
- IOMMU configuration
- Interrupt routing to shards
- Timer management
Non-responsibilities (delegated to shards):
- GPU compute logic (HAL shards)
- Network stack (planned — network shard)
- Inference runtime (inference shards)
- Logging, metrics, tracing (planned — service shards)
Code budget: The supervisor targets <10,000 lines of Rust (no_std, no_alloc in critical paths). This is comparable to seL4's ~10K LoC verified kernel. The small size enables:
- Complete manual audit
- Fuzzing of all syscall paths
- Eventual formal verification of critical properties (capability safety, IPC correctness, memory isolation)
See Section 9 for full IPC details. Summary:
- Channels: Bidirectional, capability-mediated, supervisor-dispatched message passing.
- Shared memory: Opt-in shared regions between cooperating shards, created via supervisor grant.
- GPU DMA: Direct GPU-to-GPU memory transfer between shards, mediated by IOMMU rules.
| Property | Linux Container | VM (KVM) | seL4 Process | Fuchsia Job | coconutOS Shard |
|---|---|---|---|---|---|
| Isolation mechanism | Namespaces + cgroups | Hardware virtualization | Capability-based | Capability-based | Capability-based |
| GPU isolation | None (shared driver) | PCI passthrough (1 VM = 1 GPU) | Not GPU-aware | Not GPU-aware | Native GPU partitions |
| GPU memory zeroing | No | On VM destroy only | N/A | N/A | On every free |
| Overhead | Low | High (trap-and-emulate) | Low | Low | Low |
| TCB size | ~28M LoC (kernel) | ~28M LoC + QEMU | ~10K LoC | ~200K LoC (Zircon) | <10K LoC |
| GPU scheduling | Kernel driver | Hypervisor passthrough | N/A | N/A | Integrated 3-level |
| Formal verification | No | No | Yes (functional correctness) | No | Planned (critical paths) |
| Live GPU rescaling | No | No | N/A | N/A | Yes (shard_scale) |
The following properties are targets for formal verification of the supervisor:
- Capability safety: A shard cannot invoke an operation without holding the corresponding capability. Capabilities cannot be forged.
- Spatial isolation: A shard cannot read or write memory (CPU or GPU) outside its assigned regions.
- Temporal isolation: A destroyed shard's memory (CPU and GPU) is zeroed before reassignment.
- IPC integrity: Messages are delivered exactly once, to the correct destination, without modification.
- Liveness: The supervisor's IPC dispatch and scheduling loops are guaranteed to make progress (no unbounded blocking in the supervisor).
┌──────────────────────────────────────────┐
│ Layer 5: Compute Abstraction │
│ (GpuCompute trait — dispatch, sync) │
├──────────────────────────────────────────┤
│ Layer 4: Command Submission │
│ (GpuCommandQueue — ring buffers, fences)│
├──────────────────────────────────────────┤
│ Layer 3: Memory Management │
│ (GpuMemory — alloc, map, DMA) │
├──────────────────────────────────────────┤
│ Layer 2: Partitioning │
│ (GpuPartition — CU slicing, VRAM carve) │
├──────────────────────────────────────────┤
│ Layer 1: Device Abstraction │
│ (GpuDevice — discovery, reset, power) │
└──────────────────────────────────────────┘
Each layer is defined as a Rust trait. Vendor backends implement these traits in user-mode GPU HAL shards.
/// Layer 1: Physical GPU device.
pub trait GpuDevice: Send + Sync {
fn device_id(&self) -> DeviceId;
fn vendor(&self) -> GpuVendor;
fn capabilities(&self) -> DeviceCapabilities;
fn reset(&mut self) -> Result<(), GpuError>;
fn power_state(&self) -> PowerState;
fn set_power_state(&mut self, state: PowerState) -> Result<(), GpuError>;
fn thermal_info(&self) -> ThermalInfo;
}
/// Layer 2: Logical partition of a GPU.
pub trait GpuPartition: Send + Sync {
fn partition_id(&self) -> PartitionId;
fn parent_device(&self) -> DeviceId;
fn compute_units(&self) -> Range<u32>;
fn vram_region(&self) -> MemoryRegion;
fn resize(&mut self, new_cus: u32, new_vram: usize) -> Result<(), GpuError>;
fn isolate(&mut self) -> Result<(), GpuError>; // Enforce hard isolation fences
}
/// Layer 3: GPU memory operations.
pub trait GpuMemory: Send + Sync {
fn allocate(&mut self, desc: GpuAllocDesc) -> Result<GpuAllocation, GpuError>;
fn free(&mut self, alloc: GpuAllocation) -> Result<(), GpuError>;
fn map_to_cpu(&mut self, alloc: &GpuAllocation) -> Result<*mut u8, GpuError>;
fn unmap_from_cpu(&mut self, alloc: &GpuAllocation) -> Result<(), GpuError>;
fn zero(&mut self, alloc: &GpuAllocation) -> Result<(), GpuError>;
fn usage(&self) -> MemoryUsage;
}
/// Layer 4: Command queue for GPU work submission.
pub trait GpuCommandQueue: Send + Sync {
fn submit(&mut self, commands: &[GpuCommand]) -> Result<FenceId, GpuError>;
fn wait_fence(&self, fence: FenceId, timeout: Duration) -> Result<(), GpuError>;
fn poll_fence(&self, fence: FenceId) -> FenceStatus;
fn drain(&mut self) -> Result<(), GpuError>;
}
/// Layer 5: High-level compute dispatch.
pub trait GpuCompute: Send + Sync {
fn load_shader(&mut self, binary: &[u8]) -> Result<ShaderId, GpuError>;
fn unload_shader(&mut self, shader: ShaderId) -> Result<(), GpuError>;
fn dispatch(
&mut self,
shader: ShaderId,
args: &GpuDispatchArgs,
) -> Result<FenceId, GpuError>;
fn dispatch_indirect(
&mut self,
shader: ShaderId,
args_buffer: &GpuAllocation,
) -> Result<FenceId, GpuError>;
}
/// GPU-to-GPU and GPU-to-CPU DMA operations.
pub trait GpuDma: Send + Sync {
fn copy_device_to_device(
&mut self,
src: &GpuAllocation,
dst: &GpuAllocation,
size: usize,
) -> Result<FenceId, GpuError>;
fn copy_host_to_device(
&mut self,
src: *const u8,
dst: &GpuAllocation,
size: usize,
) -> Result<FenceId, GpuError>;
fn copy_device_to_host(
&mut self,
src: &GpuAllocation,
dst: *mut u8,
size: usize,
) -> Result<FenceId, GpuError>;
fn peer_copy(
&mut self,
src_partition: PartitionId,
src_alloc: &GpuAllocation,
dst_partition: PartitionId,
dst_alloc: &GpuAllocation,
size: usize,
) -> Result<FenceId, GpuError>;
}#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum GpuVendor {
Amd,
Nvidia,
Intel,
Apple,
}
#[derive(Debug, Clone)]
pub struct GpuAllocDesc {
pub size: usize,
pub alignment: usize,
pub usage: GpuMemoryUsage,
pub zero_on_alloc: bool, // Always true for inference shards
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum GpuMemoryUsage {
Weights, // Read-only after load, large, contiguous
Activations, // Read-write, ephemeral per inference
KvCache, // Read-write, grows with sequence length
Scratch, // Temporary computation buffers
CommandBuffer, // Ring buffer for command queues
}
#[derive(Debug, Clone, Copy)]
pub struct MemoryRegion {
pub base: u64, // Physical or IOMMU-virtual address
pub size: usize,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FenceStatus {
Pending,
Signaled,
Error(GpuError),
}| Priority | Vendor | Hardware Target | Approach |
|---|---|---|---|
| Phase 1 | AMD | RDNA 3 / CDNA 3 (MI300) | Open-source register specs + Mesa/AMDGPU reference. User-mode driver shard. |
| Phase 3 | Intel | Arc (Xe) | Open-source i915 reference. Lower priority. |
| Phase 4 | NVIDIA | Hopper / Blackwell | Reverse-engineered nouveau-style or future open-source driver. Highest complexity. |
| Phase 4 | Apple | M-series (Apple GPU) | Asahi Linux reverse-engineering reference. ARM-only. |
AMD is the initial target because:
- Open-source register documentation exists
- AMDGPU kernel driver and Mesa userspace are fully open-source references
- CDNA/MI300 is the primary non-NVIDIA AI accelerator
- AMD GPUs support hardware-level compute partitioning
coconutOS enforces a typed GPU memory model. Every VRAM allocation has a declared GpuMemoryUsage type that determines:
| Usage Type | Permissions | Zeroing | Lifetime |
|---|---|---|---|
Weights |
Read-only after load | On free | Shard lifetime |
Activations |
Read-write | On alloc + free | Per-inference |
KvCache |
Read-write | On free | Per-session |
Scratch |
Read-write | On free | Per-dispatch |
CommandBuffer |
Write (CPU) → Read (GPU) | On free | Queue lifetime |
The Weights region is made read-only after initial DMA load, preventing runtime corruption. Activations are zeroed on allocation to prevent cross-inference data leaks.
GPU drivers are inherently complex. The AMD AMDGPU kernel driver alone is ~500K LoC. coconutOS mitigates this by:
- User-mode execution. GPU driver bugs crash the HAL shard, not the supervisor. The shard can be restarted.
- IOMMU containment. A buggy driver cannot DMA outside its assigned regions.
- Minimal driver scope. coconutOS drivers only need to support compute (no display, no video decode, no OpenGL). This eliminates ~60% of driver complexity.
- Incremental bring-up. Start with the minimum set of registers needed for compute dispatch, memory allocation, and power management. No attempt to support the full GPU feature set.
Estimated reduced driver LoC per vendor: ~50K (compute-only) vs. ~500K (full driver).
Every resource in coconutOS is accessed through capabilities — unforgeable tokens that encode both the resource identity and the permitted operations.
Per-shard capability table (kernel-side, 16 entries per shard):
┌──────────┬──────────┬────────────┐
│ Type (8) │ ID (16) │ Rights (16)│
└──────────┴──────────┴────────────┘
Capability types (implemented):
| Type | Value | Resource | Example Rights |
|---|---|---|---|
CAP_CHANNEL |
1 | IPC channel endpoint | send, receive, grant |
CAP_SHARD |
2 | Shard management | (reserved) |
CAP_MEMORY |
3 | Memory region | (reserved) |
CAP_GPU_DMA |
4 | GPU DMA access | write |
Planned types (not yet implemented): CAP_VRAM, CAP_GPU, CAP_IRQ, CAP_IO, CAP_TIMER.
Capabilities can be:
- Granted: A shard can pass a capability (or a restricted version) to another shard via
SYS_CAP_GRANT(requiresRIGHT_CHANNEL_GRANT). - Restricted: Rights can be removed but never added (monotonic AND via
SYS_CAP_RESTRICT). - Revoked: A shard can revoke its own capabilities via
SYS_CAP_REVOKE(non-cascading). - Inspected:
SYS_CAP_INSPECTreturns packed(cap_type << 48 | resource_id << 16 | rights).
Inspired by OpenBSD's pledge(2) and unveil(2):
Implemented as syscalls:
SYS_GPU_PLEDGE(41)—a0is a bitmask of allowed syscall categories. Monotonic: can only remove bits, never add. Bits:PLEDGE_SERIAL(1),PLEDGE_CHANNEL(2),PLEDGE_GPU_DMA(4).SYS_GPU_UNVEIL(42)—a0=offset,a1=size. One-shot: locks a VRAM range for DMA. Only the unveiled range can be used as a DMA source. Cannot be called again after the first call.
Example: locked-down GPU HAL shard
After initialization, the HAL shard restricts itself:
pledge_gpu(PLEDGE_SERIAL | PLEDGE_CHANNEL | PLEDGE_GPU_DMA)— only serial, IPC, and DMA allowedunveil_vram(offset, size)— only a specific VRAM region can be used for DMA
From this point, any attempt to invoke other syscalls or DMA outside the unveiled region is rejected.
Future expansion: The design supports richer pledge categories (compute, alloc, shader_load) and multi-region unveil, to be added as the inference stack matures.
All GPU memory regions enforce W^X (write XOR execute):
- Shader code: Loaded into a region marked executable, then made read-only. Cannot be written to after load.
- Data buffers: Marked read-write but never executable. Cannot be used as shader code.
- Command buffers: Marked write (CPU-side) and read (GPU-side). Cannot be used for shader execution.
This prevents GPU-based code injection attacks.
GPU memory layout is randomized per shard:
- VRAM region base addresses are randomized within the partition
- Command queue ring buffer locations are randomized
- Shader code load addresses are randomized
- Entropy: minimum 20 bits for VRAM regions, 16 bits for command queues
This makes VRAM-based exploits (buffer overflows, use-after-free) significantly harder to weaponize.
The IOMMU (AMD-Vi / Intel VT-d / ARM SMMU) is the hardware root of GPU isolation:
- Each GPU HAL shard has its own IOMMU domain
- DMA regions are configured by the supervisor before the HAL shard starts
- The HAL shard cannot modify its own IOMMU mappings
- DMA is restricted to the shard's assigned physical memory regions
- Interrupt remapping is enabled to prevent MSI spoofing
DMA region lifecycle:
1. Supervisor creates physical memory region
2. Supervisor configures IOMMU mapping for HAL shard's device
3. HAL shard can now DMA to/from the mapped region
4. On shard destroy: supervisor removes IOMMU mapping, zeroes memory
| Attack Vector | Mitigation | Status |
|---|---|---|
| FPU/SSE register leakage | fninit + zero all XMM0-15 + reset MXCSR on every context switch |
Implemented |
| Debug register persistence | Clear DR0-DR3, reset DR7 on every context switch | Implemented |
| Branch predictor cross-shard inference | IBPB (wrmsr 0x49) on every context switch when CPU supports it | Implemented |
| User-mode timing attacks | CR4.TSD set — rdtsc/rdtscp causes #GP in ring 3 |
Implemented |
| FXSAVE/FXRSTOR state leakage | Timer ISR saves/restores per-shard SSE state; side-channel clear still runs between shards | Implemented |
| GPU cache timing | Partition-level cache flushing on context switch | Planned |
| VRAM access patterns | Constant-time memory access primitives for sensitive operations | Planned |
| Speculative execution (CPU) | IBPB implemented; additional Spectre mitigations planned | Partial |
| PCIe bus snooping | Out of scope (physical access); mitigated by IOMMU for software DMA attacks | N/A |
All security-relevant events will be logged to a tamper-evident audit log:
- Shard creation/destruction
- Capability grants, delegations, and revocations
pledge_gpuandunveil_vramcalls- IOMMU configuration changes
- Security policy violations (attempted access beyond capabilities)
- GPU fault events (page faults, command errors)
The audit log is written to a dedicated audit shard that has no GPU access and no network access (write-only, append-only). Log integrity is protected by a hash chain.
Power On
│
▼
┌─────────┐
│ UEFI │ Platform firmware
└────┬────┘
│
▼
┌──────────────┐
│ Bootloader │ coconut-boot (Rust, UEFI application)
│ │ - Load supervisor ELF from boot FS
│ │ - Parse PT_LOAD segments → 0x200000
│ │ - Build BootInfo + memory map
│ │ - Find ACPI RSDP via UEFI config table
│ │ - Exit boot services
│ │ - Jump to supervisor (RDI = BootInfo*)
└──────┬───────┘
│
▼
┌──────────────────────────────┐
│ Boot Trampoline (_start) │
│ - Set temp stack at 0x300000│
│ - Zero BSS, init serial │
│ - Build 3-region page tables│
│ - Enable NXE, switch CR3 │
│ - Jump to supervisor_main │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ supervisor_main │
│ - PMM, frame alloc, GDT, │
│ TSS, IDT, PIC, PIT │
│ - CR4.OSFXSR + CR4.TSD │
│ - Detect IBPB, init ACPI │
│ - PCI enum, IOMMU, GPU │
│ - Init filesystem (ext2) │
│ - Remove identity mapping │
│ - Create shards: │
│ GPU HAL ×2, fs-reader, │
│ hello-c, llama-inference │
│ - Enable interrupts │
│ - Enter scheduler run loop │
└──────────────────────────────┘
Currently, shards are statically embedded in the supervisor binary via include_bytes! and created in supervisor_main. A future boot configuration system will support dynamic shard manifests:
# /boot/coconut.toml (planned)
[supervisor]
binary = "/boot/supervisor.elf"
log_level = "info"
audit = true
[gpu]
# Partition strategy: "equal" | "manifest" | "manual"
partition_strategy = "manifest"
zero_vram_on_boot = true
[[gpu.device]]
pci_slot = "0000:03:00.0"
vendor = "amd"
driver = "/boot/drivers/amd-rdna3.shard"
[[gpu.device]]
pci_slot = "0000:04:00.0"
vendor = "amd"
driver = "/boot/drivers/amd-cdna3.shard"
[network]
shard = "/boot/shards/network.shard"
interfaces = ["enp5s0"]
[filesystem]
shard = "/boot/shards/filesystem.shard"
root_device = "/dev/nvme0n1p2"
[audit]
shard = "/boot/shards/audit.shard"
log_path = "/var/log/coconut-audit"
# Inference shards to boot automatically
[[shard]]
name = "llama-inference"
manifest = "/etc/shards/llama.toml"
autostart = true
[[shard]]
name = "whisper-inference"
manifest = "/etc/shards/whisper.toml"
autostart = falseShards can be restarted without rebooting the supervisor:
- Fault detection: Supervisor detects shard crash (page fault, illegal instruction, GPU fault, watchdog timeout).
- Teardown: All shard threads halted, GPU queues drained, VRAM zeroed, capabilities revoked.
- Rebuild: Shard re-created from its original manifest, binary reloaded, GPU partition reassigned.
- State recovery: For stateless inference shards, this is a clean restart. For stateful shards (filesystem), a recovery protocol is invoked.
Hot-restart latency target: <100ms for inference shards (dominated by GPU partition setup and model weight DMA reload is avoided via copy-on-write VRAM snapshots — see Section 8).
Level 1: Supervisor Scheduler (CPU)
│
│ Assigns CPU time slices to shards
│ Round-robin with priority classes
│
├──── Level 2: Intra-Shard CPU Scheduler
│ │
│ │ Cooperative scheduling within a shard
│ │ Shard manages its own threads
│ │ Preemption only at shard boundary
│
└──── Level 3: GPU Scheduler
│
│ Per-partition GPU command queue management
│ Deadline-aware for inference latency SLOs
│ Co-scheduled with CPU to minimize idle bubbles
The supervisor uses a simple, auditable scheduling algorithm:
-
Fixed-priority round-robin with 4 priority classes:
- Critical: Supervisor-internal tasks, IOMMU management
- High: GPU HAL shards, interrupt-driven I/O
- Normal: Inference shards, application shards
- Low: Background maintenance
-
Time slice: ~1ms (PIT at ~1 kHz, divisor 1193)
-
Preemption: PIT timer ISR (vector 32) fires ~1 kHz. User-mode path: save GP regs + FXSAVE →
timer_preempt(tick++, EOI, mark Ready, yield) → FXRSTOR + restore → iretq. Kernel-mode interrupts: EOI + iretq only (no preemption of syscall handlers). -
Context switch: Naked asm function — push/pop callee-saved registers (RBX, RBP, R12-R15), swap RSP.
-
Side-channel clearing:
clear_sensitive_cpu_state()zeroes FPU/SSE/debug state and issues IBPB before every shard switch. -
MAX_SHARDS: 8, each with a 4 KiB kernel stack.
Within a shard, threads are cooperatively scheduled by a shard-local scheduler. The shard runtime library provides:
yield_now()— Voluntarily yield the current thread's time slice.spawn(task)— Create a new cooperative task within the shard.sleep(duration)— Suspend the current task.
The supervisor does not see individual threads within a shard — it schedules the shard as an opaque unit.
Each GPU partition has a scheduler that manages command queue submission:
- Deadline scheduling: Inference shards declare latency SLOs (e.g., "complete within 50ms"). The GPU scheduler prioritizes dispatches to meet deadlines.
- Batch coalescing: Multiple small dispatches are coalesced into larger batches to amortize launch overhead.
- Preemption: GPU command preemption is hardware-dependent. On AMD RDNA3/CDNA3, mid-wave preemption is supported. The scheduler uses preemption to enforce deadlines.
Inference workloads alternate between CPU (tokenization, sampling) and GPU (matrix multiply, attention) phases. The scheduler co-schedules CPU and GPU work to minimize idle bubbles:
CPU: [tokenize]────────────────[sample]────────────────[tokenize]
GPU: [prefill████████] [decode████████]
↑ CPU yields while ↑ CPU yields while
GPU is active GPU is active
Co-scheduling is achieved by:
- The inference runtime signals phase transitions via lightweight supervisor calls.
- The supervisor CPU scheduler deprioritizes shards that are GPU-bound (avoiding CPU waste).
- The GPU scheduler fast-tracks shards whose CPU work just completed (minimizing GPU idle time).
Large model inference can be split across multiple shards (pipeline parallelism):
Shard A (layers 0-15) → Shard B (layers 16-31) → Shard C (layers 32-47)
GPU 0 GPU 1 GPU 2
The scheduler provides pipeline coordination primitives:
- Pipeline barriers: A shard can signal "my stage is complete" to the next shard in the pipeline.
- Pipeline scheduling: The supervisor schedules pipeline shards in wave order to maximize throughput.
- Flow control: Backpressure from downstream shards throttles upstream dispatch to prevent buffer overflow.
- Real-time: Inference shards can request soft real-time guarantees via manifest configuration. The scheduler reserves CPU and GPU capacity to meet latency SLOs.
- Power-aware: The scheduler monitors GPU temperature and power draw. When thermal limits approach, it reduces scheduling frequency or migrates work to cooler partitions. GPU power states (active/idle/sleep) are managed per-partition.
The supervisor manages physical memory in large regions:
Physical Memory Map:
┌───────────────────────┐ 0x0000_0000_0000
│ Supervisor (reserved) │ Fixed mapping, <16 MiB
├───────────────────────┤
│ Shard Region Pool │ Allocated to shards on demand
│ │ 2 MiB granularity (huge pages)
├───────────────────────┤
│ IOMMU Page Tables │ Managed by supervisor
├───────────────────────┤
│ Device MMIO │ Mapped to HAL shards
└───────────────────────┘
The supervisor allocates physical memory in 2 MiB regions to minimize page table overhead and TLB pressure. Regions are typed:
| Region Type | Properties |
|---|---|
SupervisorPrivate |
Not accessible by any shard. Contains capability tables, scheduler state. |
ShardCode |
Mapped read + execute into one shard's address space. |
ShardData |
Mapped read + write into one shard's address space. |
ShardShared |
Mapped into multiple shards' address spaces (explicit grant required). |
DeviceDma |
Mapped into IOMMU for device DMA. Accessible by one HAL shard. |
Each shard has its own virtual address space, configured by the supervisor:
Shard Virtual Address Space (implemented):
┌───────────────────────┐ 0x3F00_0000
│ │ (unmapped)
├───────────────────────┤
│ GPU BARs (ASLR'd) │ VRAM + MMIO (HAL shards only)
├───────────────────────┤ 0x0080_0000+
│ │ (unmapped)
├───────────────────────┤
│ Stack (4 KiB) │ R+W+NX
├───────────────────────┤ 0x007F_F000
│ │ (unmapped)
├───────────────────────┤
│ Data (mmap'd heap) │ R+W+NX (via SYS_MMAP)
├───────────────────────┤ 0x0010_0000+
│ │ (unmapped)
├───────────────────────┤
│ Config page │ R (HAL shards only, VA 0x4000)
├───────────────────────┤
│ Code (multi-page) │ R+X (W^X enforced)
├───────────────────────┤ 0x0000_1000
│ (unmapped null guard) │
└───────────────────────┘ 0x0000_0000
GPU ASLR randomizes VRAM and MMIO BAR virtual addresses within [0x800000, 0x3F000000) per shard.
GPU memory (VRAM) is managed separately from CPU memory:
GPU VRAM (per partition):
┌────────────────────────┐
│ Weights Region │ Read-only after load
│ (contiguous, aligned) │
├────────────────────────┤
│ KV-Cache Region │ Grows with sequence length
│ (paged, 64 KiB pages) │
├────────────────────────┤
│ Activations Region │ Ephemeral per inference
│ (bump allocator) │
├────────────────────────┤
│ Scratch Region │ Temporary buffers
│ (pool allocator) │
├────────────────────────┤
│ Command Buffers │ Ring buffers for GPU queues
└────────────────────────┘
Allocation strategies by type:
| Type | Allocator | Rationale |
|---|---|---|
Weights |
Single contiguous allocation | Loaded once, never resized, needs contiguous VRAM for efficient access |
KvCache |
Paged allocator (64 KiB pages) | Grows dynamically with sequence length, needs efficient append |
Activations |
Bump allocator (reset per inference) | Allocated in order, freed all at once — bump is optimal |
Scratch |
Pool allocator (fixed-size blocks) | Reusable temporary buffers, predictable sizes |
CommandBuffer |
Ring buffer | Circular producer-consumer pattern |
DMA transfers between CPU and GPU memory are mediated by the supervisor:
- CPU → GPU (model load): Filesystem shard reads model weights → shared memory region → GPU HAL shard DMA-copies to VRAM.
- GPU → GPU (pipeline parallel): Shard A requests peer DMA to Shard B → supervisor verifies capabilities → configures IOMMU for cross-partition DMA → HAL performs transfer.
- GPU → CPU (inference output): Inference shard DMA-copies output tokens to CPU-mapped region → IPC to requesting shard.
All DMA operations require explicit capability checks. The supervisor validates source and destination regions before any transfer.
| Property | Mechanism |
|---|---|
| No cross-shard CPU memory access | Separate page tables per shard |
| No cross-shard GPU memory access | IOMMU domains + GPU hardware partitioning |
| No stale data in freed memory | Zero-on-free for both CPU and GPU memory |
| No stale data in allocated memory | Zero-on-alloc for Activations type |
| No executable data | W^X enforced on CPU and GPU memory |
| DMA containment | IOMMU restricts each device to assigned regions |
coconutOS does not have swap. When a shard exceeds its memory quota:
- Soft limit: Shard is notified via a callback. Expected to free caches (e.g., trim KV-cache).
- Hard limit: Further allocations fail with
OutOfMemory. The shard must handle this gracefully. - Supervisor OOM: If the supervisor itself is out of physical memory to assign, it refuses new shard creation. Existing shards are not killed — stability over throughput.
GPU OOM follows the same pattern: soft notification → hard alloc failure → no VRAM swap.
- Intra-shard IPC: <500ns for synchronous message passing within a single shard
- Inter-shard IPC: <5µs for supervisor-mediated channel messages between shards
- GPU DMA IPC: Line-rate GPU-to-GPU transfer for pipeline parallelism
- Zero-copy: Large data transfers use shared memory, not message copying
- Capability passing: Capabilities can be transferred via IPC messages
Within a shard, threads communicate without supervisor involvement:
| Mechanism | Latency | Use Case |
|---|---|---|
| Synchronous message passing | <500ns | Request-response between tasks |
| Shared memory (shard-local) | Memory access time | Bulk data sharing between threads |
| Events (wakeup signals) | <200ns | Signaling between producer/consumer tasks |
These are implemented entirely in the shard runtime library — no syscalls required.
Communication between shards requires supervisor mediation:
Syscalls: SYS_CHANNEL_SEND(21), SYS_CHANNEL_RECV(22).
Implementation:
- Single-buffered per direction, 256-byte max message size
- Blocking receive: shard state set to
Blocked, scheduler yields - Capability-gated: sender must hold
CAP_CHANNELwithRIGHT_CHANNEL_SEND, receiver must holdRIGHT_CHANNEL_RECV - Supervisor copies message between kernel-side buffers (no direct shard-to-shard memory access)
- Receiver is woken (state set to
Ready) when a message arrives
For bulk data transfer (e.g., inference inputs/outputs), channels are too slow. Shared memory regions provide zero-copy IPC:
/// Create a shared memory region accessible by two shards.
fn shared_memory_create(
size: usize,
owner_rights: Permission,
peer_rights: Permission,
) -> Result<SharedMemoryHandle, IpcError>;
/// Grant access to a shared memory region to another shard.
fn shared_memory_grant(
handle: &SharedMemoryHandle,
target_shard: ShardId,
) -> Result<(), IpcError>;Shared memory regions are:
- Created by the supervisor on behalf of the owning shard
- Mapped into both shards' virtual address spaces
- Protected by capability-based access control (read, write, or read-write per shard)
- Unmapped and zeroed on revocation or shard destruction
For GPU-to-GPU data transfer between shards (pipeline parallelism):
/// Request a peer DMA transfer between GPU partitions of two shards.
fn gpu_peer_dma(
src_alloc: &GpuAllocation,
dst_shard: ShardId,
dst_alloc: &GpuAllocation,
size: usize,
) -> Result<FenceId, IpcError>;The supervisor:
- Verifies both shards hold
peer_copypledge - Verifies the source shard holds
dma_srcrights on the source allocation - Verifies the destination shard holds
dma_dstrights on the destination allocation - Configures IOMMU for the transfer
- Instructs the source GPU HAL shard to perform the DMA
- Cleans up IOMMU mapping after transfer completes
A standard IPC protocol for chaining inference stages:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Client │────▶│ Shard A │────▶│ Shard B │────▶ Output
│ │ IPC │(layers │ DMA │(layers │
│ │ │ 0-15) │ │ 16-31) │
└─────────┘ └─────────┘ └─────────┘
Protocol messages:
| Message | Direction | Payload |
|---|---|---|
InferenceRequest |
Client → first shard | Input tokens, sampling params, session ID |
StageComplete |
Shard N → Shard N+1 | GPU DMA handle for intermediate activations |
InferenceResult |
Last shard → client | Output tokens, timing metadata |
PipelineSync |
Supervisor → all pipeline shards | Synchronization barrier for batch boundaries |
Capabilities can be sent over IPC channels, enabling dynamic delegation:
let msg = IpcMessage::new()
.with_data(request_bytes)
.with_capability(vram_read_cap); // Attach a VRAM read capability
channel_send(&endpoint, &msg)?;The supervisor validates and transfers the capability during message dispatch. The sender can choose to:
- Copy: Both shards retain the capability.
- Move: The sender loses the capability; the receiver gains it.
- Restrict: The receiver gets a version with reduced rights.
Not yet implemented. This section describes the planned network architecture.
The network stack runs entirely in a dedicated network shard — no networking code in the supervisor.
┌─────────────────────────────────────────────┐
│ Network Shard │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌────────┐ │
│ │ NIC │ │ IP │ │ TCP/ │ │ TLS │ │
│ │Driver│──│Stack │──│ UDP │──│ │ │
│ └──────┘ └──────┘ └──────┘ └────────┘ │
│ ▲ │ │
│ │ MMIO + DMA │ IPC │
│ │ (via IOMMU) ▼ │
├───────┼──────────────────────────────────────┤
│ │ Supervisor (routing only) │
├───────┼──────────────────────────────────────┤
│ │ │
│ NIC Hardware │
└──────────────────────────────────────────────┘
Each shard has a declared network policy in its manifest:
| Policy | Meaning |
|---|---|
net.none |
No network access (default for inference shards) |
net.listen(port) |
Can accept incoming connections on a specific port |
net.connect(host, port) |
Can make outgoing connections to specific destinations |
net.unrestricted |
Full network access (network shard only) |
Air-gapped inference: By default, inference shards have net.none — they cannot initiate or receive network connections. Input/output flows through IPC channels to the application shard, which may have restricted network access. This prevents exfiltration of model weights or inference data.
For high-performance multi-node inference:
- RDMA: The network shard can expose RDMA verbs to inference shards via IPC. RDMA buffer registration is mediated by the supervisor (IOMMU mapping).
- GPU-Direct: NIC-to-GPU DMA without CPU bounce buffers. Requires IOMMU configuration by the supervisor to allow the NIC to DMA directly to the GPU partition's VRAM.
Node A Node B
┌──────────┐ RDMA/RoCE ┌──────────┐
│ GPU VRAM │◄───────────────▶│ GPU VRAM │
│ (Shard A)│ NIC-to-GPU │ (Shard B)│
└──────────┘ DMA └──────────┘
Both RDMA and GPU-Direct are opt-in, require explicit capabilities, and are mediated by the supervisor's IOMMU configuration.
Currently, shards are statically compiled into the supervisor. A future manifest system will support dynamic deployment:
# /etc/shards/llama-70b.toml
[shard]
name = "llama-70b-inference"
binary = "/opt/shards/llama-inference.elf"
version = "1.0.0"
[resources]
cpu_cores = 4
memory_mib = 8192
gpu_device = "0000:03:00.0"
gpu_compute_units = 60 # Out of 120 total CUs
gpu_vram_mib = 40960 # 40 GiB VRAM
[security]
pledge_gpu = ["compute", "copy"]
unveil_vram = ["weights:ro", "activations:rw", "kv_cache:rw"]
network = "none"
[scheduling]
priority = "normal"
latency_slo_ms = 100 # Soft real-time target
cpu_affinity = [4, 5, 6, 7]
[model]
path = "/models/llama-70b-f16.coconut"
format = "coconut-model-v1"
[pipeline]
stage = 0 # First stage in pipeline
next_shard = "llama-70b-stage1"coconutOS provides two runtime APIs for building shards:
Rust API (coconut-rt): Provides #![no_std] entry point, syscall wrappers, serial I/O macros, and GPU primitives (VramAllocator, CommandRing, matmul_4x4). Used by the GPU HAL shard (coconut-shard-gpu).
C API (coconut.h): Header-only syscall wrappers for freestanding C code. Used by hello-c and the llama-inference shard.
The following shows the planned high-level inference API (not yet implemented):
use coconut_runtime::{Shard, InferenceEngine, GpuContext};
fn main() -> Result<(), coconut_runtime::Error> {
// Initialize the shard runtime
let shard = Shard::init()?;
// Get GPU context (partition already assigned by supervisor)
let gpu = shard.gpu_context()?;
// Load model weights into VRAM
let model = InferenceEngine::load_model(
&gpu,
"/models/llama-70b-f16.coconut",
)?;
// Apply security restrictions (cannot be undone)
shard.pledge_gpu(&[GpuPledge::Compute, GpuPledge::Copy])?;
shard.unveil_vram_for_model(&model)?;
// Serve inference requests via IPC
let endpoint = shard.ipc_endpoint("inference")?;
loop {
let request = endpoint.recv::<InferenceRequest>()?;
let output = model.infer(&gpu, &request)?;
endpoint.send(&request.reply_to, &output)?;
}
}coconutOS provides include/coconut.h — a header-only C interface with inline asm syscall wrappers:
// coconut.h — header-only, no libc dependency
// Core
void coconut_exit(uint64_t code);
uint64_t coconut_serial_write(const char *buf, uint64_t len);
uint64_t coconut_yield(void);
uint64_t coconut_mmap(uint64_t va_start, uint64_t num_pages);
// Filesystem
uint64_t coconut_fs_open(const char *path, uint64_t path_len);
uint64_t coconut_fs_read(uint64_t fd, void *buf, uint64_t max_len);
uint64_t coconut_fs_stat(uint64_t fd);
uint64_t coconut_fs_close(uint64_t fd);
// IPC
uint64_t coconut_channel_send(uint64_t ch, const void *buf, uint64_t len);
uint64_t coconut_channel_recv(uint64_t ch, void *buf, uint64_t max_len);
// Capabilities, GPU pledge/unveil, GPU DMA — also availableC shards are compiled with clang (freestanding x86-64), linked as flat binaries via targets/shard.ld, and embedded into the supervisor.
| Tool | Purpose | Status |
|---|---|---|
coconut-trace |
Per-shard kernel instrumentation: syscall count/cycles, context switches, wall time | Done (milestone 3.5) |
coconut-prof |
Host-side Python script that parses serial profiling output into a formatted report | Done (milestone 3.5) |
coconut-audit |
Query the audit log. Filter by shard, capability type, time range | Planned |
coconut-top |
Real-time dashboard showing shard CPU/GPU utilization, memory usage, IPC throughput | Planned |
coconut-shard |
CLI tool for shard management: create, start, stop, restart, inspect, logs | Planned |
coconut-trace adds lightweight counters to each ShardDescriptor: total syscalls dispatched, RDTSC cycles spent in syscall dispatch, context switch count, and wall-clock time (PIT ticks from first schedule to exit). The supervisor prints a summary table to serial before halt. Overhead is minimal (~20 cycles per syscall for two RDTSC reads).
coconut-prof (scripts/coconut-prof.py) reads serial output and produces a formatted report with per-shard stats, totals, and syscall distribution percentages. It also extracts shard lifecycle events (create, exit, blocked). No external dependencies — stdlib only.
# Pipe QEMU output directly
./scripts/qemu-run.sh 2>&1 | python3 scripts/coconut-prof.py
# Or parse a saved log
./scripts/qemu-run.sh 2>&1 | tee /tmp/boot.log
python3 scripts/coconut-prof.py /tmp/boot.logGDB remote debugging and raw serial output remain available. See debugging.md.
coconutOS currently uses a minimal read-only ext2 filesystem backed by a 128 KiB ramdisk generated at compile time.
Implementation:
- ext2 revision 0, 1024-byte blocks
- Supports direct block pointers and single indirect blocks (files up to 268 KiB)
- Generated by
build.rs— no external tools required - Contains
hello.txt(22 bytes) andmodel.bin(~87 KiB, deterministic transformer weights) - Global open file table (
MAX_OPEN_FILES = 16), per-shard fd ownership
Syscalls: SYS_FS_OPEN, SYS_FS_READ, SYS_FS_STAT, SYS_FS_CLOSE.
A more capable filesystem is planned for production use:
Design goals:
- Crash-consistent (no fsck required after unclean shutdown)
- Read-optimized (model weights are read-heavy, write-rare)
- Large-file friendly (model files are 10-200+ GiB)
- Zero-copy model loading support
The critical performance path is loading model weights from NVMe into GPU VRAM:
┌──────────┐ mmap ┌──────────┐ DMA ┌──────────┐
│ NVMe │───────────▶│ CPU RAM │────────▶│ GPU VRAM │
│ (model │ page-fault │ (pinned │ GPU HAL │ (weights │
│ file) │ on demand │ pages) │ shard │ region) │
└──────────┘ └──────────┘ └──────────┘
- mmap: The filesystem shard maps the model file into a shared memory region (demand-paged).
- Pin: Pages are pinned as they are faulted in, preventing eviction.
- DMA: The GPU HAL shard DMAs directly from the pinned CPU pages to VRAM.
- Unpin: CPU pages are unpinned and freed after DMA completes. The model now lives entirely in VRAM.
For shard hot-restart, VRAM weight regions can be preserved across restarts (the supervisor does not zero the weights partition if the same shard is restarting with the same model).
Goal: Functional microkernel with CPU-only shards, no GPU support.
| Milestone | Deliverable | Status |
|---|---|---|
| 0.1 | Supervisor boots on x86-64 (QEMU), initializes memory, prints to serial | Done |
| 0.2 | Shard creation and destruction (single-threaded, no GPU) | Done |
| 0.3 | IPC channels between shards (synchronous message passing) | Done |
| 0.4 | Basic CPU scheduler (round-robin, preemption) | Done |
| 0.5 | Capability system (create, check, delegate, revoke) | Done |
| 0.6 | Minimal filesystem shard (read-only, ext2-compatible for bootstrapping) | Done |
Goal: GPU HAL shard for AMD RDNA3/CDNA3, basic compute dispatch.
| Milestone | Deliverable | Status |
|---|---|---|
| 1.1 | GPU PCIe enumeration and IOMMU domain setup | Done |
| 1.2 | GPU HAL shard: device init, memory alloc, command queue | Done |
| 1.3 | Basic compute dispatch (4×4 matrix multiply via command ring) | Done |
| 1.4 | GPU memory management with typed allocations | Done |
| 1.5 | VRAM zeroing on free, W^X enforcement | Done |
| 1.6 | Performance baseline: compute throughput measurement | Done |
Goal: Multiple inference shards with strong isolation on a single GPU.
| Milestone | Deliverable | Status |
|---|---|---|
| 2.1 | GPU partitioning (CU slicing, VRAM carving) | Done |
| 2.2 | Multiple GPU HAL shard instances (one per partition) | Done |
| 2.3 | Inter-shard GPU DMA (pipeline parallelism) | Done |
| 2.4 | pledge_gpu / unveil_vram enforcement |
Done |
| 2.5 | GPU ASLR | Done |
| 2.6 | Side-channel isolation testing and hardening | Done |
Goal: End-to-end LLM inference on coconutOS.
| Milestone | Deliverable | Status |
|---|---|---|
| 3.1 | Inference runtime library (Rust API) | Done |
| 3.2 | C ABI / FFI layer | Done |
| 3.3 | Port llama2.c as proof-of-concept inference shard | Done |
| 3.4 | Inference pipeline protocol (multi-shard pipeline parallelism) | Done |
| 3.5 | coconut-trace, coconut-prof tooling | Done |
| 3.6 | Benchmark: Llama 70B inference latency vs. Linux/ROCm baseline | Planned |
Goal: Production hardening, additional GPU vendor support.
| Milestone | Deliverable | Status |
|---|---|---|
| 4.1 | Security audit of supervisor (external) | Planned |
| 4.2 | Fuzzing campaign (syzkaller-style for supervisor syscalls) | Planned |
| 4.3 | NVIDIA GPU HAL shard (Hopper/Blackwell) | Planned |
| 4.4 | Apple GPU HAL shard (M-series, ARM64 port) | Planned |
| 4.5 | Network shard with RDMA/GPU-Direct support | Planned |
| 4.6 | Formal verification of supervisor capability system (Verus or similar) | Planned |
Risk: Even compute-only GPU drivers are ~50K LoC with complex hardware interactions. User-mode drivers may have performance overhead due to IOMMU and context switching.
Mitigation: Start with the smallest possible driver surface. Benchmark early. Accept some performance loss for isolation. The IOMMU overhead on modern hardware (AMD-Vi) is typically <5% for large DMA transfers.
Risk: GPU side-channel attacks (cache timing, power analysis, memory access patterns) are an active research area. Hardware mitigations may be insufficient.
Mitigation: Design the architecture to support strong isolation, but acknowledge that side-channel resistance depends on hardware support. Partition-level cache flushing is a software mitigation, but hardware-enforced cache partitioning (AMD MIG-equivalent) is preferred. Track academic research and GPU vendor security roadmaps.
Risk: IOMMU granularity (typically 4 KiB pages) may be too coarse for fine-grained GPU memory isolation. Some GPU operations may bypass IOMMU (e.g., GPU-internal MMU, peer-to-peer over NVLink without IOMMU).
Mitigation: Use GPU hardware partitioning (CU slicing) as the primary isolation mechanism, with IOMMU as the backstop for DMA. Require GPU vendors to support IOMMU for all DMA paths. Refuse to support GPU interconnects that bypass IOMMU.
Risk: Supervisor-mediated IPC adds latency compared to Linux's direct function calls. For inference workloads that frequently alternate CPU and GPU phases, this could impact throughput.
Mitigation: Fast-path optimizations (register-based IPC for small messages, shared memory for bulk data). Target <5µs per inter-shard IPC, which is acceptable if shards batch their GPU work (typical GPU kernel is >100µs).
Risk: GPU command preemption is hardware-dependent and may not be reliable. A long-running GPU kernel in one shard could starve other shards.
Mitigation: Use cooperative preemption (shards yield at defined points in their compute kernels) as the primary mechanism. Rely on hardware preemption as a fallback. Set maximum GPU kernel execution time limits per shard.
Risk: Formally verifying even the small supervisor is a multi-year effort. Full functional correctness (seL4-level) may not be achievable in the initial timeline.
Mitigation: Start with property-level verification of critical invariants (capability safety, memory isolation) using tools like Verus (Rust) or Kani. Defer full functional correctness to a later phase. The small supervisor size (10K LoC) makes this more tractable than verifying a monolithic kernel.
Risk: A new OS with no ecosystem will struggle to attract users and contributors. Inference workloads depend on complex software stacks (PyTorch, CUDA, etc.) that won't be available on coconutOS.
Mitigation: Focus on the C ABI/FFI layer to enable porting of existing inference engines (llama.cpp, whisper.cpp). Don't try to replace CUDA — provide a compute-only API that inference engines can target. Position coconutOS as a deployment target (not a development environment) for security-critical inference.
| # | Name | Arguments | Description |
|---|---|---|---|
| 0 | SYS_EXIT |
a0: exit code |
Terminate shard |
| 1 | SYS_SERIAL_WRITE |
a0: buffer ptr, a1: length |
Write to serial console |
| 11 | SYS_CAP_GRANT |
a0: handle, a1: target shard, a2: new rights |
Grant capability copy |
| 12 | SYS_CAP_REVOKE |
a0: handle |
Revoke a capability |
| 13 | SYS_CAP_RESTRICT |
a0: handle, a1: new rights |
Restrict rights (monotonic AND) |
| 14 | SYS_CAP_INSPECT |
a0: handle |
Inspect capability |
| 21 | SYS_CHANNEL_SEND |
a0: channel ID, a1: buffer ptr, a2: length |
Send IPC message |
| 22 | SYS_CHANNEL_RECV |
a0: channel ID, a1: buffer ptr, a2: max length |
Receive IPC message (blocking) |
| 30 | SYS_FS_OPEN |
a0: path ptr, a1: path length |
Open file by path |
| 31 | SYS_FS_READ |
a0: fd, a1: buffer ptr, a2: max length |
Read from open file |
| 32 | SYS_FS_STAT |
a0: fd |
Get file size |
| 33 | SYS_FS_CLOSE |
a0: fd |
Close open file |
| 40 | SYS_GPU_DMA |
a0: target partition, a1: src offset, a2: packed(dst<<32|len) |
Inter-partition VRAM copy |
| 41 | SYS_GPU_PLEDGE |
a0: bitmask of allowed categories |
Monotonic syscall restriction |
| 42 | SYS_GPU_UNVEIL |
a0: offset, a1: size |
Lock VRAM range for DMA |
| 43 | SYS_MMAP |
a0: va_start (page-aligned), a1: num_pages |
Map data pages into shard |
| 62 | SYS_YIELD |
— | Yield CPU time slice |
Entry: syscall instruction → syscall_entry (naked stub) → dispatch by RAX.
SFMASK clears IF on entry — no timer interrupts during syscall handling.
Minimum (development/testing):
| Component | Requirement |
|---|---|
| CPU | x86-64 with IOMMU support (AMD-Vi or Intel VT-d) |
| RAM | 16 GiB |
| GPU | AMD RDNA3 (e.g., RX 7900 XTX) or CDNA3 (MI300) |
| Storage | NVMe SSD, 500 GiB |
| Firmware | UEFI with Secure Boot support |
Recommended (production inference):
| Component | Requirement |
|---|---|
| CPU | AMD EPYC (Zen 4+) with AMD-Vi |
| RAM | 128+ GiB DDR5 |
| GPU | 2-8x AMD MI300X (192 GiB HBM3 each) |
| Storage | NVMe SSD, 2+ TiB |
| Network | 100 GbE with RDMA (RoCEv2) |
| Firmware | UEFI with Secure Boot, TPM 2.0 |
| Feature | Linux | FreeBSD | OpenBSD | Fuchsia | seL4 | coconutOS |
|---|---|---|---|---|---|---|
| Microkernel | No | No | No | Yes | Yes | Yes |
| GPU-native isolation | No | No | No | No | No | Yes |
| Capability-based security | Partial | Capsicum | No | Yes | Yes | Yes |
| pledge/unveil | No | No | Yes | No | No | Yes (GPU-extended) |
| W^X (CPU) | Partial | Partial | Yes | No | N/A | Yes |
| W^X (GPU) | No | No | N/A | N/A | N/A | Yes |
| GPU ASLR | No | No | N/A | N/A | N/A | Yes |
| GPU memory zeroing | No | No | N/A | N/A | N/A | Yes |
| Rust kernel | No | No | No | Partial | No | Yes |
| Formal verification | No | No | No | No | Yes | Planned (milestone 4.6) |
| TCB size | ~28M LoC | ~10M LoC | ~1M LoC | ~200K LoC | ~10K LoC | <10K LoC |
| Term | Definition |
|---|---|
| Shard | The fundamental unit of isolation in coconutOS. Combines a CPU address space, GPU partition, capability set, and threads. |
| Supervisor | The microkernel. The only code running in ring 0. Manages shards, capabilities, IPC, and scheduling. |
| HAL | Hardware Abstraction Layer. Defines Rust traits for GPU operations. Vendor-specific implementations run in user-mode shards. |
| Capability | An unforgeable token granting specific rights to a specific resource. |
| Pledge | A monotonic restriction on permitted operations (inspired by OpenBSD pledge(2)). |
| Unveil | A monotonic restriction on visible resources (inspired by OpenBSD unveil(2)). |
| CU | Compute Unit. The basic unit of GPU compute hardware (AMD terminology). Equivalent to SM (NVIDIA). |
| VRAM | Video RAM. GPU-local memory (HBM or GDDR). |
| IOMMU | Input-Output Memory Management Unit. Hardware that restricts DMA access by devices. |
| DMA | Direct Memory Access. Hardware-level memory transfer without CPU involvement. |
| IPC | Inter-Process Communication. Message passing or shared memory between shards. |
| W^X | Write XOR Execute. A memory policy where a page can be writable or executable, but never both. |
| ASLR | Address Space Layout Randomization. Randomizing memory layout to hinder exploitation. |
| SLO | Service Level Objective. A target for latency or throughput. |
| TCB | Trusted Computing Base. The set of components that must be correct for system security to hold. |
| KV-Cache | Key-Value Cache. Cached attention states in transformer inference, growing with sequence length. |
| Pipeline parallelism | Splitting a model across multiple GPUs/shards by layer groups. |
- Klein, G., et al. "seL4: Formal Verification of an OS Kernel." SOSP 2009.
- de Raadt, T. "pledge() — a new mitigation mechanism." OpenBSD.
- de Raadt, T. "unveil() — restrict filesystem view." OpenBSD.
- Naghibijouybari, H., et al. "Rendered Insecure: GPU Side Channel Attacks are Practical." IEEE S&P 2018.
- AMD. "RDNA 3 Instruction Set Architecture Reference Guide."
- AMD. "AMD Instinct MI300 Series Accelerator ISA."
- Heiser, G. "The seL4 Microkernel — An Introduction." CSIRO/Data61.
- Zhu, Y., et al. "Understanding the Security of GPU Computing." CCS 2017.
- Asahi Linux Project. "GPU Reverse Engineering Documentation."
- The Rust Programming Language. "The Rustonomicon — Unsafe Rust."