Skip to content

TypeTerrors/nated.io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

🚀 DGX Spark GPU Development Environment

CUDA 13 · NVIDIA GB10 · PyTorch 2.9 NGC · Code-Server Web IDE

“Full GPU dev environment inside Docker with VS Code in a browser”


📌 Overview

This project sets up a GPU-accelerated development environment for the NVIDIA DGX Spark, using:

  • NVIDIA GB10 GPU (Blackwell architecture, sm_121)
  • CUDA 13.0
  • PyTorch 2.9 nightly (NGC build, GB10-compatible)
  • VS Code Web (code-server) accessible from any browser
  • Go 1.25 & Node 22 available inside the same container
  • HuggingFace model caching
  • Large shared memory (32 GB) for model runtime stability

This environment is required because standard PyTorch wheels do NOT support GB10, and many Python packages break under Python 3.12 unless patched.

What you now have is a working base environment with:

✔ GPU recognized ✔ CUDA recognized ✔ PyTorch + Torchvision functional ✔ Code-Server accessible via browser ✔ RealESRGAN partially working (missing model file fixed manually) ✔ Diffusers, Transformers installed ✔ Go + Node installed ✔ Stable development workflow preserved outside the container


🧠 Why All This Was Necessary

1. ❌ The DGX Spark GPU (GB10) is not supported by normal PyTorch

The GB10 uses Blackwell (sm_121) — newer than any public PyTorch wheels.

Public builds only support up to sm_90 / sm_120.

This creates errors like:

module 'torch' has no attribute float8_e4m3fn
CUDA available: False

And prevents torchvision kernels from loading.


2. ✔ NVIDIA NGC PyTorch 25.xx containers are compatible

The official NGC builds contain:

  • CUDA 13
  • Custom PyTorch build with sm121 kernels
  • Torchvision built against the same CUDA
  • Proper cuDNN, NCCL, and Blackwell support

These containers are the ONLY way to get PyTorch running correctly on DGX Spark today.

We use:

nvcr.io/nvidia/pytorch:25.09-py3

Later versions (25.11) also work.


3. ❌ Python environment conflicts prevented pip installs

Ubuntu 24.04 uses PEP 668 externally-managed Python, so:

pip install --upgrade pip

throws:

error: externally-managed-environment

Solution: use the Python inside the NGC container, which is not distro-managed.


4. ✔ RealESRGAN + basicsr needed patching

basicsr imports a module that newer torchvision removed:

torchvision.transforms.functional_tensor

We created a patch shim:

import sys, types
import torchvision.transforms.functional as F

shim = types.ModuleType("torchvision.transforms.functional_tensor")
shim.rgb_to_grayscale = F.rgb_to_grayscale
sys.modules["torchvision.transforms.functional_tensor"] = shim

🏗️ Environment Architecture

DGX Spark Hardware
│
├── Docker (rootless / NVIDIA runtime)
│   ├── NVIDIA NGC PyTorch 25.xx (CUDA 13, sm_121)
│   ├── Code-Server (VSCode Web)
│   ├── Go 1.25
│   ├── NodeJS 22
│   ├── Python packages (diffusers, transformers, realesrgan, opencv…)
│   └── Shared memory 32GB
│
└── Host volumes
    ├── ./workspace                 → project files
    ├── ./data/code-server-config   → VS Code settings
    ├── ./data/code-server-data     → VS Code extensions
    └── ~/.cache/huggingface        → model cache

📦 docker-compose.yml (current working version)

services:
  server:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: dgx-codeserver
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - PASSWORD=${PASSWORD:-changeme123}
    working_dir: /workspace
    ports:
      - "8080:8080"
    volumes:
      - ./workspace:/workspace
      - $HOME/.cache/huggingface:/root/.cache/huggingface
      - ./data/code-server-data:/root/.local/share/code-server
      - ./data/code-server-config:/root/.config/code-server
    shm_size: "32g"
    ipc: host
    ulimits:
      memlock:
        soft: -1
        hard: -1
      stack:
        soft: 67108864
        hard: 67108864

🐳 Dockerfile (current working version)

FROM golang:1.25-bookworm AS go
FROM node:22-bookworm AS node

FROM nvcr.io/nvidia/pytorch:25.09-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    curl git nano ca-certificates \
 && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

COPY --from=go /usr/local/go /usr/local/go
ENV PATH="/usr/local/go/bin:${PATH}"

COPY --from=node /usr/local/bin/node /usr/local/bin/node
COPY --from=node /usr/local/lib/node_modules /usr/local/lib/node_modules

RUN ln -s /usr/local/lib/node_modules/npm/bin/npm-cli.js /usr/local/bin/npm && \
    ln -s /usr/local/lib/node_modules/npm/bin/npx-cli.js /usr/local/bin/npx

RUN curl -fsSL https://code-server.dev/install.sh | sh

RUN mkdir -p /root/.config/code-server && \
    printf "bind-addr: 0.0.0.0:8080\nauth: password\ncert: false\n" \
      > /root/.config/code-server/config.yaml

EXPOSE 8080

CMD ["bash", "-lc", "code-server /workspace"]

🖥️ Accessing VS Code Web

Open:

http://<DGX-SPARK-IP>:8080

Password is stored in:

data/code-server-config/config.yaml

or via .env:

PASSWORD=mysecret

🔥 GPU Verification

Inside the container run:

python - << 'EOF'
import torch
print("Torch:", torch.__version__)
print("CUDA:", torch.version.cuda)
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU only")
EOF

Expected correct output:

Torch: 2.9.0a0+...nv25.xx
CUDA: 13.0
CUDA available: True
Device: NVIDIA GB10

📂 Persistent VS Code Data

Your VS Code Web environment is preserved across container rebuilds because you mounted:

✔ Extensions

./data/code-server-data:/root/.local/share/code-server

✔ Settings

./data/code-server-config:/root/.config/code-server

✔ Workspace

./workspace:/workspace

✔ HuggingFace models

~/.cache/huggingface:/root/.cache/huggingface


🛠️ RealESRGAN: Remaining Issues

Your RealESRGAN workflow is partially working.

✔ Fixed:

  • torchvision incompatibility (shim)
  • cv2 import
  • environment install
  • GPU detection

❌ Still needed:

Download ESRGAN model weights:

mkdir -p models
wget -O models/RealESRGAN_x4plus.pth \
  https://github.com/xinntao/Real-ESRGAN/releases/download/v0.3.0/RealESRGAN_x4plus.pth

Then update your Python script path:

ESRGAN_MODEL_PATH = "models/RealESRGAN_x4plus.pth"

⚠️ Known Limitations (Partial Working State)

Feature Status Notes
PyTorch ✔ Working NGC build required
Torchvision ✔ Working shim needed for realesrgan
Code-server ✔ Working password via config.yaml
GPU compute ✔ Working sm_121 kernels supported
RealESRGAN ⚠ Partial missing model file
Diffusers ✔ Works CUDA 13 fully supported
OpenCV ✔ Works required libGL installed
Go 1.25 ✔ Works from donor image
Node 22 ✔ Works from donor image

📘 Summary

You now have a fully functional GPU dev environment on DGX Spark with:

  • Correct CUDA + PyTorch support for GB10
  • Code-Server for browser-based IDE access
  • Reproducible Docker setup
  • Persistent configuration + extensions
  • Multi-language runtime (Python, Go, Node)
  • Ability to run heavy ML workloads in a controlled container

The remaining work is just model paths + pipeline polishing.