Lightweight CUDA Inference Server with Ollama-Compatible API
Cudara is a self-hosted inference server for HuggingFace models. Run LLMs, Vision-Language Models, Embedding models, and Speech Recognition models on your GPU with an Ollama-compatible API.
- π¦ Ollama-Compatible API - Works with existing Ollama clients
- πΌοΈ Vision-Language Models - Image understanding, OCR, visual Q&A
- π¬ Text Generation - Chat and completion with any HuggingFace LLM
- π Embeddings - Vector embeddings for RAG and semantic search
- π€ Speech Recognition - Transcribe audio with Whisper
- β‘ Quantization - Automatic 4-bit quantization via BitsAndBytes
# GPU (CUDA image)
docker run --gpus all -p 8000:8000 ghcr.io/juliog922/cudara:latest
# GPU (with persistent models)
docker run --gpus all -p 8000:8000 \
-v cudara_models:/app/models \
ghcr.io/juliog922/cudara:latest
# Auto-download models on startup (and keep /health unhealthy until they are ready)
docker run --gpus all -p 8000:8000 \
-e CUDARA_DEFAULT_MODELS="Qwen/Qwen2.5-3B-Instruct,openai/whisper-small" \
-e HF_TOKEN="..." \
-v cudara_models:/app/models \
ghcr.io/juliog922/cudara:latestThis repo publishes a single CUDA image:
ghcr.io/juliog922/cudara:latest(alias:cudaon the default branch)
Run it with the NVIDIA runtime (for Docker: --gpus all).
GitHub Actions runners typically donβt have NVIDIA GPUs, but thatβs fine: the CUDA image can be built without a GPU. A GPU is only required to run the container.
docker build -t cudara:cuda .
docker run --gpus all -p 8000:8000 cudara:cuda# Clone and install
git clone https://github.com/juliog922/cudara
cd cudara
uv sync
# Run server
uv run cudara servepip install cudara-clientfrom cudara_client import CudaraClient
client = CudaraClient("http://localhost:8000")
client.pull("Qwen/Qwen2.5-3B-Instruct")
response = client.chat("Qwen/Qwen2.5-3B-Instruct", "Hello!")
print(response.content)cudara/
βββ src/cudara/
β βββ __init__.py
β βββ main.py # FastAPI server
β βββ cli.py # CLI commands
β βββ quantization.py # BitsAndBytes quantization
β βββ image_processing.py # VRAM-aware image processing
βββ tests/
β βββ test_unit.py # Unit tests
β βββ test_integration.py # Integration tests
βββ .github/workflows/
β βββ ci.yml # Test on PR
β βββ docker-publish.yml # Build & push to GHCR
βββ models.json # Model configurations
βββ pyproject.toml # Dependencies
βββ Dockerfile # Docker build
βββ README.md
Cudara includes a CLI similar to Ollama:
# Start server
cudara serve --host 0.0.0.0 --port 8000
# List models
cudara list
# Pull a model
cudara pull Qwen/Qwen2.5-3B-Instruct
# Run inference
cudara run Qwen/Qwen2.5-3B-Instruct "Hello!"
# Interactive chat
cudara chat Qwen/Qwen2.5-3B-Instruct
# Show server status
cudara ps
# Delete model
cudara rm Qwen/Qwen2.5-3B-Instruct| Endpoint | Method | Description |
|---|---|---|
/api/tags |
GET | List available models |
/api/show |
POST | Show model details |
/api/pull |
POST | Download a model |
/api/delete |
DELETE | Delete a model |
/api/generate |
POST | Generate text |
/api/chat |
POST | Chat completion |
/api/embeddings |
POST | Generate embeddings |
| Endpoint | Method | Description |
|---|---|---|
/api/transcribe |
POST | Transcribe audio |
/api/vision |
POST | Process image |
/health |
GET | Health check |
# Chat
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2.5-3B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
# Embeddings
curl -X POST http://localhost:8000/api/embeddings \
-d '{"model": "sentence-transformers/all-MiniLM-L6-v2", "input": "Hello"}'
# Vision
curl -X POST http://localhost:8000/api/vision \
-F "model=unsloth/Qwen2.5-VL-3B-Instruct-unsloth-bnb-4bit" \
-F "prompt=What is this?" -F "file=@image.jpg"
# Transcribe
curl -X POST http://localhost:8000/api/transcribe \
-F "model=openai/whisper-small" -F "file=@audio.mp3"Edit models.json to add HuggingFace models:
"your-org/your-model": {
"description": "Your model",
"task": "text-generation",
"architecture": "AutoModelForCausalLM",
"dtype": "bfloat16",
"quantization": {
"enabled": true,
"prequantize": true,
"method": "bitsandbytes",
"bits": 4,
"category": "text_llm_medium"
},
"system_prompt": "You are helpful.",
"generation_defaults": {"max_new_tokens": 512, "temperature": 0.7}
}"unsloth/Model-bnb-4bit": {
"description": "Pre-quantized model",
"task": "text-generation",
"architecture": "AutoModelForCausalLM",
"dtype": "bfloat16",
"quantization": {"enabled": false, "notes": "Pre-quantized"}
}# All tests
uv run pytest
# Unit tests only
uv run pytest tests/test_unit.py -v
# Integration tests
uv run pytest tests/test_integration.py -v -m integration
# With coverage
uv run pytest --cov=src/cudara --cov-report=htmluv run ruff check src/ tests/
uv run ruff format src/ tests/docker build -t cudara:cuda .
docker run --gpus all -p 8000:8000 cudara:cuda| Variable | Description | Default |
|---|---|---|
HF_TOKEN |
HuggingFace token for gated models | - |
CUDARA_DEFAULT_MODELS |
Comma-separated model IDs from models.json to auto-download on startup; if unset/empty/None, it is disabled. /health stays unhealthy until all requested models are READY |
(unset / None) |
CUDA_VISIBLE_DEVICES |
GPU selection | all |
- NVIDIA GPU with 8GB+ VRAM
- CUDA 12.1+
- Python 3.11+
Install the Python client:
pip install cudara-clientSee cudara-client for documentation.
MIT