Cudara

Lightweight CUDA Inference Server with Ollama-Compatible API

Cudara is a self-hosted inference server for HuggingFace models. Run LLMs, Vision-Language Models, Embedding models, and Speech Recognition models on your GPU with an Ollama-compatible API.

Features

🦙 Ollama-Compatible API - Works with existing Ollama clients
🖼️ Vision-Language Models - Image understanding, OCR, visual Q&A
💬 Text Generation - Chat and completion with any HuggingFace LLM
📊 Embeddings - Vector embeddings for RAG and semantic search
🎤 Speech Recognition - Transcribe audio with Whisper
⚡ Quantization - Automatic 4-bit quantization via BitsAndBytes

Quick Start

Using Docker (Recommended)

# GPU (CUDA image)
docker run --gpus all -p 8000:8000 ghcr.io/juliog922/cudara:latest

# GPU (with persistent models)
docker run --gpus all -p 8000:8000 \
  -v cudara_models:/app/models \
  ghcr.io/juliog922/cudara:latest

# Auto-download models on startup (and keep /health unhealthy until they are ready)
docker run --gpus all -p 8000:8000 \
  -e CUDARA_DEFAULT_MODELS="Qwen/Qwen2.5-3B-Instruct,openai/whisper-small" \
  -e HF_TOKEN="..." \
  -v cudara_models:/app/models \
  ghcr.io/juliog922/cudara:latest

Docker images and tags

This repo publishes a single CUDA image:

ghcr.io/juliog922/cudara:latest (alias :cuda on the default branch)

Run it with the NVIDIA runtime (for Docker: --gpus all).

Building the Docker image (CUDA)

GitHub Actions runners typically don’t have NVIDIA GPUs, but that’s fine: the CUDA image can be built without a GPU. A GPU is only required to run the container.

docker build -t cudara:cuda .
docker run --gpus all -p 8000:8000 cudara:cuda

Using uv (Development)

# Clone and install
git clone https://github.com/juliog922/cudara
cd cudara
uv sync

# Run server
uv run cudara serve

Using the Client Library

pip install cudara-client

from cudara_client import CudaraClient

client = CudaraClient("http://localhost:8000")
client.pull("Qwen/Qwen2.5-3B-Instruct")
response = client.chat("Qwen/Qwen2.5-3B-Instruct", "Hello!")
print(response.content)

Project Structure

cudara/
├── src/cudara/
│   ├── __init__.py
│   ├── main.py              # FastAPI server
│   ├── cli.py               # CLI commands
│   ├── quantization.py      # BitsAndBytes quantization
│   └── image_processing.py  # VRAM-aware image processing
├── tests/
│   ├── test_unit.py         # Unit tests
│   └── test_integration.py  # Integration tests
├── .github/workflows/
│   ├── ci.yml               # Test on PR
│   └── docker-publish.yml   # Build & push to GHCR
├── models.json              # Model configurations
├── pyproject.toml           # Dependencies
├── Dockerfile               # Docker build
└── README.md

CLI Usage

Cudara includes a CLI similar to Ollama:

# Start server
cudara serve --host 0.0.0.0 --port 8000

# List models
cudara list

# Pull a model
cudara pull Qwen/Qwen2.5-3B-Instruct

# Run inference
cudara run Qwen/Qwen2.5-3B-Instruct "Hello!"

# Interactive chat
cudara chat Qwen/Qwen2.5-3B-Instruct

# Show server status
cudara ps

# Delete model
cudara rm Qwen/Qwen2.5-3B-Instruct

API Reference

Ollama-Compatible Endpoints

Endpoint	Method	Description
`/api/tags`	GET	List available models
`/api/show`	POST	Show model details
`/api/pull`	POST	Download a model
`/api/delete`	DELETE	Delete a model
`/api/generate`	POST	Generate text
`/api/chat`	POST	Chat completion
`/api/embeddings`	POST	Generate embeddings

Extended Endpoints

Endpoint	Method	Description
`/api/transcribe`	POST	Transcribe audio
`/api/vision`	POST	Process image
`/health`	GET	Health check

Examples

# Chat
curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-3B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

# Embeddings
curl -X POST http://localhost:8000/api/embeddings \
  -d '{"model": "sentence-transformers/all-MiniLM-L6-v2", "input": "Hello"}'

# Vision
curl -X POST http://localhost:8000/api/vision \
  -F "model=unsloth/Qwen2.5-VL-3B-Instruct-unsloth-bnb-4bit" \
  -F "prompt=What is this?" -F "file=@image.jpg"

# Transcribe
curl -X POST http://localhost:8000/api/transcribe \
  -F "model=openai/whisper-small" -F "file=@audio.mp3"

Adding Models

Edit models.json to add HuggingFace models:

Text Generation

"your-org/your-model": {
  "description": "Your model",
  "task": "text-generation",
  "architecture": "AutoModelForCausalLM",
  "dtype": "bfloat16",
  "quantization": {
    "enabled": true,
    "prequantize": true,
    "method": "bitsandbytes",
    "bits": 4,
    "category": "text_llm_medium"
  },
  "system_prompt": "You are helpful.",
  "generation_defaults": {"max_new_tokens": 512, "temperature": 0.7}
}

Pre-quantized (Unsloth)

"unsloth/Model-bnb-4bit": {
  "description": "Pre-quantized model",
  "task": "text-generation",
  "architecture": "AutoModelForCausalLM",
  "dtype": "bfloat16",
  "quantization": {"enabled": false, "notes": "Pre-quantized"}
}

Development

Run Tests

# All tests
uv run pytest

# Unit tests only
uv run pytest tests/test_unit.py -v

# Integration tests
uv run pytest tests/test_integration.py -v -m integration

# With coverage
uv run pytest --cov=src/cudara --cov-report=html

Lint

uv run ruff check src/ tests/
uv run ruff format src/ tests/

Build Docker

docker build -t cudara:cuda .
docker run --gpus all -p 8000:8000 cudara:cuda

Environment Variables

Variable	Description	Default
`HF_TOKEN`	HuggingFace token for gated models	-
`CUDARA_DEFAULT_MODELS`	Comma-separated model IDs from `models.json` to auto-download on startup; if unset/empty/`None`, it is disabled. `/health` stays unhealthy until all requested models are READY	(unset / None)
`CUDA_VISIBLE_DEVICES`	GPU selection	all

Requirements

NVIDIA GPU with 8GB+ VRAM
CUDA 12.1+
Python 3.11+

Cudara Client

Install the Python client:

pip install cudara-client

See cudara-client for documentation.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
src/cudara		src/cudara
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
index.html		index.html
models.json		models.json
pyproject.toml		pyproject.toml
swagger.yaml		swagger.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cudara

Features

Quick Start

Using Docker (Recommended)

Docker images and tags

Building the Docker image (CUDA)

Using uv (Development)

Using the Client Library

Project Structure

CLI Usage

API Reference

Ollama-Compatible Endpoints

Extended Endpoints

Examples

Adding Models

Text Generation

Pre-quantized (Unsloth)

Development

Run Tests

Lint

Build Docker

Environment Variables

Requirements

Cudara Client

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cudara

Features

Quick Start

Using Docker (Recommended)

Docker images and tags

Building the Docker image (CUDA)

Using uv (Development)

Using the Client Library

Project Structure

CLI Usage

API Reference

Ollama-Compatible Endpoints

Extended Endpoints

Examples

Adding Models

Text Generation

Pre-quantized (Unsloth)

Development

Run Tests

Lint

Build Docker

Environment Variables

Requirements

Cudara Client

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages