Skip to content

juliog922/cudara

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cudara

Lightweight CUDA Inference Server with Ollama-Compatible API

Docker License: MIT

Cudara is a self-hosted inference server for HuggingFace models. Run LLMs, Vision-Language Models, Embedding models, and Speech Recognition models on your GPU with an Ollama-compatible API.

Features

  • πŸ¦™ Ollama-Compatible API - Works with existing Ollama clients
  • πŸ–ΌοΈ Vision-Language Models - Image understanding, OCR, visual Q&A
  • πŸ’¬ Text Generation - Chat and completion with any HuggingFace LLM
  • πŸ“Š Embeddings - Vector embeddings for RAG and semantic search
  • 🎀 Speech Recognition - Transcribe audio with Whisper
  • ⚑ Quantization - Automatic 4-bit quantization via BitsAndBytes

Quick Start

Using Docker (Recommended)

# GPU (CUDA image)
docker run --gpus all -p 8000:8000 ghcr.io/juliog922/cudara:latest

# GPU (with persistent models)
docker run --gpus all -p 8000:8000 \
  -v cudara_models:/app/models \
  ghcr.io/juliog922/cudara:latest

# Auto-download models on startup (and keep /health unhealthy until they are ready)
docker run --gpus all -p 8000:8000 \
  -e CUDARA_DEFAULT_MODELS="Qwen/Qwen2.5-3B-Instruct,openai/whisper-small" \
  -e HF_TOKEN="..." \
  -v cudara_models:/app/models \
  ghcr.io/juliog922/cudara:latest

Docker images and tags

This repo publishes a single CUDA image:

  • ghcr.io/juliog922/cudara:latest (alias :cuda on the default branch)

Run it with the NVIDIA runtime (for Docker: --gpus all).

Building the Docker image (CUDA)

GitHub Actions runners typically don’t have NVIDIA GPUs, but that’s fine: the CUDA image can be built without a GPU. A GPU is only required to run the container.

docker build -t cudara:cuda .
docker run --gpus all -p 8000:8000 cudara:cuda

Using uv (Development)

# Clone and install
git clone https://github.com/juliog922/cudara
cd cudara
uv sync

# Run server
uv run cudara serve

Using the Client Library

pip install cudara-client
from cudara_client import CudaraClient

client = CudaraClient("http://localhost:8000")
client.pull("Qwen/Qwen2.5-3B-Instruct")
response = client.chat("Qwen/Qwen2.5-3B-Instruct", "Hello!")
print(response.content)

Project Structure

cudara/
β”œβ”€β”€ src/cudara/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py              # FastAPI server
β”‚   β”œβ”€β”€ cli.py               # CLI commands
β”‚   β”œβ”€β”€ quantization.py      # BitsAndBytes quantization
β”‚   └── image_processing.py  # VRAM-aware image processing
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_unit.py         # Unit tests
β”‚   └── test_integration.py  # Integration tests
β”œβ”€β”€ .github/workflows/
β”‚   β”œβ”€β”€ ci.yml               # Test on PR
β”‚   └── docker-publish.yml   # Build & push to GHCR
β”œβ”€β”€ models.json              # Model configurations
β”œβ”€β”€ pyproject.toml           # Dependencies
β”œβ”€β”€ Dockerfile               # Docker build
└── README.md

CLI Usage

Cudara includes a CLI similar to Ollama:

# Start server
cudara serve --host 0.0.0.0 --port 8000

# List models
cudara list

# Pull a model
cudara pull Qwen/Qwen2.5-3B-Instruct

# Run inference
cudara run Qwen/Qwen2.5-3B-Instruct "Hello!"

# Interactive chat
cudara chat Qwen/Qwen2.5-3B-Instruct

# Show server status
cudara ps

# Delete model
cudara rm Qwen/Qwen2.5-3B-Instruct

API Reference

Ollama-Compatible Endpoints

Endpoint Method Description
/api/tags GET List available models
/api/show POST Show model details
/api/pull POST Download a model
/api/delete DELETE Delete a model
/api/generate POST Generate text
/api/chat POST Chat completion
/api/embeddings POST Generate embeddings

Extended Endpoints

Endpoint Method Description
/api/transcribe POST Transcribe audio
/api/vision POST Process image
/health GET Health check

Examples

# Chat
curl -X POST http://localhost:8000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-3B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

# Embeddings
curl -X POST http://localhost:8000/api/embeddings \
  -d '{"model": "sentence-transformers/all-MiniLM-L6-v2", "input": "Hello"}'

# Vision
curl -X POST http://localhost:8000/api/vision \
  -F "model=unsloth/Qwen2.5-VL-3B-Instruct-unsloth-bnb-4bit" \
  -F "prompt=What is this?" -F "file=@image.jpg"

# Transcribe
curl -X POST http://localhost:8000/api/transcribe \
  -F "model=openai/whisper-small" -F "file=@audio.mp3"

Adding Models

Edit models.json to add HuggingFace models:

Text Generation

"your-org/your-model": {
  "description": "Your model",
  "task": "text-generation",
  "architecture": "AutoModelForCausalLM",
  "dtype": "bfloat16",
  "quantization": {
    "enabled": true,
    "prequantize": true,
    "method": "bitsandbytes",
    "bits": 4,
    "category": "text_llm_medium"
  },
  "system_prompt": "You are helpful.",
  "generation_defaults": {"max_new_tokens": 512, "temperature": 0.7}
}

Pre-quantized (Unsloth)

"unsloth/Model-bnb-4bit": {
  "description": "Pre-quantized model",
  "task": "text-generation",
  "architecture": "AutoModelForCausalLM",
  "dtype": "bfloat16",
  "quantization": {"enabled": false, "notes": "Pre-quantized"}
}

Development

Run Tests

# All tests
uv run pytest

# Unit tests only
uv run pytest tests/test_unit.py -v

# Integration tests
uv run pytest tests/test_integration.py -v -m integration

# With coverage
uv run pytest --cov=src/cudara --cov-report=html

Lint

uv run ruff check src/ tests/
uv run ruff format src/ tests/

Build Docker

docker build -t cudara:cuda .
docker run --gpus all -p 8000:8000 cudara:cuda

Environment Variables

Variable Description Default
HF_TOKEN HuggingFace token for gated models -
CUDARA_DEFAULT_MODELS Comma-separated model IDs from models.json to auto-download on startup; if unset/empty/None, it is disabled. /health stays unhealthy until all requested models are READY (unset / None)
CUDA_VISIBLE_DEVICES GPU selection all

Requirements

  • NVIDIA GPU with 8GB+ VRAM
  • CUDA 12.1+
  • Python 3.11+

Cudara Client

Install the Python client:

pip install cudara-client

See cudara-client for documentation.


License

MIT

About

Cudara is a self-hosted inference server for HuggingFace models. Run LLMs, Vision-Language Models, Embedding models, and Speech Recognition models on your GPU with an Ollama-compatible API.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors