██╗ ██╗ ██████╗ ██╗ ██╗ █████╗ ██╗
██║ ██║██╔═══██╗╚██╗██╔╝ ██╔══██╗██║
██║ ██║██║ ██║ ╚███╔╝ █████╗███████║██║
╚██╗ ██╔╝██║ ██║ ██╔██╗ ╚════╝██╔══██║██║
╚████╔╝ ╚██████╔╝██╔╝ ██╗ ██║ ██║██║
╚═══╝ ╚═════╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝╚═╝
A Hybrid Local/Cloud Architecture for Large Language Models.
VoxAI is a specialized Python-based interface designed to seamlessly bridge the gap between local hardware (APUs/Consumer GPUs) and high-performance cloud infrastructure (RunPod). It allows users to run efficient models locally while instantly "bursting" to the cloud for massive 70B+ parameter models, all within a single, unified chat interface.
- Local Mode: Optimized for consumer hardware (e.g., AMD RX 6600, NVIDIA RTX 3060). Uses
llama-cpp-pythonwith Vulkan/CUDA acceleration for standard models likeQwen 7BandLlama 3. - Cloud Mode: Instant uplink to RunPod serverless GPUs (A40, A6000, A100) for running heavyweights like
Midnight-Miqu 70BandQwen2.5 72B.
- In-Pod Swapping: Hot-swaps models inside the running container to reuse the GPU, reducing switch time from ~5 minutes to seconds.
- Zombie Process Protection: If an in-pod swap fails (e.g., library mismatch), the system automatically detects the failure, kills the "zombie" pod, and spins up a fresh compatible instance.
- Tiered GPU Selection: Automatically selects the cheapest viable GPU.
- Small Models (<30B): Rents an RTX A40/A6000 (~$0.30/hr).
- Ultra Models (70B+): Rents an A100 80GB (~$1.79/hr).
- Auto-Shutdown: Prevents billing accidents by terminating cloud resources on exit.
Clone the repository and install dependencies:
git clone https://github.com/yourusername/VoxAI_Chat_API.git
cd VoxAI_Chat_API
pip install python==3.10
pip install -r requirements.txtRename the template and add your API keys:
mv config.example.py config.pyEdit config.py:
API_KEY = "YOUR_RUNPOD_API_KEY"
POD_ID = "YOUR_POD_ID" # Optional: Initial Pod IDLaunch the unified start script:
start.batYou will be prompted to select your environment:
[1] LOCAL (RX 6600) | [2] CLOUD (RunPod)
To add a new model to the menu, simply edit the MODEL_MAP in your config.py file.
Format: "Menu Display Name": "Local Filename"
MODEL_MAP = {
# ... existing models ...
"My New Cool Model": "my-model-v1.gguf",
}Note: Ensure the .gguf file is placed inside the models/ directory.
VoxAI performs a hardware handshake on boot to determine optimal thread counts and GPU layers. Performance: Local 14B models typically respond in <15s on mid-range hardware.
[LOCAL] 🛡️ Loading GGUF: models/Qwen3-VL-8B-Instruct.gguf...
[HANDSHAKE] Detected 4 Physical Cores. Mode: APU (Hybrid/Vulkan)
[LOCAL] 🟢 Backend drivers loaded manually.
[LOCAL] ✅ Engine Online.
During testing, swapping from a 70B model (A100) to a 14B model (A40) caused a library mismatch. VoxAI self-corrected in <45s.
[PHOENIX] 🔥 Initiating Swap...
[PHOENIX] ♻️ Optimizing: Reusing active GPU...
[DEBUG] Valid Endpoint, but ID mismatch (1/5): 'openai/gpt-oss-20b' != 'Qwen/Qwen3-14B'
[PHOENIX] ❌ Swap Verification Failed: Persistent Old Model Detected.
[PHOENIX] ⚠️ In-Pod Swap failed. Retrying with fresh pod...
[PHOENIX] ☠️ Terminating old pod...
[PHOENIX] 🐣 Renting NVIDIA A40...
[PHOENIX] ✅ Online! Serving: Qwen/Qwen3-14B
User: "Can u write me a song Like AJR kinda song?" VoxAI (Qwen 14B):
Title: "Echoes in the Static" Verse 1: Driving through the city, dashboard lights are glowing... Chorus: I hear echoes in the static, whispers in the noise...
start.bat- Main entry point.vox_core_chat.py- The brain. Handles input, local inference, and cloud orchestration.runpod_interface.py- The driver. Manages RunPod API, renting, and swapping.machine_engine_handshake.py- Hardware detection logic for local optimization.config.py- User settings (GitIgnored).
This project is open-source. Feel free to fork and modify!