Skip to content

0x0funky/audioghost-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

55 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AudioGhost AI πŸŽ΅πŸ‘»

AudioGhost Banner

AI-Powered Object-Oriented Audio Separation

Describe the sound you want to extract or remove using natural language. Powered by Meta's SAM-Audio model.

Demo Python License

🎬 Demo

Audio Separation

audioghost.mp4

Video Upload

audioghost_video.mp4

Features

  • 🎯 Text-Guided Separation - Describe what you want to extract: "vocals", "drums", "a dog barking"
  • 🎬 Video Upload Support - Upload videos and extract/remove audio sources (audio extraction only, not vision-based)
  • πŸš€ Memory Optimized - Lite mode reduces VRAM from ~11GB to ~4GB
  • 🎨 Modern UI - Glassmorphism design with waveform visualization
  • ⚑ Real-time Progress - Track separation progress in real-time
  • πŸŽ›οΈ Stem Mixer - Preview and compare original, extracted, and residual audio

πŸ—ΊοΈ Roadmap

  • πŸ–±οΈ Visual Prompting - Click on video to select sound sources visually (Integration with SAM 2)

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Frontend                       β”‚
β”‚             (Next.js + Tailwind v4)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Backend API                        β”‚
β”‚            (FastAPI + Python)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Task Queue                          β”‚
β”‚          (Celery + Redis)                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           SAM Audio Lite                         β”‚
β”‚    (Memory-optimized Meta SAM-Audio)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Requirements

  • Python 3.11+
  • CUDA-compatible GPU (4GB+ VRAM for lite mode, 12GB+ for full mode)
  • CUDA 12.6 (recommended)
  • Node.js 18+ (for frontend)

πŸ’‘ FFmpeg and Redis are automatically installed by the installer.

πŸš€ One-Click Installation (Recommended)

First Time Setup

# Run installer (creates Conda env, downloads Redis, installs all dependencies)
install.bat

Daily Usage

# Start all services with one click
start.bat

# Stop all services
stop.bat

Manual Setup (Advanced)

1. Start Redis

Redis is automatically downloaded to redis/ folder by install.bat. If you prefer Docker:

docker-compose up -d

2. Create Anaconda Environment

# Create new environment (Python 3.11+ required)
conda create -n audioghost python=3.11 -y

# Activate environment
conda activate audioghost

3. Install PyTorch (CUDA 12.6)

pip install torch==2.9.0+cu126 torchvision==0.24.0+cu126 torchaudio==2.9.0+cu126 --index-url https://download.pytorch.org/whl/cu126 --extra-index-url https://pypi.org/simple

4. Install FFmpeg (required by TorchCodec)

conda install -c conda-forge ffmpeg -y

5. Install SAM Audio

pip install git+https://github.com/facebookresearch/sam-audio.git

6. Install Backend Dependencies

cd backend
pip install -r requirements.txt

7. Install Frontend Dependencies

cd frontend
npm install

8. Start Services

Terminal 1 - Backend API:

cd backend
uvicorn main:app --reload --port 8000

Terminal 2 - Celery Worker:

conda activate audioghost
cd backend
celery -A workers.celery_app worker --loglevel=info --pool=solo

Terminal 3 - Frontend:

cd frontend
npm run dev

9. Open the App

Navigate to http://localhost:3000

10. Connect HuggingFace

  1. Click "Connect HuggingFace" button
  2. Request access at https://huggingface.co/facebook/sam-audio-large
  3. Create Access Token: https://huggingface.co/settings/tokens
  4. Paste the token and connect

Usage

  1. Upload an audio file (MP3, WAV, FLAC)
  2. Describe what you want to extract or remove:
    • "vocals" / "singing voice"
    • "drums" / "percussion"
    • "background music"
    • "a dog barking"
    • "crowd noise"
  3. Click Extract or Remove
  4. Wait for processing
  5. Preview and download the results

Performance Benchmarks

Tested on RTX 4090 with 4:26 audio (11 chunks @ 25s each)

VRAM Usage (Lite Mode)

Model bfloat16 (Default) float32 (High Quality) Recommended GPU
Small ~6 GB ~10 GB RTX 3060 6GB / RTX 3070 8GB
Base ~7 GB ~13 GB RTX 3070/4060 8GB / RTX 4070 12GB
Large ~10 GB ~20 GB RTX 3080/4070 12GB / RTX 4080 16GB

πŸ’‘ High Quality Mode (float32): Better separation quality but uses +2-3GB more VRAM. Enable via the "High Quality Mode" toggle in the UI.

Processing Time

Model First Run (incl. model load) Subsequent Runs Speed
Small ~78s ~25s ~10x realtime
Base ~100s ~29s ~9x realtime
Large ~130s ~41s ~6.5x realtime

πŸ’‘ First run includes model download and loading. Subsequent runs use cached models.

Memory Optimization Details

AudioGhost uses a "Lite Mode" that removes unused model components:

Component Removed VRAM Saved
Vision Encoder ~2GB
Visual Ranker ~2GB
Text Ranker ~2GB
Span Predictor ~1-2GB

Total Reduction: Up to 40% less VRAM compared to original SAM-Audio

This is achieved by:

  • Disabling video-related features (not needed for audio-only)
  • Using predict_spans=False and reranking_candidates=1
  • Using bfloat16 precision by default (optional float32 for quality)
  • 25-second chunking for long audio files

Project Structure

audioghost-ai/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py           # FastAPI app
β”‚   β”œβ”€β”€ api/              # API routes
β”‚   β”‚   β”œβ”€β”€ auth.py       # HuggingFace auth
β”‚   β”‚   └── separate.py   # Separation endpoints
β”‚   └── workers/
β”‚       β”œβ”€β”€ celery_app.py # Celery config
β”‚       └── tasks.py      # SAM Audio Lite worker
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ app/          # Next.js app
β”‚   β”‚   └── components/   # React components
β”‚   └── package.json
β”œβ”€β”€ sam_audio_lite.py     # Standalone lite version
β”œβ”€β”€ QUICKSTART.md         # Quick setup guide
└── README.md

API Reference

POST /api/separate/

Create a separation task.

Form Data:

  • file - Audio file
  • description - Text prompt (e.g., "vocals")
  • mode - "extract" or "remove"
  • model_size - "small", "base", or "large" (default: "base")

Response:

{
  "task_id": "uuid",
  "status": "pending",
  "message": "Task submitted successfully"
}

GET /api/separate/{task_id}/status

Get task status and progress.

GET /api/separate/{task_id}/download/{stem}

Download result audio (ghost, clean, or original).

Troubleshooting

CUDA Out of Memory

  • Use model_size: "small" instead of "base" or "large"
  • Ensure lite mode is enabled (check for "Optimizing model for low VRAM" in logs)
  • Close other GPU applications

TorchCodec DLL Error

  • Downgrade to FFmpeg 7.x
  • Ensure FFmpeg bin directory is in PATH

HuggingFace 401 Error

  • Re-authenticate via the UI
  • Check that .hf_token exists in backend/

License

This project is licensed under the MIT License. SAM-Audio is licensed by Meta under a research license.

Credits

  • SAM-Audio by Meta AI Research
  • Core Optimization Logic: Special thanks to NilanEkanayake for providing the initial code modifications in Issue #24 that made VRAM inference reduction possible.
  • Built with ❀️ using Next.js, FastAPI, and Celery

About

Extract any sound with text prompts. Memory-optimized SAM-Audio with modern UI.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors