A self-hosted AI video generation pipeline that creates viral-style content featuring animated babies, animals, and characters. This system replicates trending formats like "baby podcasters" and "animal CEO meetings" using a complete local AI pipeline.
- Complete Local Pipeline: Generate videos offline using self-hosted AI models
- Multiple Character Types: Babies, animals, celebrities, and cartoon characters
- Advanced Voice Synthesis: Baby voices, celebrity impersonations, and emotional variations
- Professional Lip-Sync: Natural facial animation synchronized with audio
- Template System: Pre-built viral content templates
- Real-Time Progress: WebSocket-based generation monitoring
- Modern UI: React frontend with drag-and-drop and real-time previews
- FastAPI server with async support
- Pydantic models for type safety
- Single responsibility services
- ComfyUI integration for Stable Diffusion
- Ollama for local LLM inference
- Multiple TTS engines (MeloTTS, FishSpeech, F5-TTS)
- Vite for fast development
- Tailwind CSS for styling
- React Query for API state management
- Zustand for client state
- Framer Motion for animations
- Python 3.9+ with pip/uv
- Node.js 18+ with npm
- NVIDIA GPU (8GB+ VRAM recommended)
- CUDA 11.8+ for GPU acceleration
# Clone repository
git clone <repository-url>
cd ai-video-generation-workflow
# Install Python dependencies
cd backend
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e ".[dev,gpu,models]"
# Set up environment
cp .env.example .env
# Edit .env with your configuration
# Run development server
python src/main.py# Install Node dependencies
cd frontend
npm install
# Start development server
npm run dev- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/api/docs
| Component | Minimum | Recommended | Optimal |
|---|---|---|---|
| GPU VRAM | 8GB | 12GB | 24GB+ |
| System RAM | 16GB | 32GB | 64GB+ |
| Storage | 100GB | 200GB | 500GB+ |
| CPU Cores | 6 | 8 | 12+ |
The system automatically downloads required models:
- LLM: Llama 3.1-8B (4.6GB)
- TTS: MeloTTS (500MB)
- Image: Stable Diffusion v1.5 (4GB)
- Animation: LatentSync (3.2GB)
The repository includes complete VS Code configuration:
- Launch configurations for debugging backend and frontend
- Workspace settings with Python and TypeScript formatting
- Extension recommendations for optimal development experience
- Python: Black, Ruff, MyPy for formatting and linting
- TypeScript: ESLint, Prettier for code quality
- Testing: Pytest (backend), Vitest (frontend)
- Pre-commit hooks for consistent code style
# Backend development
cd backend
python src/main.py # Start dev server with hot reload
# Frontend development
cd frontend
npm run dev # Start with hot reload
# Run tests
cd backend && python -m pytest
cd frontend && npm test
# Code formatting
cd backend && black . && ruff check .
cd frontend && npm run format- Choose Character Type: Select from baby humans, animals, celebrities, or cartoon characters
- Write Script: Create engaging prompt or use trending templates
- Configure Voice: Select voice style matching your character
- Generate Video: Watch real-time progress as AI creates your video
- Download & Share: Get your viral-ready MP4 file
- Custom Characters: Upload reference images for personalized avatars
- Voice Cloning: Use sample audio for custom voice generation
- Batch Processing: Generate multiple variations simultaneously
- Template Creation: Save successful configurations for reuse
# Server settings
HOST=0.0.0.0
PORT=8000
DEBUG=false
# Model settings
MODELS_DIR=./models
GPU_ENABLED=true
MAX_CONCURRENT_TASKS=2
# Model selection
LLM_MODEL=llama3.1:8b
TTS_MODEL=melotts
IMAGE_MODEL=stable-diffusion-v1-5# GPU memory management
TORCH_CUDA_ALLOC_CONF=expandable_segments:True
# Performance tuning
MAX_VIDEO_DURATION=300
API_RATE_LIMIT=100# Build and run with Docker Compose
docker-compose up -d
# Or build manually
docker build -t ai-video-gen .
docker run -p 8000:8000 --gpus all ai-video-gen# Backend production
pip install -e ".[production]"
gunicorn src.main:app --host 0.0.0.0 --port 8000
# Frontend production build
npm run build
# Serve dist/ with nginx or similar- Script Generation: ~10 seconds
- Audio Synthesis: ~15 seconds
- Image Generation: ~30 seconds
- Lip-Sync Animation: ~60 seconds
- Video Rendering: ~20 seconds
- Total: ~2-3 minutes for 30-second video
- Enable GPU acceleration for 5-10x speedup
- Use quantized models to reduce VRAM usage
- Batch process multiple videos for efficiency
- SSD storage improves model loading times
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- ComfyUI for Stable Diffusion integration
- Ollama for local LLM inference
- MeloTTS for high-quality text-to-speech
- HunyuanVideo for lip-sync animation
- FastAPI and React communities
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with β€οΈ for the AI content creation community# Test automation