A premium, local-first EPUB reader with high-fidelity "Direct Neural" text-to-speech. Built with Next.js, FastAPI, and ONNX.
eBookBot converts your EPUB books into immersive audio experiences. It uses a Flow-Matching based TTS engine (ReaderAudioEngine) to generate natural speech with precise word-level synchronization.
TTS model in use: Supertone/supertonic-2.
- Local-first pipeline with fast, responsive playback.
- Word-sync highlighting aligned with neural audio.
- Fine-grained reading controls for layout and tempo.
- Modular architecture: Next.js UI + FastAPI API + ONNX TTS engine.
| Layer | Technology |
|---|---|
| Frontend | Next.js (App Router) |
| Backend | FastAPI |
| TTS Engine | ONNX Runtime + ReaderAudioEngine |
| TTS Model | Supertone/supertonic-2 |
- Python 3.10+
- Node.js 18+
- ONNX Runtime (CUDA recommended for GPU acceleration, works on CPU too)
The easiest way to run both the backend and frontend simultaneously is using the run.py script:
python run.pyThis will:
- Start the FastAPI backend.
- Start the Next.js frontend.
- Handle clean shutdown of both services.
If you prefer to run services separately:
cd ReaderAudioAPI
pip install -r requirements.txt
python -m uvicorn app.main:app --reloadcd reader-frontend
npm install
npm run dev- Open http://localhost:3000.
- Click the + (Plus) icon in the sidebar.
- Upload an EPUB file and wait for processing.
- Tip: You can purchase high-quality EPUBs from official bookstores or find catalogs on community sites like Free Media Collection.
- Select the book and click Play.
An average book requires about 400 MB of local storage (audio + cache). We will optimize this in the future; see TODO below.
By default, runtime data is stored in ReaderAudioAPI/oas_assets/ (uploads, audio, metadata).
You can override this location by setting EBOOKBOT_DATA_DIR before starting the backend.
- TTS worker pool: defaults to 3 GPU workers.
EBOOKBOT_TTS_WORKERS(default: 3)EBOOKBOT_TTS_MAX_INFLIGHT(default: workers * 2)EBOOKBOT_TTS_TASK_TIMEOUT_SECONDS(default: 600)
- Idle GPU cleanup: when you pause generation and no work remains, the TTS worker processes shut down to free VRAM. Workers will auto-resume on the next queued task.
- Dynamic Controls
- Precise sliders for Reading Size, Line Height, Word Spacing, and Chunk Gaps
- Tempo control (0.5x to 3.0x)
- Instant Playback: Iterative chunking lets you start instantly while the rest builds in the background.
- Word-Sync: Visual highlighting tracks the neural audio in real time.
The core engine is included as a submodule. It is responsible for:
- Auto-downloading models from HuggingFace.
- Low-latency ONNX inference.
- Estimating precise word timestamps for highlighting.
To contribute or find more details about the engine, visit the ReaderAudioEngine/ directory.
- Optimize per-book storage size (target below ~100 MB).
- it is 200 right now.
- Add audio compression or streaming for long books.
- Provide a cleanup tool for cached audio.
MIT (see LICENSE).
Note: the TTS model and the ReaderAudioEngine submodule may be governed by their own separate licenses/terms.
| Type | Details |
|---|---|
| Author | Izzet Sezer |
| sezer@imsezer.com |
