Skip to content

muhammadGagah/native-speech-generation

Repository files navigation

Native Speech Generation for NVDA

Author: Muhammad Gagah muha.aku@gmail.com

Native Speech Generation is an NVDA add-on that integrates Google Gemini AI to generate high-quality, natural-sounding speech directly within NVDA. It provides a clean, fully accessible interface for converting text into audio, supporting both single-speaker narration and dynamic multi-speaker dialogues.

This add-on is designed for smooth workflows, accessibility-first interaction, and flexible voice control suitable for narration, dialogue, and audio content production.


Features

High-Quality Speech Generation

  • Choose between:

    • Gemini Flash 3.1 Preview Powerful, low-latency speech generation, very good for short audio.
    • Gemini Flash 2.5 Standard quality, fast generation, low latency.
    • Gemini Pro 2.5 Premium, more realistic voices (paid model).

Single & Multi-Speaker Modes

  • Single-speaker narration for standard text-to-speech.
  • Multi-speaker (2 speakers) mode for dialogues with distinct voices.

Advanced Voice Control

  • Speaker Naming Assign custom names (e.g., John, Mary) in multi-speaker mode. The AI automatically maps voices based on speaker names in the script.

  • Style Instructions Provide prompts such as “Speak in a cheerful tone” or “Narrate calmly” to guide delivery.

  • Temperature Control Adjust output variation and creativity:

    • Lower values → more stable and predictable speech.
    • Higher values → more expressive and varied speech.

Accessible & Clean Interface

  • Fully accessible with screen readers.
  • Advanced options are placed in a collapsible panel to keep the main dialog simple and focused.

Seamless Workflow

  • Audio plays automatically after generation.
  • Generated audio can be replayed or saved as a high-quality .wav file.
  • Designed for minimal friction during repeated generation and playback.

Smart Voice Loading & Caching

  • Available voices are fetched dynamically from the Gemini API.
  • Voice data is cached for 24 hours to reduce API calls and speed up startup.

Talk With AI (Live Conversation)

  • Real-time Voice Chat: Have a natural, low-latency spoken conversation with Gemini.
  • Grounding with Google Search: Enable the AI to access real-time information from the web during your chat.
  • Interruptible: You can interrupt the AI at any time by speaking or pressing "Stop Conversation".
  • Customizable: Uses your selected voice and style instructions.
  • Thinking Level Control: Choose No Thinking, Low, Medium, or High depending on the reasoning depth you want.
  • Reconnect Continuity: Recent conversation context is restored automatically after a reconnect, without a separate memory toggle.
  • More Stable Streaming: Improved reconnection behavior (backoff + retry) and adaptive audio buffering for better resilience on unstable networks.

Requirements

  • NVDA 2024.1 or newer, tested through NVDA 2026.1.
  • Active internet connection.
  • A valid Google Gemini API Key.

Installation

  1. Download the latest add-on package from the Releases page: https://github.com/MuhammadGagah/native-speech-generation/releases
  2. Install it like any standard NVDA add-on.
  3. Restart NVDA when prompted.

API Key Setup (Required)

  1. Create an API key from Google AI Studio: https://aistudio.google.com/apikey
  2. Open NVDA and go to: NVDA Menu → Tools → Native Speech Generation
  3. Click “API Key Settings”.
  4. This opens NVDA Settings directly in the Native Speech Generation category.
  5. Paste your Gemini API Key into the GEMINI API Key field.
  6. Click OK to save.

Saved keys are stored securely using Windows DPAPI, so the encrypted value cannot be decrypted on a different Windows machine or user account.

For advanced deployments, you can also provide the key through the GEMINI_API_KEY environment variable. The add-on will use it automatically when no stored key is available.


How to Use

Open the dialog using:

  • NVDA+Control+Shift+G, or
  • NVDA Menu → Tools → Native Speech Generation

Main Interface Elements

  • Text to convert Enter or paste the text you want to convert to speech.

  • Style instructions (optional) Provide guidance for tone, emotion, or delivery.

  • Select Model

    • Flash 3.1 Preview
    • Flash 2.5
    • Pro 2.5 (High Quality)
  • Speaker Mode

    • Single-speaker
    • Multi-speaker (2)

Generating Speech

Single-Speaker Mode

  1. Select Single-speaker.
  2. Choose a voice from the Select Voice dropdown.
  3. Enter your text.
  4. Optionally add style instructions.
  5. Click Generate Speech.
  6. The audio will play automatically after generation.

Multi-Speaker Mode

  1. Select Multi-speaker (2).

  2. For each speaker:

    • Enter a unique Speaker Name.
    • Select a distinct Voice.
  3. Format the text so each line starts with the speaker name followed by a colon.

Example:

Alice: Hi Bob, how are you today?
Bob: I'm doing great, Alice! The weather is fantastic.
  1. Click Generate Speech. Voices will be assigned automatically based on the speaker names.

Talk With AI (Live Mode)

Experience a natural, two-way voice conversation with Gemini.

  1. Configure your desired Voice and Style Instructions in the main dialog. (Note: Talk With AI currently supports Single-speaker mode only)
  2. Click Talk With AI.
  3. In the new window:
    • Start Conversation: Begins the session. Speak into your microphone.
    • Stop Conversation: Ends the session.
    • Grounding with Google Search: Check this box to allow Gemini to search the web for answers (e.g., current news, weather).
      • Note: This checkbox is hidden while a conversation is active. Stop the conversation to change it.
    • Thinking level: Choose No Thinking, Low, Medium, or High.
    • Microphone Toggle: Mute/Unmute your microphone.
    • Volume: Adjust the AI's playback volume.

Advanced Settings

  • Enable Advanced Settings (Temperature) to show the slider.

  • Temperature Range:

    • 0.0 → Most deterministic and stable.
    • 1.0 → Default balance.
    • 2.0 → Most creative and varied.

Buttons Overview

  • Generate Speech - Start speech generation.
  • Play - Replay the last generated audio.
  • Talk With AI - Open the real-time voice conversation interface.
  • Save Audio - Save the last audio as a .wav file.
  • API Key Settings - Open the add-on configuration in NVDA Settings.
  • View voices in AI Studio - Opens Google AI Studio in a browser.
  • Close - Close the dialog (or press Escape).

Input Gestures

Customizable via: NVDA Menu → Preferences → Input Gestures → Native Speech Generation

Default gesture:

  • NVDA+Control+Shift+G – Open Native Speech Generation dialog.

Development & Contribution Guide

If you want to develop or modify this add-on, follow the steps below.

Environment Setup

  • Python matching your target NVDA runtime

    • Use Python 3.13 64-bit when testing or packaging dependencies for NVDA 2026.1 and newer.
    • Use Python 3.11 32-bit only when packaging dependencies for older supported NVDA builds.
  • uv for the pinned build and lint toolchain.

    uv sync
    uv run pre-commit run --all-files
    uv run scons
    uv run scons pot
    

    SCons 4.10.1, Markdown 3.10, Ruff 0.14.10, Pyright 1.1.407, and the other build tools are installed from uv.lock.

  • GNU Gettext Tools (optional, recommended for localization)

Additional Dependencies

For local development only, install the audio-only Talk With AI dependencies directly into the add-on library path using the Python version and architecture that match the NVDA runtime you are testing:

python.exe -m pip install google-genai pyaudio --target "D:/myAdd-on/Native-Speech-Generation/addon/globalPlugins/NativeSpeechGeneration/lib"

Adjust the path according to your local add-on source directory.

For the current audio-only Talk With AI implementation, you do not need opencv-python, pillow, or mss.

For release packages, the add-on downloads the latest verified dependency archive based on the running NVDA version:

  • lib.zip for NVDA 2025.3.3 and older supported builds.
  • lib64.zip for NVDA 2026.1 and newer.

The add-on reads SHA-256 data from the latest GitHub dependency release, using the release asset digest or checksum files. Bundled approved checksums are kept only as a fallback for first-time installs when the latest-release lookup fails. Manual library reinstalls require the latest verified release. The extracted folder is always installed as addon/globalPlugins/NativeSpeechGeneration/lib.

Then copy the following from your Python installation into:

addon/globalPlugins/NativeSpeechGeneration/lib
  • zoneinfo folder
  • secrets.py file

Contributing

Contributions, suggestions, and bug reports are very welcome.

  • Open an Issue for bugs or feature requests.
  • Submit a Pull Request for code contributions.

Contact

About

Add-on NVDA untuk mengubah teks menjadi suara alami dengan Google Gemini AI.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages