Trump Speeches NLP Chatbot

A full-stack NLP platform built from scratch — retrieval-augmented generation with hybrid search, multi-model sentiment analysis, and AI-powered topic extraction, all wrapped in a FastAPI service with pluggable LLM providers (Gemini, OpenAI, Claude), semantic document chunking, and automated deployment pipelines.

The code is the resume. If you want to see how I think about software, architecture, and ML engineering — you're in the right place.

What's Inside

RAG Pipeline

This isn't a LangChain tutorial copy-paste. It's a modular pipeline where each component (search, confidence scoring, entity analysis, document loading, guardrails, query rewriting) has its own responsibilities, tests, and clean interfaces.

Hybrid Search — Dense MPNet embeddings (768d) + BM25 keyword matching (70/30 weighting) + cross-encoder reranking for precision
Three-Layer Guardrails — Pre-retrieval query validation → post-retrieval relevance filtering (sigmoid-normalised cross-encoder scores) → post-generation grounding verification. If no chunks pass the relevance gate, it says "I don't know" instead of hallucinating
Query Rewriting — LLM-powered query cleaning (typos, abbreviations) before search. Conservative by design — no synonym expansion, no scope broadening. Deterministic rewrites at temperature=0.0
Semantic Chunking — Custom sentence-level embedding similarity chunker (not LangChain's off-the-shelf splitter). NLTK tokenisation + cosine similarity with configurable percentile-based breakpoints and tail-merging. Produces ~2,354 coherent chunks from 35 speeches
Smart Confidence Scoring — Multi-factor calculation: retrieval quality (40%), consistency (25%), coverage (20%), entity presence (15%)
Entity Analytics — Extraction with sentiment analysis, co-occurrence tracking, and speech coverage mapping

Sentiment Analysis

Three models working together: FinBERT for polarity, RoBERTa for emotion detection, and an LLM for contextual interpretation. Not just "positive/negative" — actual nuanced analysis with explanations.

Topic Extraction

DBSCAN semantic clustering with sentence-transformers, LLM-generated topic labels and summaries, contextual snippets with keyword highlighting.

The Engineering

Pluggable LLM Providers — Factory pattern with lazy imports. Swap Gemini/OpenAI/Anthropic by changing one env var
FastAPI Backend — 12+ endpoints, async handling, Pydantic validation, dependency injection
CI/CD — 9 GitHub Actions workflows: tests, lint, type-check, security, Docker, docs, Azure/Render deployment
Testing — 191 tests, 66%+ coverage, parametrised test cases
Modern Python — uv, Ruff, mypy, structured logging (JSON for prod, pretty for dev), Docker multi-stage builds

Try It Live

The API is deployed on Azure and ready to explore:

Interactive Web App — Try the RAG system, sentiment analysis, and topic extraction
API Docs (Swagger) — Interactive API playground
API Docs (ReDoc) — Clean, readable documentation
Full Documentation — Complete guides, architecture diagrams, and API reference

⚠️ Important - Cold Start Notice:

The app runs on Azure Free Tier hosting. Due to the large ML models (~2GB) and containerized deployment:

Initial load (cold start): 1-5 minutes when idle. You may need to refresh the page several times.

AI-generated responses: 30 seconds to 2 minutes for complex queries (LLM processing + embeddings).

Subsequent requests: Fast once warmed up (~2-5 seconds).

Recommended workflow: Open the link, refresh every 30 seconds for a few minutes until the page loads successfully. Once loaded, the app is responsive.

API Endpoints

Endpoint	Method	Description
`/rag/ask`	POST	Ask questions — returns AI-generated answers with confidence scores and source attribution
`/rag/search`	POST	Semantic search over indexed documents
`/rag/stats`	GET	Vector database statistics
`/rag/index`	POST	Index/re-index documents
`/analyze/sentiment`	POST	Multi-model sentiment analysis (FinBERT + RoBERTa + LLM)
`/analyze/topics`	POST	AI-powered topic extraction with semantic clustering
`/analyze/words`	POST	Word frequency analysis
`/analyze/ngrams`	POST	N-gram analysis
`/health`	GET	System health and service status
`/config`	GET	Public runtime configuration
`/diagnostics`	GET	Detailed diagnostics for troubleshooting

Interactive docs at /docs (Swagger) and /redoc (ReDoc). Web UI at /.

The Dataset

35 rally speech transcripts (2019–2020), 300,000+ words, indexed as ~2,354 semantic chunks in ChromaDB. Real-world political text with nuanced language — a good stress test for NLP.

Get Started

What You Need

Python 3.11 or newer
uv (modern Python package manager)
An LLM API key (grab a free one from Google Gemini — it's the default provider)
- Or use OpenAI / Anthropic if you prefer

Setup

Install dependencies
```
uv sync
```

Configure LLM Provider

The project supports multiple LLM providers with a model-agnostic configuration approach.

Option A: Google Gemini (Default)

Create a .env file in the project root:

# LLM Provider Configuration
LLM_PROVIDER=gemini
LLM_API_KEY=your_gemini_api_key_here
LLM_MODEL_NAME=gemini-2.0-flash-exp

# Optional: Adjust LLM parameters
LLM_TEMPERATURE=0.7
LLM_MAX_OUTPUT_TOKENS=2048

Option B: OpenAI

# Install OpenAI support
uv sync --group llm-openai

Update .env:

LLM_PROVIDER=openai
LLM_API_KEY=sk-your_openai_api_key_here
LLM_MODEL_NAME=gpt-4o-mini
LLM_TEMPERATURE=0.7
LLM_MAX_OUTPUT_TOKENS=2048

Option C: Anthropic (Claude)

# Install Anthropic support
uv sync --group llm-anthropic

Update .env:

LLM_PROVIDER=anthropic
LLM_API_KEY=sk-ant-your_anthropic_api_key_here
LLM_MODEL_NAME=claude-3-5-sonnet-20241022
LLM_TEMPERATURE=0.7
LLM_MAX_OUTPUT_TOKENS=2048

Install All Providers:

uv sync --group llm-all

Start the FastAPI server

uv run uvicorn speech_nlp.app:app --reload

Access the application
- Local: http://localhost:8000 (instant, recommended for testing)
- Azure (deployed): https://trump-speeches-nlp-chatbot.azurewebsites.net (Cold start: 1-5min, refresh periodically)
- API Docs: https://trump-speeches-nlp-chatbot.azurewebsites.net/docs
- ReDoc: https://trump-speeches-nlp-chatbot.azurewebsites.net/redoc

Try the RAG System

Web Interface: Navigate to the RAG tab and ask a question

API Example:

curl -X POST http://localhost:8000/rag/ask `
  -H "Content-Type: application/json" `
  -d '{"question": "What was said about the economy?", "top_k": 5}'

Python Example:

import requests

response = requests.post(
    "http://localhost:8000/rag/ask",
    json={"question": "What economic policies were discussed?", "top_k": 5}
)
print(response.json()["answer"])

Alternative: Docker

Note: Add your Gemini API key to the Dockerfile or pass it as an environment variable.

Run with Docker

Build the Docker image

docker build -t trump-speeches-nlp-chatbot .

Run the container

docker run -it --rm -p 8000:8000 trump-speeches-nlp-chatbot

Or use Docker Compose
```
docker-compose up -d
```

View Documentation Site (Optional)

The project includes comprehensive documentation built with MkDocs:

# Install documentation dependencies
uv sync --group docs

# Serve documentation site locally (with live reload)
uv run mkdocs serve

Then open http://localhost:8001 to browse the documentation with search and navigation.

Build static site:

uv run mkdocs build

This generates a site/ folder with the complete static documentation website.

Explore Analysis Notebooks (Optional)

# Install notebook dependencies (includes matplotlib, seaborn, plotly, etc.)
uv sync --group notebooks
uv run jupyter lab

Navigate to notebooks/ to explore statistical NLP analysis and visualizations.

Testing & Code Quality

uv sync --group dev        # Install dev dependencies
uv run pytest              # Run all tests with coverage
uv run ruff check src/     # Lint
uv run ruff format src/    # Format
uv run mypy src/           # Type check
uv run bandit -r src/ -c pyproject.toml  # Security scan

CI/CD runs 9 GitHub Actions workflows on every push: tests (Python 3.11 + 3.12), lint, type-check, security scanning, Docker build, docs deployment, and Azure/Render deployment.

CI/CD Pipeline

The project uses modular GitHub Actions workflows for continuous integration:

✅ Automated Testing on Python 3.11, 3.12 (python-tests.yml)
✅ Code Quality — Ruff linting and formatting (python-lint.yml)
✅ Type Checking — Mypy static analysis (python-typecheck.yml)
✅ Security Scanning — Bandit and pip-audit (security-audit.yml)
✅ Documentation Linting — Markdownlint (markdown-lint.yml)
✅ Documentation Deployment — Auto-deploy to GitHub Pages (deploy-docs.yml)
✅ Docker Build and Push — Automated image build and push to DockerHub (build-push-docker.yml)
✅ Azure Deployment — Deploy to Azure on push to main (deploy-azure.yml)

For detailed testing documentation, see the Testing Guide.

Project Structure

src/speech_nlp/
├── app.py                       # Application entry point
├── constants.py                 # Application-wide constants
├── exceptions.py                # Custom exception classes
├── security.py                  # API key + input sanitization
├── api/
│   ├── chatbot.py               # RAG endpoints
│   ├── analysis.py              # NLP analysis endpoints
│   ├── health.py                # Health, config, diagnostics
│   └── dependencies.py          # Dependency injection
├── config/
│   ├── settings.py              # Pydantic settings
│   └── logging.py               # Structured logging (JSON + color)
├── schemas/
│   ├── requests.py              # API request models
│   └── responses.py             # API response models
├── services/
│   ├── llm/                     # Pluggable LLM providers
│   │   ├── base.py              #   Abstract interface
│   │   ├── factory.py           #   Factory + lazy imports
│   │   ├── gemini.py            #   Google Gemini
│   │   ├── openai.py            #   OpenAI GPT (optional)
│   │   └── anthropic.py         #   Anthropic Claude (optional)
│   ├── rag/                     # RAG pipeline components
│   │   ├── service.py           #   RAG orchestrator
│   │   ├── search.py            #   Hybrid search
│   │   ├── guardrails.py        #   Three-layer guardrails
│   │   ├── rewriter.py          #   LLM query cleaning
│   │   ├── confidence.py        #   Multi-factor scoring
│   │   ├── entities.py          #   Entity extraction + analytics
│   │   ├── chunking.py          #   Semantic chunking + metadata
│   │   └── models.py            #   Internal domain models
│   └── analysis/
│       ├── sentiment.py         # Multi-model sentiment
│       ├── topics.py            # Semantic topic extraction
│       └── text.py              # Word frequency, n-grams
├── utils/
│   ├── embeddings.py            # Embedding utilities
│   ├── formatting.py            # Response formatting
│   ├── io.py                    # Data loading
│   └── text.py                  # Text cleaning
├── templates/index.html         # Web UI
└── static/                      # CSS + images

tests/                           # 191 tests, 66%+ coverage
data/                            # Speech transcripts + ChromaDB
configs/                         # YAML env configs (dev/staging/prod)
docs/                            # MkDocs documentation site

Documentation

Full Documentation Site — Guides, architecture diagrams, and reference docs.

To run locally: uv sync --group docs && uv run mkdocs serve --dev-addr localhost:8001

License

MIT License — see LICENSE. Speech transcripts are publicly available data.

Kristiyan Bonev — GitHub · LinkedIn · k.s.bonev@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.github/workflows		.github/workflows
configs		configs
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src/speech_nlp		src/speech_nlp
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trump Speeches NLP Chatbot

What's Inside

RAG Pipeline

Sentiment Analysis

Topic Extraction

The Engineering

Try It Live

API Endpoints

The Dataset

Get Started

What You Need

Setup

Try the RAG System

Alternative: Docker

Run with Docker

View Documentation Site (Optional)

Explore Analysis Notebooks (Optional)

Testing & Code Quality

CI/CD Pipeline

Project Structure

Documentation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Trump Speeches NLP Chatbot

What's Inside

RAG Pipeline

Sentiment Analysis

Topic Extraction

The Engineering

Try It Live

API Endpoints

The Dataset

Get Started

What You Need

Setup

Try the RAG System

Alternative: Docker

Run with Docker

View Documentation Site (Optional)

Explore Analysis Notebooks (Optional)

Testing & Code Quality

CI/CD Pipeline

Project Structure

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages