(Qissa Go — "The Storyteller") is a full-stack generative AI system that produces short Urdu children's stories using classical probabilistic NLP techniques. Built without any pre-trained large language models, this system demonstrates how corpus engineering, subword tokenization, and n-gram language modeling can be combined into a production-grade microservice with a modern, reactive frontend
The system scrapes real Urdu stories from the web, trains a custom Byte Pair Encoding (BPE) tokenizer from scratch, fits a trigram language model with linear interpolation smoothing, and serves generation through a containerized FastAPI backend. A Next.js frontend deployed on Vercel provides a ChatGPT-style typewriter interface fully in Urdu.
This project was developed as Assignment 1 for CS 4063 — Natural Language Processing, bridging classical NLP foundations with modern software engineering practices.
- Scrape and preprocess a diverse corpus of at least 200 real-world Urdu stories with custom special tokens
- Train a custom BPE tokenizer with a vocabulary size of 250 — without using any pre-built tokenizer libraries
- Implement a trigram language model with Maximum Likelihood Estimation (MLE) and linear interpolation smoothing
- Serve the model via a containerized FastAPI microservice exposing a
POST /generateendpoint - Build and deploy a reactive Next.js frontend on Vercel with a typewriter-style story display
- Establish a full DevOps pipeline with Docker and GitHub Actions CI/CD
The system follows a decoupled microservices architecture. The frontend communicates with the backend exclusively through a REST API, and all NLP processing is isolated within the backend container.
graph TD
A[👤 User - Browser] -->|HTTPS| B[🌐 Next.js Frontend - Vercel]
B -->|POST /generate - REST| C[🐳 FastAPI Backend - Docker Container]
C --> D[🔤 BPE Tokenizer\nbpe_tokenizer.json]
C --> E[📊 Trigram LM\nTrained at Startup]
E --> F[📂 Tokenized_Data.txt\nPre-encoded Corpus]
D --> F
C -->|Generated Story JSON| B
B -->|Typewriter Animation| A
subgraph Data Pipeline [Offline Data Pipeline]
G[🕷️ Selenium Scraper] -->|Raw .txt files| H[urdu_stories_dataset/]
H --> I[BPE Training Script]
I --> D
I --> F
end
| Technology | Purpose |
|---|---|
| Python 3.11 | Core backend language |
| Selenium + undetected-chromedriver | JavaScript-rendered web scraping |
| BeautifulSoup4 | HTML parsing and content extraction |
| FastAPI | REST API microservice framework |
| Uvicorn | ASGI server for FastAPI |
| Pydantic | Request/response schema validation |
| Next.js 16 | React-based frontend framework |
| React 19 | Component-based UI library |
| Tailwind CSS 4 | Utility-first styling |
| Docker | Backend containerization |
| GitHub Actions | CI/CD automation pipeline |
| Vercel | Frontend cloud deployment |
| Render / Railway | Backend container hosting |
| Conda | Scraping environment management |
NLP_Assignment-1-main/
│
├── Scraping/
│ └── scarper.py # Selenium + BS4 scraper for UrduPoint
│
├── urdu_stories_dataset/ # 597 raw scraped Urdu story .txt files
│ ├── Aik_Chirya_Ki_Kahani_*.txt
│ └── ...
│
├── BSETokenizer/
│ ├── TokenizerCode.py # Full BPE implementation (train/encode/decode/save/load)
│ ├── bpe_tokenizer.json # Trained tokenizer artifact (vocab + merge list)
│ └── __init__.py
│
├── Tokenized_Dataset/
│ └── Tokenized_Data.txt # Full corpus encoded with BPE (one token per line)
│
├── Trigram_Language_Model/
│ ├── trigram_model.py # Trigram LM: train, interpolation, generation, postprocess
│ ├── run.sh # Shell script to run the generation pipeline
│ └── __init__.py
│
├── api/
│ ├── main.py # FastAPI app: /generate and /health endpoints
│ └── requirements.txt # Python dependencies for the API
│
├── frontend/
│ ├── app/
│ │ ├── page.jsx # Main page: orchestrates all UI state
│ │ ├── layout.jsx # Root HTML layout and font imports
│ │ └── globals.css # Global Urdu font + animation styles
│ ├── components/
│ │ ├── StoryDisplay.jsx # Typewriter story output with copy/new story actions
│ │ ├── InputPanel.jsx # Urdu text input + submit button
│ │ ├── SettingsPanel.jsx # Sliders for max_length, temperature, top_k
│ │ └── TypingIndicator.jsx # Animated loading dots
│ ├── lib/
│ │ └── api.js # fetch wrappers: generateStory(), checkHealth()
│ ├── package.json
│ └── next.config.mjs
│
├── Requirements/
│ └── ScrapingConda.sh # Conda environment setup script
│
├── Prompts/
│ ├── ScrapingPrompts.md # AI prompts used during scraper development
│ └── DeploymentPrompts.md # AI prompts used during deployment setup
│
├── all_story_links.txt # Collected UrduPoint story URLs
├── Dockerfile # Backend container definition
└── __init__.pyDirectory explanations:
Scraping/— The offline data acquisition pipeline. Runs once to build the corpus.urdu_stories_dataset/— 597 cleaned Urdu story files, each containing preprocessed text with special tokens inserted in place of punctuation.BSETokenizer/— Self-contained BPE tokenizer module. The trained artifact (bpe_tokenizer.json) is checked into the repository so the API can load it without retraining.Tokenized_Dataset/— The entire corpus encoded as BPE tokens, stored one token per line. This file is loaded at API startup to train the trigram model in memory.Trigram_Language_Model/— Pure-Python n-gram LM. Has no runtime dependencies beyond the standard library and the tokenizer module.api/— The FastAPI microservice. Loads all models at startup and serves generation requests.frontend/— The Next.js application deployed on Vercel.
Objective: Scrape a minimum of 200 Urdu children's stories, strip non-Urdu content, and annotate text with structural special tokens.
Source: UrduPoint Kids — Moral Stories
Result: 597 story files collected across 50 list pages.
The scraper uses undetected-chromedriver to bypass bot detection on UrduPoint, which renders story content via JavaScript. BeautifulSoup then parses the returned page source.
def get_driver():
options = uc.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.page_load_strategy = "eager"
driver = uc.Chrome(options=options)
return driverThe scraper targets the .txt_detail container, surgically removes ad nodes (.hide_desk, <script>, <ins>, <iframe>), then splits content on <br> and </p> boundaries using a marker-replacement strategy to reconstruct natural paragraph breaks.
Each paragraph is passed through a two-stage filter: contains_urdu() (Unicode range check) and is_boilerplate() (phrase blocklist), then into process_urdu_text():
def process_urdu_text(text):
# 1. Strip all quotation marks
text = text.translate(str.maketrans('', '', '"\'""'''))
# 2. Collapse ellipsis sequences to a space
text = re.sub(r'\.{1,}', ' ', text)
# 3. Replace Urdu sentence-ending punctuation with <EOS>
for punct in ['۔', '؟', '!']:
text = text.replace(punct, ' <EOS> ')
# 4. Normalize whitespace
text = re.sub(r'\s+', ' ', text).strip()
# 5. Guarantee terminal <EOS>
if not text.endswith('<EOS>'):
text += ' <EOS>'
return text| Token | Meaning | Inserted At |
|---|---|---|
<EOS> |
End of Sentence | Each ۔, ؟, ! |
<EOP> |
End of Paragraph | After each paragraph block |
<EOD> |
End of Document (Story) | After the final paragraph |
These tokens occupy unused Unicode byte positions reserved for custom use, ensuring they cannot collide with any Urdu character.
flowchart TD
A[List Page URL] --> B[Selenium GET + Wait]
B --> C[Collect story hrefs]
C --> D{For each story URL}
D --> E[GET story page]
E --> F[BS4 parse page source]
F --> G[Remove ads/scripts]
G --> H[Split on br / /p markers]
H --> I{For each paragraph}
I --> J{Contains Urdu?}
J -- No --> K[Discard]
J -- Yes --> L{Boilerplate?}
L -- Yes --> K
L -- No --> M[process_urdu_text]
M --> N[Append EOP]
N --> O[Write .txt file + EOD]
Objective: Train a custom Byte Pair Encoding tokenizer to vocabulary size 250 without using any third-party tokenizer library.
BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent symbol pair into a new composite token, growing the vocabulary until the target size is reached.
Initial: [ا, ی, ک, ' ', <EOS>, ...]
Iter 1: Count pairs → most frequent ('ک', 'ہ') → merge → 'کہ'
Iter 2: Count pairs → most frequent ('ا', 'ن') → merge → 'ان'
...
Until: |vocab| == 250
def bytePairEncoding(training_sequence, initial_extra_tokens=None, max_vocab_size=250):
vocab = set(training_sequence)
if initial_extra_tokens:
vocab.update(initial_extra_tokens)
merge_list = []
while len(vocab) < max_vocab_size:
# Count all adjacent pairs
pair_counts = {}
for i in range(len(training_sequence) - 1):
pair = (training_sequence[i], training_sequence[i + 1])
pair_counts[pair] = pair_counts.get(pair, 0) + 1
# Find most frequent pair
most_frequent_pair = max(pair_counts, key=pair_counts.get)
if pair_counts[most_frequent_pair] < 2:
break
# Merge: create new token and replace all occurrences
new_token = most_frequent_pair[0] + most_frequent_pair[1]
training_sequence = merge_pair(training_sequence, most_frequent_pair, new_token)
vocab.add(new_token)
merge_list.append((new_token, most_frequent_pair))
return training_sequence, vocab, merge_listThe special tokens <EOS>, <EOP>, <EOD> are pre-seeded into the initial vocabulary using initial_extra_tokens and are treated as atomic units by tokenize_text(), which scans for them before falling back to single-character tokenization.
Encoding replays the merge list in reverse order (most recently learned merge applied first), greedily merging character pairs in the input to maximize compression:
def encode(sequence, vocab, merge_list, special_tokens=SPECIAL_TOKENS):
if isinstance(sequence, str):
sequence = tokenize_text(sequence, special_tokens)
for new_token, pair in reversed(merge_list):
sequence = apply_merge(sequence, pair, new_token)
return sequenceDecoding replays the merge list in forward order, expanding each composite token back to its constituent parts:
def decode(encoded_sequence, vocab, merge_list):
for new_token, pair in merge_list:
sequence = expand_token(sequence, new_token, pair)
return sequenceThe trained tokenizer is persisted as BSETokenizer/bpe_tokenizer.json:
{
"vocab": ["ا", "ب", "کہ", "ان", "<EOS>", "..."],
"merge_list": [["کہ", ["ک", "ہ"]], ["ان", ["ا", "ن"]], "..."],
"special_tokens": ["<EOS>", "<EOP>", "<EOD>"]
}flowchart TD
A[Load all .txt files] --> B[Concatenate corpus]
B --> C[tokenize_text: char + special tokens]
C --> D{vocab < 250?}
D -- Yes --> E[Count all adjacent pairs]
E --> F[Find most frequent pair]
F --> G{count >= 2?}
G -- No --> H[Stop]
G -- Yes --> I[Create new_token = left + right]
I --> J[Replace all occurrences in sequence]
J --> K[Add to vocab, append to merge_list]
K --> D
D -- No --> L[Save bpe_tokenizer.json]
L --> M[Save Tokenized_Data.txt]
Objective: Train a 3-gram language model with linear interpolation smoothing and support variable-length generation until the <EOD> token.
The model counts unigrams, bigrams, and trigrams over the BPE-encoded corpus loaded from Tokenized_Dataset/Tokenized_Data.txt:
def train_trigram_lm(encoded_corpus):
unigram_counts = Counter(encoded_corpus)
bigram_counts = defaultdict(int)
trigram_counts = defaultdict(int)
for i in range(len(encoded_corpus) - 1):
bigram_counts[(encoded_corpus[i], encoded_corpus[i+1])] += 1
for i in range(len(encoded_corpus) - 2):
trigram_counts[(encoded_corpus[i], encoded_corpus[i+1], encoded_corpus[i+2])] += 1
return unigram_counts, bigram_counts, trigram_counts, len(encoded_corpus)Raw trigram MLE probabilities are zero for unseen contexts. Linear interpolation combines all three n-gram orders to guarantee non-zero probability for any token in the vocabulary:
where
def interpolated_prob(w1, w2, w3, unigram_counts, bigram_counts,
trigram_counts, total_tokens, l1=0.6, l2=0.3, l3=0.1):
p_tri = trigram_counts[(w1, w2, w3)] / bigram_counts[(w1, w2)] \
if bigram_counts[(w1, w2)] > 0 else 0
p_bi = bigram_counts[(w2, w3)] / unigram_counts[w2] \
if unigram_counts[w2] > 0 else 0
p_uni = unigram_counts[w3] / total_tokens
return l1 * p_tri + l2 * p_bi + l3 * p_uniGeneration is an autoregressive loop: at each step, the last two tokens form the trigram context, interpolated probabilities are computed for every vocabulary token, and the next token is sampled:
def generate_text(prefix_tokens, ..., max_length=1000, temperature=1.0, top_k=None, stop_token="<EOD>"):
generated = prefix_tokens[:]
while len(generated) < max_length:
w1, w2 = generated[-2], generated[-1]
# Compute probabilities for all vocab tokens
probs = [interpolated_prob(w1, w2, w3, ...) for w3 in vocab]
# Temperature scaling
probs = [p ** (1.0 / temperature) for p in probs]
# Top-K filtering
# Repetition penalty: 0.7x for tokens seen in last 10
next_token = random.choices(vocab, weights=probs)[0]
generated.append(next_token)
if next_token == stop_token:
break
return generatedSampling controls:
| Parameter | Effect |
|---|---|
temperature |
Lower (<1.0) makes distribution sharper (more predictable); higher (>1.0) increases randomness |
top_k |
Restricts sampling to the K most probable tokens at each step |
| Repetition penalty | Tokens appearing in the last 10 positions have their probability multiplied by 0.7 |
After generation and BPE decoding, the raw token stream is converted back into readable Urdu:
def postprocess_story(tokens, special_tokens):
text = "".join(tokens)
text = text.replace(" <EOS>", "۔").replace("<EOS>", "۔")
text = text.replace(" <EOP>", "\n").replace("<EOP>", "\n")
text = text.replace(" <EOD>", "").replace("<EOD>", "")
return textflowchart TD
A[Load bpe_tokenizer.json] --> B[Load Tokenized_Data.txt]
B --> C[train_trigram_lm]
C --> D[Encode Urdu prefix with BPE]
D --> E{len prefix >= 2?}
E -- No --> F[Duplicate tokens to reach 2]
E -- Yes --> G[Generate loop]
F --> G
G --> H[Take last 2 tokens as context w1 w2]
H --> I[Compute interpolated_prob for all vocab]
I --> J[Temperature scaling]
J --> K[Top-K filter]
K --> L[Repetition penalty]
L --> M[Sample next_token]
M --> N{next_token == EOD or max_length?}
N -- No --> G
N -- Yes --> O[Decode BPE tokens]
O --> P[postprocess_story]
P --> Q[Return Urdu story string]
Objective: Expose the model pipeline as a containerized REST API.
The application uses FastAPI's lifespan context manager to load all models exactly once at startup, storing them in app.state for reuse across requests:
@asynccontextmanager
async def lifespan(app: FastAPI):
vocab, merge_list, special_tokens = load_tokenizer_json(...)
encoded_corpus = load_tokenized_corpus_for_trigram(...)
unigram_counts, bigram_counts, trigram_counts, total_tokens = train_trigram_lm(encoded_corpus)
app.state.vocab = vocab
app.state.merge_list = merge_list
# ... store all model components
yieldEndpoints:
| Method | Path | Description |
|---|---|---|
GET |
/health |
Returns API status and vocab size |
POST |
/generate |
Accepts prefix + params, returns generated story |
Request schema (POST /generate):
{
"prefix": "ایک بار کی بات ہے",
"max_length": 500,
"temperature": 0.8,
"top_k": 7
}Response schema:
{
"story": "ایک بار کی بات ہے کہ ایک چھوٹا سا بچہ...",
"tokens_generated": 312
}CORS is configured to allow all origins so the Vercel-hosted frontend can reach the backend without proxy setup.
FROM python:3.11-slim
WORKDIR /app
COPY api/requirements.txt /app/api/requirements.txt
RUN pip install --no-cache-dir -r api/requirements.txt
COPY api/ /app/api/
COPY BSETokenizer/ /app/BSETokenizer/
COPY Trigram_Language_Model/ /app/Trigram_Language_Model/
COPY Tokenized_Dataset/ /app/Tokenized_Dataset/
EXPOSE 8000
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]The python:3.11-slim base keeps the image lean. Model artifacts (bpe_tokenizer.json, Tokenized_Data.txt) are baked into the image, eliminating cold-start downloads.
flowchart LR
A[GitHub Push] --> B[GitHub Actions CI]
B --> C[docker build]
C --> D[docker push → Registry]
D --> E[Render / Railway Pull & Deploy]
E --> F[Container Running\nPort 8000]
F --> G[Vercel Frontend]
Objective: Build a reactive, Urdu-first chat interface with typewriter-style story rendering.
app/page.jsx ← Root state manager
├── StoryDisplay.jsx ← Scrollable story area + copy/new story controls
├── SettingsPanel.jsx ← max_length, temperature, top_k sliders
├── InputPanel.jsx ← RTL Urdu textarea + submit button
└── TypingIndicator.jsx ← Animated three-dot loader
All state is managed in page.jsx using React hooks. Story generation is driven by handleSubmit:
const handleSubmit = useCallback(async () => {
setIsLoading(true);
setStory("");
const result = await generateStory({
prefix: prefix.trim(),
maxLength: settings.maxLength,
temperature: settings.temperature,
topK: settings.topK,
});
setStory(result.story); // triggers typewriter effect
setTokensGenerated(result.tokens_generated);
}, [prefix, settings]);When a new story arrives, a useEffect watching story starts an interval that reveals 4 words every 80 ms, simulating ChatGPT-style streaming:
useEffect(() => {
const words = story.split(/(\s+)/);
let idx = 0;
setIsTypewriting(true);
typewriterRef.current = setInterval(() => {
const chunk = words.slice(idx, idx + 4).join("");
idx += 4;
setDisplayedStory(prev => prev + chunk);
if (idx >= words.length) {
clearInterval(typewriterRef.current);
setIsTypewriting(false);
}
}, 80);
}, [story]);lib/api.js wraps fetch calls to the backend. The base URL is injected at build time via the NEXT_PUBLIC_API_URL environment variable:
export async function generateStory({ prefix, maxLength, temperature, topK }) {
const response = await fetch(`${BASE_URL}/generate`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ prefix, max_length: maxLength, temperature, top_k: topK }),
});
if (!response.ok) throw new Error(`Server error: ${response.status}`);
return response.json();
}
export async function checkHealth() {
try {
const response = await fetch(`${BASE_URL}/health`);
return response.ok;
} catch { return false; }
}The checkHealth() call runs on mount and displays a live green/red indicator dot showing whether the backend is reachable.
Objective: Deploy the frontend to Vercel and the backend to a public container platform.
The deployment is split across two platforms to match each component's runtime needs:
- Frontend: Vercel (static + edge, zero-config Next.js deployment)
- Backend: Render or Railway (Docker container hosting with persistent memory for the trigram model)
BPE is a data compression algorithm adapted for NLP tokenization. Starting from a character vocabulary, it repeatedly merges the most frequent adjacent symbol pair. After
Complexity: Each iteration scans the full sequence ($O(|C|)$) and updates pair counts, for a total of
For a trigram
This is undefined for unseen bigram contexts. Linear interpolation resolves this by backing off to lower-order models, guaranteeing coverage for all vocabulary tokens.
The weights (
After computing the probability distribution over the vocabulary, temperature
Lower
- Python 3.11+
- Conda (for scraping environment)
- Node.js 18+ and npm
- Docker (for containerized backend)
- Google Chrome (for scraping)
# Create and activate the conda environment
bash Requirements/ScrapingConda.sh
conda activate urdu-scraper
# Run the scraper (pages 1–50)
python Scraping/scarper.py
# Output: urdu_stories_dataset/ directory with .txt filespip install -r api/requirements.txt
# Train the tokenizer and save artifacts
python BSETokenizer/TokenizerCode.py
# Output: BSETokenizer/bpe_tokenizer.json
# Tokenized_Dataset/Tokenized_Data.txtbash Trigram_Language_Model/run.sh
# Or directly:
conda activate urdu-scraper
python -m Trigram_Language_Model.trigram_model
# Output: Generated Urdu story printed to stdoutpip install -r api/requirements.txt
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
# API available at http://localhost:8000
# Docs at http://localhost:8000/docs# Build the image
docker build -t urdu-story-api .
# Run the container
docker run -p 8000:8000 urdu-story-api
# Test the API
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prefix": "ایک بار کی بات ہے", "max_length": 300, "temperature": 0.8, "top_k": 7}'cd frontend
# Set the backend URL
echo 'NEXT_PUBLIC_API_URL=http://localhost:8000' > .env.local
npm install
npm run dev
# Frontend available at http://localhost:3000- Push the repository to GitHub
- Create a new Web Service on render.com
- Select Docker as the environment
- Set the root directory to the repo root (Dockerfile is at root)
- Render auto-detects
EXPOSE 8000and routes traffic accordingly - Copy the deployed service URL (e.g.,
https://urdu-api.onrender.com)
cd frontend
npx vercel
# Follow prompts; set NEXT_PUBLIC_API_URL to your Render backend URLOr via the Vercel dashboard:
- Import the GitHub repository
- Set the Root Directory to
frontend - Add environment variable:
NEXT_PUBLIC_API_URL = https://your-backend.onrender.com - Deploy — Vercel auto-detects Next.js
Create .github/workflows/docker.yml in the repository root:
name: Build and Push Docker Image
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
push: true
tags: ${{ secrets.DOCKERHUB_USERNAME }}/urdu-story-api:latestAdd DOCKERHUB_USERNAME and DOCKERHUB_TOKEN as repository secrets in GitHub Settings.
Urdu Prefix: ایک بار کی بات ہے
max_length: 500
temperature: 0.8
top_k: 7
curl -X POST https://your-api.onrender.com/generate \
-H "Content-Type: application/json" \
-d '{
"prefix": "ایک بار کی بات ہے",
"max_length": 500,
"temperature": 0.8,
"top_k": 7
}'{
"story": "ایک بار کی بات ہے کہ ایک چھوٹا سا بچہ جنگل میں گیا۔\nاس نے دیکھا کہ ایک پرندہ درخت پر بیٹھا ہے۔\nبچے نے پرندے سے پوچھا کہ تم کہاں رہتے ہو۔\nپرندے نے کہا میں اللہ کی مخلوق ہوں۔",
"tokens_generated": 248
}curl https://your-api.onrender.com/health
# {"status":"ok","model":"trigram","vocab_size":250}JavaScript-rendered content: UrduPoint renders story bodies via JavaScript, making standard requests + BeautifulSoup insufficient. The solution was undetected-chromedriver to execute the page fully before parsing, with randomized delays to avoid rate limiting.
Bot detection: Standard Selenium fingerprints were blocked. undetected-chromedriver patches Chrome's automation signatures to appear as a regular browser session.
Special token collisions: Naïve string replacement of <EOS> could corrupt Urdu text containing < or >. The tokenize_text() function performs a priority scan for special token strings before falling back to single characters, preventing any ambiguity.
BPE on right-to-left script: Urdu is written right-to-left but stored left-to-right in UTF-8. The BPE algorithm operates on the byte/character sequence as stored, which is correct — no special RTL handling was required at the algorithm level.
Trigram sparsity: With a vocabulary of 250 tokens and a real corpus, many trigram contexts are unseen. Linear interpolation with a unigram floor ensures every possible next token receives a non-zero probability, preventing generation from stalling.
Model load time: Training the trigram model over the full tokenized corpus takes several seconds. Loading it at API startup (rather than per-request) via FastAPI's lifespan handler ensures zero per-request overhead.
Vocabulary size of 250: The assignment specifies this hyperparameter. With Urdu's relatively large character inventory (~60 core characters plus diacritics), 250 tokens produces approximately 190 learned merges, capturing common morphemes (prefixes, suffixes, common word stems) while keeping the vocabulary small enough for a compact trigram model.
One token per line in Tokenized_Data.txt: Storing the encoded corpus one token per line makes it trivially loadable with readlines() and avoids any delimiter ambiguity with Urdu text or special tokens.
Baking model artifacts into the Docker image: The tokenizer JSON and tokenized corpus are copied into the image at build time. This eliminates runtime file downloads, makes the container fully self-contained, and ensures reproducibility at the cost of a larger image size.
app.state for model storage: FastAPI's app.state dictionary is the idiomatic location for objects shared across requests. The alternative — module-level globals — would complicate testing and reload scenarios.
Typewriter at 80 ms / 4 words: This pace was tuned to feel natural for Urdu readers while completing a 300-word story in under 8 seconds, avoiding both the impatience of slow display and the disorientation of instant rendering.
RTL-first UI: All text containers use the urdu-text class which sets direction: rtl and font-family: 'Noto Nastaliq Urdu'. This ensures correct ligature rendering and reading order for Urdu script without requiring per-element overrides.
- Higher-order n-grams: A 4-gram or 5-gram model would capture longer dependencies at the cost of increased memory and sparsity add-k smoothing or Kneser-Ney could be explored.
- Larger vocabulary: Increasing vocab size beyond 250 would reduce out-of-vocabulary fragmentation for rare words, improving generation fluency.
-
Streaming API: Replace the single-shot
POST /generatewith a Server-Sent Events (SSE) or WebSocket endpoint to stream tokens one-by-one, enabling true word-by-word generation rather than frontend simulation. - Perplexity evaluation: Add a held-out test set and compute perplexity to objectively measure model quality across hyperparameter settings.
-
Lambda tuning: Implement EM-based or grid-search optimization of the interpolation weights
$\lambda_1, \lambda_2, \lambda_3$ on a validation set. - Corpus expansion: With 50 pages scraped, approximately 200+ additional pages remain on UrduPoint. Doubling or tripling the corpus would significantly improve trigram coverage.
- Story coherence: Post-processing heuristics (filtering repetitive sentences, enforcing minimum story length) could improve output quality without changing the underlying model.
This system demonstrates that meaningful generative NLP does not require large language models. By combining rigorous corpus engineering, a from-scratch BPE tokenizer, and a smoothed trigram language model, the system produces coherent Urdu children's stories from arbitrary prefixes. The architecture is fully containerized, continuously deployed, and presented through a polished Urdu-first web interface connecting classical probabilistic NLP directly to modern production software engineering practices.





