Wordstats Update - October 2025

Summary

Updated python-wordstats to use full frequency lists with 10x more word coverage.

Changes

Before

Used hermitdave 2016 50k word lists
Loaded only top 10,000 words per language
Many medium-frequency words had rank=100000 (unknown)

After

Uses hermitdave 2018 full word lists
Loads ~100k-250k words per language (filtered by MIN_OCCURRENCE_COUNT ≥ 10)
Much better coverage for medium and low-frequency words

Example (Danish)

Word	Occurrences	Before	After
ikke	1,928,834	rank=5	rank=5 ✓
omfatte	45	rank=100000 ❌	rank=33596 ✓
skorter	11	rank=100000 ❌	rank=88912 ✓

Configuration

Development (default)

# default_api.cfg
PRELOAD_WORDSTATS=False  # Lazy loading - fast startup

Languages load on first use (~1 second per language)
Good for development: fast restarts, no waiting

Production (recommended)

# production_api.cfg
PRELOAD_WORDSTATS=True  # Preload at startup

All languages load at startup (~10-30 seconds total)
No first-request delays
Recommended for production

Implementation

The preloading logic is in zeeguu/api/app.py:

if app.config.get("PRELOAD_WORDSTATS", False):
    from wordstats import LanguageInfo
    from zeeguu.core.model import Language

    all_languages = Language.all_languages()
    language_codes = [lang.code for lang in all_languages]
    LanguageInfo.load_in_memory_for(language_codes)

Memory Impact

Approximate memory per language (with MIN_OCCURRENCE_COUNT=10):

Danish: ~10 MB (99k words)
English: ~25 MB (241k words)
Total for all languages: ~150-200 MB

This is acceptable for modern production servers.

Migration

After deploying the updated wordstats:

Recalculate phrase ranks (already done):

source ~/.venvs/z_env/bin/activate
python tools/migrations/25-10-22--recalculate_all_multiword_phrase_ranks.py

Update production config:
- Set PRELOAD_WORDSTATS=True in production config
- Keep PRELOAD_WORDSTATS=False in development

Restart API and verify startup logs:

*** Wordstats preloaded 15 languages in 18.45s

Benefits

Better word prioritization: Words like "omfatte" and "skorter" now have accurate ranks
Improved learning experience: Better difficulty estimates for spaced repetition
Configurable preloading: Fast development, optimized production
Language-independent: MIN_OCCURRENCE_COUNT=10 works well across all languages (~14-15% of corpus)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wordstats Update - October 2025

Summary

Changes

Before

After

Example (Danish)

Configuration

Development (default)

Production (recommended)

Implementation

Memory Impact

Migration

Benefits

FilesExpand file tree

WORDSTATS_UPDATE.md

Latest commit

History

WORDSTATS_UPDATE.md

File metadata and controls

Wordstats Update - October 2025

Summary

Changes

Before

After

Example (Danish)

Configuration

Development (default)

Production (recommended)

Implementation

Memory Impact

Migration

Benefits