ClawLearn is a Python CLI that turns real-world text (podcast transcripts, articles, etc.)
into Anki cloze decks (.apkg) for language learning.
It:
- ingests content from local
.txt/.md/.epubfiles - cleans and chunks the text into context blocks
- uses an OpenAI-compatible LLM to generate contextual cloze sentences
- uses a separate (usually cheaper) LLM for translations
- uses
edge_ttsto generate audio for each card - exports a complete Anki deck via
genanki - applies taxonomy-aware candidate ranking in advanced mode
- combines model-proposed labels with programmatic re-ranking corrections
- adds
expression_transferhints to capture cross-context reuse intent - writes taxonomy/validation/transfer metrics into
run_summary.jsonfor tuning
This README describes the current V2-oriented CLI. For an overview in Chinese, see
README_zh.md.
python -m venv .venv
# Windows (PowerShell)
. .venv/Scripts/activate
# macOS / Linux
# source .venv/bin/activate
pip install -r requirements.txt# Copy env example and check prompts/templates
python -m clawlearn.cli initinit will:
- create
.envfrom.env.example(if missing) - verify that the default prompts and template exist:
./prompts/cloze_contextual.json./prompts/cloze_prose_beginner.json./prompts/cloze_prose_intermediate.json./prompts/cloze_prose_advanced.json./prompts/cloze_transcript_beginner.json./prompts/cloze_transcript_intermediate.json./prompts/cloze_transcript_advanced.json./prompts/cloze_textbook_examples.json./prompts/translate_rewrite.json./prompts/template_extraction.json./prompts/template_explanation.json./templates/anki_cloze_default.json
Edit .env (or copy ENV_EXAMPLE.md content into your own env file) to point at
your LLM endpoints and TTS voices.
All configuration is done via environment variables. The repository includes
ENV_EXAMPLE.md describing defaults. Key groups:
CLAWLEARN_LLM_PROVIDER=openai_compatible
CLAWLEARN_LLM_BASE_URL=http://127.0.0.1:8000/v1
CLAWLEARN_LLM_API_KEY=YOUR_API_KEY
CLAWLEARN_LLM_MODEL=qwen3-30b
CLAWLEARN_LLM_TIMEOUT_SECONDS=120
CLAWLEARN_LLM_MAX_RETRIES=3
CLAWLEARN_LLM_RETRY_BACKOFF_SECONDS=2.0
# Base sleep between successful LLM calls (seconds);
# actual sleep is random in [N, 3N], 0 means no sleep.
CLAWLEARN_LLM_REQUEST_SLEEP_SECONDS=0
CLAWLEARN_LLM_TEMPERATURE=0.2The cloze LLM is responsible for generating contextual cloze sentences.
It expects an OpenAI-compatible /chat/completions endpoint.
CLAWLEARN_INGEST_SHORT_LINE_MAX_WORDS=3This controls pre-LLM line filtering:
- lines with very few words (for example one-word interjections) are dropped;
- set
CLAWLEARN_INGEST_SHORT_LINE_MAX_WORDS=0to disable this filter. .mdinput is converted to plain text before filtering..epubinput is unpacked and chapter HTML is converted to plain text before filtering.
Chunking controls how input text is split into context blocks:
# Character-based chunking, with soft sentence boundaries.
CLAWLEARN_CHUNK_MAX_CHARS=1800
CLAWLEARN_CHUNK_MIN_CHARS=120
CLAWLEARN_CHUNK_OVERLAP_SENTENCES=1Behaviour:
- text is split into paragraphs, then short paragraphs are merged until
CHUNK_MIN_CHARSis reached - each merged paragraph is split into chunks based on
CHUNK_MAX_CHARS:- extend by whole sentences until just below the char limit
- never split inside a sentence
- adjacent chunks may share
CHUNK_OVERLAP_SENTENCESof overlap
These control per-card behaviour:
# Max sentences per cloze text (validator + prompt docs)
CLAWLEARN_CLOZE_MAX_SENTENCES=3
# Min characters per cloze text. Too-short candidates are discarded (0 = no limit).
CLAWLEARN_CLOZE_MIN_CHARS=200
# Difficulty: beginner | intermediate | advanced
# Can be overridden by CLI --difficulty
CLAWLEARN_CLOZE_DIFFICULTY=intermediate
# Max number of cards per chunk (after dedupe); empty/0 = no per-chunk cap
CLAWLEARN_CLOZE_MAX_PER_CHUNK=4
# LLM chunk batch size: how many chunks to process per LLM call; 1 = per-chunk
CLAWLEARN_LLM_CHUNK_BATCH_SIZE=1
# Retry format-only validation failures (recover candidates rejected for format issues).
CLAWLEARN_VALIDATE_FORMAT_RETRY_ENABLE=true
# Retry attempts after initial validation failure (0-3).
CLAWLEARN_VALIDATE_FORMAT_RETRY_MAX=3
# Allow attempts >=2 to call LLM repair/regenerate.
CLAWLEARN_VALIDATE_FORMAT_RETRY_LLM_ENABLE=true- The LLM decides how many candidates to return per chunk (0-N).
CLOZE_MAX_PER_CHUNKis a safety cap applied after validation and dedupe; set to0or empty to disable.- Difficulty is now a first-class strategy selector (not only a prompt hint): it affects prompt family variant, validation, and ranking.
CLAWLEARN_LLM_CHUNK_BATCH_SIZEis always user-respected (no profile/difficulty hard override).
For material_profile=textbook_examples, if CLOZE_MIN_CHARS is above 120
and you do not override with --cloze-min-chars, the run is rejected.
# Optional small LLM for translation; falls back to main LLM if empty.
CLAWLEARN_TRANSLATE_LLM_BASE_URL=
CLAWLEARN_TRANSLATE_LLM_API_KEY=
CLAWLEARN_TRANSLATE_LLM_MODEL=
CLAWLEARN_TRANSLATE_LLM_TEMPERATURE=
# Number of originals translated in one request (recommended: 4-8).
CLAWLEARN_TRANSLATE_BATCH_SIZE=4- If
TRANSLATE_LLM_BASE_URL/MODELare set, translations are generated by this small LLM (cheaper backend). - Otherwise, translations use the main cloze LLM.
CLAWLEARN_TRANSLATE_BATCH_SIZEcontrols batch size per request (recommended range: 4-8; start from 4).- Translation batches use request-level retries (max 3 attempts). For partial responses, successful items are consumed first and only remaining items are retried.
ClawLearn can optionally run a secondary extraction pass using a different LLM configuration. This is useful when your primary model is conservative (high precision) and a secondary model can provide recall, or when you want to compare/merge candidates from two models.
- Enable via
CLAWLEARN_SECONDARY_EXTRACT_ENABLE=true. - Secondary pass never aborts the whole run; errors are recorded and the build falls back to primary.
- Candidates from both passes are deduped and merged.
run_summary.jsonreports the unique gain.
Relevant env vars:
CLAWLEARN_SECONDARY_EXTRACT_ENABLE=false
CLAWLEARN_SECONDARY_EXTRACT_PARALLEL=false
CLAWLEARN_SECONDARY_EXTRACT_LLM_BASE_URL=
CLAWLEARN_SECONDARY_EXTRACT_LLM_API_KEY=
CLAWLEARN_SECONDARY_EXTRACT_LLM_MODEL=
CLAWLEARN_SECONDARY_EXTRACT_LLM_TIMEOUT_SECONDS=
CLAWLEARN_SECONDARY_EXTRACT_LLM_TEMPERATURE=
CLAWLEARN_SECONDARY_EXTRACT_LLM_MAX_RETRIES=
CLAWLEARN_SECONDARY_EXTRACT_LLM_RETRY_BACKOFF_SECONDS=
CLAWLEARN_SECONDARY_EXTRACT_LLM_CHUNK_BATCH_SIZE=CLAWLEARN_CONTENT_PROFILE=prose_article
CLAWLEARN_MATERIAL_PROFILE=prose_article
CLAWLEARN_LEARNING_MODE=lingua_expression
CLAWLEARN_PROMPT_CLOZE=./prompts/cloze_contextual.json
CLAWLEARN_PROMPT_CLOZE_TEXTBOOK=./prompts/cloze_textbook_examples.json
CLAWLEARN_PROMPT_CLOZE_PROSE_BEGINNER=./prompts/cloze_prose_beginner.json
CLAWLEARN_PROMPT_CLOZE_PROSE_INTERMEDIATE=./prompts/cloze_prose_intermediate.json
CLAWLEARN_PROMPT_CLOZE_PROSE_ADVANCED=./prompts/cloze_prose_advanced.json
CLAWLEARN_PROMPT_CLOZE_TRANSCRIPT_BEGINNER=./prompts/cloze_transcript_beginner.json
CLAWLEARN_PROMPT_CLOZE_TRANSCRIPT_INTERMEDIATE=./prompts/cloze_transcript_intermediate.json
CLAWLEARN_PROMPT_CLOZE_TRANSCRIPT_ADVANCED=./prompts/cloze_transcript_advanced.json
CLAWLEARN_PROMPT_TRANSLATE=./prompts/translate_rewrite.json
# Preferred default prompt files by role (used when no CLI override)
CLAWLEARN_EXTRACT_PROMPT=
CLAWLEARN_EXPLAIN_PROMPT=
CLAWLEARN_PROMPT_LANG=zh
CLAWLEARN_ANKI_TEMPLATE=./templates/anki_cloze_default.json
# Intermediate run data (JSONL, media snapshots)
CLAWLEARN_OUTPUT_DIR=./runs
# Final exported decks (when --output is not provided)
CLAWLEARN_EXPORT_DIR=./outputs
CLAWLEARN_LOG_DIR=./logs
CLAWLEARN_LOG_LEVEL=INFO
CLAWLEARN_SAVE_INTERMEDIATE=true
CLAWLEARN_ALLOW_EMPTY_DECK=true
CLAWLEARN_DEFAULT_DECK_NAME=ClawLearn Default DeckCLAWLEARN_MATERIAL_PROFILEchooses material strategy:prose_article,transcript_dialogue,textbook_examples.CLAWLEARN_LEARNING_MODEdefaults tolingua_expression(see supported modes insrc/clawlearn/constants.py).- Prompt selection is now profile + difficulty driven:
- prose:
cloze_prose_{beginner|intermediate|advanced}.json - transcript:
cloze_transcript_{beginner|intermediate|advanced}.json - textbook_examples:
cloze_textbook_examples.json
- prose:
CLAWLEARN_CONTENT_PROFILEis kept as a backward-compatible alias.CLAWLEARN_EXTRACT_PROMPT/CLAWLEARN_EXPLAIN_PROMPTcan pin default prompt files directly (higher priority than profile-chain defaults, lower priority than CLI--extract-prompt/--explain-prompt).CLAWLEARN_PROMPT_LANGcontrols which language variant is used for multi-lingual prompts (enorzh), and can be overridden by--prompt-lang.
CLAWLEARN_TTS_PROVIDER=edge_tts
CLAWLEARN_TTS_OUTPUT_FORMAT=mp3
CLAWLEARN_TTS_RATE=+0%
CLAWLEARN_TTS_VOLUME=+0%
CLAWLEARN_TTS_RANDOM_SEED=
# Voice lists per language; at least 3 voices per source language is recommended.
CLAWLEARN_TTS_EDGE_EN_VOICES=en-US-AnaNeural,en-US-AndrewNeural,en-GB-SoniaNeural
CLAWLEARN_TTS_EDGE_ZH_VOICES=zh-CN-XiaoxiaoNeural,zh-CN-YunxiNeural,zh-CN-liaoning-XiaobeiNeural
CLAWLEARN_TTS_EDGE_JA_VOICES=ja-JP-NanamiNeural,ja-JP-KeitaNeural,ja-JP-AoiNeuralClawLearn uses edge_tts to synthesize audio for the Original sentence of
each card, selecting a voice from the configured list based on source_lang.
Cloze generation now uses prompt family + difficulty variant:
prose_article:cloze_prose_beginner.json,cloze_prose_intermediate.json,cloze_prose_advanced.jsontranscript_dialogue:cloze_transcript_beginner.json,cloze_transcript_intermediate.json,cloze_transcript_advanced.jsontextbook_examples:cloze_textbook_examples.json
Legacy cloze_contextual.json is still supported for backward compatibility.
- Prompt fields support both legacy and multi-lingual formats:
- Legacy string format:
"system_prompt": "...". - Multi-lingual map:
"system_prompt": { "en": "...", "zh": "..." }.
- Legacy string format:
- At runtime, the language variant is selected based on
CLAWLEARN_PROMPT_LANG(or the--prompt-langCLI override).
It uses source_lang, target_lang, learning_mode, difficulty, cloze_max_sentences, and
a merged chunk_text (possibly containing multiple chunk blocks) as
placeholders.
The expected output depends on the selected extraction prompt schema:
- Phrase extraction pipeline (current default prompts, schema
phrase_candidates_*):
{
"chunk_id": "chunk_0001",
"context_sentences": [
"Sentence 1 copied verbatim from chunk_text.",
"Sentence 2 copied verbatim from chunk_text."
],
"phrases": [
{ "text": "short phrase copied verbatim" }
]
}In this pipeline, the LLM does not output any cloze markers. Cloze markup is generated later by code by injecting selected phrase spans into the original sentence text.
- Legacy cloze-cards pipeline (schema
cloze_cards_*, kept for backward compatibility):
{
"chunk_id": "chunk_0001_abcd12",
"text": "The more {{c1::<b>whimsical explanation</b>}}(target-lang hint) is that maybe RL training makes the models a little too {{c2::<b>single-minded</b>}}(target-lang hint) and narrowly focused.",
"original": "The more whimsical explanation is that maybe RL training makes the models a little too single-minded and narrowly focused.",
"target_phrases": ["whimsical explanation", "single-minded"],
"note_hint": "optional short hint"
}Key rules (enforced via prompt + validator):
textmust contain at least one cloze:- cloze syntax:
{{cN::...}}where N = 1, 2, 3... - cloze inside uses
<b>...</b>to emphasize the phrase. - cloze is immediately followed by a short explanation in parentheses:
{{c1::<b>whimsical explanation</b>}}(target-lang hint).
- cloze syntax:
originalmust not contain any cloze markers or HTML.- Each chunk may produce 0-4 high-quality cloze candidates.
- Each candidate may contain multiple clozes (e.g. c1 and c2 in the same sentence).
The validator further:
- normalizes single-brace clozes (
{c1::...}->{{c1::...}}); - auto-injects a
{{c1::...}}cloze fromtarget_phraseswhen text has no clozes but phrases are present (fallback only); - rejects candidates that are too short (
len(text) < CLOZE_MIN_CHARS) or exceedCLOZE_MAX_SENTENCES.
NOTE: future work may include renumbering multiple
c1occurrences toc1,c2,c3in order of appearance.
Use this with --content-profile textbook_examples for textbook-style entries
that mix headwords, definitions, and example sentences. The prompt is tuned to:
- ignore standalone headword/title lines;
- ignore dictionary-style definition lines;
- extract cloze candidates only from natural example sentences.
The translation prompt follows the same multi-lingual structure support as the
cloze prompts: system_prompt and user_prompt_template may be plain strings
or { "en": "...", "zh": "..." } maps, selected via
CLAWLEARN_PROMPT_LANG / --prompt-lang.
The translation prompt runs in batch mode. Input is an array of originals, and the LLM should return a JSON array in the same order:
[
{ "translation": "..." },
{ "translation": "..." }
]Validator ensures:
- non-empty
translation; - no translation-prefix artifacts like
"Translation:"; - no Markdown
**(HTML<b>is allowed).
Error/retry semantics:
- Network/request-level failures (timeout, HTTP, full JSON parse failure) are retried up to 3 times for the current remaining batch.
- When a batch response is incomplete, successful entries are accepted and only the remaining entries are retried (up to 3 attempts total).
- Content/validation failures are not retried; they follow
--continue-on-errorsemantics.
The entrypoint module is src/clawlearn/cli.py, exposing a Typer-based CLI.
You can run it either as a module or via an installed entrypoint (if configured
in your environment).
python -m clawlearn.cli init- Creates
.envfrom.env.exampleif needed. - Verifies required prompt/template files.
- Optionally prepares an output directory.
python -m clawlearn.cli doctor --env-file .envPerforms a series of checks:
- Python dependencies (
edge_tts,genanki,httpx,typer) - base config (paths, prompt/template files)
- runtime config (LLM, TTS voices)
- cloze/translate prompt schema
- primary LLM connectivity (
CLAWLEARN_LLM_*) - translation LLM config & connectivity (
CLAWLEARN_TRANSLATE_LLM_*) - output directory writability
- cloze control summary (max_sentences / min_chars / difficulty / max_per_chunk / material_profile / learning_mode)
- TTS voices for
default_source_lang
Core command:
python -m clawlearn.cli lingua build deck INPUT \
--source-lang en \
--target-lang zh \
--material-profile prose_article|transcript_dialogue|textbook_examples \
--learning-mode lingua_expression|lingua_reading \
--lingua-annotate \
--lingua-annotate-batch-size 50 \
--lingua-annotate-max-items 200 \
--extract-prompt ./prompts/cloze_transcript_advanced.json \
--explain-prompt ./prompts/translate_rewrite.json \
--verbose \
--input-char-limit 4000 \
--env-file .env \
--output deck.apkg \
--deck-name "My Cloze Deck" \
--max-chars 1500 \
--cloze-min-chars 60 \
--max-notes 200 \
--temperature 0.2 \
--difficulty beginner|intermediate|advanced \
--prompt-lang en|zh \
--extract-prompt ./prompts/cloze_prose_intermediate.json \
--explain-prompt ./prompts/translate_rewrite.json \
--save-intermediate \
--continue-on-error \
--debugWhere:
INPUT: path to.txt/.md/.epubfile.--source-lang/--target-langoverride defaults from env.--material-profileselects material strategy and cloze prompt family.--learning-modeselects the pipeline behavior. For lingua pipelines:lingua_expression|lingua_reading. For textbook pipelines:textbook_focus|textbook_review.--content-profileis kept as a deprecated alias of--material-profile.--input-char-limitlets you process only the first N characters for quick tests.--difficultyoverridesCLAWLEARN_CLOZE_DIFFICULTY.--prompt-langoverridesCLAWLEARN_PROMPT_LANGfor multi-lingual prompts.--extract-promptoverrides extraction prompt file for this run.--explain-promptoverrides explanation prompt file for this run.--max-charsoverridesCLAWLEARN_CHUNK_MAX_CHARSfor this run.--cloze-min-charsoverridesCLAWLEARN_CLOZE_MIN_CHARSfor this run.- In
textbook_examplesprofile, runs are rejected when envCLOZE_MIN_CHARS > 120unless you explicitly provide--cloze-min-chars. --max-notesimposes a global cap on number of notes.--save-intermediatedumps intermediates underCLAWLEARN_OUTPUT_DIR/<run_id>.- When
--outputis not provided, the final deck is written toCLAWLEARN_EXPORT_DIR/<run_id>/output.apkg. --continue-on-errorlogs and skips individual failures instead of aborting.--debugmakes_run_guardre-raise exceptions with tracebacks.- By default, deck name uses the input file name (without extension);
--deck-nameoverrides it.
python -m clawlearn.cli prompt validate ./prompts/cloze_prose_intermediate.json
python -m clawlearn.cli prompt validate ./prompts/cloze_textbook_examples.jsonValidates a prompt file against the expected JSON schema (see
src/clawlearn/llm/prompt_loader.py).
python -m clawlearn.cli config show --env-file .env
python -m clawlearn.cli config validate --env-file .envconfig showprints resolvedAppConfigas JSON.config validateruns config validation without building a deck.
Using the default Anki template (templates/anki_cloze_default.json), each card
has at least the following fields:
- Text: the cloze sentence with
{{cN::...}}markers, HTML<b>for emphasis, and optional translations in parentheses. - Original: the original sentence(s) without cloze markers or HTML.
- Translation: target-language translation of
Original. - Note: metadata (source title, chunk id, target phrases).
- Audio:
edge_tts-generated audio forOriginal(via[sound:xxx.mp3]).
The exact Anki field mapping is defined in the JSON template; you can customize it if you want different field names or card faces.
If you have older scripts/configs, here are the key changes:
-
Subcommands:
- New preferred command:
python -m clawlearn.cli lingua build deck ... python -m clawlearn.cli build deck ...is kept as a deprecated alias.
- New preferred command:
-
Profiles:
- Prefer
--material-profile/CLAWLEARN_MATERIAL_PROFILE. --content-profile/CLAWLEARN_CONTENT_PROFILEis a deprecated alias.- Legacy
content_profile=generalmaps tomaterial_profile=prose_article.
- Prefer
-
Learning mode:
CLAWLEARN_LEARNING_MODEdefaults tolingua_expression.- Supported modes are listed in
src/clawlearn/constants.py.
-
Extraction schema:
- Current default prompts output
phrase_candidates_*JSON (no cloze markup). - Cloze markup is generated by code by injecting phrase spans.
- Legacy prompts may output
cloze_cards_*JSON (cloze markup produced by the LLM).
- Current default prompts output
-
Set up
.env- Configure LLM endpoints (primary + optional translation LLM).
- Configure chunking & cloze controls.
- Configure TTS voices for your source language.
-
Run doctor
python -m clawlearn.cli doctor --env-file .env
-
Build a deck from a podcast transcript
python -m clawlearn.cli lingua build deck ./podcast_transcript.md \ --source-lang en --target-lang zh --env-file .env \ --material-profile transcript_dialogue --learning-mode lingua_expression \ --difficulty intermediate --max-chars 1500 \ --save-intermediate --continue-on-error --verbose
-
Import the generated
.apkginto Anki and review cards. -
Inspect intermediates (optional) under
./runs/<run_id>:chunks.jsonl: chunked texttext_candidates.raw.jsonl: raw cloze candidates from LLMtext_candidates.validated.jsonl: candidates that passed validationtranslations.jsonl: translations per cardcards.final.jsonl: final card data before export
For users who prefer a browser-based interface, ClawLearn ships an optional local-only web UI built with Gradio. This does not change the core CLI behaviour and is only started when explicitly invoked.
Install the web extra (in addition to the core dependencies):
pip install .[web]From the project root:
clawlearn-web
# or
python -m clawlearn_web.appThis starts a Gradio app bound to 127.0.0.1:7860. Open
http://127.0.0.1:7860 in your browser.
The web UI has three tabs:
- Run - upload a
.txt/.md/.epubfile, select source/target language, content profile, difficulty, and per-run overrides (max notes, input char limit, cloze min chars, chunk max chars, temperature). The backend calls the samerun_build_deckpipeline and writes intermediate data toCLAWLEARN_OUTPUT_DIR/<run_id>and the final deck toCLAWLEARN_EXPORT_DIR/<run_id>/output.apkg. - Config - a
.enveditor for commonCLAWLEARN_*settings (LLM endpoints, chunk/cloze defaults, prompt language, output/log directories, default deck name, TTS, etc.). Saving changes writes a new.env, validates it viaclawlearn.config.validate_base_config+validate_runtime_config, and rolls back on failure. The "Load defaults" button loads values fromENV_EXAMPLE.mdinto the form without writing to disk. The Config tab also provides "List models" / "Test connectivity" helpers for both the Extraction LLM and the Explanation LLM using their/modelsendpoints, and includesCLAWLEARN_EXTRACT_PROMPT/CLAWLEARN_EXPLAIN_PROMPTdropdowns. - Prompt - manage prompt files in
./promptswithNew / Save / Rename / Delete. New prompt creation loads role templates (template_extraction.jsonortemplate_explanation.json) based on selected Prompt type (Extraction/Explanation). Save/Delete require explicit confirmation in the UI. Delete is guarded: the app refuses to remove the last Extraction prompt or the last Explanation prompt. The web UI is optional; OpenClaw skills and automated usage should continue calling the CLI directly.
- Cloze numbering: when multiple clozes appear in a single
text, we may need to renumber them deterministically (c1,c2,c3) in order of appearance. - Cloze formatting: current prompt encourages the
{{cN::<b>phrase</b>}}(translation)style, but behaviour still depends on the chosen LLM and may require further prompt tuning. - Tests: the original
tests/directory has been removed frommain; if you extend the project, consider reintroducing a focused test suite.
For a Chinese overview and usage guide, see README_zh.md.