Problem
Monolithic Script Structure
migrate_subjects.py is currently structured as a monolithic script: it loads multiple mapping files, applies several classification paths, and emits a single combined tag result in one pass. That design makes it difficult to work on tag types or a small subset of common tags independently. Just take the current scenario as an example. Contributors may each own a different tag-type module, but all of that work still converges on the same script entry point, so parallel development and incremental rollout become harder than they should be.
We need to split migrate_subjects.py into smaller modules that can be enabled independently. Each module should own one tag type, or one bounded subset within a tag type, along with its mapping files and processing rules. That would let us develop, review, and ship validated migration logic gradually instead of treating the migration system as an all-or-nothing unit.
In the future, we could rely on a single openlibrary-bot and extend it by introducing new tags through configuration. Of course, we could still run multiple openlibrary-bot instances in parallel if needed. In this way, the core migration logic would be encapsulated into a shared component, and each bot would remain lightweight defining only its own tag mappings and processing rules in configuration.
Coupling Between Classification and Execution
The core classification part is tightly bonded to its loader and exporter in migrate_subjects.py. But in some occasions, we might want to change the input format or export format. For example, we may need to operate dry runs on local dump data for validation, and in later stages we may also need to read from the official API or write to different downstream artifacts.
To support that flexibility, execution concerns such as loading, orchestration, and exporting should be separated from the core classification module. That separation would let us switch input sources and output modes by changing the runner or executor without rewriting the tag-classification logic itself.
Context
The migration pipeline is split into four layers:
rule-pack core
runner
analyzer
apply
Each layer has a single responsibility. This prevents logic drift and keeps review boundaries clear.
1. Rule-Pack Core
The classification core operates on one work at a time.
- Input: a normalized OL work object plus one or more enabled rule packs
- Output: a structured proposal object for that work
The core does not handle:
- batch orchestration
- CLI input parsing
- result file layout
- dry-run reporting
- Open Library writes
The core must remain deterministic for a given work object, enabled rule packs, and mapping versions.
2. Runner
The runner is responsible for execution and result-file production.
- Reads input from API, local JSON, batch files, or filtered dumps
- Normalizes each record into a standard work object
- Calls the classification core
- Writes proposal files
- Records run metadata such as run id, timestamp, input source, enabled rule packs, and mapping version
The runner is the upstream producer for both analysis and apply workflows.
3. Analyzer
The analyzer reads proposal files and produces review artifacts.
- dry-run details
- aggregate stats
- hit rates by rule pack
- unmapped summaries
- per-rule sampling for human QA
The analyzer never performs classification and never writes to Open Library.
4. Apply
The apply step reads reviewed proposal files and performs controlled writes.
- additive-only in early phases
- batch save with checkpointing
- audit logging
- retry handling
In the initial rollout, apply should only add missing typed tags. It should not remove existing tags or overwrite human-curated values.
The apply layer also needs an explicit write target. Even if the exact OL write payload is still to be finalized, the architecture should assume that proposals are translated into a stable, typed-tag write shape before any save is attempted. That contract should be defined early, because apply cannot be reviewed meaningfully without knowing what fields or API shape it updates.
The current SubjectClassifier should be treated as the bot core, not as the entire bot.
The full migration bot should follow the general shape used in openlibrary-bots:
- a dedicated bot directory
- clear script entry points
- a
README.md
requirements.txt or runtime configuration
- operational features such as dry-run, batching, logging, and checkpointing
Conceptually, this is a two-layer design:
- inner layer:
SubjectClassifier plus rule packs
- outer layer: migration bot scripts and execution flow
This keeps classification logic reusable while making the delivered bot feel native to the openlibrary-bots repo.
Work Breakdown
Implementation Details
Rule Packs
The current SubjectClassifier should evolve into a pluggable rule-pack core.
Instead of one class implicitly classifying every supported type, the core should run only the explicitly enabled packs.
This split does not need to force every current output into the same implementation shape on day one. In the current script, some outputs are mapping-backed subject classifications, while others are field-level normalization paths:
genres, subgenres, content_formats, literary_themes, literary_tropes, main_topics, and audience are currently driven by mapping files
people and places currently use override tables on subject_people and subject_places
times currently passes through cleaned subject_times values without a controlled mapping step
Phase 1 can treat these as different pack categories, or defer some of them entirely, but the architecture should name that distinction explicitly rather than implying every output already behaves like a symmetric mapping-backed rule pack.
Examples:
core = SubjectClassifier(rule_packs=[LiteraryFormPack()])
core = SubjectClassifier(rule_packs=[ContentFormatsPack()])
core = SubjectClassifier(
rule_packs=[LiteraryFormPack(), ContentFormatsPack()]
)
Each rule pack owns:
- the target output type or types
- its mapping files
- any heuristics beyond direct string mapping
- evidence emitted into the proposal
Each rule pack should expose a stable interface conceptually equivalent to:
class RulePack:
name: str
version: str
output_types: list[str]
def apply(self, work: dict) -> dict:
...
Returned data should include:
- proposed tags for owned output types
- evidence for each match
- diagnostics such as dropped or unmapped inputs
This allows the system to start with direct-match packs and later add heuristic or model-assisted packs without changing the runner or apply interfaces.
Overall Structure
Proposed target structure:
tags-bot/
README.md
requirements.txt
run_migration.py
analyze_proposals.py
apply_proposals.py
core/
classifier.py
rule_packs/
mappings/
config/
Standard Work Input
All runners should normalize source data into a standard work object before invoking the core.
Minimum shape:
{
"key": "/works/OL82563W",
"title": "Wuthering Heights",
"subjects": ["Love stories", "Gothic fiction"],
"subject_people": ["Heathcliff"],
"subject_places": ["Yorkshire"],
"subject_times": ["19th century"]
}
The classification core should not need to know whether the work came from:
- the OL API
- a local JSON file
- a filtered dump row
- a batch fetch pipeline
Proposal Schema
Proposal files are the stable intermediate artifact shared by runner, analyzer, and apply.
Each proposal should be self-describing and reviewable without rerunning classification. If proposals are expected to be replayable on their own, they must also preserve enough normalized source input to reconstruct the classification decision.
Recommended schema:
{
"schema_version": "1.0",
"run": {
"run_id": "2026-04-17-literary-form-v1",
"created_at": "2026-04-17T04:10:00Z",
"source_mode": "api",
"input_ref": "OL82563W",
"enabled_rule_packs": [
{"name": "literary_form", "version": "2026-04-17"}
]
},
"work": {
"work_id": "OL82563W",
"work_key": "/works/OL82563W",
"title": "Wuthering Heights",
"source_snapshot": {
"subjects": ["Love stories", "Gothic fiction"],
"subject_people": ["Heathcliff"],
"subject_places": ["Yorkshire"],
"subject_times": ["19th century"]
}
},
"proposals": {
"literary_form": ["Fiction"],
"content_formats": []
},
"evidence": {
"matched_subjects": [
{
"raw": "fiction",
"normalized": "fiction",
"rule_pack": "literary_form",
"output_type": "literary_form",
"value": "Fiction",
"reason": "direct mapping"
}
]
},
"diagnostics": {
"unmapped_subjects": [],
"dropped_subjects": [],
"reading_levels": [],
"classification_codes": []
}
}
Schema Requirements
Every proposal file must include:
- the proposed tags
- the work identifier
- the normalized source fields used to produce the proposal, or a durable reference to an immutable source snapshot
- the rule-pack and mapping versions that produced it
- evidence explaining why tags were proposed
- diagnostics for strings that were ignored or not understood
This is what makes audit and sampling practical. It also makes replay practical if we decide proposals should be rerunnable without refetching source data.
Script Split
The current all-in-one migration script should be split into three independent scripts around the shared core.
run_migration.py
Responsibilities:
- parse CLI/config input
- load work records
- select enabled rule packs
- invoke the core
- write proposal files and run metadata
Typical usage:
python scripts/run_migration.py --work OL82563W --type literary_form
python scripts/run_migration.py --batch work_ids.txt --type literary_form --type content_formats
python scripts/run_migration.py --dump filtered_works.txt.gz --config run.toml
analyze_proposals.py
Responsibilities:
- read proposal files
- generate summaries and dry-run views
- report hit counts, unmapped counts, and rule-pack precision samples
Typical usage:
python scripts/analyze_proposals.py proposals/2026-04-17-literary-form-v1/
apply_proposals.py
Responsibilities:
- read reviewed proposals
- compare against current work state
- translate approved proposals into the stable typed-tag write shape
- perform additive-only updates in batches
- checkpoint progress and log failures
Typical usage:
python scripts/apply_proposals.py proposals/approved/literary_form/
Phase Plan
Phase 1
- Extract a reusable core from
migrate_subjects.py
- Support rule-pack selection
- Define and emit the proposal schema
- Build runner, analyzer, and apply as separate scripts
- Start with direct-match
content_formats
- Decide whether
people, places, and times ship in Phase 1 as dedicated normalization packs or remain outside the first rule-pack rollout
Phase 2
- Support multiple rule packs in one run
- Add config-file driven runs
- Add richer heuristics and context-sensitive rules
- Add larger-scale dump execution
- Add stronger quality reporting
Problem
Monolithic Script Structure
migrate_subjects.pyis currently structured as a monolithic script: it loads multiple mapping files, applies several classification paths, and emits a single combined tag result in one pass. That design makes it difficult to work on tag types or a small subset of common tags independently. Just take the current scenario as an example. Contributors may each own a different tag-type module, but all of that work still converges on the same script entry point, so parallel development and incremental rollout become harder than they should be.We need to split
migrate_subjects.pyinto smaller modules that can be enabled independently. Each module should own one tag type, or one bounded subset within a tag type, along with its mapping files and processing rules. That would let us develop, review, and ship validated migration logic gradually instead of treating the migration system as an all-or-nothing unit.In the future, we could rely on a single
openlibrary-botand extend it by introducing new tags through configuration. Of course, we could still run multipleopenlibrary-botinstances in parallel if needed. In this way, the core migration logic would be encapsulated into a shared component, and each bot would remain lightweight defining only its own tag mappings and processing rules in configuration.Coupling Between Classification and Execution
The core classification part is tightly bonded to its loader and exporter in
migrate_subjects.py. But in some occasions, we might want to change the input format or export format. For example, we may need to operate dry runs on local dump data for validation, and in later stages we may also need to read from the official API or write to different downstream artifacts.To support that flexibility, execution concerns such as loading, orchestration, and exporting should be separated from the core classification module. That separation would let us switch input sources and output modes by changing the runner or executor without rewriting the tag-classification logic itself.
Context
The migration pipeline is split into four layers:
rule-pack corerunneranalyzerapplyEach layer has a single responsibility. This prevents logic drift and keeps review boundaries clear.
1. Rule-Pack Core
The classification core operates on one work at a time.
The core does not handle:
The core must remain deterministic for a given work object, enabled rule packs, and mapping versions.
2. Runner
The runner is responsible for execution and result-file production.
The runner is the upstream producer for both analysis and apply workflows.
3. Analyzer
The analyzer reads proposal files and produces review artifacts.
The analyzer never performs classification and never writes to Open Library.
4. Apply
The apply step reads reviewed proposal files and performs controlled writes.
In the initial rollout, apply should only add missing typed tags. It should not remove existing tags or overwrite human-curated values.
The apply layer also needs an explicit write target. Even if the exact OL write payload is still to be finalized, the architecture should assume that proposals are translated into a stable, typed-tag write shape before any save is attempted. That contract should be defined early, because apply cannot be reviewed meaningfully without knowing what fields or API shape it updates.
The current
SubjectClassifiershould be treated as the bot core, not as the entire bot.The full migration bot should follow the general shape used in
openlibrary-bots:README.mdrequirements.txtor runtime configurationConceptually, this is a two-layer design:
SubjectClassifierplus rule packsThis keeps classification logic reusable while making the delivered bot feel native to the
openlibrary-botsrepo.Work Breakdown
SubjectClassifierinto a pluggable rule-pack corerun_migration.py,analyze_proposals.py, andapply_proposals.pycontent_formatsImplementation Details
Rule Packs
The current
SubjectClassifiershould evolve into a pluggable rule-pack core.Instead of one class implicitly classifying every supported type, the core should run only the explicitly enabled packs.
This split does not need to force every current output into the same implementation shape on day one. In the current script, some outputs are mapping-backed subject classifications, while others are field-level normalization paths:
genres,subgenres,content_formats,literary_themes,literary_tropes,main_topics, andaudienceare currently driven by mapping filespeopleandplacescurrently use override tables onsubject_peopleandsubject_placestimescurrently passes through cleanedsubject_timesvalues without a controlled mapping stepPhase 1 can treat these as different pack categories, or defer some of them entirely, but the architecture should name that distinction explicitly rather than implying every output already behaves like a symmetric mapping-backed rule pack.
Examples:
Each rule pack owns:
Each rule pack should expose a stable interface conceptually equivalent to:
Returned data should include:
This allows the system to start with direct-match packs and later add heuristic or model-assisted packs without changing the runner or apply interfaces.
Overall Structure
Proposed target structure:
Standard Work Input
All runners should normalize source data into a standard work object before invoking the core.
Minimum shape:
{ "key": "/works/OL82563W", "title": "Wuthering Heights", "subjects": ["Love stories", "Gothic fiction"], "subject_people": ["Heathcliff"], "subject_places": ["Yorkshire"], "subject_times": ["19th century"] }The classification core should not need to know whether the work came from:
Proposal Schema
Proposal files are the stable intermediate artifact shared by runner, analyzer, and apply.
Each proposal should be self-describing and reviewable without rerunning classification. If proposals are expected to be replayable on their own, they must also preserve enough normalized source input to reconstruct the classification decision.
Recommended schema:
{ "schema_version": "1.0", "run": { "run_id": "2026-04-17-literary-form-v1", "created_at": "2026-04-17T04:10:00Z", "source_mode": "api", "input_ref": "OL82563W", "enabled_rule_packs": [ {"name": "literary_form", "version": "2026-04-17"} ] }, "work": { "work_id": "OL82563W", "work_key": "/works/OL82563W", "title": "Wuthering Heights", "source_snapshot": { "subjects": ["Love stories", "Gothic fiction"], "subject_people": ["Heathcliff"], "subject_places": ["Yorkshire"], "subject_times": ["19th century"] } }, "proposals": { "literary_form": ["Fiction"], "content_formats": [] }, "evidence": { "matched_subjects": [ { "raw": "fiction", "normalized": "fiction", "rule_pack": "literary_form", "output_type": "literary_form", "value": "Fiction", "reason": "direct mapping" } ] }, "diagnostics": { "unmapped_subjects": [], "dropped_subjects": [], "reading_levels": [], "classification_codes": [] } }Schema Requirements
Every proposal file must include:
This is what makes audit and sampling practical. It also makes replay practical if we decide proposals should be rerunnable without refetching source data.
Script Split
The current all-in-one migration script should be split into three independent scripts around the shared core.
run_migration.pyResponsibilities:
Typical usage:
analyze_proposals.pyResponsibilities:
Typical usage:
apply_proposals.pyResponsibilities:
Typical usage:
Phase Plan
Phase 1
migrate_subjects.pycontent_formatspeople,places, andtimesship in Phase 1 as dedicated normalization packs or remain outside the first rule-pack rolloutPhase 2