Skip to content

Modularize the migration pipeline for reusable tag-type bots #2

@Kaftow

Description

@Kaftow

Problem

Monolithic Script Structure

migrate_subjects.py is currently structured as a monolithic script: it loads multiple mapping files, applies several classification paths, and emits a single combined tag result in one pass. That design makes it difficult to work on tag types or a small subset of common tags independently. Just take the current scenario as an example. Contributors may each own a different tag-type module, but all of that work still converges on the same script entry point, so parallel development and incremental rollout become harder than they should be.

We need to split migrate_subjects.py into smaller modules that can be enabled independently. Each module should own one tag type, or one bounded subset within a tag type, along with its mapping files and processing rules. That would let us develop, review, and ship validated migration logic gradually instead of treating the migration system as an all-or-nothing unit.

In the future, we could rely on a single openlibrary-bot and extend it by introducing new tags through configuration. Of course, we could still run multiple openlibrary-bot instances in parallel if needed. In this way, the core migration logic would be encapsulated into a shared component, and each bot would remain lightweight defining only its own tag mappings and processing rules in configuration.

Coupling Between Classification and Execution

The core classification part is tightly bonded to its loader and exporter in migrate_subjects.py. But in some occasions, we might want to change the input format or export format. For example, we may need to operate dry runs on local dump data for validation, and in later stages we may also need to read from the official API or write to different downstream artifacts.

To support that flexibility, execution concerns such as loading, orchestration, and exporting should be separated from the core classification module. That separation would let us switch input sources and output modes by changing the runner or executor without rewriting the tag-classification logic itself.

Context

The migration pipeline is split into four layers:

  1. rule-pack core
  2. runner
  3. analyzer
  4. apply

Each layer has a single responsibility. This prevents logic drift and keeps review boundaries clear.

1. Rule-Pack Core

The classification core operates on one work at a time.

  • Input: a normalized OL work object plus one or more enabled rule packs
  • Output: a structured proposal object for that work

The core does not handle:

  • batch orchestration
  • CLI input parsing
  • result file layout
  • dry-run reporting
  • Open Library writes

The core must remain deterministic for a given work object, enabled rule packs, and mapping versions.

2. Runner

The runner is responsible for execution and result-file production.

  • Reads input from API, local JSON, batch files, or filtered dumps
  • Normalizes each record into a standard work object
  • Calls the classification core
  • Writes proposal files
  • Records run metadata such as run id, timestamp, input source, enabled rule packs, and mapping version

The runner is the upstream producer for both analysis and apply workflows.

3. Analyzer

The analyzer reads proposal files and produces review artifacts.

  • dry-run details
  • aggregate stats
  • hit rates by rule pack
  • unmapped summaries
  • per-rule sampling for human QA

The analyzer never performs classification and never writes to Open Library.

4. Apply

The apply step reads reviewed proposal files and performs controlled writes.

  • additive-only in early phases
  • batch save with checkpointing
  • audit logging
  • retry handling

In the initial rollout, apply should only add missing typed tags. It should not remove existing tags or overwrite human-curated values.

The apply layer also needs an explicit write target. Even if the exact OL write payload is still to be finalized, the architecture should assume that proposals are translated into a stable, typed-tag write shape before any save is attempted. That contract should be defined early, because apply cannot be reviewed meaningfully without knowing what fields or API shape it updates.

The current SubjectClassifier should be treated as the bot core, not as the entire bot.

The full migration bot should follow the general shape used in openlibrary-bots:

  • a dedicated bot directory
  • clear script entry points
  • README.md
  • requirements.txt or runtime configuration
  • operational features such as dry-run, batching, logging, and checkpointing

Conceptually, this is a two-layer design:

  1. inner layer: SubjectClassifier plus rule packs
  2. outer layer: migration bot scripts and execution flow

This keeps classification logic reusable while making the delivered bot feel native to the openlibrary-bots repo.

Work Breakdown

  •   Evolve the current SubjectClassifier into a pluggable rule-pack core
  •   Run only the explicitly enabled packs instead of implicitly classifying every supported type
  •   Normalize all runners to a standard work object before invoking the core
  •   Define proposal files as the stable intermediate artifact shared by runner, analyzer, and apply
  •   Split the current all-in-one migration script into run_migration.pyanalyze_proposals.py, and apply_proposals.py
  •   Proceed by type, subset, and batch rather than all-at-once
  •  Start Phase 1 with direct-match content_formats
  •   Preserve the same proposal schema as the system grows to multiple rule packs and more complex inference

Implementation Details

Rule Packs

The current SubjectClassifier should evolve into a pluggable rule-pack core.

Instead of one class implicitly classifying every supported type, the core should run only the explicitly enabled packs.

This split does not need to force every current output into the same implementation shape on day one. In the current script, some outputs are mapping-backed subject classifications, while others are field-level normalization paths:

  • genressubgenrescontent_formatsliterary_themesliterary_tropesmain_topics, and audience are currently driven by mapping files
  • people and places currently use override tables on subject_people and subject_places
  • times currently passes through cleaned subject_times values without a controlled mapping step

Phase 1 can treat these as different pack categories, or defer some of them entirely, but the architecture should name that distinction explicitly rather than implying every output already behaves like a symmetric mapping-backed rule pack.

Examples:

core = SubjectClassifier(rule_packs=[LiteraryFormPack()])
core = SubjectClassifier(rule_packs=[ContentFormatsPack()])
core = SubjectClassifier(
    rule_packs=[LiteraryFormPack(), ContentFormatsPack()]
)

Each rule pack owns:

  • the target output type or types
  • its mapping files
  • any heuristics beyond direct string mapping
  • evidence emitted into the proposal

Each rule pack should expose a stable interface conceptually equivalent to:

class RulePack:
    name: str
    version: str
    output_types: list[str]

    def apply(self, work: dict) -> dict:
        ...

Returned data should include:

  • proposed tags for owned output types
  • evidence for each match
  • diagnostics such as dropped or unmapped inputs

This allows the system to start with direct-match packs and later add heuristic or model-assisted packs without changing the runner or apply interfaces.

Overall Structure

Proposed target structure:

tags-bot/
  README.md
  requirements.txt
  run_migration.py
  analyze_proposals.py
  apply_proposals.py
  core/
    classifier.py
    rule_packs/
  mappings/
  config/

Standard Work Input

All runners should normalize source data into a standard work object before invoking the core.

Minimum shape:

{
  "key": "/works/OL82563W",
  "title": "Wuthering Heights",
  "subjects": ["Love stories", "Gothic fiction"],
  "subject_people": ["Heathcliff"],
  "subject_places": ["Yorkshire"],
  "subject_times": ["19th century"]
}

The classification core should not need to know whether the work came from:

  • the OL API
  • a local JSON file
  • a filtered dump row
  • a batch fetch pipeline

Proposal Schema

Proposal files are the stable intermediate artifact shared by runner, analyzer, and apply.

Each proposal should be self-describing and reviewable without rerunning classification. If proposals are expected to be replayable on their own, they must also preserve enough normalized source input to reconstruct the classification decision.

Recommended schema:

{
  "schema_version": "1.0",
  "run": {
    "run_id": "2026-04-17-literary-form-v1",
    "created_at": "2026-04-17T04:10:00Z",
    "source_mode": "api",
    "input_ref": "OL82563W",
    "enabled_rule_packs": [
      {"name": "literary_form", "version": "2026-04-17"}
    ]
  },
  "work": {
    "work_id": "OL82563W",
    "work_key": "/works/OL82563W",
    "title": "Wuthering Heights",
    "source_snapshot": {
      "subjects": ["Love stories", "Gothic fiction"],
      "subject_people": ["Heathcliff"],
      "subject_places": ["Yorkshire"],
      "subject_times": ["19th century"]
    }
  },
  "proposals": {
    "literary_form": ["Fiction"],
    "content_formats": []
  },
  "evidence": {
    "matched_subjects": [
      {
        "raw": "fiction",
        "normalized": "fiction",
        "rule_pack": "literary_form",
        "output_type": "literary_form",
        "value": "Fiction",
        "reason": "direct mapping"
      }
    ]
  },
  "diagnostics": {
    "unmapped_subjects": [],
    "dropped_subjects": [],
    "reading_levels": [],
    "classification_codes": []
  }
}

Schema Requirements

Every proposal file must include:

  • the proposed tags
  • the work identifier
  • the normalized source fields used to produce the proposal, or a durable reference to an immutable source snapshot
  • the rule-pack and mapping versions that produced it
  • evidence explaining why tags were proposed
  • diagnostics for strings that were ignored or not understood

This is what makes audit and sampling practical. It also makes replay practical if we decide proposals should be rerunnable without refetching source data.

Script Split

The current all-in-one migration script should be split into three independent scripts around the shared core.

run_migration.py

Responsibilities:

  • parse CLI/config input
  • load work records
  • select enabled rule packs
  • invoke the core
  • write proposal files and run metadata

Typical usage:

python scripts/run_migration.py --work OL82563W --type literary_form
python scripts/run_migration.py --batch work_ids.txt --type literary_form --type content_formats
python scripts/run_migration.py --dump filtered_works.txt.gz --config run.toml

analyze_proposals.py

Responsibilities:

  • read proposal files
  • generate summaries and dry-run views
  • report hit counts, unmapped counts, and rule-pack precision samples

Typical usage:

python scripts/analyze_proposals.py proposals/2026-04-17-literary-form-v1/

apply_proposals.py

Responsibilities:

  • read reviewed proposals
  • compare against current work state
  • translate approved proposals into the stable typed-tag write shape
  • perform additive-only updates in batches
  • checkpoint progress and log failures

Typical usage:

python scripts/apply_proposals.py proposals/approved/literary_form/

Phase Plan

Phase 1

  • Extract a reusable core from migrate_subjects.py
  • Support rule-pack selection
  • Define and emit the proposal schema
  • Build runner, analyzer, and apply as separate scripts
  • Start with direct-match content_formats
  • Decide whether peopleplaces, and times ship in Phase 1 as dedicated normalization packs or remain outside the first rule-pack rollout

Phase 2

  • Support multiple rule packs in one run
  • Add config-file driven runs
  • Add richer heuristics and context-sensitive rules
  • Add larger-scale dump execution
  • Add stronger quality reporting

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions