Skip to content

Refactor/seperate common module#3

Open
Kaftow wants to merge 9 commits into
Open-Book-Genome-Project:mainfrom
Kaftow:refactor/seperate-common-module
Open

Refactor/seperate common module#3
Kaftow wants to merge 9 commits into
Open-Book-Genome-Project:mainfrom
Kaftow:refactor/seperate-common-module

Conversation

@Kaftow
Copy link
Copy Markdown

@Kaftow Kaftow commented Apr 21, 2026

Close #2

Description

Summary

This PR extracts the legacy scripts/migrate_subjects.py logic into a reusable classification core with explicit rule packs and sequential shared state.

The main goal is to stop treating subject migration as one monolithic script and instead make it possible to:

  • run only selected tag-type packs
  • develop packs independently in separate modules
  • preserve the legacy “ordered pass over shrinking subjects” behavior
  • keep the operational script thin, with classification logic living in reusable core code

This branch also moves mapping data out of scripts/ into resources/mappings/ and adds an explicit shell entry point for the legacy full run.

Architecture

The new layout is split into a few narrow layers:

  • core/

    • orchestration and assembly
    • SubjectClassifier is now just the work-level runner
    • pack-name resolution and construction live in classifier_assembler.py and pack_registry.py
    • RunState holds shared sequential state during a classification run
  • rule_engine/

    • minimal shared interfaces and normalization helpers
    • RulePack defines the pack contract
  • rules/

    • reusable low-level matching primitives
    • prefix matching, mapping-based lookup, override lookup, passthrough cleanup
  • rule_packs/

    • one module per tag type
    • each pack composes the low-level rules it needs instead of relying on a single monolithic implementation

State design

Classification is now explicitly sequential.

SubjectClassifier.classify_work() creates a RunState containing:

  • the normalized input work
  • the accumulated typed-tag result
  • remaining_subjects, a working copy of the legacy subjects list

Packs run in order and can consume matched subject strings from remaining_subjects. That preserves the legacy behavior where earlier matches reduce the input seen by later passes, while still keeping the core generic enough for future pack composition.

The shared mutable state is intentionally small. The only cross-pack coordination is:

  • accumulated output tags
  • the shrinking remaining_subjects list

What changed

  • Extracted classification logic out of scripts/migrate_subjects.py
  • Added core/subject_classifier.py as the orchestration-only classifier
  • Added core/run_state.py for shared sequential execution state
  • Added core/classifier_assembler.py and core/pack_registry.py for pack resolution / construction
  • Added rule_engine/ for the minimal pack interface and normalization helpers
  • Added rules/ for reusable rule primitives
  • Split subject migration into explicit rule_packs/ modules:
    • literary_form
    • audience
    • genres
    • subgenres
    • content_formats
    • moods
    • literary_themes
    • literary_tropes
    • main_topics
    • people
    • places
    • times
    • subject_diagnostics
  • Moved mapping JSON files from scripts/mappings/ to resources/mappings/
  • Kept core/migrate_subject_classifier.py as a compatibility shim for older imports
  • Updated scripts/README.md to document the new structure and execution model

CLI / execution changes

migrate_subjects.py no longer silently enables a default full preset when --pack is omitted.

Instead:

  • use --pack explicitly for partial / targeted runs
  • use the shell wrapper for the legacy full ordered run

Legacy-compatible full execution is now:

./scripts/run_legacy_subjects.sh --file work.json --dry-run

That wrapper expands to a fixed ordered pack list:

literary_form
audience
genres
subgenres
content_formats
moods
literary_themes
literary_tropes
main_topics
subject_diagnostics
people
places
times

This keeps the old end-to-end flow available, but makes the execution order explicit and inspectable.

Why this structure

The old script mixed together:

  • input loading
  • mapping loading
  • classification logic
  • sequencing
  • output generation

That made it difficult to work on one tag type at a time or to reuse the classifier in a different runner.

This refactor moves toward a narrower core:

work + ordered packs -> result

That gives us a base we can later reuse in:

  • dry-run analyzers
  • alternate loaders
  • proposal generation
  • apply workflows

without putting those concerns back into the classifier itself.

Notes

This PR is mainly a structural refactor. The intent is to preserve the legacy migration flow while making the system modular enough for future pack-by-pack development and rollout.

@Kaftow
Copy link
Copy Markdown
Author

Kaftow commented Apr 21, 2026

I’ve completed a simple content_formats migration logic, and also enables the dry-run path to support inspecting the post-migration work.

The dry-run output now makes it possible to review:

the proposed content_formats tags
which legacy subjects would be removed
which legacy subjects would remain
the per-subject migration action (move vs extract_only)
The output format currently looks like this:

=== OLDEMO1W ===
  proposed_tags:
    content_formats:
      - Memoir
      - Anthology
      - Letters
      - Dictionary
      - Biography
      - Autobiography
      - Manga
      - Encyclopedia
      - Novel
      - Diary
    reading_level:
      - Grade 4
    unmapped:
      - abc
  subject_proposal:
    removed:
      - Memoirs
      - Anthology
      - Letters
      - Dictionary
    remaining:
      - Biography
      - Autobiography
      - Manga
      - Encyclopedia
      - Novel
      - format:Diary
      - abc
      - Grade 4
  subject_matches:
    - Memoirs -> content_formats:Memoir (move)
    - Anthology -> content_formats:Anthology (move)
    - Letters -> content_formats:Letters (move)
    - Dictionary -> content_formats:Dictionary (move)
    - Biography -> content_formats:Biography (extract_only)
    - Autobiography -> content_formats:Autobiography (extract_only)
    - Manga -> content_formats:Manga (extract_only)
    - Encyclopedia -> content_formats:Encyclopedia (extract_only)
    - Novel -> content_formats:Novel (extract_only)
    - format:Diary -> content_formats:Diary (extract_only)

Command for test:


python scripts/migrate_subjects.py \
  --file demo_content_formats.json \
  --pack content_formats \
  --pack subject_diagnostics \
  --dry-run
  

@mekarpeles mekarpeles requested a review from Copilot April 21, 2026 17:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the legacy subject-migration script into a small reusable “classification core” (core/ + rule_engine/ + rules/) driven by explicit rule packs (rule_packs/), and relocates mapping data from scripts/ into resources/mappings/ to support modular, pack-by-pack dry runs.

Changes:

  • Replaced the monolithic scripts/migrate_subjects.py classifier with pack-driven execution using core.subject_classifier.SubjectClassifier.
  • Introduced a minimal rule-pack framework (rule_engine/, rules/, rule_packs/) including shared sequential RunState and subject-migration helpers.
  • Moved mapping JSON from scripts/mappings/ to resources/mappings/ and narrowed documented scope to content_formats + subject_diagnostics.

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scripts/migrate_subjects.py Runner now builds a classifier from selected pack(s) and emits a proposal-style report.
scripts/README.md Updated docs to describe pack-based dry runs and new report format.
core/subject_classifier.py New orchestration core: runs packs over RunState and returns a report.
core/run_state.py Shared sequential state (results + shrinking subjects + match audit).
core/json_loader.py Loads mappings/droppable sets from resources/mappings/.
rule_engine/base.py Defines the RulePack interface.
rule_engine/normalization.py Shared normalization + reading-level / classification detection helpers.
rule_engine/init.py Exposes rule-engine primitives.
rules/match_result.py Introduces RuleMatch (value + action).
rules/mapping_rule.py Mapping-based matcher using normalized lookups.
rules/prefix_rule.py Prefix matcher for values like format:Diary.
rules/init.py Exports composable rule primitives.
rule_packs/subject_migration.py Shared helper + base pack for subject-list migration with move/extract_only.
rule_packs/content_formats.py content_formats pack using mapping + prefix rules with move/extract policy split.
rule_packs/subject_diagnostics.py Diagnostics pack for droppable/reading-level/classification/unmapped handling.
rule_packs/init.py Exports pack classes and (currently partial) pack-class tuples.
resources/mappings/content_formats.json New location for content format mapping data.
resources/mappings/droppable.json New location for droppable legacy-subject strings.
demo_content_formats.json Adds a demo work JSON for local dry-run experimentation.
scripts/mappings/audience.json Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/genres.json Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/subgenres.json Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/literary_themes.json Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/literary_tropes.json Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/main_topics.json Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/people_overrides.json Removed legacy override file (moved/retired under new architecture).
scripts/mappings/places_overrides.json Removed legacy override file (moved/retired under new architecture).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +28
REPO_ROOT = Path(__file__).resolve().parent.parent
if str(REPO_ROOT) not in sys.path:
sys.path.insert(0, str(REPO_ROOT))

from core.subject_classifier import SubjectClassifier
from rule_packs.content_formats import ContentFormatsPack
from rule_packs.subject_diagnostics import SubjectDiagnosticsPack
Comment on lines +23 to +27
for raw in state.remaining_subjects:
key = normalize(raw)
if key in self.droppable:
continue
if key in state.retained_matched_subjects:
Comment thread scripts/README.md
Comment on lines 35 to 45
# Single work by OL ID
python scripts/migrate_subjects.py --work OL82563W

# From a local JSON file
python scripts/migrate_subjects.py --file work.json

# Batch from a newline-delimited list of OL IDs
python scripts/migrate_subjects.py --batch ol_ids.txt --output output/

# Dry run (print proposed mappings without writing)
python scripts/migrate_subjects.py --work OL82563W --dry-run
Comment thread scripts/README.md
rule_packs/
content_formats.py # current migration logic under active development
subject_diagnostics.py # minimal QA/support pack
utils.py # shared subject-pack execution helper
Comment on lines +50 to +54
for name in selected:
if name in PACK_PRESETS:
expanded.extend(PACK_PRESETS[name])
continue
expanded.append(name)
Comment on lines +154 to 163
parser.add_argument(
"--pack",
action="append",
choices=AVAILABLE_PACK_NAMES,
help="Enable only the named rule pack. Repeat to combine multiple packs.",
)

args = parser.parse_args()
classifier = SubjectClassifier()
classifier = build_subject_classifier(args.pack)

Comment thread core/json_loader.py
if not path.exists():
return {}
with open(path) as handle:
return json.load(handle)
Comment thread rule_packs/__init__.py
Comment on lines +3 to +10
from .content_formats import ContentFormatsPack
from .subject_diagnostics import SubjectDiagnosticsPack

SUBJECT_PACK_CLASSES = (ContentFormatsPack,)

FIELD_PACK_CLASSES = ()

ALL_PACK_CLASSES = SUBJECT_PACK_CLASSES + FIELD_PACK_CLASSES
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Modularize the migration pipeline for reusable tag-type bots

2 participants