Refactor/seperate common module by Kaftow · Pull Request #3 · Open-Book-Genome-Project/tags

Kaftow · 2026-04-21T06:30:57Z

Close #2

Description

Summary

This PR extracts the legacy scripts/migrate_subjects.py logic into a reusable classification core with explicit rule packs and sequential shared state.

The main goal is to stop treating subject migration as one monolithic script and instead make it possible to:

run only selected tag-type packs
develop packs independently in separate modules
preserve the legacy “ordered pass over shrinking subjects” behavior
keep the operational script thin, with classification logic living in reusable core code

This branch also moves mapping data out of scripts/ into resources/mappings/ and adds an explicit shell entry point for the legacy full run.

Architecture

The new layout is split into a few narrow layers:

core/
- orchestration and assembly
- SubjectClassifier is now just the work-level runner
- pack-name resolution and construction live in classifier_assembler.py and pack_registry.py
- RunState holds shared sequential state during a classification run
rule_engine/
- minimal shared interfaces and normalization helpers
- RulePack defines the pack contract
rules/
- reusable low-level matching primitives
- prefix matching, mapping-based lookup, override lookup, passthrough cleanup
rule_packs/
- one module per tag type
- each pack composes the low-level rules it needs instead of relying on a single monolithic implementation

State design

Classification is now explicitly sequential.

SubjectClassifier.classify_work() creates a RunState containing:

the normalized input work
the accumulated typed-tag result
remaining_subjects, a working copy of the legacy subjects list

Packs run in order and can consume matched subject strings from remaining_subjects. That preserves the legacy behavior where earlier matches reduce the input seen by later passes, while still keeping the core generic enough for future pack composition.

The shared mutable state is intentionally small. The only cross-pack coordination is:

accumulated output tags
the shrinking remaining_subjects list

What changed

Extracted classification logic out of scripts/migrate_subjects.py
Added core/subject_classifier.py as the orchestration-only classifier
Added core/run_state.py for shared sequential execution state
Added core/classifier_assembler.py and core/pack_registry.py for pack resolution / construction
Added rule_engine/ for the minimal pack interface and normalization helpers
Added rules/ for reusable rule primitives
Split subject migration into explicit rule_packs/ modules:
- literary_form
- audience
- genres
- subgenres
- content_formats
- moods
- literary_themes
- literary_tropes
- main_topics
- people
- places
- times
- subject_diagnostics
Moved mapping JSON files from scripts/mappings/ to resources/mappings/
Kept core/migrate_subject_classifier.py as a compatibility shim for older imports
Updated scripts/README.md to document the new structure and execution model

CLI / execution changes

migrate_subjects.py no longer silently enables a default full preset when --pack is omitted.

Instead:

use --pack explicitly for partial / targeted runs
use the shell wrapper for the legacy full ordered run

Legacy-compatible full execution is now:

./scripts/run_legacy_subjects.sh --file work.json --dry-run

That wrapper expands to a fixed ordered pack list:

literary_form
audience
genres
subgenres
content_formats
moods
literary_themes
literary_tropes
main_topics
subject_diagnostics
people
places
times

This keeps the old end-to-end flow available, but makes the execution order explicit and inspectable.

Why this structure

The old script mixed together:

input loading
mapping loading
classification logic
sequencing
output generation

That made it difficult to work on one tag type at a time or to reuse the classifier in a different runner.

This refactor moves toward a narrower core:

work + ordered packs -> result

That gives us a base we can later reuse in:

dry-run analyzers
alternate loaders
proposal generation
apply workflows

without putting those concerns back into the classifier itself.

Notes

This PR is mainly a structural refactor. The intent is to preserve the legacy migration flow while making the system modular enough for future pack-by-pack development and rollout.

Kaftow · 2026-04-21T17:11:57Z

I’ve completed a simple content_formats migration logic, and also enables the dry-run path to support inspecting the post-migration work.

The dry-run output now makes it possible to review:

the proposed content_formats tags
which legacy subjects would be removed
which legacy subjects would remain
the per-subject migration action (move vs extract_only)
The output format currently looks like this:

=== OLDEMO1W ===
  proposed_tags:
    content_formats:
      - Memoir
      - Anthology
      - Letters
      - Dictionary
      - Biography
      - Autobiography
      - Manga
      - Encyclopedia
      - Novel
      - Diary
    reading_level:
      - Grade 4
    unmapped:
      - abc
  subject_proposal:
    removed:
      - Memoirs
      - Anthology
      - Letters
      - Dictionary
    remaining:
      - Biography
      - Autobiography
      - Manga
      - Encyclopedia
      - Novel
      - format:Diary
      - abc
      - Grade 4
  subject_matches:
    - Memoirs -> content_formats:Memoir (move)
    - Anthology -> content_formats:Anthology (move)
    - Letters -> content_formats:Letters (move)
    - Dictionary -> content_formats:Dictionary (move)
    - Biography -> content_formats:Biography (extract_only)
    - Autobiography -> content_formats:Autobiography (extract_only)
    - Manga -> content_formats:Manga (extract_only)
    - Encyclopedia -> content_formats:Encyclopedia (extract_only)
    - Novel -> content_formats:Novel (extract_only)
    - format:Diary -> content_formats:Diary (extract_only)

Command for test:


python scripts/migrate_subjects.py \
  --file demo_content_formats.json \
  --pack content_formats \
  --pack subject_diagnostics \
  --dry-run

Copilot

Pull request overview

This PR refactors the legacy subject-migration script into a small reusable “classification core” (core/ + rule_engine/ + rules/) driven by explicit rule packs (rule_packs/), and relocates mapping data from scripts/ into resources/mappings/ to support modular, pack-by-pack dry runs.

Changes:

Replaced the monolithic scripts/migrate_subjects.py classifier with pack-driven execution using core.subject_classifier.SubjectClassifier.
Introduced a minimal rule-pack framework (rule_engine/, rules/, rule_packs/) including shared sequential RunState and subject-migration helpers.
Moved mapping JSON from scripts/mappings/ to resources/mappings/ and narrowed documented scope to content_formats + subject_diagnostics.

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
scripts/migrate_subjects.py	Runner now builds a classifier from selected pack(s) and emits a proposal-style report.
scripts/README.md	Updated docs to describe pack-based dry runs and new report format.
core/subject_classifier.py	New orchestration core: runs packs over `RunState` and returns a report.
core/run_state.py	Shared sequential state (results + shrinking subjects + match audit).
core/json_loader.py	Loads mappings/droppable sets from `resources/mappings/`.
rule_engine/base.py	Defines the `RulePack` interface.
rule_engine/normalization.py	Shared normalization + reading-level / classification detection helpers.
rule_engine/init.py	Exposes rule-engine primitives.
rules/match_result.py	Introduces `RuleMatch` (value + action).
rules/mapping_rule.py	Mapping-based matcher using normalized lookups.
rules/prefix_rule.py	Prefix matcher for values like `format:Diary`.
rules/init.py	Exports composable rule primitives.
rule_packs/subject_migration.py	Shared helper + base pack for subject-list migration with `move`/`extract_only`.
rule_packs/content_formats.py	`content_formats` pack using mapping + prefix rules with move/extract policy split.
rule_packs/subject_diagnostics.py	Diagnostics pack for droppable/reading-level/classification/unmapped handling.
rule_packs/init.py	Exports pack classes and (currently partial) pack-class tuples.
resources/mappings/content_formats.json	New location for content format mapping data.
resources/mappings/droppable.json	New location for droppable legacy-subject strings.
demo_content_formats.json	Adds a demo work JSON for local dry-run experimentation.
scripts/mappings/audience.json	Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/genres.json	Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/subgenres.json	Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/literary_themes.json	Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/literary_tropes.json	Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/main_topics.json	Removed legacy mapping file (moved/retired under new architecture).
scripts/mappings/people_overrides.json	Removed legacy override file (moved/retired under new architecture).
scripts/mappings/places_overrides.json	Removed legacy override file (moved/retired under new architecture).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+REPO_ROOT = Path(__file__).resolve().parent.parent
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+
+from core.subject_classifier import SubjectClassifier
+from rule_packs.content_formats import ContentFormatsPack
+from rule_packs.subject_diagnostics import SubjectDiagnosticsPack


+        for raw in state.remaining_subjects:
+            key = normalize(raw)
+            if key in self.droppable:
+                continue
+            if key in state.retained_matched_subjects:


 # Single work by OL ID
 python scripts/migrate_subjects.py --work OL82563W

 # From a local JSON file
 python scripts/migrate_subjects.py --file work.json

 # Batch from a newline-delimited list of OL IDs
 python scripts/migrate_subjects.py --batch ol_ids.txt --output output/

 # Dry run (print proposed mappings without writing)
 python scripts/migrate_subjects.py --work OL82563W --dry-run


+rule_packs/
+  content_formats.py     # current migration logic under active development
+  subject_diagnostics.py # minimal QA/support pack
+  utils.py               # shared subject-pack execution helper


+    for name in selected:
+        if name in PACK_PRESETS:
+            expanded.extend(PACK_PRESETS[name])
+            continue
+        expanded.append(name)


+    parser.add_argument(
+        "--pack",
+        action="append",
+        choices=AVAILABLE_PACK_NAMES,
+        help="Enable only the named rule pack. Repeat to combine multiple packs.",
+    )

    args = parser.parse_args()
-    classifier = SubjectClassifier()
+    classifier = build_subject_classifier(args.pack)



+    if not path.exists():
+        return {}
+    with open(path) as handle:
+        return json.load(handle)


+from .content_formats import ContentFormatsPack
+from .subject_diagnostics import SubjectDiagnosticsPack
+
+SUBJECT_PACK_CLASSES = (ContentFormatsPack,)
+
+FIELD_PACK_CLASSES = ()
+
+ALL_PACK_CLASSES = SUBJECT_PACK_CLASSES + FIELD_PACK_CLASSES


Kaftow added 3 commits April 21, 2026 12:07

Move mapping files into resource folder

ea83629

Refactor legacy migrate_subjects.py into several modules

49ad631

Update README.md

e656611

modi02 mentioned this pull request Apr 21, 2026

Implement LiteraryFormPack: mapping, LCSH suffix extraction, conflict resolution #4

Open

Kaftow added 6 commits April 22, 2026 00:22

Move pack default assembly into pack-owned factories

025dd51

Print remaining and removed subjects in migrate_subjects.py

6002bd6

Support rule-level move and extract-only subject actions

1830c7c

Finish rule pack for content_formats

98fd813

Delete unnecessary abstract layer

335c71f

Minimize rule_pack package

d5b7c29

mekarpeles requested a review from Copilot April 21, 2026 17:47

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/seperate common module#3

Refactor/seperate common module#3
Kaftow wants to merge 9 commits into
Open-Book-Genome-Project:mainfrom
Kaftow:refactor/seperate-common-module

Kaftow commented Apr 21, 2026

Uh oh!

Kaftow commented Apr 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kaftow commented Apr 21, 2026

Summary

Architecture

State design

What changed

CLI / execution changes

Why this structure

Notes

Uh oh!

Kaftow commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kaftow commented Apr 21, 2026 •

edited

Loading