Refactor/seperate common module#3
Conversation
|
I’ve completed a simple The dry-run output now makes it possible to review: the proposed Command for test: |
There was a problem hiding this comment.
Pull request overview
This PR refactors the legacy subject-migration script into a small reusable “classification core” (core/ + rule_engine/ + rules/) driven by explicit rule packs (rule_packs/), and relocates mapping data from scripts/ into resources/mappings/ to support modular, pack-by-pack dry runs.
Changes:
- Replaced the monolithic
scripts/migrate_subjects.pyclassifier with pack-driven execution usingcore.subject_classifier.SubjectClassifier. - Introduced a minimal rule-pack framework (
rule_engine/,rules/,rule_packs/) including shared sequentialRunStateand subject-migration helpers. - Moved mapping JSON from
scripts/mappings/toresources/mappings/and narrowed documented scope tocontent_formats+subject_diagnostics.
Reviewed changes
Copilot reviewed 25 out of 27 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/migrate_subjects.py | Runner now builds a classifier from selected pack(s) and emits a proposal-style report. |
| scripts/README.md | Updated docs to describe pack-based dry runs and new report format. |
| core/subject_classifier.py | New orchestration core: runs packs over RunState and returns a report. |
| core/run_state.py | Shared sequential state (results + shrinking subjects + match audit). |
| core/json_loader.py | Loads mappings/droppable sets from resources/mappings/. |
| rule_engine/base.py | Defines the RulePack interface. |
| rule_engine/normalization.py | Shared normalization + reading-level / classification detection helpers. |
| rule_engine/init.py | Exposes rule-engine primitives. |
| rules/match_result.py | Introduces RuleMatch (value + action). |
| rules/mapping_rule.py | Mapping-based matcher using normalized lookups. |
| rules/prefix_rule.py | Prefix matcher for values like format:Diary. |
| rules/init.py | Exports composable rule primitives. |
| rule_packs/subject_migration.py | Shared helper + base pack for subject-list migration with move/extract_only. |
| rule_packs/content_formats.py | content_formats pack using mapping + prefix rules with move/extract policy split. |
| rule_packs/subject_diagnostics.py | Diagnostics pack for droppable/reading-level/classification/unmapped handling. |
| rule_packs/init.py | Exports pack classes and (currently partial) pack-class tuples. |
| resources/mappings/content_formats.json | New location for content format mapping data. |
| resources/mappings/droppable.json | New location for droppable legacy-subject strings. |
| demo_content_formats.json | Adds a demo work JSON for local dry-run experimentation. |
| scripts/mappings/audience.json | Removed legacy mapping file (moved/retired under new architecture). |
| scripts/mappings/genres.json | Removed legacy mapping file (moved/retired under new architecture). |
| scripts/mappings/subgenres.json | Removed legacy mapping file (moved/retired under new architecture). |
| scripts/mappings/literary_themes.json | Removed legacy mapping file (moved/retired under new architecture). |
| scripts/mappings/literary_tropes.json | Removed legacy mapping file (moved/retired under new architecture). |
| scripts/mappings/main_topics.json | Removed legacy mapping file (moved/retired under new architecture). |
| scripts/mappings/people_overrides.json | Removed legacy override file (moved/retired under new architecture). |
| scripts/mappings/places_overrides.json | Removed legacy override file (moved/retired under new architecture). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| REPO_ROOT = Path(__file__).resolve().parent.parent | ||
| if str(REPO_ROOT) not in sys.path: | ||
| sys.path.insert(0, str(REPO_ROOT)) | ||
|
|
||
| from core.subject_classifier import SubjectClassifier | ||
| from rule_packs.content_formats import ContentFormatsPack | ||
| from rule_packs.subject_diagnostics import SubjectDiagnosticsPack |
| for raw in state.remaining_subjects: | ||
| key = normalize(raw) | ||
| if key in self.droppable: | ||
| continue | ||
| if key in state.retained_matched_subjects: |
| # Single work by OL ID | ||
| python scripts/migrate_subjects.py --work OL82563W | ||
|
|
||
| # From a local JSON file | ||
| python scripts/migrate_subjects.py --file work.json | ||
|
|
||
| # Batch from a newline-delimited list of OL IDs | ||
| python scripts/migrate_subjects.py --batch ol_ids.txt --output output/ | ||
|
|
||
| # Dry run (print proposed mappings without writing) | ||
| python scripts/migrate_subjects.py --work OL82563W --dry-run |
| rule_packs/ | ||
| content_formats.py # current migration logic under active development | ||
| subject_diagnostics.py # minimal QA/support pack | ||
| utils.py # shared subject-pack execution helper |
| for name in selected: | ||
| if name in PACK_PRESETS: | ||
| expanded.extend(PACK_PRESETS[name]) | ||
| continue | ||
| expanded.append(name) |
| parser.add_argument( | ||
| "--pack", | ||
| action="append", | ||
| choices=AVAILABLE_PACK_NAMES, | ||
| help="Enable only the named rule pack. Repeat to combine multiple packs.", | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
| classifier = SubjectClassifier() | ||
| classifier = build_subject_classifier(args.pack) | ||
|
|
| if not path.exists(): | ||
| return {} | ||
| with open(path) as handle: | ||
| return json.load(handle) |
| from .content_formats import ContentFormatsPack | ||
| from .subject_diagnostics import SubjectDiagnosticsPack | ||
|
|
||
| SUBJECT_PACK_CLASSES = (ContentFormatsPack,) | ||
|
|
||
| FIELD_PACK_CLASSES = () | ||
|
|
||
| ALL_PACK_CLASSES = SUBJECT_PACK_CLASSES + FIELD_PACK_CLASSES |
Close #2
Description
Summary
This PR extracts the legacy
scripts/migrate_subjects.pylogic into a reusable classification core with explicit rule packs and sequential shared state.The main goal is to stop treating subject migration as one monolithic script and instead make it possible to:
This branch also moves mapping data out of
scripts/intoresources/mappings/and adds an explicit shell entry point for the legacy full run.Architecture
The new layout is split into a few narrow layers:
core/SubjectClassifieris now just the work-level runnerclassifier_assembler.pyandpack_registry.pyRunStateholds shared sequential state during a classification runrule_engine/RulePackdefines the pack contractrules/rule_packs/State design
Classification is now explicitly sequential.
SubjectClassifier.classify_work()creates aRunStatecontaining:workresultremaining_subjects, a working copy of the legacysubjectslistPacks run in order and can consume matched subject strings from
remaining_subjects. That preserves the legacy behavior where earlier matches reduce the input seen by later passes, while still keeping the core generic enough for future pack composition.The shared mutable state is intentionally small. The only cross-pack coordination is:
remaining_subjectslistWhat changed
scripts/migrate_subjects.pycore/subject_classifier.pyas the orchestration-only classifiercore/run_state.pyfor shared sequential execution statecore/classifier_assembler.pyandcore/pack_registry.pyfor pack resolution / constructionrule_engine/for the minimal pack interface and normalization helpersrules/for reusable rule primitivesrule_packs/modules:literary_formaudiencegenressubgenrescontent_formatsmoodsliterary_themesliterary_tropesmain_topicspeopleplacestimessubject_diagnosticsscripts/mappings/toresources/mappings/core/migrate_subject_classifier.pyas a compatibility shim for older importsscripts/README.mdto document the new structure and execution modelCLI / execution changes
migrate_subjects.pyno longer silently enables a default full preset when--packis omitted.Instead:
--packexplicitly for partial / targeted runsLegacy-compatible full execution is now:
That wrapper expands to a fixed ordered pack list:
This keeps the old end-to-end flow available, but makes the execution order explicit and inspectable.
Why this structure
The old script mixed together:
That made it difficult to work on one tag type at a time or to reuse the classifier in a different runner.
This refactor moves toward a narrower core:
work + ordered packs -> resultThat gives us a base we can later reuse in:
without putting those concerns back into the classifier itself.
Notes
This PR is mainly a structural refactor. The intent is to preserve the legacy migration flow while making the system modular enough for future pack-by-pack development and rollout.