Skip to content

feat(importer): JSON-based laddr import producing legacy-import branch snapshots#57

Merged
themightychris merged 8 commits into
mainfrom
feat/laddr-import-via-json
May 18, 2026
Merged

feat(importer): JSON-based laddr import producing legacy-import branch snapshots#57
themightychris merged 8 commits into
mainfrom
feat/laddr-import-via-json

Conversation

@themightychris
Copy link
Copy Markdown
Member

Summary

  • Replaces the previous mysqldump-based laddr importer with one that fetches the public laddr dataset from codeforphilly.org's ?format=json endpoints and writes a full-tree snapshot commit on the legacy-import branch in the public data repo.
  • Each run is re-runnable; consecutive commits diff cleanly to show what changed upstream. UUIDs are read forward from the previous snapshot's tree so identical source data produces identical trees.
  • Adapts the cutover-dry-run orchestrator + operator docs (docs/operations/cutover.md, docs/operations/cutover-rollback.md) and specs/architecture.md to the new model.

Spec changes

  • specs/behaviors/legacy-id-mapping.md re-framed for the snapshot/merge model (drops "single big commit" + MySQL framing). First commit on this branch.
  • specs/architecture.md Data migration section rewritten.

Implementation

  • apps/api/scripts/import-laddr/json-fetcher.ts (new) — HTTP + pagination + per-endpoint Zod schemas. Records arrive in laddr's standard { success, total, limit, offset, data: [...] } envelope; we page until total is reached.
  • apps/api/scripts/import-laddr/translators.ts (adapted) — JSON-shape inputs replace the prior mysqldump-Row shape. Improvements drawn from live data: bio truncation, ChatChannel coercion, tag-handle-from-Title fallback (laddr's JSON renderer strips the dot in topic.parkingtopicparking; Title carries topic.Parking so we recover).
  • apps/api/scripts/import-laddr/importer.ts (new orchestrator) — wipes importer-owned directories, fetches in FK order, writes one TOML per record keyed by legacyId, commits as Code for Philly API <api@users.noreply.codeforphilly.org>. Uses bare-git operations (not gitsheets transact) because the legacy-import branch's filenames are keyed differently from runtime path templates. git cat-file --batch is used for the existing-IDs pre-pass to handle 44k-file snapshots efficiently.
  • apps/api/scripts/import-laddr.ts rewritten as a thin CLI shell.
  • Deleted: apps/api/scripts/import-laddr/mysqldump-parser.ts, apps/api/scripts/fixtures/laddr-fixture.sql.

Endpoint findings

/project-memberships and /tag-assignments 404 on the live site — those join records come via the project list's ?include=Tags,Memberships and the people list's ?include=Tags instead. Synthesized as TagAssignment + ProjectMembership records at translation time.

Real run snapshot (2026-05-18)

Per-resource record counts from a live run against codeforphilly.org:

  • people: 31,396 imported (6 dropped on slug normalization, 0 zod errors)
  • projects: 268 imported (0 errors)
  • project-memberships: 822 imported
  • project-updates: 504 imported (13 dropped on unresolved FKs)
  • project-buzz: 32 imported (81 dropped — see importer: 81 of 113 project-buzz records skip on http:// URLs #56)
  • tags: 897 imported (120 dropped — handles with no resolvable namespace and no Title fallback)
  • tag-assignments: 10,105 imported

Test plan

  • npm run -w apps/api type-check clean
  • npm run -w apps/api test -- tests/import-laddr.test.ts — 22 tests pass (unit + orchestrator + idempotence + single-record-change)
  • npm run -w apps/api test -- tests/cutover-dry-run.test.ts — 5 tests pass (adapted to mock fetch)
  • Full npm run -w apps/api test — running at PR creation time; ping if any other suite breaks
  • End-to-end run against codeforphilly.org produced commit c81b1849c on legacy-import in a scratch clone of the data repo
  • Idempotent re-run: second commit's diff is exactly what changed on laddr between the two fetches (3 records: 1 modified, 2 new)
  • Clean fast-forward merge of legacy-import → fresh main
  • Conflicting edit on main surfaces as a normal git merge conflict
  • help-wanted-roles/ files added on main survive a merge from legacy-import
  • All filenames in importer-owned directories match <legacyId>.toml (or composite form)
  • Person.slackSamlNameId === Person.slug (spot-checked across legacy IDs 1, 100, 1000, 37000)
  • All stage values lowercase (commenting | drifting | hibernating | maintaining | prototyping | testing — no Prototyping-style casing)
  • Tags split into namespace/slug; underscores coerced to hyphens; missing-dot handles recovered from Title

🤖 Generated with Claude Code

themightychris added a commit that referenced this pull request May 18, 2026
All 14 validation criteria verified end-to-end. Notes cover the
endpoint-coverage reality (5 list endpoints + 2 includes, not 7
endpoints), the tag-handle JSON-renderer quirk, the idempotence
mechanism (UUID carry-forward via `git cat-file --batch`), and the
PII-grep nuance (literal pattern was too broad for laddr's freeform
markdown; structured PII fields are absent).

Follow-ups:
  - #56 — project-buzz http-only URL drops
  - #58 — laddr tags with no resolvable namespace
  - #59 — operator runbook for push + merge to data repo

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
themightychris added a commit that referenced this pull request May 18, 2026
All 14 validation criteria verified end-to-end. Notes cover the
endpoint-coverage reality (5 list endpoints + 2 includes, not 7
endpoints), the tag-handle JSON-renderer quirk, the idempotence
mechanism (UUID carry-forward via `git cat-file --batch`), and the
PII-grep nuance (literal pattern was too broad for laddr's freeform
markdown; structured PII fields are absent).

Follow-ups:
  - #56 — project-buzz http-only URL drops
  - #58 — laddr tags with no resolvable namespace
  - #59 — operator runbook for push + merge to data repo

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@themightychris themightychris force-pushed the feat/laddr-import-via-json branch from 679e429 to e964e40 Compare May 18, 2026 23:23
themightychris and others added 8 commits May 18, 2026 19:26
Replaces the mysqldump-based laddr-import implementation with a JSON-fetching
importer that produces full-snapshot commits on a `legacy-import` branch, then
merges into main. Targets codeforphilly.org's `?format=json` endpoints.

Plan body covers: branching model, stable legacyId filenames, CLI shape,
interactive dev loop, file/module changes (mysqldump path deleted), and the
spec amendments to legacy-id-mapping.md that drop MySQL / single-big-commit
framing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop "single big commit" / MySQL framing. The importer is now a re-runnable
JSON fetcher that produces full-tree snapshot commits on a `legacy-import`
branch, which the operator merges into `main` to integrate updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mporter

Each run fetches the public laddr dataset from `codeforphilly.org`'s
`?format=json` endpoints (tags, people, projects, project-updates,
project-buzz) and writes a full-tree snapshot commit on the
`legacy-import` branch in the public data repo. Consecutive runs diff
cleanly to show what changed upstream.

Differences from the prior mysqldump implementation:

  - Reads JSON from the live site, not a SQL dump file. No fixture SQL
    or mysqldump parser needed.
  - Memberships and tag-assignments arrive via `?include=Tags,Memberships`
    on the projects list (and `?include=Tags` on people) — no separate
    `/project-memberships` or `/tag-assignments` list endpoints exist.
  - Files on `legacy-import` are keyed by laddr's auto-increment ID
    (`<sheet>/<legacyId>.toml`, composite for memberships and
    tag-assignments) so re-runs overwrite stable paths.
  - Full-tree replace per run, not per-entity upserts. The wipe + write
    pattern is bare-git, not gitsheets transact, because the path
    templates we want for diff-ability differ from the runtime spec's
    slug-based paths. The legacy-import branch is parallel history —
    runtime data lives on `main`, which the operator merges into
    separately.
  - UUIDs are read-forward from the previous snapshot when a path
    already exists, so idempotence holds without depending on `now`.
  - Pseudonymous author identity on every commit
    (Code for Philly API <api@users.noreply.codeforphilly.org>).

Translator robustness improvements drawn from the live data:

  - Tag handles with the dot stripped by laddr's JSON renderer
    (`topicparking`) are recovered from the Title field
    (`topic.Parking`) when present.
  - Tag slug components with underscores are coerced to hyphens.
  - Bios over 10k chars (spam accounts) are truncated with a warning.
  - Full names over 120 chars are truncated.
  - ChatChannel is coerced through the v1 regex (lowercase, strip
    leading `#`, replace non-allowed chars with `-`).

CLI surface:

  npm run -w apps/api script:import-laddr -- \
    --source-host=codeforphilly.org \
    --data-repo=$CFP_DATA_REPO_PATH \
    --branch=legacy-import \
    [--dry-run] [--no-commit] [--limit=N] [--verbose] \
    [--page-size=N] [--delay-ms=N]

Private-store import (emails, password hashes, newsletter prefs) is out
of scope — the JSON endpoints expose public fields only. That will be
covered by a separate plan (per laddr-import-via-json.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cutover-dry-run orchestrator was wired to the mysqldump-based importer
with `--sql` + `--private-store` arguments. With the JSON importer in place,
adapt:

  - cutover-dry-run.ts now wraps importLaddrFromJson in dry-run mode and
    compares per-sheet imported counts against the laddr server's reported
    `total` for each list endpoint. Tolerable-diff thresholds carve out
    known data-quality drops (tags with no resolvable namespace, http-only
    buzz URLs).
  - cutover-dry-run.test.ts uses an in-memory fetch mock instead of the SQL
    fixture (which was deleted with the mysqldump-parser removal).
  - docs/operations/cutover.md drops `--sql` from every command and rewords
    the T-3, T-1, and T-0 steps to describe pulling from the live laddr
    site and committing snapshots on the `legacy-import` branch.
  - docs/operations/cutover-rollback.md updates the read-only-source line.
  - specs/architecture.md rewrites the "Data migration" section to
    describe the snapshot/merge model rather than "one big commit."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first cut of the existing-IDs pre-pass called `git show HEAD:<file>`
once per importer-owned TOML file. For a typical snapshot (~44k files),
that's 44k fork+exec roundtrips which took 7+ minutes to complete on the
second run.

Replace with a single `git cat-file --batch` subprocess that streams blob
contents in one stdin/stdout exchange. Verified against the full 44k-file
snapshot — pre-pass now finishes in seconds.

Also add a test verifying the "single-record-change" criterion from the
plan: importing the same dataset twice with one project's Title flipped
produces a commit whose diff is exactly that file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 14 validation criteria verified end-to-end. Notes cover the
endpoint-coverage reality (5 list endpoints + 2 includes, not 7
endpoints), the tag-handle JSON-renderer quirk, the idempotence
mechanism (UUID carry-forward via `git cat-file --batch`), and the
PII-grep nuance (literal pattern was too broad for laddr's freeform
markdown; structured PII fields are absent).

Follow-ups:
  - #56 — project-buzz http-only URL drops
  - #58 — laddr tags with no resolvable namespace
  - #59 — operator runbook for push + merge to data repo

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  - cutover-dry-run.ts: drop the redundant `= 0` assignment that the
    try/catch immediately overrides (no-useless-assignment)
  - importer.ts: convert `(_msg: string) => {}` to `(): void => {}`
    (no-unused-vars); add `cause` to the error rethrown from
    `ensureGitRepo` (preserve-caught-error)
  - tests/import-laddr.test.ts: drop unused RawPersonSchema /
    RawProjectSchema imports; rename `_` loop variable to `_row` with an
    explicit eslint-disable so the no-unused-vars rule is silenced

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@themightychris themightychris force-pushed the feat/laddr-import-via-json branch from e964e40 to b410de3 Compare May 18, 2026 23:27
@themightychris themightychris merged commit 9bfaab6 into main May 18, 2026
1 check passed
@themightychris themightychris deleted the feat/laddr-import-via-json branch May 18, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant