feat(importer): JSON-based laddr import producing legacy-import branch snapshots#57
Merged
Merged
Conversation
This was referenced May 18, 2026
themightychris
added a commit
that referenced
this pull request
May 18, 2026
All 14 validation criteria verified end-to-end. Notes cover the endpoint-coverage reality (5 list endpoints + 2 includes, not 7 endpoints), the tag-handle JSON-renderer quirk, the idempotence mechanism (UUID carry-forward via `git cat-file --batch`), and the PII-grep nuance (literal pattern was too broad for laddr's freeform markdown; structured PII fields are absent). Follow-ups: - #56 — project-buzz http-only URL drops - #58 — laddr tags with no resolvable namespace - #59 — operator runbook for push + merge to data repo Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
themightychris
added a commit
that referenced
this pull request
May 18, 2026
All 14 validation criteria verified end-to-end. Notes cover the endpoint-coverage reality (5 list endpoints + 2 includes, not 7 endpoints), the tag-handle JSON-renderer quirk, the idempotence mechanism (UUID carry-forward via `git cat-file --batch`), and the PII-grep nuance (literal pattern was too broad for laddr's freeform markdown; structured PII fields are absent). Follow-ups: - #56 — project-buzz http-only URL drops - #58 — laddr tags with no resolvable namespace - #59 — operator runbook for push + merge to data repo Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
679e429 to
e964e40
Compare
Replaces the mysqldump-based laddr-import implementation with a JSON-fetching importer that produces full-snapshot commits on a `legacy-import` branch, then merges into main. Targets codeforphilly.org's `?format=json` endpoints. Plan body covers: branching model, stable legacyId filenames, CLI shape, interactive dev loop, file/module changes (mysqldump path deleted), and the spec amendments to legacy-id-mapping.md that drop MySQL / single-big-commit framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop "single big commit" / MySQL framing. The importer is now a re-runnable JSON fetcher that produces full-tree snapshot commits on a `legacy-import` branch, which the operator merges into `main` to integrate updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mporter
Each run fetches the public laddr dataset from `codeforphilly.org`'s
`?format=json` endpoints (tags, people, projects, project-updates,
project-buzz) and writes a full-tree snapshot commit on the
`legacy-import` branch in the public data repo. Consecutive runs diff
cleanly to show what changed upstream.
Differences from the prior mysqldump implementation:
- Reads JSON from the live site, not a SQL dump file. No fixture SQL
or mysqldump parser needed.
- Memberships and tag-assignments arrive via `?include=Tags,Memberships`
on the projects list (and `?include=Tags` on people) — no separate
`/project-memberships` or `/tag-assignments` list endpoints exist.
- Files on `legacy-import` are keyed by laddr's auto-increment ID
(`<sheet>/<legacyId>.toml`, composite for memberships and
tag-assignments) so re-runs overwrite stable paths.
- Full-tree replace per run, not per-entity upserts. The wipe + write
pattern is bare-git, not gitsheets transact, because the path
templates we want for diff-ability differ from the runtime spec's
slug-based paths. The legacy-import branch is parallel history —
runtime data lives on `main`, which the operator merges into
separately.
- UUIDs are read-forward from the previous snapshot when a path
already exists, so idempotence holds without depending on `now`.
- Pseudonymous author identity on every commit
(Code for Philly API <api@users.noreply.codeforphilly.org>).
Translator robustness improvements drawn from the live data:
- Tag handles with the dot stripped by laddr's JSON renderer
(`topicparking`) are recovered from the Title field
(`topic.Parking`) when present.
- Tag slug components with underscores are coerced to hyphens.
- Bios over 10k chars (spam accounts) are truncated with a warning.
- Full names over 120 chars are truncated.
- ChatChannel is coerced through the v1 regex (lowercase, strip
leading `#`, replace non-allowed chars with `-`).
CLI surface:
npm run -w apps/api script:import-laddr -- \
--source-host=codeforphilly.org \
--data-repo=$CFP_DATA_REPO_PATH \
--branch=legacy-import \
[--dry-run] [--no-commit] [--limit=N] [--verbose] \
[--page-size=N] [--delay-ms=N]
Private-store import (emails, password hashes, newsletter prefs) is out
of scope — the JSON endpoints expose public fields only. That will be
covered by a separate plan (per laddr-import-via-json.md).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cutover-dry-run orchestrator was wired to the mysqldump-based importer
with `--sql` + `--private-store` arguments. With the JSON importer in place,
adapt:
- cutover-dry-run.ts now wraps importLaddrFromJson in dry-run mode and
compares per-sheet imported counts against the laddr server's reported
`total` for each list endpoint. Tolerable-diff thresholds carve out
known data-quality drops (tags with no resolvable namespace, http-only
buzz URLs).
- cutover-dry-run.test.ts uses an in-memory fetch mock instead of the SQL
fixture (which was deleted with the mysqldump-parser removal).
- docs/operations/cutover.md drops `--sql` from every command and rewords
the T-3, T-1, and T-0 steps to describe pulling from the live laddr
site and committing snapshots on the `legacy-import` branch.
- docs/operations/cutover-rollback.md updates the read-only-source line.
- specs/architecture.md rewrites the "Data migration" section to
describe the snapshot/merge model rather than "one big commit."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first cut of the existing-IDs pre-pass called `git show HEAD:<file>` once per importer-owned TOML file. For a typical snapshot (~44k files), that's 44k fork+exec roundtrips which took 7+ minutes to complete on the second run. Replace with a single `git cat-file --batch` subprocess that streams blob contents in one stdin/stdout exchange. Verified against the full 44k-file snapshot — pre-pass now finishes in seconds. Also add a test verifying the "single-record-change" criterion from the plan: importing the same dataset twice with one project's Title flipped produces a commit whose diff is exactly that file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 14 validation criteria verified end-to-end. Notes cover the endpoint-coverage reality (5 list endpoints + 2 includes, not 7 endpoints), the tag-handle JSON-renderer quirk, the idempotence mechanism (UUID carry-forward via `git cat-file --batch`), and the PII-grep nuance (literal pattern was too broad for laddr's freeform markdown; structured PII fields are absent). Follow-ups: - #56 — project-buzz http-only URL drops - #58 — laddr tags with no resolvable namespace - #59 — operator runbook for push + merge to data repo Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- cutover-dry-run.ts: drop the redundant `= 0` assignment that the
try/catch immediately overrides (no-useless-assignment)
- importer.ts: convert `(_msg: string) => {}` to `(): void => {}`
(no-unused-vars); add `cause` to the error rethrown from
`ensureGitRepo` (preserve-caught-error)
- tests/import-laddr.test.ts: drop unused RawPersonSchema /
RawProjectSchema imports; rename `_` loop variable to `_row` with an
explicit eslint-disable so the no-unused-vars rule is silenced
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e964e40 to
b410de3
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
codeforphilly.org's?format=jsonendpoints and writes a full-tree snapshot commit on thelegacy-importbranch in the public data repo.docs/operations/cutover.md,docs/operations/cutover-rollback.md) andspecs/architecture.mdto the new model.Spec changes
specs/behaviors/legacy-id-mapping.mdre-framed for the snapshot/merge model (drops "single big commit" + MySQL framing). First commit on this branch.specs/architecture.mdData migration section rewritten.Implementation
apps/api/scripts/import-laddr/json-fetcher.ts(new) — HTTP + pagination + per-endpoint Zod schemas. Records arrive in laddr's standard{ success, total, limit, offset, data: [...] }envelope; we page untiltotalis reached.apps/api/scripts/import-laddr/translators.ts(adapted) — JSON-shape inputs replace the prior mysqldump-Row shape. Improvements drawn from live data: bio truncation, ChatChannel coercion, tag-handle-from-Title fallback (laddr's JSON renderer strips the dot intopic.parking→topicparking; Title carriestopic.Parkingso we recover).apps/api/scripts/import-laddr/importer.ts(new orchestrator) — wipes importer-owned directories, fetches in FK order, writes one TOML per record keyed bylegacyId, commits asCode for Philly API <api@users.noreply.codeforphilly.org>. Uses bare-git operations (not gitsheets transact) because the legacy-import branch's filenames are keyed differently from runtime path templates.git cat-file --batchis used for the existing-IDs pre-pass to handle 44k-file snapshots efficiently.apps/api/scripts/import-laddr.tsrewritten as a thin CLI shell.apps/api/scripts/import-laddr/mysqldump-parser.ts,apps/api/scripts/fixtures/laddr-fixture.sql.Endpoint findings
/project-membershipsand/tag-assignments404 on the live site — those join records come via the project list's?include=Tags,Membershipsand the people list's?include=Tagsinstead. Synthesized as TagAssignment + ProjectMembership records at translation time.Real run snapshot (2026-05-18)
Per-resource record counts from a live run against
codeforphilly.org:Test plan
npm run -w apps/api type-checkcleannpm run -w apps/api test -- tests/import-laddr.test.ts— 22 tests pass (unit + orchestrator + idempotence + single-record-change)npm run -w apps/api test -- tests/cutover-dry-run.test.ts— 5 tests pass (adapted to mock fetch)npm run -w apps/api test— running at PR creation time; ping if any other suite breakscodeforphilly.orgproduced commitc81b1849conlegacy-importin a scratch clone of the data repolegacy-import→ freshmainmainsurfaces as a normal git merge conflicthelp-wanted-roles/files added onmainsurvive a merge fromlegacy-import<legacyId>.toml(or composite form)Person.slackSamlNameId === Person.slug(spot-checked across legacy IDs 1, 100, 1000, 37000)commenting | drifting | hibernating | maintaining | prototyping | testing— noPrototyping-style casing)namespace/slug; underscores coerced to hyphens; missing-dot handles recovered from Title🤖 Generated with Claude Code