fix: restore sponsor stats after GOV.UK CSV format change (May 2026)#47
Merged
Conversation
GOV.UK dropped Town/City from the register CSV in May 2026. Zod validation required licenceStatus/licenceType/rating, which the new format cannot supply — 100% row rejection produced an empty fingerprinted CSV. csvdiff then reported the entire register as deleted, and the state machine mass-removed all 143k sponsors, leaving every stats counter at 0. Changes: - sponsorRowSchema: make licenceStatus/licenceType/rating optional so rows are never rejected solely because enum columns are absent - sponsorCsvColumns: new shared module that maps both old and new GOV.UK CSV header layouts to column indexes; replaces diverging logic in csvArchiver and csvFingerprintBuilder - csvArchiver/csvFingerprintBuilder: consume shared resolver; abort with error (instead of writing empty output) when >20% of rows are rejected — prevents a repeat of the silent mass-removal incident - sponsorStateMachine: Phase C2 self-heal sweep resurrects sponsors whose fingerprint reappears in today's register but were previously removed; mass-removal circuit breaker blocks the run if >20% of live records would be deleted in a single pass - migrations/0021: rewrites stored fingerprints name|town|route → name||route across 9 tables (with dedup) to match what the updated generator now produces for the town-less new format - tests: sponsorCsvNewFormat.test.ts confirms both CSV layouts parse and fingerprint correctly (3 new-format rows accepted, 0 rejected)
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the sponsor monitor ETL against GOV.UK register CSV layout drift (May 2026) to prevent silent full-register deletion cascades, and restores correct sponsor stats by making parsing/fingerprinting tolerant to missing enum columns while adding safety rails in the state machine and a fingerprint rewrite migration.
Changes:
- Make
licenceStatus/licenceType/ratingoptional in Zod validation to prevent 100% row rejection when enum columns are missing. - Introduce a shared CSV header→column-index resolver and use it in both the CSV archiver and fingerprint builder, aborting loudly when validation rejection rate is >20%.
- Add state-machine self-heal (presence-based resurrection), a mass-removal circuit breaker, and a DB migration to rewrite stored fingerprints to the new town-less format, plus regression tests for the new CSV layout.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| server/utils/sponsorStateMachine.ts | Adds Phase C2 self-heal sweep, mass-removal circuit breaker, and guards Phase D2 on trustworthy fingerprint set size. |
| server/utils/sponsorRowSchema.ts | Makes enum metadata optional to avoid rejecting rows when new CSV layouts omit those columns. |
| server/utils/sponsorCsvColumns.ts | New shared module to resolve column indexes across legacy + May 2026 header layouts. |
| server/utils/csvFingerprintBuilder.ts | Switches to shared column resolver and aborts/removes output on >20% validation rejection. |
| server/utils/csvArchiver.ts | Switches to shared column resolver and aborts on >20% validation rejection instead of returning a near-empty dataset. |
| server/utils/tests/sponsorCsvNewFormat.test.ts | Adds regression coverage ensuring both old and new layouts parse and fingerprint correctly. |
| migrations/0021_fingerprint_town_removal.sql | Rewrites stored fingerprints from `name |
Comments suppressed due to low confidence (1)
server/utils/sponsorStateMachine.ts:761
- Phase D2 removals (confirming GRACE_PERIOD sponsors absent in both D-1 and D) are not included in the mass-removal circuit breaker. This means a large GRACE_PERIOD pool could still be bulk-removed in Phase D2 without tripping the >20% safeguard described in the PR.
const toRemoveD2 = allGracePeriod.filter(
(r) => !processedInPhaseD.has(r.fingerprint) && !todayFingerprintSet.has(r.fingerprint),
);
if (toRemoveD2.length > 0) {
for (let i = 0; i < toRemoveD2.length; i += BATCH_SIZE) {
Comment on lines
+694
to
+698
| const msg = | ||
| `Mass-removal circuit breaker tripped: run wants to remove/grace ${deletionImpact.toLocaleString()} ` + | ||
| `of ${liveCount.toLocaleString()} live sponsors (> ${MASS_REMOVAL_FRACTION * 100}%). ` + | ||
| `Aborting before any status changes are applied. ` + | ||
| `Set SPONSOR_ALLOW_MASS_REMOVAL=1 to override deliberately.`; |
Comment on lines
+82
to
+89
| // Enum metadata is optional: the GOV.UK register CSV has shipped at least | ||
| // two column layouts (legacy "Type & Rating"/"Route" and the May 2026 | ||
| // "TierRating"/"Migrant Classification"/"Sponsor Status" format), and not | ||
| // every layout carries every field. Requiring these caused 100% row | ||
| // rejection → empty fingerprinted CSV → the 2026-05-20 mass-removal | ||
| // incident. A row is identified by name + typeRating + route; enum | ||
| // metadata enriches it but must never reject it. | ||
| licenceStatus: SponsorLicenceStatusSchema.optional(), |
3 tasks
Sam-Aitech
added a commit
that referenced
this pull request
Jun 12, 2026
fix: address PR #47 review findings — column resolver, guard alerts, test coverage
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



GOV.UK dropped Town/City from the register CSV in May 2026. Zod validation required licenceStatus/licenceType/rating, which the new format cannot supply — 100% row rejection produced an empty fingerprinted CSV. csvdiff then reported the entire register as deleted, and the state machine mass-removed all 143k sponsors, leaving every stats counter at 0.
Changes:
Description
Closes #
Type of Change
Changes Made
Testing
Steps to verify:
1.
2.
3.
Test cases covered:
Security Considerations
Checklist
npm run checkpasses)npm run test:run).env.exampleupdated if new environment variables were addedshared/schema.tschanges have a corresponding migration fileScreenshots / Demo