Skip to content

fix: restore sponsor stats after GOV.UK CSV format change (May 2026)#47

Merged
Sam-Aitech merged 1 commit into
mainfrom
feat/readme-visuals
Jun 11, 2026
Merged

fix: restore sponsor stats after GOV.UK CSV format change (May 2026)#47
Sam-Aitech merged 1 commit into
mainfrom
feat/readme-visuals

Conversation

@Sam-Aitech

Copy link
Copy Markdown
Owner

GOV.UK dropped Town/City from the register CSV in May 2026. Zod validation required licenceStatus/licenceType/rating, which the new format cannot supply — 100% row rejection produced an empty fingerprinted CSV. csvdiff then reported the entire register as deleted, and the state machine mass-removed all 143k sponsors, leaving every stats counter at 0.

Changes:

  • sponsorRowSchema: make licenceStatus/licenceType/rating optional so rows are never rejected solely because enum columns are absent
  • sponsorCsvColumns: new shared module that maps both old and new GOV.UK CSV header layouts to column indexes; replaces diverging logic in csvArchiver and csvFingerprintBuilder
  • csvArchiver/csvFingerprintBuilder: consume shared resolver; abort with error (instead of writing empty output) when >20% of rows are rejected — prevents a repeat of the silent mass-removal incident
  • sponsorStateMachine: Phase C2 self-heal sweep resurrects sponsors whose fingerprint reappears in today's register but were previously removed; mass-removal circuit breaker blocks the run if >20% of live records would be deleted in a single pass
  • migrations/0021: rewrites stored fingerprints name|town|route → name||route across 9 tables (with dedup) to match what the updated generator now produces for the town-less new format
  • tests: sponsorCsvNewFormat.test.ts confirms both CSV layouts parse and fingerprint correctly (3 new-format rows accepted, 0 rejected)

Description

Closes #


Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that changes existing behavior)
  • Performance improvement
  • Refactor (code change with no behavior change)
  • Documentation update
  • CI/tooling change

Changes Made


Testing

Steps to verify:
1.
2.
3.

Test cases covered:

  • Happy path
  • Error case / invalid input
  • Edge case
  • Security-critical path (if applicable)

Security Considerations

  • This change does not affect authentication or authorization
  • This change does not expose new data to unauthorized users
  • This change does not introduce new file upload or user-controlled input paths
  • This change does not affect payment or subscription logic

Checklist

  • Code follows project style guidelines (npm run check passes)
  • Tests pass locally (npm run test:run)
  • No new TypeScript warnings introduced
  • Documentation updated (README, DEVELOPMENT.md, API_REFERENCE.md) if applicable
  • .env.example updated if new environment variables were added
  • shared/schema.ts changes have a corresponding migration file
  • No secrets or API keys committed

Screenshots / Demo

GOV.UK dropped Town/City from the register CSV in May 2026.
Zod validation required licenceStatus/licenceType/rating, which the new
format cannot supply — 100% row rejection produced an empty fingerprinted
CSV. csvdiff then reported the entire register as deleted, and the state
machine mass-removed all 143k sponsors, leaving every stats counter at 0.

Changes:
- sponsorRowSchema: make licenceStatus/licenceType/rating optional so rows
  are never rejected solely because enum columns are absent
- sponsorCsvColumns: new shared module that maps both old and new GOV.UK
  CSV header layouts to column indexes; replaces diverging logic in
  csvArchiver and csvFingerprintBuilder
- csvArchiver/csvFingerprintBuilder: consume shared resolver; abort with
  error (instead of writing empty output) when >20% of rows are rejected —
  prevents a repeat of the silent mass-removal incident
- sponsorStateMachine: Phase C2 self-heal sweep resurrects sponsors whose
  fingerprint reappears in today's register but were previously removed;
  mass-removal circuit breaker blocks the run if >20% of live records would
  be deleted in a single pass
- migrations/0021: rewrites stored fingerprints name|town|route →
  name||route across 9 tables (with dedup) to match what the updated
  generator now produces for the town-less new format
- tests: sponsorCsvNewFormat.test.ts confirms both CSV layouts parse and
  fingerprint correctly (3 new-format rows accepted, 0 rejected)
Copilot AI review requested due to automatic review settings June 11, 2026 03:57
@sonarqubecloud

Copy link
Copy Markdown

@Sam-Aitech Sam-Aitech merged commit ba39341 into main Jun 11, 2026
14 checks passed
@Sam-Aitech Sam-Aitech deleted the feat/readme-visuals branch June 11, 2026 04:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the sponsor monitor ETL against GOV.UK register CSV layout drift (May 2026) to prevent silent full-register deletion cascades, and restores correct sponsor stats by making parsing/fingerprinting tolerant to missing enum columns while adding safety rails in the state machine and a fingerprint rewrite migration.

Changes:

  • Make licenceStatus / licenceType / rating optional in Zod validation to prevent 100% row rejection when enum columns are missing.
  • Introduce a shared CSV header→column-index resolver and use it in both the CSV archiver and fingerprint builder, aborting loudly when validation rejection rate is >20%.
  • Add state-machine self-heal (presence-based resurrection), a mass-removal circuit breaker, and a DB migration to rewrite stored fingerprints to the new town-less format, plus regression tests for the new CSV layout.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
server/utils/sponsorStateMachine.ts Adds Phase C2 self-heal sweep, mass-removal circuit breaker, and guards Phase D2 on trustworthy fingerprint set size.
server/utils/sponsorRowSchema.ts Makes enum metadata optional to avoid rejecting rows when new CSV layouts omit those columns.
server/utils/sponsorCsvColumns.ts New shared module to resolve column indexes across legacy + May 2026 header layouts.
server/utils/csvFingerprintBuilder.ts Switches to shared column resolver and aborts/removes output on >20% validation rejection.
server/utils/csvArchiver.ts Switches to shared column resolver and aborts on >20% validation rejection instead of returning a near-empty dataset.
server/utils/tests/sponsorCsvNewFormat.test.ts Adds regression coverage ensuring both old and new layouts parse and fingerprint correctly.
migrations/0021_fingerprint_town_removal.sql Rewrites stored fingerprints from `name
Comments suppressed due to low confidence (1)

server/utils/sponsorStateMachine.ts:761

  • Phase D2 removals (confirming GRACE_PERIOD sponsors absent in both D-1 and D) are not included in the mass-removal circuit breaker. This means a large GRACE_PERIOD pool could still be bulk-removed in Phase D2 without tripping the >20% safeguard described in the PR.
  const toRemoveD2 = allGracePeriod.filter(
    (r) => !processedInPhaseD.has(r.fingerprint) && !todayFingerprintSet.has(r.fingerprint),
  );

  if (toRemoveD2.length > 0) {
    for (let i = 0; i < toRemoveD2.length; i += BATCH_SIZE) {

Comment on lines +694 to +698
const msg =
`Mass-removal circuit breaker tripped: run wants to remove/grace ${deletionImpact.toLocaleString()} ` +
`of ${liveCount.toLocaleString()} live sponsors (> ${MASS_REMOVAL_FRACTION * 100}%). ` +
`Aborting before any status changes are applied. ` +
`Set SPONSOR_ALLOW_MASS_REMOVAL=1 to override deliberately.`;
Comment on lines +82 to +89
// Enum metadata is optional: the GOV.UK register CSV has shipped at least
// two column layouts (legacy "Type & Rating"/"Route" and the May 2026
// "TierRating"/"Migrant Classification"/"Sponsor Status" format), and not
// every layout carries every field. Requiring these caused 100% row
// rejection → empty fingerprinted CSV → the 2026-05-20 mass-removal
// incident. A row is identified by name + typeRating + route; enum
// metadata enriches it but must never reject it.
licenceStatus: SponsorLicenceStatusSchema.optional(),
Sam-Aitech added a commit that referenced this pull request Jun 12, 2026
fix: address PR #47 review findings — column resolver, guard alerts, test coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants