Skip to content

LOTC-1502: validator error on non-object sample_data#187

Open
kevinborkman-hub wants to merge 1 commit into
mainfrom
LOTC-1502-validator-array-sample-data
Open

LOTC-1502: validator error on non-object sample_data#187
kevinborkman-hub wants to merge 1 commit into
mainfrom
LOTC-1502-validator-array-sample-data

Conversation

@kevinborkman-hub

@kevinborkman-hub kevinborkman-hub commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • New Rust validator src/validate/sample_data_shape.rs hard-fails when a transform's embedded settings.sample_data is a JSON array, scalar, or empty object. Non-empty object passes; missing passes (legitimate no-op).
  • Runs on every CI invocation (Track 1 and Track 2), closing the silent-skip gap in src/deploy/default.rs:536 (get_sample_data_as_json) where .as_object() returned null for arrays/scalars and caused insert_sample_data_if_present + verify_rows_ingested to no-op while deploy reported success.
  • Wired into src/main.rs after sample_data_exists and before the freshness warnings.
  • Normalizes 9 pre-existing offenders so the validator lands green (details below).

Jira

LOTC-1502

Changes

New validator (src/validate/sample_data_shape.rs)

  • Contract: settings.sample_data must be a non-empty JSON object when present.
  • Error messages name the offending transform path; array-form message points to scripts/configure_bundle.py for normalization.
  • Covered by 5 unit tests: non-empty object / missing pass; array / scalar / empty object fail.

Bundle normalizations (13 files)

Bundle Change
aws/bot-insights (4 transforms: akamai_ds2, cloudflare, cloudfront_firehose, fastly) embedded settings.sample_data reduced from array to first element; separate sample_data.json files likewise normalized
aws/bot-insights/default rich 23-field separate sample_data.json promoted into embedded copy (was a {timestamp}-only stub)
aws/cdn-insights (3 transforms: akamai_ds2, cloudflare, cloudfront_firehose) embedded array → first element; separate files were already single objects
aws/cloudfront-to-kinesis embedded raw TSV string → wrapper object {data_type, tsv} from separate sample_data_template.json

Overlap with PR #184

The bot-insights changes overlap with commit 23e3e44 on the fix-bot-insights-ai-category branch (PR #184). If #184 merges first, this PR rebases cleanly on bot-insights. The AI-category transform fix (554569c) from that branch is not included here and remains for #184.

Test plan

  • cargo test — all 62 tests in main binary pass (57 pre-existing + 5 new)
  • cargo run — every bundle returns SUCCESS
  • Filesystem scan for remaining array/scalar/empty-object sample_data across aws/ and trafficpeak/ returns 0 offenders
  • CI (Track 1 and Track 2) confirm green
  • Manual deploy check on a Track 2 bundle to confirm validator fires in the validate-only route

Risk

  • The validator halts on first error (matches existing validator style). Any future bundle regressing to array/scalar form will block all CI until fixed — by design per the ticket.
  • sample_data_exists (runs first) permits empty objects; sample_data_shape catches them. The division of responsibility is intentional but worth documenting in a follow-up comment.

🤖 Generated with Claude Code

Add src/validate/sample_data_shape.rs — a new hard-failing validator
that rejects settings.sample_data when it is a JSON array, scalar, or
empty object. Previously get_sample_data_as_json in src/deploy/default.rs
filtered through .as_object() and silently returned null for arrays/
scalars, causing insert_sample_data_if_present and verify_rows_ingested
to no-op; deploy reported success with no data inserted. The new
validator runs on every CI invocation (Track 1 and Track 2) so arrays
and scalars are caught uniformly before the deploy path runs.

Contract: settings.sample_data must be a non-empty JSON object when
present. Missing sample_data is allowed (legitimate no-op for
transforms with no sample). Wired in src/main.rs after
sample_data_exists and before sample_data_freshness.

Normalize 9 existing offenders so the validator lands green:
- aws/bot-insights: 5 transforms reduced from embedded array to the
  first element; default's rich 23-field separate sample_data.json
  promoted into the embedded copy so test ingests land a realistic
  row rather than a {timestamp} stub. Separate sample_data.json files
  normalized to match embedded for 4 of these transforms.
- aws/cdn-insights: 3 transforms reduced from embedded array to the
  first element (separate files were already single objects).
- aws/cloudfront-to-kinesis: embedded raw tab-separated log string
  replaced with the wrapper object {data_type, tsv} from the separate
  sample_data_template.json.

Covered by 5 unit tests: non-empty object passes, missing passes,
array fails with configure_bundle.py hint, scalar fails, empty object
fails. All 62 tests in the main bin pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@kcorbett-hdx kcorbett-hdx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the bot-insights bundle from this? Or are these changes necessary?

@kevinborkman-hub kevinborkman-hub force-pushed the LOTC-1502-validator-array-sample-data branch from 9143e4b to 18b6aa3 Compare April 30, 2026 14:25
@kevinborkman-hub kevinborkman-hub force-pushed the LOTC-1502-validator-array-sample-data branch from 18b6aa3 to 9143e4b Compare April 30, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants