Skip to content

feat: Add validation infrastructure and review skills#180

Closed
DougManuel wants to merge 16 commits into
mainfrom
skills/review-validation
Closed

feat: Add validation infrastructure and review skills#180
DougManuel wants to merge 16 commits into
mainfrom
skills/review-validation

Conversation

@DougManuel
Copy link
Copy Markdown
Contributor

Summary

  • Add 4 Claude Code skills for cchsflow development: review, validation, worksheets, and derive
  • Add worksheet validation infrastructure: check_worksheet(), fix_worksheet(), scope_worksheets(), YAML schemas, and CLI scripts (exec/check-worksheets.R, exec/fix-worksheets.R)
  • Add CEP-015 (variable discovery and project tools)

Test plan

  • Rscript exec/check-worksheets.R runs without errors
  • Rscript exec/fix-worksheets.R runs without errors
  • devtools::load_all() loads successfully
  • Scoped validation works: Rscript exec/check-worksheets.R --variables "SDCGCGT"

DougManuel and others added 16 commits February 24, 2026 12:27
Scope and requirements for consumer-side tooling: dependency
resolver, coverage checker, project worksheet generator. Includes
evaluation against 7 criteria with prioritisation — R1 (dependencies)
and R3 (project worksheets) as Phase 1 core, R5 (recommended sets)
deferred, Phase 3 conditional on downstream adoption.

Documents existing implementations in DemPoRT-V2-dev (recursive
dependency tree walker, transformation chain resolution, role-based
selection) and MockData (modular parsers, derived var identification)
as consolidation targets.
…lability guidance

cchsflow-review:
- Add pre-2007 explicit mapping check (Check 7)
- Add DerivedVar mixed _p/_m detection (Check 8)
- Update era boundary section: concept-first with CCHS naming eras table
- Add DerivedVar feeder check under L6

cchsflow-validation:
- New skill with checks 1-8 including severity ratings

cchsflow-worksheets:
- Add PUMF availability by cycle table (2001-2023)
- Document cchs2021_p as invalid database name
- Add DerivedVar row splitting guidance and age feeder split table
Four docs covering: harmonization workflow (L0-L6), PUMF vs Master
splitting (including PUMF availability by cycle table), derived variable
functions, and variableStart/databaseStart authoring patterns.
… new checks

Add recode block terminology definition to Check 2b. Clarify that the
collision check is at the (database, recStart) level, not databaseStart
overlap alone. Reference check_recode_blocks() and check_invalid_databases()
automated checks.

Add cchs2021_p/2022_p/2023_p to Check 5 invalid database patterns with
context on why they don't exist as standalone PUMF files.

Add multi-block databaseStart fix rule to Step 10: narrow each block to only
the databases where its source variable exists; never replace the full
databaseStart (risks dropping shorthand-covered databases). Include the
Beyond Compare verification step.
Split the 68KB monolithic SKILL.md (1,066 lines) into a 372-line
orchestrator that delegates to focused docs:

- docs/worksheet-reference.md (moved from docs/)
- docs/l0-l2-documentation-review.md (L0-L2 checks)
- docs/l3-l5-worksheet-checks.md (L3-L5 checks)
- docs/l6-implementation-validation.md (rec_with_table testing)
- docs/csv-validation-and-fixes.md (validation tools + fix workflow)
- docs/review/ (Gem system prompt, notebook manifest/coverage)

Added prerequisite section requiring worksheet-reference.md be read
before any review. Added .gitignore exception for skill docs/ folders.
The sub-item numbered 3b is renumbered to 4, with subsequent items
shifted to 5-7 for consistent sequential numbering.
…dit, and dev mode

- Add PUMF-Master variable family pattern documentation to worksheet-reference.md
  explaining the systematic relationship between Master continuous, PUMF categorical,
  and _cont bridging variables (with DHH_AGE example and DHHGAGE_B footnote)
- Add Check 8: Completeness audit (8a missing-code rows, 8b cycle coverage,
  8c variable family completeness) to l3-l5-worksheet-checks.md
- Add --dev mode to SKILL.md for authoring/development use where omissions are P1
- Cross-reference variable family pattern from Check 3 (PUMF vs Master naming)
…done criteria

SKILL.md orchestrates existing docs (foundations, patterns, testing) and adds
a 5-step done criteria checklist that includes R CMD check — filling a gap
where package-level validation was missing from the DV function workflow.
Triage step now detects when PRs touch R/ or tests/ files, flags that
Step 7b package health check will run, and cross-references the
cchsflow-derive done criteria for new or modified functions. Also
strengthens GHA failure handling — treat failing CI as blocking.
- Replace character-by-character .parse_csv_text() with vectorised
  readr::read_lines() + grep approach (variable_details.csv: infinite
  hang → 1.7s)
- Add R/scope-worksheets.R with scope_worksheets() and parse_scope_args()
  for filtering by --subject or --variables
- Update exec scripts to accept --subject/--variables CLI args
- Scoped fix merges corrected rows back into full worksheets
- Update 3 skills (review, validation, worksheets) with scoped usage docs

Scoped check: ~0.2s for typical 15-20 variable subset.
Full check: ~2s for entire variable_details.csv (3,700+ rows).
- Step 7: Broaden re-confirmation to all findings (not just P0/P1)
  to catch stale-data false positives on informational items
- Step 8: Add branch verification before CEP commit to prevent
  committing artifacts on the wrong branch
- Step 8/9: Add fix-then-report ordering guidance — apply fixes
  before posting PR comment when issues are being corrected
- L6: Add explicit DV feeder resolution callout with multi-era
  example (rec_with_table does not auto-resolve feeders)
- L6: Add QMD skip-if-clean guidance for reviews with no anomalies
- Reference: Add Gem template fallback path via git show
- Add load_schema() function and YAML schema files (variables.yaml,
  variable_details.yaml) so check_worksheet() can actually run
- Add glue, purrr, readr, yaml to DESCRIPTION Imports
- Fix assignment operator bug in .create_line_ending_crlf_error()
  (message <- should be message =)
- Fix stray comma in fix_worksheet() write_csv() call
- Use devtools::load_all() in exec scripts when in source tree
- Regenerate NAMESPACE exports and man pages
…erences

- Renumber validation checks sequentially (1-8) in cchsflow-validation
- Add skill chooser table and L-stage mapping to cchsflow-validation
- Add "Save Gem verification findings" step to cchsflow-review Step 8
- Fix stale metadata_registry.yaml references across 3 docs
- Replace broken MCP troubleshooting cross-reference in cchsflow-worksheets
  with inline 4-step troubleshooting guide
- Fix _s suffix description (deprecated, replace with _m)
- Update expected column counts to match current schemas (10/16)
Copilot AI review requested due to automatic review settings April 7, 2026 00:45
@DougManuel DougManuel closed this Apr 7, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an R-based validation + auto-fix workflow for the project’s CSV “worksheets” (variables.csv, variable_details.csv), adds scoping helpers for faster targeted checks, and documents the workflow via CEP-015 and Claude skill docs.

Changes:

  • Add worksheet validators/fixers (check_worksheet(), fix_worksheet(), check_recode_blocks()) driven by YAML schema definitions.
  • Add worksheet scoping utilities (scope_worksheets(), parse_scope_args()) plus CLI scripts to check/fix the worksheets.
  • Add/refresh documentation (Rd docs, CEP-015, and .claude/skills/* guidance).

Reviewed changes

Copilot reviewed 62 out of 63 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
R/scope-worksheets.R Adds scoped filtering and CLI arg parsing for targeted worksheet operations.
R/load-schema.R Loads YAML schema metadata for worksheet validation.
R/fix-worksheet.R Implements auto-fixes for detected worksheet formatting errors.
NAMESPACE Exposes new public functions.
DESCRIPTION Adds new imported packages used by validation/fix infrastructure.
exec/check-worksheets.R CLI entrypoint to run worksheet checks (with optional scoping).
exec/fix-worksheets.R CLI entrypoint to apply worksheet fixes (with optional scoping + merge-back).
inst/metadata/schemas/core/variables.yaml Defines expected variables.csv column order + ID column.
inst/metadata/schemas/core/variable_details.yaml Defines expected variable_details.csv column order.
.gitignore Updates ignore rules (incl. skill docs exception).
ceps/cep-015-variable-tools/cep-015-variable-tools.qmd Adds CEP-015 design/plan document.
man/scope_worksheets.Rd Roxygen output for scope_worksheets().
man/parse_scope_args.Rd Roxygen output for parse_scope_args().
man/load_schema.Rd Roxygen output for load_schema().
man/fix_worksheet.Rd Roxygen output for fix_worksheet().
man/check_worksheet.Rd Roxygen output for check_worksheet().
man/check_recode_blocks.Rd Roxygen output for check_recode_blocks().
man/dot-split_csv_line.Rd Documents internal helper for raw CSV field splitting.
man/dot-check_column_order.Rd Documents internal column order check.
man/dot-check_excessive_quoting.Rd Documents internal excessive quoting check.
man/dot-check_line_endings.Rd Documents internal line ending check.
man/dot-check_row_sorting.Rd Documents internal row sorting check.
man/dot-check_trailing_empty_columns.Rd Documents internal trailing empty columns check.
man/dot-create_column_order_error.Rd Documents internal column order error constructor.
man/dot-create_excessive_quoting_error.Rd Documents internal excessive quoting error constructor.
man/dot-create_file_not_found_error.Rd Documents internal file-not-found error constructor.
man/dot-create_invalid_csv_error.Rd Documents internal invalid-CSV error constructor.
man/dot-create_line_ending_crlf_error.Rd Documents internal CRLF error constructor.
man/dot-create_missing_id_column_error.Rd Documents internal missing-ID-column error constructor.
man/dot-create_recode_block_collision_error.Rd Documents internal recode collision error constructor.
man/dot-create_trailing_empty_columns_error.Rd Documents internal trailing empty columns error constructor.
man/dot-create_unsorted_rows_error.Rd Documents internal unsorted rows error constructor.
man/dot-fix_column_order_errors.Rd Documents internal column reordering fix helper.
man/dot-fix_empty_column_errors.Rd Documents internal empty column fix helper.
man/dot-fix_unsorted_rows_error.Rd Documents internal row sorting fix helper.
.claude/skills/cchsflow-worksheets/SKILL.md Adds worksheet authoring guidance + validation/fix usage.
.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md Adds detailed authoring rules and pitfalls.
.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md Adds L0–L6 workflow documentation.
.claude/skills/cchsflow-worksheets/docs/derived-variable-functions.md Adds derived-function authoring guidance.
.claude/skills/cchsflow-validation/SKILL.md Adds validation skill guidance (incl. scoped workflows).
.claude/skills/cchsflow-review/SKILL.md Adds PR review workflow guidance for worksheets.
.claude/skills/cchsflow-review/docs/variable-naming-conventions.md Adds naming conventions reference.
.claude/skills/cchsflow-review/docs/review/notebook-coverage.md Adds NotebookLM coverage summary for review workflow.
.claude/skills/cchsflow-review/docs/review/gemini-gem-system-prompt.md Adds Gem system prompt reference for reviews.
.claude/skills/cchsflow-review/docs/l6-implementation-validation.md Adds L6 validation guidance for runtime testing.
.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md Adds structured worksheet checks guidance.
.claude/skills/cchsflow-review/docs/l0-l2-documentation-review.md Adds doc review guidance (MCP/CLI fallbacks).
.claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md Adds CSV validation + fix workflow guidance.
.claude/skills/cchsflow-derive/SKILL.md Adds derived variable development guidance.
.claude/skills/cchsflow-derive/docs/testing.md Adds DV testing guidance (unit + golden fixtures).
.claude/skills/cchsflow-derive/docs/patterns/pathway-branching.md Adds pattern guidance for complex DVs.
.claude/skills/cchsflow-derive/docs/patterns/pass-through.md Adds pattern guidance for pass-through DVs.
.claude/skills/cchsflow-derive/docs/patterns/multi-source-routing.md Adds pattern guidance for multi-source routing.
.claude/skills/cchsflow-derive/docs/patterns/formula-calculation.md Adds pattern guidance for formula-based DVs.
.claude/skills/cchsflow-derive/docs/patterns/category-grouping.md Adds pattern guidance for categorical grouping.
.claude/skills/cchsflow-derive/docs/patterns/cat-to-continuous.md Adds pattern guidance for cat-to-continuous.
.claude/skills/cchsflow-derive/docs/function-inventory.md Adds inventory of DV functions by pattern/level/tier.
.claude/skills/cchsflow-derive/docs/foundations.md Adds v3 DV foundations (3-step architecture, anti-patterns).
.claude/skills/cchsflow-derive/docs/7-levels.md Adds L1–L7 complexity taxonomy reference.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/fix-worksheet.R
Comment on lines +98 to +101
empty_col_positions <- purrr::map_int(empty_column_errors, ~ .x$col_num)
cols_to_keep <- colnames(csv_data)[-empty_col_positions]
return(csv_data[, cols_to_keep])
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing a data.frame with csv_data[, cols_to_keep] will drop to a vector when only one column remains (drop=TRUE default), which can break subsequent processing and readr::write_csv(). Use drop = FALSE to always return a data.frame.

Copilot uses AI. Check for mistakes.
Comment thread R/fix-worksheet.R
Comment on lines +140 to +143
)

return(csv_data_with_missing_columns[, corrected_column_order])
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subsetting with csv_data_with_missing_columns[, corrected_column_order] will drop to a vector if the result has a single column. Use drop = FALSE to ensure the return value is always a data.frame.

Copilot uses AI. Check for mistakes.
Comment thread R/fix-worksheet.R
purrr::keep(errors, ~ .x$error_type == "unsorted_rows")
)

# Write CSV content and also fix excessive quotting and line ending errors
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "quotting" should be "quoting".

Copilot uses AI. Check for mistakes.
Comment thread R/scope-worksheets.R
Comment on lines +3 to +6
#' Creates temporary copies of variables.csv and variable_details.csv
#' containing only the rows matching the specified scope. Scope can be
#' defined by variable names, subject values, or auto-detected from git diff.
#'
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says scope can be auto-detected from git diff, but the implementation only supports explicit variables and subjects. Either implement git-diff scoping or remove that claim from the roxygen docs to avoid misleading users.

Copilot uses AI. Check for mistakes.
Comment thread DESCRIPTION
Comment on lines 37 to +48
Depends:
R (>= 3.5.0),
haven (>= 1.1.2),
dplyr (>= 0.8.2),
sjlabelled (>= 1.0.17),
stringr (>= 1.2.0),
magrittr
Imports:
glue,
purrr,
readr,
yaml
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new exec scripts require cli (and devtools when run from the repo), but these packages are not listed in Depends/Imports/Suggests. Since the PR test plan calls these scripts directly, add the needed packages to Suggests (or switch to requireNamespace() with a clear error) so runs don't fail on a clean environment.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants