feat: Add validation infrastructure and review skills#180
Conversation
Scope and requirements for consumer-side tooling: dependency resolver, coverage checker, project worksheet generator. Includes evaluation against 7 criteria with prioritisation — R1 (dependencies) and R3 (project worksheets) as Phase 1 core, R5 (recommended sets) deferred, Phase 3 conditional on downstream adoption. Documents existing implementations in DemPoRT-V2-dev (recursive dependency tree walker, transformation chain resolution, role-based selection) and MockData (modular parsers, derived var identification) as consolidation targets.
…lability guidance cchsflow-review: - Add pre-2007 explicit mapping check (Check 7) - Add DerivedVar mixed _p/_m detection (Check 8) - Update era boundary section: concept-first with CCHS naming eras table - Add DerivedVar feeder check under L6 cchsflow-validation: - New skill with checks 1-8 including severity ratings cchsflow-worksheets: - Add PUMF availability by cycle table (2001-2023) - Document cchs2021_p as invalid database name - Add DerivedVar row splitting guidance and age feeder split table
Four docs covering: harmonization workflow (L0-L6), PUMF vs Master splitting (including PUMF availability by cycle table), derived variable functions, and variableStart/databaseStart authoring patterns.
… new checks Add recode block terminology definition to Check 2b. Clarify that the collision check is at the (database, recStart) level, not databaseStart overlap alone. Reference check_recode_blocks() and check_invalid_databases() automated checks. Add cchs2021_p/2022_p/2023_p to Check 5 invalid database patterns with context on why they don't exist as standalone PUMF files. Add multi-block databaseStart fix rule to Step 10: narrow each block to only the databases where its source variable exists; never replace the full databaseStart (risks dropping shorthand-covered databases). Include the Beyond Compare verification step.
Split the 68KB monolithic SKILL.md (1,066 lines) into a 372-line orchestrator that delegates to focused docs: - docs/worksheet-reference.md (moved from docs/) - docs/l0-l2-documentation-review.md (L0-L2 checks) - docs/l3-l5-worksheet-checks.md (L3-L5 checks) - docs/l6-implementation-validation.md (rec_with_table testing) - docs/csv-validation-and-fixes.md (validation tools + fix workflow) - docs/review/ (Gem system prompt, notebook manifest/coverage) Added prerequisite section requiring worksheet-reference.md be read before any review. Added .gitignore exception for skill docs/ folders.
The sub-item numbered 3b is renumbered to 4, with subsequent items shifted to 5-7 for consistent sequential numbering.
…dit, and dev mode - Add PUMF-Master variable family pattern documentation to worksheet-reference.md explaining the systematic relationship between Master continuous, PUMF categorical, and _cont bridging variables (with DHH_AGE example and DHHGAGE_B footnote) - Add Check 8: Completeness audit (8a missing-code rows, 8b cycle coverage, 8c variable family completeness) to l3-l5-worksheet-checks.md - Add --dev mode to SKILL.md for authoring/development use where omissions are P1 - Cross-reference variable family pattern from Check 3 (PUMF vs Master naming)
…done criteria SKILL.md orchestrates existing docs (foundations, patterns, testing) and adds a 5-step done criteria checklist that includes R CMD check — filling a gap where package-level validation was missing from the DV function workflow.
Triage step now detects when PRs touch R/ or tests/ files, flags that Step 7b package health check will run, and cross-references the cchsflow-derive done criteria for new or modified functions. Also strengthens GHA failure handling — treat failing CI as blocking.
…e-Lab/cchsflow into skills/review-validation
- Replace character-by-character .parse_csv_text() with vectorised readr::read_lines() + grep approach (variable_details.csv: infinite hang → 1.7s) - Add R/scope-worksheets.R with scope_worksheets() and parse_scope_args() for filtering by --subject or --variables - Update exec scripts to accept --subject/--variables CLI args - Scoped fix merges corrected rows back into full worksheets - Update 3 skills (review, validation, worksheets) with scoped usage docs Scoped check: ~0.2s for typical 15-20 variable subset. Full check: ~2s for entire variable_details.csv (3,700+ rows).
- Step 7: Broaden re-confirmation to all findings (not just P0/P1) to catch stale-data false positives on informational items - Step 8: Add branch verification before CEP commit to prevent committing artifacts on the wrong branch - Step 8/9: Add fix-then-report ordering guidance — apply fixes before posting PR comment when issues are being corrected - L6: Add explicit DV feeder resolution callout with multi-era example (rec_with_table does not auto-resolve feeders) - L6: Add QMD skip-if-clean guidance for reviews with no anomalies - Reference: Add Gem template fallback path via git show
- Add load_schema() function and YAML schema files (variables.yaml, variable_details.yaml) so check_worksheet() can actually run - Add glue, purrr, readr, yaml to DESCRIPTION Imports - Fix assignment operator bug in .create_line_ending_crlf_error() (message <- should be message =) - Fix stray comma in fix_worksheet() write_csv() call - Use devtools::load_all() in exec scripts when in source tree - Regenerate NAMESPACE exports and man pages
…erences - Renumber validation checks sequentially (1-8) in cchsflow-validation - Add skill chooser table and L-stage mapping to cchsflow-validation - Add "Save Gem verification findings" step to cchsflow-review Step 8 - Fix stale metadata_registry.yaml references across 3 docs - Replace broken MCP troubleshooting cross-reference in cchsflow-worksheets with inline 4-step troubleshooting guide - Fix _s suffix description (deprecated, replace with _m) - Update expected column counts to match current schemas (10/16)
There was a problem hiding this comment.
Pull request overview
This PR introduces an R-based validation + auto-fix workflow for the project’s CSV “worksheets” (variables.csv, variable_details.csv), adds scoping helpers for faster targeted checks, and documents the workflow via CEP-015 and Claude skill docs.
Changes:
- Add worksheet validators/fixers (
check_worksheet(),fix_worksheet(),check_recode_blocks()) driven by YAML schema definitions. - Add worksheet scoping utilities (
scope_worksheets(),parse_scope_args()) plus CLI scripts to check/fix the worksheets. - Add/refresh documentation (Rd docs, CEP-015, and
.claude/skills/*guidance).
Reviewed changes
Copilot reviewed 62 out of 63 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| R/scope-worksheets.R | Adds scoped filtering and CLI arg parsing for targeted worksheet operations. |
| R/load-schema.R | Loads YAML schema metadata for worksheet validation. |
| R/fix-worksheet.R | Implements auto-fixes for detected worksheet formatting errors. |
| NAMESPACE | Exposes new public functions. |
| DESCRIPTION | Adds new imported packages used by validation/fix infrastructure. |
| exec/check-worksheets.R | CLI entrypoint to run worksheet checks (with optional scoping). |
| exec/fix-worksheets.R | CLI entrypoint to apply worksheet fixes (with optional scoping + merge-back). |
| inst/metadata/schemas/core/variables.yaml | Defines expected variables.csv column order + ID column. |
| inst/metadata/schemas/core/variable_details.yaml | Defines expected variable_details.csv column order. |
| .gitignore | Updates ignore rules (incl. skill docs exception). |
| ceps/cep-015-variable-tools/cep-015-variable-tools.qmd | Adds CEP-015 design/plan document. |
| man/scope_worksheets.Rd | Roxygen output for scope_worksheets(). |
| man/parse_scope_args.Rd | Roxygen output for parse_scope_args(). |
| man/load_schema.Rd | Roxygen output for load_schema(). |
| man/fix_worksheet.Rd | Roxygen output for fix_worksheet(). |
| man/check_worksheet.Rd | Roxygen output for check_worksheet(). |
| man/check_recode_blocks.Rd | Roxygen output for check_recode_blocks(). |
| man/dot-split_csv_line.Rd | Documents internal helper for raw CSV field splitting. |
| man/dot-check_column_order.Rd | Documents internal column order check. |
| man/dot-check_excessive_quoting.Rd | Documents internal excessive quoting check. |
| man/dot-check_line_endings.Rd | Documents internal line ending check. |
| man/dot-check_row_sorting.Rd | Documents internal row sorting check. |
| man/dot-check_trailing_empty_columns.Rd | Documents internal trailing empty columns check. |
| man/dot-create_column_order_error.Rd | Documents internal column order error constructor. |
| man/dot-create_excessive_quoting_error.Rd | Documents internal excessive quoting error constructor. |
| man/dot-create_file_not_found_error.Rd | Documents internal file-not-found error constructor. |
| man/dot-create_invalid_csv_error.Rd | Documents internal invalid-CSV error constructor. |
| man/dot-create_line_ending_crlf_error.Rd | Documents internal CRLF error constructor. |
| man/dot-create_missing_id_column_error.Rd | Documents internal missing-ID-column error constructor. |
| man/dot-create_recode_block_collision_error.Rd | Documents internal recode collision error constructor. |
| man/dot-create_trailing_empty_columns_error.Rd | Documents internal trailing empty columns error constructor. |
| man/dot-create_unsorted_rows_error.Rd | Documents internal unsorted rows error constructor. |
| man/dot-fix_column_order_errors.Rd | Documents internal column reordering fix helper. |
| man/dot-fix_empty_column_errors.Rd | Documents internal empty column fix helper. |
| man/dot-fix_unsorted_rows_error.Rd | Documents internal row sorting fix helper. |
| .claude/skills/cchsflow-worksheets/SKILL.md | Adds worksheet authoring guidance + validation/fix usage. |
| .claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md | Adds detailed authoring rules and pitfalls. |
| .claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md | Adds L0–L6 workflow documentation. |
| .claude/skills/cchsflow-worksheets/docs/derived-variable-functions.md | Adds derived-function authoring guidance. |
| .claude/skills/cchsflow-validation/SKILL.md | Adds validation skill guidance (incl. scoped workflows). |
| .claude/skills/cchsflow-review/SKILL.md | Adds PR review workflow guidance for worksheets. |
| .claude/skills/cchsflow-review/docs/variable-naming-conventions.md | Adds naming conventions reference. |
| .claude/skills/cchsflow-review/docs/review/notebook-coverage.md | Adds NotebookLM coverage summary for review workflow. |
| .claude/skills/cchsflow-review/docs/review/gemini-gem-system-prompt.md | Adds Gem system prompt reference for reviews. |
| .claude/skills/cchsflow-review/docs/l6-implementation-validation.md | Adds L6 validation guidance for runtime testing. |
| .claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md | Adds structured worksheet checks guidance. |
| .claude/skills/cchsflow-review/docs/l0-l2-documentation-review.md | Adds doc review guidance (MCP/CLI fallbacks). |
| .claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md | Adds CSV validation + fix workflow guidance. |
| .claude/skills/cchsflow-derive/SKILL.md | Adds derived variable development guidance. |
| .claude/skills/cchsflow-derive/docs/testing.md | Adds DV testing guidance (unit + golden fixtures). |
| .claude/skills/cchsflow-derive/docs/patterns/pathway-branching.md | Adds pattern guidance for complex DVs. |
| .claude/skills/cchsflow-derive/docs/patterns/pass-through.md | Adds pattern guidance for pass-through DVs. |
| .claude/skills/cchsflow-derive/docs/patterns/multi-source-routing.md | Adds pattern guidance for multi-source routing. |
| .claude/skills/cchsflow-derive/docs/patterns/formula-calculation.md | Adds pattern guidance for formula-based DVs. |
| .claude/skills/cchsflow-derive/docs/patterns/category-grouping.md | Adds pattern guidance for categorical grouping. |
| .claude/skills/cchsflow-derive/docs/patterns/cat-to-continuous.md | Adds pattern guidance for cat-to-continuous. |
| .claude/skills/cchsflow-derive/docs/function-inventory.md | Adds inventory of DV functions by pattern/level/tier. |
| .claude/skills/cchsflow-derive/docs/foundations.md | Adds v3 DV foundations (3-step architecture, anti-patterns). |
| .claude/skills/cchsflow-derive/docs/7-levels.md | Adds L1–L7 complexity taxonomy reference. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| empty_col_positions <- purrr::map_int(empty_column_errors, ~ .x$col_num) | ||
| cols_to_keep <- colnames(csv_data)[-empty_col_positions] | ||
| return(csv_data[, cols_to_keep]) | ||
| } |
There was a problem hiding this comment.
Indexing a data.frame with csv_data[, cols_to_keep] will drop to a vector when only one column remains (drop=TRUE default), which can break subsequent processing and readr::write_csv(). Use drop = FALSE to always return a data.frame.
| ) | ||
|
|
||
| return(csv_data_with_missing_columns[, corrected_column_order]) | ||
| } |
There was a problem hiding this comment.
Subsetting with csv_data_with_missing_columns[, corrected_column_order] will drop to a vector if the result has a single column. Use drop = FALSE to ensure the return value is always a data.frame.
| purrr::keep(errors, ~ .x$error_type == "unsorted_rows") | ||
| ) | ||
|
|
||
| # Write CSV content and also fix excessive quotting and line ending errors |
There was a problem hiding this comment.
Typo in comment: "quotting" should be "quoting".
| #' Creates temporary copies of variables.csv and variable_details.csv | ||
| #' containing only the rows matching the specified scope. Scope can be | ||
| #' defined by variable names, subject values, or auto-detected from git diff. | ||
| #' |
There was a problem hiding this comment.
The documentation says scope can be auto-detected from git diff, but the implementation only supports explicit variables and subjects. Either implement git-diff scoping or remove that claim from the roxygen docs to avoid misleading users.
| Depends: | ||
| R (>= 3.5.0), | ||
| haven (>= 1.1.2), | ||
| dplyr (>= 0.8.2), | ||
| sjlabelled (>= 1.0.17), | ||
| stringr (>= 1.2.0), | ||
| magrittr | ||
| Imports: | ||
| glue, | ||
| purrr, | ||
| readr, | ||
| yaml |
There was a problem hiding this comment.
The new exec scripts require cli (and devtools when run from the repo), but these packages are not listed in Depends/Imports/Suggests. Since the PR test plan calls these scripts directly, add the needed packages to Suggests (or switch to requireNamespace() with a clear error) so runs don't fail on a clean environment.
Summary
check_worksheet(),fix_worksheet(),scope_worksheets(), YAML schemas, and CLI scripts (exec/check-worksheets.R,exec/fix-worksheets.R)Test plan
Rscript exec/check-worksheets.Rruns without errorsRscript exec/fix-worksheets.Rruns without errorsdevtools::load_all()loads successfullyRscript exec/check-worksheets.R --variables "SDCGCGT"