From 4845592434b859da769011de8b8e6a8e1973c5af Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Tue, 24 Feb 2026 08:37:06 -0500 Subject: [PATCH 01/15] Add CEP-015: Variable discovery and project tools MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Scope and requirements for consumer-side tooling: dependency resolver, coverage checker, project worksheet generator. Includes evaluation against 7 criteria with prioritisation — R1 (dependencies) and R3 (project worksheets) as Phase 1 core, R5 (recommended sets) deferred, Phase 3 conditional on downstream adoption. Documents existing implementations in DemPoRT-V2-dev (recursive dependency tree walker, transformation chain resolution, role-based selection) and MockData (modular parsers, derived var identification) as consolidation targets. --- .../cep-015-variable-tools.qmd | 374 ++++++++++++++++++ 1 file changed, 374 insertions(+) create mode 100644 ceps/cep-015-variable-tools/cep-015-variable-tools.qmd diff --git a/ceps/cep-015-variable-tools/cep-015-variable-tools.qmd b/ceps/cep-015-variable-tools/cep-015-variable-tools.qmd new file mode 100644 index 00000000..959127c2 --- /dev/null +++ b/ceps/cep-015-variable-tools/cep-015-variable-tools.qmd @@ -0,0 +1,374 @@ +--- +title: "CEP-015: Variable discovery and project tools" +author: "Doug Manuel for the cchsflow development team" +date: "2026-02-24" +format: + html: + toc: true + toc-depth: 3 + toc-title: "Contents" + number-sections: true +--- + +::: callout-note +## CEP metadata + +| Field | Value | +|---------|--------------------------------------| +| CEP | 15 | +| Title | Variable discovery and project tools | +| Authors | Doug Manuel | +| Created | 2026-02-24 | +| Status | Draft | +| Branch | skills/review-validation | +::: + +## Abstract + +This CEP specifies tools that help researchers and AI agents find, select, and use harmonised CCHS variables from cchsflow. The current cchsflow workflow assumes users know which variables they need and how to call `rec_with_table()`. In practice, users need to: + +1. Discover which harmonised variables are available for their research question +2. Understand dependency chains — derived variables require feeder variables +3. Check cycle and database coverage for their target population +4. Generate project-specific worksheets with only the variables they need + +These tools complement the existing *producer-side* infrastructure (authoring, review, validation skills) with *consumer-side* capabilities. + +## Motivation + +### The dependency problem + +A researcher requesting `pack_years_der` implicitly needs 9+ feeder variables, some of which are themselves derived. Today, this dependency chain is encoded in `variableStart` fields using `DerivedVar::[feeder1, feeder2, ...]` notation, but there is no programmatic way to resolve the full transitive closure. + +If any feeder is missing from the project worksheet, `rec_with_table()` fails silently or produces unexpected results. This is the single most common source of user error. + +### The coverage problem + +Not all variables are available for all cycles or database types. A researcher planning a 2001-2023 trend analysis needs to know that `age_start_smoking` has no PUMF source for 2022-2023, or that PUMF pack-years have \~15-20% relative error versus Master. This information exists in worksheet metadata but requires manual inspection. + +### The project worksheet problem + +`rec_with_table()` processes all variables in the worksheets. For large projects, this is slow and produces many irrelevant variables. Users need a way to generate minimal project-specific worksheets containing only their target variables and transitive dependencies. + +## Existing work + +### Within cchsflow + +- **`R/variable-discovery.R`** — metadata queries (`get_harmonized_variables()`, `get_source_mappings()`, `find_variable_in_data()`), recommended variable tags, subject/section filtering. Currently v1, smoking-focused. +- **`recommended` metadata tag** — `{recommended:primary}` / `{recommended:secondary}` in the `notes` field of `variables.csv`. Started for smoking but not applied broadly. +- **`R/table-generators.R`** — generates summary tables (cycle coverage, variable counts). Primarily for documentation, not user-facing. + +### Within cchsflow (runtime) + +- **`recode_derived_variables()`** in `R/recode-with-table.R` (lines 820-966) — the existing runtime dependency resolver. Parses `DerivedVar::[...]` from `variableStart`, extracts feeders, detects circular dependencies via a `var_stack`, and recurses to resolve missing feeders. Tightly coupled to the recoding pipeline — cannot be called standalone for metadata queries. + +### In downstream projects + +Three big-life-lab repositories have independently implemented variable selection and dependency resolution. These are the primary consumers that CEP-015 aims to consolidate. + +**DemPoRT-V2-dev** — the most mature implementation (6+ R files): + +- `R/get-dependency-tree.R` — recursive tree walker: `get_dependency_tree(variable_name, variables_sheet, database_name)` returns a nested list. Handles `DerivedVar::[...]` and database-specific overrides (`db_name::[VAR]`). +- `R/variable-start-utils.R` — `is_derived_var()` and `get_derived_vars()` helpers using `DerivedVar::\[(.+?)\]` regex. +- `R/get-start-var.R` — higher-level resolver that walks transformation chains (`rcs[...]`, `center[...]`, `interact[...]`, `dummy[...]`). Includes `parse_variable_start()` with bracket-aware comma splitting and `generate_sas_code_from_csv()` for ICES extraction. +- Role-based variable selection via `recodeflow:::select_vars_by_role()` and a `roles.csv` vocabulary (15 roles including `predictor`, `intermediate`, `imputation-predictor`). +- Project worksheets maintained separately: `demportdev-variables.csv` (292 rows), `cchsflow-variable-details.csv` (930 rows, copied from cchsflow), `demportdev-variable-details.csv` (35 project-specific rows), joined via `rbind()` at pipeline start. + +**MockData** (`v030-refactor` branch) — cleaner, modular implementation: + +- `R/mockdata-parsers.R` — `parse_variable_start(variable_start, cycle)` handles 4 formats (db-prefixed, bracket, default, plain; returns NULL for DerivedVar). `parse_range_notation()` for recStart/recEnd parsing. +- `R/mockdata-helpers.R` — `get_cycle_variables()`, `get_raw_variables()`, `get_variable_details_for_raw()` for per-cycle variable resolution. +- `R/identify_derived_vars.R` — `identify_derived_vars()` pattern detection + `get_raw_var_dependencies()` non-recursive feeder extraction. + +**huipain** — simplest pattern: calls `recodeflow::rec_with_table()` per cycle with full cchsflow worksheets copied as-is. No custom dependency resolution. + +### External + +- **cchs-metadata MCP** — provides StatCan source metadata (16,000+ variables, 251 datasets) but does not know about cchsflow harmonisation logic (dependency chains, recoding rules). + +## Evaluation + +Requirements were scored against seven criteria. The current project context: 381 variables, 3,698 variable_details rows, 225 DerivedVar entries, 17 recommended tags (smoking only), \~2,900 CRAN downloads/year, 7+ downstream repositories within big-life-lab (bllflow, cvd-trends-Canada, phiat-yll, raiflow, huipain, chmsflow, calibrationTutorial). + +| Req | Value | Success | Complexity | Maintenance | User proximity | Incremental | Alternative today | +|---------|---------|---------|---------|---------|---------|---------|---------| +| R1: Dependencies | **High** — 225 DerivedVars; silent failures are #1 user error; 5+ downstream projects affected | **High** — deterministic; `parse_variable_start()` exists | Medium — recursive resolution, cycle detection, PUMF/Master splits; \~200-300 lines | **Zero** — reads existing worksheets | **High** — directly prevents most common failure | Yes | Read CEP docs; trial and error | +| R2: Coverage | Medium — useful for trend analyses, but `databaseStart` is human-readable | **High** — straightforward CSV cross-reference | Low — \~100-150 lines | **Zero** — reads existing worksheets | Medium — researchers can read `databaseStart` directly | Yes | Manual inspection of `databaseStart` | +| R3: Project worksheets | **High** — 5+ downstream projects independently maintain subset scripts | **High** — CSV subsetting is well-defined; depends on R1 | Low-Medium — \~150 lines | Low — output format matches existing worksheets | **High** — produces an artifact they directly use | Needs R1 | Copy full worksheets, delete rows manually | +| R4: Discovery | Low-Medium — incremental over existing `get_harmonized_variables()` | **High** — extending existing functions | Low — \~50-100 additional lines | Low — reads worksheets | Medium — helps new users; experienced users know their variables | Yes | `grep` on variables.csv | +| R5: Recommended sets | **Low** — requires domain expertise for curation; 17 tags across 381 vars; bundles are project-specific | Medium — code is easy; curation is hard; who maintains definitions? | Low (code), **High** (curation) | **High** — bundles go stale as variables are added per CEP | Low-Medium — researchers define their own lists | Yes, but low value without broad tag coverage | Project-specific variable lists | +| R6: MCP/CLI | Medium — serves AI agents, not humans directly | Medium — cross-repo coordination; scope boundary unclear | Medium-High — Python wrappers or reimplementation | Medium — two repos must stay in sync | Low (humans), High (AI agents) | Needs R1-R3 | AI agents read worksheets directly | +| R7: Consumer skill | Medium — orchestrates existing tools | **High** — skills are lightweight | Low — a SKILL.md file | Low — thin wrapper | Medium — Claude Code users only | Needs R1-R3 | Ad hoc prompting | + +### Prioritisation + +**Core (Phase 1):** R1 (dependencies) and R3 (project worksheets) are the clear winners — high value, zero/low maintenance, directly address the most common failure mode across 5+ downstream projects. R2 (coverage) is low effort and complements R1. R4 (discovery) is incremental and can be included opportunistically. + +**Defer:** R5 (recommended sets) has the worst value-to-maintenance ratio. The curation problem is unsolved, only 17 of 381 variables are tagged, and bundles are inherently project-specific. Defer until R1-R3 reveal whether standardised bundles are actually useful. + +**Phase 3 only:** R6 (MCP/CLI) and R7 (consumer skill) depend entirely on Phase 1-2 stability. Don't plan details until the R functions are working and the API surface is validated by use. + +### Overall assessment: 7/10 — proceed with tight scope + +The dependency resolver + project worksheet generator combination addresses a documented, recurring pain point across 5+ downstream projects, with zero ongoing maintenance cost. Phase 1-2 is ~500 lines of R with clear inputs, deterministic outputs, and built-in test cases (225 DerivedVar entries, real downstream projects). The risk is not building the wrong thing — it is over-building. Keep Phase 1-2 tight, ship it, let Phase 3 demand materialise on its own. The ~2,900 annual CRAN downloads indicate a niche but real user base; the primary beneficiaries are big-life-lab's own research pipelines. + +## Scope + +### In scope + +1. **Dependency resolver** — given target variable(s), resolve the full dependency graph +2. **Coverage checker** — report variable availability by cycle and database type +3. **Project worksheet generator** — produce minimal `variables.csv` and `variable_details.csv` for a set of target variables +4. **Variable discovery** — search and filter harmonised variables by domain, type, coverage +5. **Recommended variable sets** — deferred; curated bundles for common research use cases +6. **MCP and CLI integration** — Phase 3; expose key functions through the cchs-metadata MCP server and CLI +7. **Consumer skill** — Phase 3; Claude Code skill for researchers using cchsflow in analysis projects + +### Out of scope + +- Changes to `rec_with_table()` core processing logic +- New harmonisation of CCHS variables (covered by existing CEPs) +- Data access or download tools (PUMF/Master files) +- Statistical analysis functions + +## Requirements + +### R1: Dependency resolution + +Given one or more target variables, return the complete set of variables needed, in dependency order. + +**Inputs:** + +- `target_variables`: character vector of cchsflow variable names +- `databases`: optional character vector of target databases (e.g., `"cchs2017_2018_p"`) + +**Outputs:** + +- Ordered list of variables from leaves (no dependencies) to roots (target variables) +- For each variable: its direct feeders, whether it's a DerivedVar, and which `Func::` function implements it +- Cycle of circular dependencies or missing feeders + +**Data source:** `variable_details.csv` — parse `variableStart` for `DerivedVar::[...]` patterns, resolve recursively. + +### R2: Coverage checking + +Given target variables and target databases, report availability and caveats. + +**Inputs:** + +- `variables`: character vector +- `databases`: character vector (e.g., `c("cchs2001_p", "cchs2017_2018_p")`) + +**Outputs:** + +- Matrix: variable x database → available / not available / available with caveats +- Caveats: PUMF precision warnings, optional module flags, Master-only indicators +- Maximum common coverage: largest set of databases where all target variables are available + +**Data source:** `variables.csv` `databaseStart` field, `variable_details.csv` for caveat metadata. + +### R3: Project worksheet generation + +Given target variables, generate minimal project-specific worksheets. + +**Inputs:** + +- `target_variables`: character vector +- `databases`: optional — restrict to specific databases +- `output_dir`: path for generated files + +**Outputs:** + +- `variables.csv` containing target variables + all transitive dependencies +- `variable_details.csv` containing all rows for included variables, filtered to requested databases +- Summary report: variable count, cycle coverage, known caveats + +**Depends on:** R1 (dependency resolution), R2 (coverage checking). + +### R4: Variable discovery enhancements + +Extend `variable-discovery.R` with: + +- Search by label text (not just exact subject match) +- Filter by coverage (e.g., "available for 2001-2018 PUMF") +- Filter by variable type (categorical, continuous, derived) +- List dependencies for a variable (wrapper around R1) +- `recommended` tag applied across all variable domains (not just smoking) + +### R5: Recommended variable sets + +Two distinct concepts: + +**Domain importance** (per-variable): which variable is the primary choice within a subject. Already encoded as `{recommended:primary}` in `notes`. Stays inline until Phase 3 migration. + +- Example: `SMKDSTY_A` is the primary smoking status variable; `SMKDSTY_B` is secondary (2015+ categories only) + +**Use-case bundles** (cross-domain): curated sets for common research purposes, stored in `variable_sets.csv`: + +- **Smoking dose-response**: pack_years_der, SMKDSTY_cat3, DHHGAGE_cont, sex +- **Chronic disease risk factors**: smoking status, BMI, alcohol, physical activity +- **Demographics baseline**: age, sex, education, income, province + +Each bundle specifies target variables; the dependency resolver (R1) handles transitive expansion. + +### R6: MCP and CLI integration + +Expose R1-R4 through the cchs-metadata MCP server and CLI: + +- `resolve_dependencies(variables)` — returns dependency graph +- `check_coverage(variables, databases)` — returns coverage matrix +- `generate_project_worksheets(variables, databases, output_dir)` — generates files + +These extend the existing MCP (which handles StatCan source metadata) with cchsflow harmonisation logic. + +### R7: Consumer skill + +Claude Code skill (`cchsflow-analysis` or `cchsflow-user`) that: + +- Helps researchers find variables for their research question +- Generates the correct `rec_with_table()` call +- Warns about coverage gaps and precision caveats +- Produces project-specific worksheets +- Explains what harmonisation transforms are applied + +## Architecture considerations + +### Where should the logic live? + +| Component | Location | Rationale | +|-------------------------|-----------------------|-------------------------| +| Dependency resolver | R function in cchsflow | Reads cchsflow worksheets directly | +| Coverage checker | R function in cchsflow | Reads cchsflow worksheets directly | +| Project worksheet generator | R function in cchsflow | Writes cchsflow-format CSVs | +| MCP tools | cchsflow-docs MCP server | Extends existing metadata server | +| CLI commands | cchsflow-docs CLI | Extends existing CLI | +| Consumer skill | `.claude/skills/` | Orchestrates R functions + MCP | + +The core logic (R1-R3) should be R functions in the cchsflow package itself — they operate on cchsflow data structures and should be testable with `devtools::test()`. The MCP/CLI layer calls these functions. + +### Dependency graph data structure + +The dependency graph is implicit in `variable_details.csv`. Each `DerivedVar::[a, b, c]` in `variableStart` declares edges. The resolver builds an explicit DAG: + +``` +pack_years_der +├── SMKDSTY_A (non-derived — leaf) +├── DHHGAGE_cont (non-derived — leaf) +├── age_start_smoking (DerivedVar) +│ ├── SMKG040_cont (PUMF) or SMK_040 (Master) +│ └── ... (resolved by database type) +├── cigs_per_day (DerivedVar) +│ ├── SMKDSTY_A (already in graph — shared dependency) +│ ├── SMK_204 (leaf) +│ └── SMK_208 (leaf) +├── time_quit_smoking (DerivedVar) +│ └── ... +└── ... +``` + +### Handling PUMF/Master splits + +When a variable has separate PUMF and Master rows with different feeders (e.g., `age_start_smoking` routes to `SMKG040_cont` on PUMF, `SMK_040` on Master), the resolver needs to know which database type is targeted. If both, it includes the union of feeders. + +## Metadata architecture + +### Current state + +Metadata is currently embedded in worksheet fields: + +- **Dependencies** — implicit in `variableStart` via `DerivedVar::[feeder1, feeder2, ...]` notation. No pre-computed graph; must be resolved by parsing at runtime. +- **Recommended tags** — `{recommended:primary}` / `{recommended:secondary}` in the free-text `notes` field of `variables.csv`. Applied only to smoking variables so far. +- **Coverage** — implicit in `databaseStart` (comma-separated list of databases). No matrix view; must be cross-referenced with `variable_details.csv` for caveats. + +### Design principle: functions first, then extract + +The dependency graph already exists — it's encoded in `variableStart`. Building a separate "dependency database" either materialises a cache (useful but not a new source of truth) or creates a second canonical source that must stay in sync with worksheets. + +The implementation therefore follows **functions first, extraction second**: + +1. Build R functions that read directly from existing worksheets and resolve metadata at runtime +2. Once the functions work and the API surface is validated by real use, extract metadata into standalone form (structured CSV or database tables) + +This avoids designing a schema in the abstract and ensures the metadata structure reflects what consumers actually need. + +### Recommended sets: definition matters + +Before structuring recommended metadata, distinguish two concerns: + +- **Domain importance** — primary/secondary within a subject (e.g., `SMKDSTY_A` is the primary smoking status variable). This is the current `{recommended:primary}` tag. +- **Use-case bundles** — curated sets for specific research purposes (e.g., "chronic disease risk factors" combines variables across domains). This is a different concept. + +Phase 1 uses simple `variable_sets.csv` for use-case bundles. Domain importance tags stay in `notes` until there are multiple consumers that benefit from structured form. + +### Future: standalone metadata + +Eventually, metadata should move out of inline `notes` tags into proper fields or tables: + +- **Within cchsflow** — `variable_sets.csv` for recommended bundles, structured columns in `variables.csv` for domain importance +- **Within cchsflow-docs** — dependency graphs, coverage matrices, and recommended sets as DuckDB tables alongside StatCan metadata, exposed via MCP + +This migration should happen only after Phase 1 functions are working and the schema is validated by use. The cchsflow-docs database currently serves StatCan metadata ("what does StatCan publish?"); adding cchsflow metadata ("what does cchsflow do with it?") changes its scope and should be a deliberate decision. + +## Implementation phases + +### Phase 1: Core R functions (R1 + R2 + R4) + +Build the core logic as R functions that parse worksheet CSVs directly. Zero new data files — all metadata is read from existing worksheets at runtime. + +- **Dependency resolver** — parse `DerivedVar::[...]` from `variable_details.csv`, resolve recursively, return topologically ordered variable list with feeder metadata +- **Coverage checker** — cross-reference `databaseStart` and `variable_details.csv` to produce variable x database availability matrix +- **Discovery enhancements** — extend `variable-discovery.R` with label search, coverage filter, dependency listing wrapper +- **Tests** for all functions, including edge cases (circular dependencies, missing feeders, PUMF/Master splits) + +This validates the logic, API surface, and metadata needs with real use before any schema extraction or new data files. + +### Phase 2: Project worksheet generation (R3) + +Build on Phase 1 functions: + +- **Project worksheet generator** — given target variables, produce minimal `variables.csv` and `variable_details.csv` with all transitive dependencies, filtered to requested databases +- **Summary reports** — variable count, coverage gaps, caveats, feeder counts +- **Validation** — test with real downstream use cases (e.g., generate worksheets for a DemPoRT-like smoking + demographics subset) + +### Phase 3: Integration layer (R6 + R7), conditional + +Proceed only after Phases 1-2 are stable and used by at least one downstream project: + +- **MCP/CLI tools** in cchsflow-docs — dependency resolution, coverage queries, project worksheet generation. Requires deciding the scope boundary (open question #1). +- **Consumer skill** — Claude Code skill orchestrating R functions and MCP tools for researchers +- **Metadata extraction** — migrate inline `{recommended:X}` tags to structured fields if multiple consumers benefit; materialise dependency graph and coverage matrix as database tables if the MCP integration warrants it +- **Recommended variable sets** (R5) — revisit once Phase 1-2 experience reveals whether standardised bundles are useful or whether project-specific lists remain the norm + +## Open questions + +1. **Scope boundary for cchsflow-docs** — the MCP currently serves StatCan metadata. Adding cchsflow harmonisation metadata (dependencies, recommended sets) changes its purpose. Should this be a separate MCP, an extension of the existing one, or a shared database with distinct tables? (Phase 3 decision — not needed for Phases 1-2.) +2. **Should project worksheets be self-contained** or reference the main cchsflow installation? Self-contained is simpler for users but duplicates data; references are lighter but require cchsflow to be installed. +3. **How to handle version differences** — if a project uses an older cchsflow version with different variables, should the resolver warn or adapt? + +### Resolved + +- **Cross-domain dependencies**: Yes, the resolver must handle them. `pack_years_der` depends on `DHHGAGE_cont` (Demographics). The dependency graph crosses subject boundaries by design. +- **Recommended tag migration timing**: Defer until Phase 3. Only migrate when multiple consumers exist and the schema is validated by Phase 1-2 use. +- **Recommended variable sets priority**: Deferred. The curation burden is high, only 17 of 381 variables are tagged, and downstream projects maintain their own lists. Revisit after Phases 1-2. + +## References + +### Within cchsflow + +- `R/variable-discovery.R` — existing variable discovery functions (v3-smoking branch) +- `R/clean-variables.R` — worksheet infrastructure (clean_variables, derive_passthrough, parse_range_notation) +- `R/recode-with-table.R` lines 820-966 — existing runtime dependency resolver (`recode_derived_variables()`) +- `.claude/skills/cchsflow-worksheets/docs/derived-variable-functions.md` — 3-step architecture and feeder alignment + +### Downstream implementations to consolidate + +- `DemPoRT-V2-dev/R/get-dependency-tree.R` — recursive dependency tree walker +- `DemPoRT-V2-dev/R/get-start-var.R` — `parse_variable_start()` with transformation chain resolution +- `DemPoRT-V2-dev/R/variable-start-utils.R` — `is_derived_var()`, `get_derived_vars()` +- `MockData/R/mockdata-parsers.R` (`v030-refactor`) — `parse_variable_start()`, `parse_range_notation()` +- `MockData/R/mockdata-helpers.R` — `get_cycle_variables()`, `get_raw_variables()` +- `MockData/R/identify_derived_vars.R` (`v030-refactor`) — `identify_derived_vars()`, `get_raw_var_dependencies()` + +### External + +- cchs-metadata MCP server — `../cchsflow-docs/mcp-server/server.py` \ No newline at end of file From 0a430b5bc55061dad10984eeea9d4d7adb312be2 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Wed, 11 Mar 2026 15:16:26 -0400 Subject: [PATCH 02/15] skill(cchsflow-review): Add scope-confirmation prompt when domain not explicit --- .claude/skills/cchsflow-review/SKILL.md | 957 ++++++++++++++++++++++++ 1 file changed, 957 insertions(+) create mode 100644 .claude/skills/cchsflow-review/SKILL.md diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md new file mode 100644 index 00000000..99c3f9d2 --- /dev/null +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -0,0 +1,957 @@ +--- +name: cchsflow-review +description: Review cchsflow worksheet changes for correctness using the CEP/L0-L6 process. Use when reviewing PRs that modify variables.csv or variable_details.csv, or when a user wants to validate their own harmonization work. Generates or updates a CEP as a review artifact, runs worksheet checks, and performs L6 implementation validation with rec_with_table(). Invoke with a PR number or a list of variables. +allowed-tools: Bash(gh:*), Bash(git:*), Bash(Rscript:*), Bash(R:*), Read, Glob, Grep +--- + +# cchsflow worksheet review + +CEP-driven review for cchsflow worksheet changes. Reviews follow the L0-L6 harmonization workflow, generating a CEP as a review artifact that documents findings and links to the PR. + +## Usage + +``` +/cchsflow-review +/cchsflow-review # review unstaged changes +``` + +## Workflow + +### Step 1: Scope and triage + +Before any checks, establish what is being reviewed and assess the shape of the diff. + +#### Confirm scope with the user + +cchsflow PRs typically cover one domain at a time (smoking, alcohol, physical activity, etc.). If the scope is not explicit in the invocation, **ask before proceeding**: + +> "Which variables or domain should I focus on? (e.g., smoking variables, SMK_*/SMKG_*, or a specific list)" + +This prevents accidentally reviewing or modifying variables from other domains that happen to share the same worksheets. Do not infer scope from the branch name alone — confirm with the user. + +#### Review contexts + +- **PR review**: Reviewing another contributor's PR +- **Self-review**: User is checking their own in-progress harmonization work + +#### Triage the diff + +For PR reviews, run triage first: + +1. **Get the diff** and identify which variables were modified in `variable_details.csv` and `variables.csv` +2. **Check `variables.csv` diff size** — if the entire file was rewritten (line count matches total rows), flag as potential formatting/schema change vs targeted edits +3. **Check GHA status** — have CI checks run? Are they passing? +4. **Count modified variables** and group by domain + +**Important:** `gh pr diff --stat` does not exist and `gh pr diff` does not support path filtering. Instead, check out the PR branch and use git directly: + +```bash +gh pr checkout --repo Big-Life-Lab/cchsflow +git fetch origin +git diff origin/...HEAD --numstat # file-level change stats +git diff origin/...HEAD --name-only # file list +gh pr checks --repo Big-Life-Lab/cchsflow 2>&1 || echo "No checks configured" +``` + +Note: `gh pr checks` returns exit code 1 when no checks exist — this is not an error. + +**Full-file formatting changes:** If `variables.csv` shows a line count close to its total row count (e.g., 379+/379-), the diff may be dominated by formatting changes (quoting, whitespace) rather than content changes. Use Python's csv module to compare content between branches, ignoring formatting: + +```python +python3 -c " +import csv +old = list(csv.DictReader(open('/tmp/variables_target.csv'))) +new = list(csv.DictReader(open('inst/extdata/variables.csv'))) +# Compare by variable name, find content differences +" +``` + +Never use bash text tools (sed, awk, grep) to parse CSV files — use Python csv or R `read.csv()` for reliable structured data parsing. + +#### Propose a scope + +Extract the list of modified variables from the diff, then propose: + +1. **Variables**: List all variables found in the diff, grouped by domain if possible. Flag any variables that appear in the diff but are not mentioned in the PR title/description. + - Example: "Proposing to review 8 variables: FVCDFRU, FVCDSAL, FVCDPOT, FVCDCAR, FVCDVEG, FVCDJUI, diet_score, diet_score_cat3. The PR also modifies ADL_01 and 293 other variables in variables.csv — these are outside the stated scope." + +2. **Database types**: Default to **both PUMF (`_p`) and Master (`_m`)**. cchsflow currently supports `_p` and `_m` suffixes. The `_s` (share file) suffix is deprecated and must be converted to `_m` whenever encountered in reviewed variables — this is a required fix, not just a note. The `_i` (ICES) suffix is similarly deprecated — replace with `_m`. Before converting `_s` → `_m`, verify that a corresponding `_m` entry does not already exist for that database (if it does, delete the `_s` row instead of renaming it). + +3. **Cycles**: Default to **all cycles present in the diff** (typically 2001 through 2017-2018, expanding as new cycles are added). + +#### Print and proceed + +Print the proposed scope and triage summary clearly to the console, then proceed. The user can interrupt at any time to narrow or expand the scope. + +``` +Triage: + Files changed: variables.csv (379+/379-), variable_details.csv (186+/186-) + Variables modified: 302 total (8 in-scope, 294 out-of-scope) + GHA checks: not run + Full-file rewrite detected in variables.csv (likely formatting change) + +Proposed review scope: + Variables: FVCDFRU, FVCDSAL, FVCDPOT, FVCDCAR, FVCDVEG, FVCDJUI, diet_score, diet_score_cat3 + Database types: PUMF (_p) and Master (_m) + Cycles: 2001 through 2017-2018 + Out-of-scope: 294 other variables, column reordering + +Proceeding with review. Interrupt to adjust scope. +``` + +If the user has already specified a scope (e.g., "just review the FVC variables"), skip the proposal and use their scope directly. + +#### Scope boundaries + +If the diff contains changes beyond the agreed variables (e.g., column reordering, unrelated variable modifications), note this in the triage output but do not review those changes unless the user requests it. + +### Step 2: Eligibility check (PR reviews only) + +For PR reviews, check the PR is reviewable: + +- State is OPEN and not a draft +- Not an automated PR +- If a prior approval exists, check whether new commits were pushed after the approval date — if so, the PR still needs review + +```bash +gh pr view --repo Big-Life-Lab/cchsflow --json state,isDraft,author,reviews,commits +``` + +### Step 3: Set up working tree and locate/create CEP + +#### Ensure worksheets are from the PR branch + +For PR reviews, the working tree must have the PR's worksheets so that `rec_with_table()` tests the PR's changes, not `main`. Check out the PR branch: + +```bash +gh pr checkout --repo Big-Life-Lab/cchsflow +``` + +If the PR modifies R functions (e.g., new derived variable functions in `R/`), use `devtools::load_all()` instead of `library(cchsflow)` in integration tests so the PR's code is loaded rather than the installed package version. + +**Expected warnings:** `devtools::load_all()` on feature branches commonly produces NAMESPACE conflict warnings (e.g., `has_cchs_missing_codes`, `if_else2`) and "no such file" warnings for files that exist on other branches. These are expected and do not prevent tests from running. Do not flag these as issues. + +#### Regression baseline + +To distinguish PR-introduced issues from pre-existing ones, fetch the target branch and use it as a baseline: + +```bash +git fetch origin +``` + +For every issue found in steps 5-6, check whether it also exists on the target branch: +- **Worksheet typos**: Compare the specific variable's rows between branches using Python csv module +- **L6 failures**: If `rec_with_table()` fails for a cycle, check whether the same cycle works on the target branch +- **Low prevalence**: Check whether the same pattern exists on the target branch — if so, it's pre-existing + +An issue that exists on the target branch is pre-existing (score 0) unless the PR makes it worse. An issue that exists on the target branch for *other* variables but was copied into the PR's variables is PR-introduced (score normally). + +#### Locate or create CEP + +Check if a CEP already exists for this domain/variable group. CEPs live in `ceps/cep-NNN-/`. + +**If a CEP exists:** +- Read its current state (`_workflow_state.yaml` if present) +- Note which L-stages are complete +- Focus the review on stages that are incomplete or need re-validation + +**If no CEP exists**, default to creating a **minimal review CEP** for PR reviews: + +``` +ceps/cep-NNN-/ + PR--review-summary.md # Findings and recommendations + integration-test-.R # rec_with_table() test script + -pumf-integration-test.csv # Test results + variable-availability.csv # Variable availability matrix +``` + +The user can interrupt to request a full CEP (with L0-L6 documents, QMDs, subgroup specs — see CEP-002 for the pattern) or to skip CEP generation entirely. + +**CEP numbering:** To avoid collisions with CEPs on other branches, scan for existing CEP numbers across all branches: + +```bash +git log --all --oneline -- 'ceps/' | head -20 +ls ceps/ 2>/dev/null +``` + +Use the next available number. Include the domain name (e.g., `cep-007-diet`) to disambiguate. + +### Step 4: L0-L2 documentation review + +For each in-scope variable, verify the documentation foundations. Read `.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md` for detailed L0-L2 templates. + +#### L0: Documentation assessment + +Verify source variables against CCHS documentation using the **cchsflow-docs** repository (`Big-Life-Lab/cchsflow-docs` on GitHub, cloned alongside cchsflow). This step confirms that variables claimed in `variableStart` and `databaseStart` actually exist in the CCHS data for those cycles. + +##### Primary source: cchs-metadata MCP server + +**Always use the cchs-metadata MCP as the primary tool for L0-L1 verification.** It provides the most complete and queryable metadata — 16,000+ variables across 251 datasets, enriched from PUMF RData, DDI XML, and ICES sources with full provenance tracking. + +**Key tools:** +- `mcp__cchs-metadata__search_variables(query)` — find variables by name or label substring +- `mcp__cchs-metadata__get_variable_detail(variable_name)` — full metadata including labels, question text, value codes, dataset history +- `mcp__cchs-metadata__get_variable_history(variable_name)` — which cycles/datasets contain the variable (essential for era boundary verification) +- `mcp__cchs-metadata__get_value_codes(variable_name)` — response categories with frequencies +- `mcp__cchs-metadata__compare_master_pumf(variable_name, cycle)` — compare PUMF vs Master metadata for a specific cycle (essential for PUMF/Master split decisions) +- `mcp__cchs-metadata__suggest_cchsflow_row(variable_name)` — draft a cchsflow harmonisation row +- `mcp__cchs-metadata__get_dataset_variables(dataset_id)` — list all variables in a specific dataset +- `mcp__cchs-metadata__get_source_conflicts(variable_name, dataset_id)` — find cross-source label disagreements (useful for catching metadata inconsistencies) +- `mcp__cchs-metadata__get_database_summary()` — database overview and statistics + +**Using MCP results:** +- The `cchsflow_name` field maps StatCan source variables to their cchsflow harmonized names — use this to verify that `variableStart` entries point to the correct source variable for each cycle +- Use `get_variable_history` to confirm a variable exists across claimed cycles and to identify era renames (e.g., SMK_09C → SMK_090 at the 2015 boundary) +- Use `compare_master_pumf` to verify whether PUMF and Master share the same source variable or need split rows + +**Caution:** The MCP `label_short`/`label_long` fields may be contaminated by cchsflow labels (see MCP error report from alcohol review). Always cross-check against `label_statcan` which comes from DDI primary sources. + +##### If the MCP is not available + +Check whether the MCP is loaded: +```bash +claude mcp list +``` + +If `cchs-metadata` is missing or shows "Failed to connect", the server needs to be configured. The MCP server (v0.3.0+) lives in the **cchsflow-docs** repository and is also available as a [GitHub release](https://github.com/Big-Life-Lab/cchsflow-docs/releases). + +**Quick setup** (if cchsflow-docs is cloned at `../cchsflow-docs/`): +```bash +cd ../cchsflow-docs/mcp-server && bash ../scripts/setup.sh +claude mcp add cchs-metadata -- python3 /Users/dmanuel/github/cchsflow-docs/mcp-server/server.py +``` + +**Manual setup:** +1. Ensure `cchsflow-docs` is cloned alongside cchsflow (typically `../cchsflow-docs/`) +2. Ensure `mcp-server/server.py` exists in cchsflow-docs +3. Ensure the database exists: `../cchsflow-docs/database/cchs_metadata.duckdb` (download from the [v0.3.0 release](https://github.com/Big-Life-Lab/cchsflow-docs/releases) or rebuild: `Rscript --vanilla ../cchsflow-docs/database/build_db.R`) +4. Add the MCP to Claude Code: + ```bash + claude mcp add cchs-metadata -- python3 /Users/dmanuel/github/cchsflow-docs/mcp-server/server.py + ``` + Or add to `~/.claude.json` (see `.mcp.json.example` in cchsflow-docs for a template): + ```json + "cchs-metadata": { + "type": "stdio", + "command": "python3", + "args": ["/Users/dmanuel/github/cchsflow-docs/mcp-server/server.py"], + "env": {"CCHS_DB_PATH": "/Users/dmanuel/github/cchsflow-docs/database/cchs_metadata.duckdb"} + } + ``` +5. Restart Claude Code for the MCP tools to appear in the tool list + +##### CLI fallback + +If the MCP server cannot be started but the database exists, use the standalone CLI (no FastMCP dependency — only `duckdb` required): +```bash +python3 ../cchsflow-docs/mcp-server/cli.py search smoking +python3 ../cchsflow-docs/mcp-server/cli.py detail SMKDSTY +python3 ../cchsflow-docs/mcp-server/cli.py history SMK_204 +python3 ../cchsflow-docs/mcp-server/cli.py conflicts --variable SMKDSTY +python3 ../cchsflow-docs/mcp-server/cli.py codes SMK_204 +``` + +All commands support `--json` for machine-readable output and `--db PATH` for custom database path. + +See the cchsflow-docs `CLAUDE.md` and `.claude/skills/cchs-database/SKILL.md` for database build workflow and schema details. + +##### Fallback: file-based lookups + +If the MCP is unavailable and cannot be restored, use these file-based sources in the cchsflow-docs repo (typically `../cchsflow-docs/`): + +1. **Extracted YAML data dictionaries** — structured variable definitions by cycle: + ``` + ../cchsflow-docs/cchs-extracted/data-dictionary/{year}/ + ``` + Coverage: 2000-2001 through 2023. + +2. **DDI XML files** — authoritative StatsCan PUMF documentation: + ``` + ../cchsflow-docs/cchs-pumf-docs/CCHS_DDI/ + ``` + +3. **CCHS variable dictionary CSV** — flat file for quick lookups: + ``` + ../cchsflow-docs/data/cchs_variable_dictionary.csv + ``` + +These are the raw sources that feed the MCP database. The MCP is strongly preferred because it cross-references all sources, deduplicates, and provides structured query tools rather than requiring manual grep/search across hundreds of files. + +##### What to verify + +For each in-scope variable: +1. **Existence**: Does the source variable name appear in the documentation for each claimed cycle? +2. **Category codes**: Do `recStart` values match the documented category definitions? +3. **Era renames**: For 2015+ cycles, confirm the renamed variable exists +4. **Cycle coverage up to latest available**: Check whether the variable exists in cycles beyond the PR's `databaseStart` (documentation covers up to 2023) — these may be candidates for expansion + +##### What to flag + +- Variable listed in `variableStart` but not found in documentation for that cycle → **P0** (wrong variable name) +- Variable not checked (no documentation available for that cycle) → note as untested +- Variable exists in additional cycles not included in `databaseStart` → informational (expansion opportunity) + +#### L1: Variable concordance + +Use the cchsflow-docs extracted data dictionaries to verify source variable names across eras: + +- Pre-2007: cycle letter in 4th position (A=2001, C=2003, E=2005) +- 2007-2014: standard naming +- Post-2014: check for 3-digit renames — search the 2015+ YAML files to confirm actual names +- 2022+: check for modular renames (e.g., CSS/SPU prefixes for smoking) + +For each era boundary, compare the variable name in `variableStart` against the corresponding cycle's YAML data dictionary in cchsflow-docs. PUMF and Master data dictionaries may differ — check both `_p` and `_m` YAML files where available. + +#### L2: Semantic mapping + +- Are category codes consistent across cycles? +- Are semantic breaks identified and documented? +- Do recoding rules handle all source categories? + +### Step 5: L3-L5 worksheet and testing checks + +Run these checks in parallel for the in-scope variables. Read `.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md` for detailed reference. + +#### Check 1: Era boundary defaults + +The most dangerous class of bug. For each variable: + +1. Parse the `databaseStart` field — does it span both 2007-2014 and 2015+ cycles? +2. Parse the `variableStart` field — do 2015+ databases have explicit `db::VAR` mappings? +3. If a `[VAR]` default exists and 2015+ databases lack explicit mappings, the default will apply the wrong variable name at runtime + +**Key 2015 renames to check:** +- Smoking categorical: SMK_06A → SMK_060, SMK_09A → SMK_080, SMK_10A → SMK_100 +- Smoking continuous: SMK_06C → SMK_070, SMK_09C → SMK_090, SMK_10C → SMK_110 +- Smoking derived: SMKDSTY → SMKDVSTY, SMKDSTP → SMKDVSTP +- PUMF grouped: SMKG06C → SMKG070, SMKG09C → SMKG090, SMKG10C → SMKG110 +- FVC: FVCDFRU → FVCDVFRU, FVCDSAL → FVCDVGRN, FVCDCAR → FVCDVORA, FVCDPOT → FVCDVPOT, FVCDVEG → FVCDVVEG, FVCDJUI → FVCDVJUI +- ADL: ADL_01-06 → ADL_005-030 (3-digit, 2015-2021), then → ADL_05-30 (2-digit, 2023+) + +**Key 2023 renames to check:** +- ADL: ADL_005 → ADL_05, ADL_010 → ADL_10, ADL_015 → ADL_15, ADL_020 → ADL_20, ADL_025 → ADL_25, ADL_030 → ADL_30. This is a new era boundary — `[ADL_005]` defaults will not work for 2023 databases. + +#### Check 2: databaseStart consistency + +For each variable: +1. Extract `databaseStart` from variables.csv +2. Extract all `databaseStart` entries from variable_details.csv for that variable +3. The variables.csv list must equal the union of all variable_details.csv lists +4. Flag any databases present in one file but not the other + +For each mismatch found, classify it: +- **PR-introduced**: The mismatch is new (not on target branch) — report as P1 +- **Pre-existing**: The mismatch exists on the target branch — document in pre-existing issues +- **`_p` in vd only**: PUMF databases in variable_details but not variables.csv is a known pattern for variables that span both pre-2015 and 2015+ eras (the pre-2015 block includes `_p` databases that the 2015+ block in variables.csv doesn't list). Note but do not flag as a bug. + +All mismatches must be explicitly listed in the review summary, even pre-existing ones. Do not silently omit consistency results. + +#### Check 3: PUMF vs Master naming + +For `_m` (master) databases: +- Pre-2007: cycle letter in source variable name (A=2001, C=2003, E=2005) +- 2007-2014: standard naming (no prefix letter) +- 2015+: check for renamed variables + +For `_p` (PUMF) databases: +- May use grouped/derived variable names (e.g., SMKG prefix, FVCD prefix) + +Verify that `_m` databases don't reference PUMF-only grouped variables, and vice versa. + +For variables where PUMF and Master use fundamentally different source types (categorical vs continuous), see `cchsflow-worksheets/docs/pumf-master-harmonization.md` for the required row-splitting pattern and common errors. + +#### Check 4: Pre-2007 cycle letters + +For variables with pre-2007 master cycles, verify the cycle letter: +- 2001 (`_m` or `_p`): letter A in the variable name (e.g., SMKA_203, FVCADFRU) +- 2003: letter C (e.g., SMKC_203, FVCCDFRU) +- 2005: letter E (e.g., SMKE_203, FVCEDFRU) + +The letter position varies by variable domain but follows a consistent pattern within each domain. + +#### Check 5: Known error patterns + +Scan for: +- `cchs20013_` — extra zero typo (should be `cchs2013_`) +- `chs20` without leading `c` — missing `c` typo (should be `cchs20`). This pattern has been found in ADL and FVC variables (e.g., `chs2011_2012_m` instead of `cchs2011_2012_m`). Check all database names match the `cchs` prefix. +- `_i` suffix databases — deprecated, should be `_m` +- `_s` suffix databases — deprecated, **always convert to `_m`** when found in reviewed variables. Check that a corresponding `_m` entry doesn't already exist (if it does, delete the `_s` row; if not, rename `_s` → `_m`). This applies even if the `_s` is pre-existing on the target branch — if the PR touches these rows, fix the suffix. **Naming convention**: `_s` share files are single-year extracts, so map to the single-year master form: `cchs2009_s` → `cchs2009_m` (not `cchs2009_2010_m`), `cchs2010_s` → `cchs2010_m`, `cchs2012_s` → `cchs2012_m`. Check `variables.csv` to confirm which `_m` form is expected. +- `[[VAR]]` — double brackets (invalid notation) +- `[VAR1, VAR2]` without `DerivedVar::` prefix — ambiguous multi-variable input + +**Pre-existing typo propagation:** Typo patterns often exist in the target branch for other variables and get copied into new variables through copy-paste. For each typo found, check whether the same pattern exists on the target branch for the same variables — if not, it was introduced by this PR even if the pattern exists elsewhere. + +#### Check 5b: dummyVariable naming conventions + +Verify that `dummyVariable` values follow the naming convention defined in `metadata_registry.yaml`. + +**Categorical variables** — regex: `^[a-zA-Z0-9_]+_cat[0-9]+(_[0-9]+|_NA[a-z])$` + +| Row type | Pattern | Example | +|----------|---------|---------| +| Valid category | `{variable}_cat{N}_{recEnd}` | `SMK_204_cat4_1`, `FVC_1A_cat5_3` | +| Missing (not applicable) | `{variable}_cat{N}_NAa` | `SMK_204_cat4_NAa` | +| Missing (don't know/refusal) | `{variable}_cat{N}_NAb` | `SMK_204_cat4_NAb` | + +**Continuous variables and Func rows** use `N/A` (no naming convention). + +**Key rules:** +1. **No colons in dummy names** — use `_NAa` and `_NAb`, not `_NA::a` or `_NA::b`. Colons are invalid in identifiers. +2. **Suffix must match recEnd** — the number after the last underscore should equal the `recEnd` value for that row. A mismatch (e.g., `_cat5_2` with `recEnd=1`) indicates a copy-paste error. +3. **N must match numValidCat** — the number after `_cat` should equal the `numValidCat` value for valid categories of that variable. +4. **Func rows use `N/A`** — derived variable rows (where `recEnd` starts with `Func::`) use `dummyVariable=N/A`. + +**What to flag:** +- `_NA::a` or `_NA::b` patterns (should be `_NAa` / `_NAb`) +- Suffix-recEnd mismatches (e.g., `_cat5_2` on a row with `recEnd=1`) +- Func rows with constructed dummy names instead of `N/A` +- Continuous rows with anything other than `N/A` + +#### Check 5c: Swapped recEnd values + +Check for rows where `recEnd` values appear to be swapped between adjacent rows. This is a **P0 data bug** — it produces incorrect values at runtime with no warning. + +**Detection pattern:** +1. For each variable, examine rows where `recStart` is a valid data range (e.g., `[1,120]`) and adjacent rows where `recStart` is a not-applicable code (e.g., `996`) +2. The valid data range should map to `recEnd=copy` (or the appropriate output value), not to `NA::a` +3. A not-applicable code should map to `NA::a` or `NA::b`, not to `copy` + +**Example (FVC_6D bug found in PR #148):** +``` +# WRONG — recEnd values swapped +recStart=[1,120] recEnd=NA::a ← valid data being set to missing! +recStart=996 recEnd=copy ← not-applicable code being copied as data! + +# CORRECT +recStart=[1,120] recEnd=copy ← valid data copied through +recStart=996 recEnd=NA::a ← not-applicable code set to missing +``` + +**When to check:** Always check continuous variables with `copy` and `NA::a`/`NA::b` recEnd values. Swapped values are especially likely for variables added via copy-paste from similar variables. + +#### Check 5d: Label and metadata consistency + +Scan for common metadata quality issues in modified variables: + +1. **Double spaces** — check `label`, `labelLong`, `catLabel`, `catLabelLong`, `variableStartShortLabel`, and `variableStartLabel` for consecutive spaces +2. **Spelling errors in labels** — common typos: "consumptoin" (consumption), "freqeuncy" (frequency), "repondent" (respondent) +3. **Trailing punctuation in labelLong** — trailing dashes or incomplete labels (e.g., `"Daily consumption - fruit - (D)"` should be `"Daily consumption - fruit (D)"`) +4. **Missing descriptions** — derived daily frequency variables (FVCD*) and other derived variables should have `description` fields +5. **catLabel propagation** — when a label is fixed in `catLabel`, check that the same fix applies to `catLabelLong`, `variableStartShortLabel`, and `variableStartLabel` where those fields share the same text + +These are P2 issues (metadata quality) but are cheap to fix during review and prevent accumulation of inconsistencies. + +#### DV function naming convention (v3) + +New or refactored DV functions should use tidyverse-style verb-first names. The `_fun` suffix is legacy and being phased out as functions are refactored. + +| Verb | Purpose | Example | +|------|---------|---------| +| `calculate_*()` | Mathematical computation | `calculate_pct_time()`, `calculate_bmi()` | +| `categorize_*()` | Classification into groups | `categorize_pct_time()`, `categorize_bmi()` | +| `assess_*()` | Health risk evaluation | `assess_drinking_risk()` | +| `score_*()` | Scoring systems | `score_adl()` | +| `adjust_*()` | Data correction | `adjust_bmi()` | + +Legacy functions (e.g., `bmi_fun()`, `pack_years_fun()`) retain old names until refactored. Worksheets reference functions via `Func::` prefix (e.g., `Func::calculate_pct_time`). + +#### Check 6: L4 — derived variable specification review + +If the in-scope variables include derived variables (functions in `R/`): + +1. **Input consistency**: Read the DV function (e.g., `calculate_pct_time()` in `R/percent-time-canada.R`) and verify that the input variable names it expects match those listed in `variable_details.csv` for the derived variable +2. **Category coverage**: Verify the function handles all category values that the worksheet's `recFrom` maps to — no unhandled cases that would silently produce NA +3. **Output consistency**: Verify the function's return values match the `recTo` values in the worksheet +4. **Output bounds validation**: For continuous DVs, check whether the function validates output range. Values outside the valid domain (e.g., percentage >100 or <0) indicate inconsistent inputs and should return `tagged_na("b")`. The valid range should be documented in the `notes` field of the Func row in variable_details (documentation only for now, ready for future validation framework). If the DV lacks bounds checking, flag as P1. +5. **Documentation**: Check roxygen docs match the actual function signature + +#### Check 7: Unit tests (L5) + +If the PR includes or modifies test files in `tests/testthat/`: +- Verify category coverage (all output categories have test cases) +- Check edge cases (missing data, boundary values) +- Verify cross-cycle consistency + +If the PR lacks tests for new derived variables, flag this. + +### Step 6: L6 implementation validation + +**This is the highest-priority check.** Run `rec_with_table()` against actual PUMF data. This is not just a pass/fail test — the output is an analytical tool. By examining prevalence and distributions across cycles and categories, reviewers can identify harmonization problems that worksheet checks alone cannot catch, such as a sudden step change in prevalence at an era boundary (e.g., 2014 → 2015) that signals a naming mismatch or category recode error. + +#### Scope and limitations + +**PUMF data only.** L6 can currently test only `_p` databases. The `data/` directory contains PUMF RData files (`cchs2001_p.RData` through `cchs2017_2018_p.RData`). Master (`_m`) data is in a secure environment where LLMs cannot run. + +For master-only changes (e.g., a PR that only adds `_m` cycles), L6 cannot validate at runtime. In this case: +- Rely on L3-L5 worksheet checks (especially era boundary and naming checks) +- Generate the integration test R script anyway and save it to the CEP — the user or a colleague can run it in the secure environment +- Note the limitation explicitly in the review output + +**Future:** Mock data from the `mockdata` repo will enable L6 testing for all database types. + +#### Data locations + +PUMF RData files are in `data/`: +- `cchs2001_p.RData` through `cchs2017_2018_p.RData` + +Each file loads a data frame named after the cycle (e.g., `cchs2001_p`). + +#### Integration test script + +Generate and run a fully executable R script for the in-scope variables — no placeholders. Extract the actual variable names and cycle list from the worksheets. Save the script to the CEP directory so reviewers can re-run it. + +The script should: +1. Read `variable_details.csv` to extract the `_p` databases from `databaseStart` for each in-scope variable +2. Load cchsflow from the PR branch (use `devtools::load_all()` if R functions were modified, otherwise `library(cchsflow)`) +3. For each cycle, run `rec_with_table()` and collect results +4. Print cross-cycle prevalence summary +5. Save results CSV + +Pattern based on CEP-006: + +```r +# devtools::load_all() # Use if PR modifies R/ functions +library(cchsflow) +library(dplyr) + +# Load worksheet from the branch under review +variable_details <- read.csv("inst/extdata/variable_details.csv", + stringsAsFactors = FALSE) + +# Extract PUMF cycles from databaseStart for the in-scope variables +# (agent: replace with actual variable names and cycles from the worksheet) +variables_to_test <- c("FVCDFRU", "FVCDSAL", "FVCDPOT") +cycles <- c("cchs2001_p", "cchs2003_p", "cchs2005_p", + "cchs2007_2008_p", "cchs2009_2010_p", "cchs2011_2012_p", + "cchs2013_2014_p", "cchs2015_2016_p", "cchs2017_2018_p") + +results <- data.frame() + +for (cycle in cycles) { + rdata_file <- file.path("data", paste0(cycle, ".RData")) + if (!file.exists(rdata_file)) { + cat("SKIP", cycle, "- file not found\n") + next + } + + load(rdata_file) + df <- get(cycle) + + result <- tryCatch({ + rec_with_table( + data = df, + variables = variables_to_test, + database_name = cycle, + variable_details = variable_details, + log = FALSE + ) + }, error = function(e) { + cat("ERROR in", cycle, ":", e$message, "\n") + NULL + }) + + if (!is.null(result)) { + n <- nrow(result) + for (v in setdiff(names(result), "ADM_RNO")) { + valid <- sum(!is.na(result[[v]])) + cat(cycle, v, ": valid =", valid, "/", n, + "(", round(100 * valid / n, 1), "%)\n") + + # Category distribution (for categorical variables) + freq <- table(result[[v]], useNA = "ifany") + print(freq) + + results <- rbind(results, data.frame( + cycle = cycle, variable = v, + n = n, valid = valid, + valid_pct = round(100 * valid / n, 1), + stringsAsFactors = FALSE + )) + } + } + + rm(list = cycle) # free memory +} + +# Cross-cycle prevalence summary +cat("\n=== CROSS-CYCLE SUMMARY ===\n") +for (v in unique(results$variable)) { + cat("\n", v, ":\n") + sub <- results[results$variable == v, ] + print(sub[, c("cycle", "n", "valid", "valid_pct")], row.names = FALSE) +} + +# Save results +write.csv(results, "ceps/cep-NNN-domain/vars-pumf-integration-test.csv", + row.names = FALSE) +``` + +#### Cross-cycle prevalence QMD + +After generating the integration test CSV, create a Quarto document (`.qmd`) that visualises the cross-cycle results. This is a standard CEP artifact — visual inspection of prevalence trends is the most effective way to detect era boundary problems. + +The QMD should include: +1. **Cross-cycle valid % line plot** for each key variable (or a representative subset), with cycles on the x-axis and valid % on the y-axis. Add vertical reference lines at era boundaries (2007, 2015). +2. **Category distribution plot** for categorical derived variables (e.g., stacked bar chart of diet_score_cat3 across cycles). +3. **Annotations** for known data patterns — e.g., optional content cycles where low prevalence is expected, documented in the R function's roxygen or CCHS documentation. +4. **Brief narrative** interpreting the plots: are transitions clean? Any unexpected step changes? + +Use base R graphics (`plot()`, `barplot()`) to avoid extra dependencies. The QMD should be self-contained — load the results CSV, not rerun the integration test. + +Pattern: + +```yaml +--- +title: "CEP-NNN: Cross-cycle prevalence" +format: + html: + toc: true + code-fold: true +--- +``` + +```r +results <- read.csv("domain-pumf-integration-test.csv") + +# Extract year from cycle name for x-axis +results$year <- as.numeric(gsub("cchs(\\d{4}).*", "\\1", results$cycle)) + +# Plot valid % by cycle for a key variable +var_data <- results[results$variable == "KEY_VAR", ] +plot(var_data$year, var_data$valid_pct, type = "b", pch = 19, + xlab = "CCHS cycle", ylab = "Valid %", + main = "KEY_VAR: cross-cycle prevalence") +abline(v = c(2007, 2015), lty = 2, col = "grey50") +``` + +Save the QMD to the CEP directory alongside the other artifacts: + +``` +ceps/cep-NNN-/ + cep-NNN-.qmd # Cross-cycle prevalence plots + PR--review-summary.md + integration-test-.R + -pumf-integration-test.csv +``` + +#### Cross-cycle prevalence analysis + +The cross-cycle summary is the most important output. Review the `valid_pct` column for each variable across cycles and look for: + +1. **Step changes at era boundaries** — a sudden jump or drop in prevalence between 2005 → 2007 (pre-2007 to standard era) or 2014 → 2015 (standard to post-2014 era) suggests a naming mismatch or incorrect `[VAR]` default +2. **Unexpected zeros** — a cycle showing 0% valid when the variable should be available indicates a wrong source variable name or missing `db::VAR` mapping +3. **Exposure distribution shifts** — the key harmonization question is whether typical exposures remain stable across cycles. For continuous variables (e.g., daily fruit/veg consumption), check whether the proportion at clinically meaningful thresholds (e.g., 0 servings, >5 servings/day) shifts at era boundaries. For categorical variables, compare `table()` output across cycles. A sudden distribution change at 2015 that doesn't track the gradual secular trend suggests a mapping or recoding error, not a real population change. +4. **Derived variable completeness** — if a derived variable has lower valid % than its inputs, the DV function may be dropping valid cases + +**Optional content cycles:** Some CCHS modules are optional content in certain cycles — provinces opt in, so prevalence drops sharply. Before flagging low prevalence as an issue, check the R function's roxygen documentation and CCHS documentation for known optional content cycles. For example, FVC (fruit and vegetable consumption) was optional in 2005 and 2017-2018, producing ~56% and ~1% valid respectively — these are expected, not errors. + +Cross-cycle trends require human judgement. The skill should produce a clear summary table and flag any obvious discontinuities, but the reviewer interprets the results using their domain knowledge. In future, threshold-based alerts may be added. + +Example of a step change indicating a problem: +``` + cycle valid_pct + cchs2009_2010_p 34.1 <- normal + cchs2011_2012_p 14.7 <- lower (optional content) + cchs2013_2014_p 28.9 <- normal + cchs2015_2016_p 0.0 <- PROBLEM: variable renamed but mapping missing + cchs2017_2018_p 0.0 <- same problem +``` + +#### Derived variable testing + +If the in-scope variables include derived variables (functions in `R/`): + +1. Identify the DV function (e.g., `diet_score_fun()` in `R/diet.R`) +2. Check that all input variables are available in the test cycles +3. Run `rec_with_table()` with the derived variable to verify the full pipeline +4. Compare the derived variable's valid % against its input variables — the DV should not have materially higher valid % than its least-available input +5. For categorical derived variables and key continuous inputs, examine the **exposure distribution** across cycles — not just valid counts. The central harmonization question is whether typical exposures (e.g., proportion with 0 fruit/veg, or >5 servings/day) remain stable across cycles. A sudden shift in the distribution at an era boundary signals a recoding or mapping error even when valid % is unchanged. Include these distributions in both the integration test output and the QMD visualisation + +#### What to report from L6 + +For each cycle tested: +- **N**: Total respondents +- **Valid count and %**: Non-NA values for each variable +- **Category distribution**: `table()` output for categorical variables +- **Errors**: Any `rec_with_table()` failures with error messages + +Flag: +- **Step changes at era boundaries** (most important — signals naming/mapping errors) +- Cycles where valid % is 0 (variable may not exist despite being listed) +- Cycles where category distributions shift unexpectedly +- Derived variable failures or unexplained completeness gaps + +### Step 7: Confidence scoring + +#### Re-confirm findings before scoring + +Before finalising the review summary, **re-confirm each P0/P1 finding** by reading the specific cell directly from the current branch's `inst/extdata/` file using Python csv. Do not rely on earlier script output or cached copies (e.g., `/tmp/vd_pr.csv`). A finding that cannot be reproduced on a fresh read of the branch should be downgraded to 0. This step catches false positives caused by stale data in intermediate files. + +#### Scoring scale + +For each issue found, score confidence 0-100: + +- **0**: False positive — doesn't stand up to scrutiny, or pre-existing issue (also present on target branch) +- **25**: Might be real but could be false positive; stylistic issue not in project docs +- **50**: Verified real issue but minor/nitpick; not very important relative to the PR +- **75**: Verified real issue that will impact functionality or is called out in project docs +- **100**: Definitely a real issue confirmed by evidence + +**L6-specific scoring guidance:** +- `rec_with_table()` error (function fails) → **100** (confirmed breakage) +- 0% valid for a cycle that should have data → **100** (confirmed by PUMF data) +- Step change at era boundary → **90-100** depending on magnitude (confirmed by cross-cycle trend) +- Category distribution shift → **75** (requires domain interpretation, but flagged by data) +- L6 limitation (master-only, no runtime test available) → do not score; note as untestable + +Filter out issues scoring below 80. + +### Step 8: Report results + +#### Save CEP artifacts + +Save the integration test script, results, and QMD to the CEP directory: + +``` +ceps/cep-NNN-/ + cep-NNN-.qmd # Cross-cycle prevalence plots and narrative + PR--review-summary.md + integration-test-.R + -pumf-integration-test.csv +``` + +#### Commit and push CEP artifacts + +After saving artifacts, **commit and push them to the PR branch** so other reviewers can access them. CEP artifacts referenced in PR comments must exist on the branch — local-only files create dead references. + +```bash +git add ceps/cep-NNN-/ +# Exclude rendered output (.html, *_files/, .quarto/) — only commit source files +git commit -m "Add CEP-NNN review artifacts for PR #XXX" +git push origin +``` + +If working on a different branch than the PR, push to the PR branch or note in the PR comment where the artifacts live. + +#### Post PR comment (PR reviews) + +Post a comment on the PR using `gh pr comment`: + +```markdown +### Code review + +Reviewed [N variables] for [PUMF/Master/both] across [cycle range]. + +#### L6 integration test: cross-cycle prevalence + +Ran `rec_with_table()` against PUMF data for each cycle: + +| Cycle | N | VAR1 valid % | VAR2 valid % | ... | +|-------|---|-------------|-------------|-----| +| cchs2001_p | 130,880 | 35.7% | ... | ... | +| cchs2003_p | 134,072 | 58.6% | ... | ... | +| ... | ... | ... | ... | ... | + +[Note any step changes, zeros, or unexpected patterns here] + +[If master-only changes were not testable, note: "Master (_m) mappings validated by worksheet checks only — no runtime data available for L6 testing."] + +#### Issues found + +[N issues or "No issues found"] + +1. (, ) + + +CEP: `ceps/cep-NNN-/` + +Generated with [Claude Code](https://claude.ai/code) +``` + +If no issues survive filtering: + +```markdown +### Code review + +Reviewed [N variables] for [PUMF/Master/both] across [cycle range]. No issues found. + +L6 integration test: `rec_with_table()` ran successfully for all PUMF cycles. + +Checked: era boundary defaults, databaseStart consistency, naming conventions, DV specifications, known error patterns, and PUMF integration. + +CEP: `ceps/cep-NNN-/` + +Generated with [Claude Code](https://claude.ai/code) +``` + +#### Self-review reporting + +For self-review, report findings directly to the user without posting a PR comment. Still save CEP artifacts if CEP generation was not skipped. + +### Step 9: Run CSV validation tools + +Before proposing fixes, run the automated CSV validation tools to catch formatting and schema issues that the manual checks may have missed. + +#### Available tools + +**`check_worksheet()` / `fix_worksheet()`** (on `v3-smoking` and later branches): + +```bash +# Check for formatting violations (column order, line endings, row sorting, quoting) +Rscript exec/check-worksheets.R + +# Auto-fix formatting violations +Rscript exec/fix-worksheets.R +``` + +These are enforced by the `check-csv.yml` GitHub Action on PRs that modify `inst/extdata/variables.csv` or `variable_details.csv`. The GHA runs `check-worksheets.R` and fails if violations are found. + +**`standardise_csv()`** (on `feature/csv-standardisation-updates` branch): + +```r +# Basic mode — fix git conflicts (BOM, line endings, column order) +standardise_csv("inst/extdata/variables.csv") + +# Collaboration mode — enhanced schema validation +standardise_csv("inst/extdata/variable_details.csv", collaboration = TRUE, validate_only = TRUE) +``` + +Collaboration mode validates fields against `metadata_registry.yaml` regex patterns including `dummyVariable`, `variableStart`, `recStart`, and `recEnd`. It also checks for missing categorical dummy variables and cross-field rules. + +#### When to run + +- **Always** run `check-worksheets.R` (or `standardise_csv()` if available) before proposing fixes, to ensure proposed changes don't introduce new formatting violations +- **After applying fixes**, run validation again to confirm the fix didn't break formatting +- If the PR's branch has `check-csv.yml` GHA, check whether CI passed — if not, the formatting issues may need to be fixed before the review's substantive issues + +#### Branch availability + +| Tool | Branches | +|------|----------| +| `check_worksheet()` / `fix_worksheet()` | `v3-smoking`, `feature/v3.0.0-validation-infrastructure`, and later | +| `standardise_csv()` with collaboration mode | `feature/csv-standardisation-updates` and later | +| `check-csv.yml` GHA | `v3-smoking` and later | + +If the PR's branch doesn't have these tools, run validation from a branch that does by checking out only the worksheet files: + +```bash +# Validate worksheets from a branch that has the tools +git stash +git checkout v3-smoking -- exec/check-worksheets.R R/check-worksheet.R R/fix-worksheet.R +Rscript exec/check-worksheets.R +git checkout -- exec/ R/check-worksheet.R R/fix-worksheet.R +git stash pop +``` + +### Step 10: Propose worksheet fixes (if issues found) + +If the review identified worksheet errors (typos, missing mappings, incorrect database names), propose fixes to the user rather than silently modifying the worksheets. + +#### Workflow + +1. **Summarize the proposed changes** — list each fix with the affected variable(s), the current (incorrect) value, and the corrected value. For example: + + ``` + Proposed worksheet fixes: + + 1. FVC_1A through FVC_6E (30 variables): Replace `chs2011_2012_m` with + `cchs2011_2012_m` and `chs2013_2014_m` with `cchs2013_2014_m` in both + variables.csv and variable_details.csv + + 2. FVCDPOT: Replace `cchs20013_2014_m` with `cchs2013_2014_m` in + variable_details.csv (extra zero) + ``` + +2. **Wait for user approval** — the user decides whether to apply the fixes now, defer them, or handle them differently (e.g., as a follow-up PR, or let the PR author fix them). + +3. **Apply fixes using R or Python** — never use bash text tools on CSV files. Use R's `read.csv()`/`write.csv()` or Python's csv module to make targeted edits while preserving the file's existing formatting and quoting conventions. + + **CRITICAL: Scope fixes to in-scope variables only.** When applying replacements (e.g., `_s` → `_m`, typo corrections), filter to only the rows belonging to the PR's in-scope variables. Never apply global `gsub()` or `str_replace_all()` across the entire dataframe — this will modify hundreds of unrelated variables. Always subset first: + ```r + alc_idx <- which(vd$variable %in% in_scope_vars) + for (i in alc_idx) { + vd$databaseStart[i] <- gsub("cchs2009_s", "cchs2009_m", vd$databaseStart[i]) + } + ``` + +4. **Save fixes to a temporary file** — per project conventions (CLAUDE.local.md), write proposed changes to `/tmp/` for user review before editing the main worksheet files directly. The user or PR author integrates the changes. + +5. **Verify idempotency** — always read from `inst/extdata/` (the clean source), never from previously modified `/tmp/` files. After running a modification script, re-run it to confirm the output is identical. If the script detects its own changes on the second run (e.g., skips "already has 2021"), the idempotency check passed. + +6. **Offer visual diff review** — before applying changes to `inst/extdata/`, pause and ask the user whether they want to review the diff in a visual diff tool (e.g., Beyond Compare, Kaleidoscope, VS Code diff). This is especially valuable for large worksheet changes where the programmatic summary may miss formatting issues (e.g., Python csv re-quoting all fields, creating a noisy diff that obscures the real changes). + + **For PR reviews**: Use the **merge base** as the comparison baseline, not the target branch tip. This ensures the diff shows only what the PR branch changed, excluding divergence on the target branch since the PR was created. This is especially important for full-file rewrites where comparing against the target tip shows noise from unrelated target-side changes. + + ```bash + # Find the merge base between the PR branch and target + MERGE_BASE=$(git merge-base origin/ ) + + # Extract the file at the merge base + git show ${MERGE_BASE}:inst/extdata/variable_details.csv > /tmp/vd_mergebase.csv + git show ${MERGE_BASE}:inst/extdata/variables.csv > /tmp/vars_mergebase.csv + + # Compare merge base vs current PR branch (shows only PR changes) + bcompare /tmp/vd_mergebase.csv inst/extdata/variable_details.csv + bcompare /tmp/vars_mergebase.csv inst/extdata/variables.csv + ``` + + **For self-review / proposed fixes**: Compare the current working copy against the proposed modifications in `/tmp/`: + + ```bash + bcompare inst/extdata/variable_details.csv /tmp/variable_details_updated.csv + bcompare inst/extdata/variables.csv /tmp/variables_updated.csv + ``` + + If the user doesn't have a visual diff tool configured, offer to help set one up. Common options: + - **Beyond Compare**: `brew install --cask beyond-compare` — configure as git difftool with `git config --global diff.tool bc` and `git config --global difftool.bc.path /usr/local/bin/bcompare` + - **VS Code**: `code --diff ` + - **Kaleidoscope**: `ksdiff ` + - **FileMerge** (macOS built-in): `opendiff ` + + **Why merge-base matters:** In the GEN_10 PR (#169) review, comparing against the target tip showed 23 extra DHHGAGE_E rows and SDCDCGT changes that were on the target branch, not the PR. This noise obscured the actual PR changes. Using merge-base revealed only the GEN_07 and GEN_10 rows — the true scope. Similarly, in the diet PR (#148) review, Python's csv writer re-quoted every field, producing a noisy git diff. A visual diff tool with merge-base comparison would have caught both issues immediately. + +#### When not to fix + +- Pre-existing issues on the target branch that are outside the PR's scope — note them in the review but do not propose fixes as part of this PR +- **Exception: `_s` suffix databases** — always fix `_s` → `_m` when encountered in reviewed variables, even if pre-existing. Deprecated suffixes should not persist in the worksheets. +- Issues that require domain judgement (e.g., whether a variable should use a different source name) — flag for human review +- Changes to R functions — these require separate code review and testing + +### Scope expansion during review + +If the review identifies expansion opportunities (e.g., additional cycles available in cchsflow-docs that are not yet in the worksheets) and the user requests adding them, the review transitions into authoring: + +1. **Enter plan mode** to design the worksheet changes. The plan should cover which variables, databases, and variableStart mappings need updating. +2. **Write a modification script** (Python csv module) that reads from `inst/extdata/`, applies all changes, and writes to `/tmp/` for user review. The script should handle both the expansion and any typo fixes from the review. +3. **Run verification** — check databaseStart consistency, era boundary correctness, and variableStart mappings in the `/tmp/` output files. +4. **Present changes to the user** with a clear summary of what was modified before applying to `inst/extdata/`. +5. **Update the CEP** to document the expansion (new cycles, era boundaries, naming changes). +6. **Re-run CSV validation** (Step 9) on the expanded worksheets. + +The key constraint: all changes go through `/tmp/` for review before touching `inst/extdata/`. The review skill delegates to the worksheets skill for authoring decisions (era naming conventions, variableStart patterns). + +### Step 11: Retrospective — review the skill + +After the PR comment is posted (or findings reported for self-review), take a moment to reflect on the review process while the work is still in context. This step is easy to skip but valuable for continuous improvement. + +1. **What worked well?** Which checks caught real issues? Which were most efficient? +2. **What was slow or failed?** R script execution problems, false positives that wasted time, checks that didn't apply? +3. **What patterns emerged?** New typo patterns, domain-specific naming conventions, recurring copy-paste errors? +4. **Should the skill be updated?** New known error patterns, improved check logic, better operational practices (e.g., "always write R scripts to files, not inline")? +5. **What carries forward?** Pre-existing issues noted but not fixed, refactoring opportunities flagged, expansion opportunities identified? + +Summarise the retrospective to the user. If skill updates are warranted, propose specific edits. If operational lessons were learned, consider updating project memory. + +## Reference + +- L0-L6 workflow: `.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md` +- Era mapping tables: `.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md` +- Schema definitions: `inst/metadata/schemas/core/variables.yaml`, `inst/metadata/schemas/core/variable_details.yaml` +- Regex patterns and naming conventions: `inst/metadata/documentation/metadata_registry.yaml` +- CSV formatting check/fix: `exec/check-worksheets.R`, `exec/fix-worksheets.R` (uses `R/check-worksheet.R`, `R/fix-worksheet.R`) +- CSV standardisation with schema validation: `R/csv-utils.R` (`standardise_csv()`), `R/schema-validation.R` (`validate_csv_against_schema()`) +- Validation constants: `R/validation-constants.R` +- GHA workflow for CSV checks: `.github/workflows/check-csv.yml` +- Example CEP (full): `ceps/cep-002-smoking/` (smoking harmonization) +- Example CEP (review): `ceps/cep-006-oral-health/` (DEN_132 PR review with integration tests) +- PUMF data: `data/cchs*_p.RData` From a47681032976fe86c1139db032494de497f9b2ee Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Thu, 12 Mar 2026 08:08:52 -0400 Subject: [PATCH 03/15] feat(skills): Add validation checks, era boundary docs, and PUMF availability guidance cchsflow-review: - Add pre-2007 explicit mapping check (Check 7) - Add DerivedVar mixed _p/_m detection (Check 8) - Update era boundary section: concept-first with CCHS naming eras table - Add DerivedVar feeder check under L6 cchsflow-validation: - New skill with checks 1-8 including severity ratings cchsflow-worksheets: - Add PUMF availability by cycle table (2001-2023) - Document cchs2021_p as invalid database name - Add DerivedVar row splitting guidance and age feeder split table --- .claude/skills/cchsflow-review/SKILL.md | 55 +++- .claude/skills/cchsflow-validation/SKILL.md | 286 ++++++++++++++++++++ .claude/skills/cchsflow-worksheets/SKILL.md | 107 ++++++++ 3 files changed, 438 insertions(+), 10 deletions(-) create mode 100644 .claude/skills/cchsflow-validation/SKILL.md create mode 100644 .claude/skills/cchsflow-worksheets/SKILL.md diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md index 99c3f9d2..e58e6904 100644 --- a/.claude/skills/cchsflow-review/SKILL.md +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -290,6 +290,7 @@ For each in-scope variable: - Variable listed in `variableStart` but not found in documentation for that cycle → **P0** (wrong variable name) - Variable not checked (no documentation available for that cycle) → note as untested - Variable exists in additional cycles not included in `databaseStart` → informational (expansion opportunity) +- **Pre-2007 databases in `databaseStart` without explicit `db::VAR` mappings in `variableStart`** → **P1** (wrong source variable at runtime via `[VAR]` default). Pre-2007 variable names require a cycle letter in position 4 (A=2001, C=2003, E=2005) — the `[VAR]` default will look up the 2007-2014 name, which does not exist in pre-2007 datasets. Always verify every pre-2007 database has an explicit mapping. #### L1: Variable concordance @@ -314,23 +315,35 @@ Run these checks in parallel for the in-scope variables. Read `.claude/skills/cc #### Check 1: Era boundary defaults -The most dangerous class of bug. For each variable: +The most dangerous class of bug. The `[VAR]` default in `variableStart` resolves to the base variable name at runtime for any database not explicitly mapped. This is only correct for databases in the **same naming era** as the base name. Whenever `databaseStart` spans an era boundary, all databases in the **other** era must have explicit `db::VAR` mappings. -1. Parse the `databaseStart` field — does it span both 2007-2014 and 2015+ cycles? -2. Parse the `variableStart` field — do 2015+ databases have explicit `db::VAR` mappings? -3. If a `[VAR]` default exists and 2015+ databases lack explicit mappings, the default will apply the wrong variable name at runtime +The general rule: **for every era boundary crossed by `databaseStart`, verify that all databases on the far side have explicit mappings.** -**Key 2015 renames to check:** +For each variable: +1. Parse the `databaseStart` field and identify which era boundaries it crosses +2. Parse the `variableStart` field — do databases on the far side of each boundary have explicit `db::VAR` mappings? +3. If a `[VAR]` default exists and cross-era databases lack explicit mappings, the default will silently apply the wrong variable name at runtime + +**CCHS naming era boundaries:** + +| Boundary | Direction | Pattern | Example | +|----------|-----------|---------|---------| +| Pre-2007 → 2007 | Pre-2007 needs cycle letter | `SMK_204` → `SMKA_204` (2001), `SMKC_204` (2003), `SMKE_204` (2005) | Any variable with 2001/2003/2005 in databaseStart | +| 2007–2014 → 2015 | 3-digit rename | `SMK_06A` → `SMK_060`, `SMK_06C` → `SMK_070` | Smoking, FVC, ADL | +| 2021 → 2022 | CSS/SPU module restructure | Smoking cessation/history split into new modules | SPU_25A/B replace SMK_09A/C for cessation timing | +| 2021 → 2023 | ADL digit reduction | `ADL_005` → `ADL_05` | ADL variables only | + +The pre-2007 and 2015+ boundaries are the most common sources of bugs — they affect almost all smoking, FVC, and ADL variables. Always use `get_variable_history()` from the cchs-metadata MCP to confirm the exact boundary for the variable under review. + +**Key 2015+ renames (most common):** - Smoking categorical: SMK_06A → SMK_060, SMK_09A → SMK_080, SMK_10A → SMK_100 -- Smoking continuous: SMK_06C → SMK_070, SMK_09C → SMK_090, SMK_10C → SMK_110 +- Smoking continuous (Master): SMK_06C → SMK_070, SMK_09C → SMK_090, SMK_10C → SMK_110 +- Smoking continuous (PUMF): SMKG06C → SMKG070, SMKG09C → SMKG090, SMKG10C → SMKG110 - Smoking derived: SMKDSTY → SMKDVSTY, SMKDSTP → SMKDVSTP -- PUMF grouped: SMKG06C → SMKG070, SMKG09C → SMKG090, SMKG10C → SMKG110 +- Smoking intensity (daily smoker): SMK_204 / SMK_208 → SMK_045 (PUMF) / SMK_040 (Master daily), SMK_075 (both, former daily) - FVC: FVCDFRU → FVCDVFRU, FVCDSAL → FVCDVGRN, FVCDCAR → FVCDVORA, FVCDPOT → FVCDVPOT, FVCDVEG → FVCDVVEG, FVCDJUI → FVCDVJUI - ADL: ADL_01-06 → ADL_005-030 (3-digit, 2015-2021), then → ADL_05-30 (2-digit, 2023+) -**Key 2023 renames to check:** -- ADL: ADL_005 → ADL_05, ADL_010 → ADL_10, ADL_015 → ADL_15, ADL_020 → ADL_20, ADL_025 → ADL_25, ADL_030 → ADL_30. This is a new era boundary — `[ADL_005]` defaults will not work for 2023 databases. - #### Check 2: databaseStart consistency For each variable: @@ -667,6 +680,28 @@ If the in-scope variables include derived variables (functions in `R/`): 4. Compare the derived variable's valid % against its input variables — the DV should not have materially higher valid % than its least-available input 5. For categorical derived variables and key continuous inputs, examine the **exposure distribution** across cycles — not just valid counts. The central harmonization question is whether typical exposures (e.g., proportion with 0 fruit/veg, or >5 servings/day) remain stable across cycles. A sudden shift in the distribution at an era boundary signals a recoding or mapping error even when valid % is unchanged. Include these distributions in both the integration test output and the QMD visualisation +#### DerivedVar feeder check (PUMF/Master split) + +For any derived variable that uses age, sex, or any other variable that differs between PUMF and Master, run a cross-database feeder check: + +```r +devtools::load_all() +# Compare feeders with _p vs _m filter +resolve_dependencies("pack_years_der", databases = "cchs2015_2016_p") +resolve_dependencies("pack_years_der", databases = "cchs2015_2016_m") +``` + +If both calls return the same combined feeder list, DerivedVar rows are mixing `_p` and `_m` databases and need splitting. The correct state: the `_p` call should return PUMF-specific feeders (e.g., `DHHGAGE_cont`), the `_m` call should return Master-specific feeders (e.g., `DHH_AGE`). + +**Key age feeder split:** + +| Feeder | Database type | Note | +|--------|---------------|-------| +| `DHHGAGE_cont` | PUMF (`_p`) only | Midpoint-imputed from grouped PUMF age bands | +| `DHH_AGE` | Master (`_m`) all cycles including 2001 | True continuous age | + +A DerivedVar row that lists both `_p` and `_m` databases when feeders differ is a **P1** error — `rec_with_table()` will silently use the wrong age variable for at least one database type. See `pumf-master-harmonization.md` for the correct row-splitting pattern. + #### What to report from L6 For each cycle tested: diff --git a/.claude/skills/cchsflow-validation/SKILL.md b/.claude/skills/cchsflow-validation/SKILL.md new file mode 100644 index 00000000..50ecfd63 --- /dev/null +++ b/.claude/skills/cchsflow-validation/SKILL.md @@ -0,0 +1,286 @@ +--- +name: cchsflow-validation +description: Validate cchsflow worksheets for CSV formatting, source references, and cross-file consistency. Use before merging PRs that modify variables.csv or variable_details.csv, after authoring worksheet rows (L5 stage), or when GHA checks fail and you need local diagnostics. +allowed-tools: Bash(Rscript:*), Bash(R:*), Bash(git:*), Read, Glob, Grep +--- + +# cchsflow worksheet validation + +Run programmatic validation checks on cchsflow worksheets. This skill runs the same checks as GHA but locally, with additional cross-file consistency checks. + +## Usage + +``` +/cchsflow-validation +/cchsflow-validation path/to/variables.csv path/to/variable_details.csv +``` + +When invoked without arguments, validates the production worksheets at `inst/extdata/`. + +## Validation checks + +### Check 1: CSV formatting + +Run the fix-worksheets script to check (and optionally fix) formatting: + +```r +Rscript exec/fix-worksheets.R +``` + +This checks for: +- Excessive quoting (all fields quoted when not needed) +- Wrong column order (compared against YAML schemas) +- Empty trailing columns +- CRLF line endings (should be LF only) +- Unsorted rows (variables.csv sorted by `variable` column) + +**Schema files:** +- `inst/metadata/schemas/core/variables.yaml` +- `inst/metadata/schemas/core/variable_details.yaml` + +If `fix-worksheets.R` fails due to package load errors, use the fallback: + +```r +Rscript -e " +library(readr) +vars <- read.csv('inst/extdata/variables.csv', stringsAsFactors = FALSE, check.names = FALSE) +write_csv(vars, 'inst/extdata/variables.csv', na = '', quote = 'needed', escape = 'double', eol = '\n') +details <- read.csv('inst/extdata/variable_details.csv', stringsAsFactors = FALSE, check.names = FALSE) +write_csv(details, 'inst/extdata/variable_details.csv', na = '', quote = 'needed', escape = 'double', eol = '\n') +" +``` + +### Check 2: Source reference validation + +If `R/validate-all-source-references.R` exists, validate that all variableStart references point to real variables in the DDI: + +```r +Rscript -e " +source('R/validate-all-source-references.R') +result <- validate_all_source_references('inst/extdata/variable_details.csv') +print_all_validation_result(result) +" +``` + +This catches: +- `[VAR]` defaults that don't exist in 2015+ cycles +- Typos in variable names +- PUMF variables used for master databases (or vice versa) +- Missing explicit mappings for renamed variables + +### Check 3: Cross-file consistency + +Use R to check that variables.csv and variable_details.csv are internally consistent: + +```r +Rscript -e " +vars <- read.csv('inst/extdata/variables.csv', stringsAsFactors = FALSE, check.names = FALSE) +details <- read.csv('inst/extdata/variable_details.csv', stringsAsFactors = FALSE, check.names = FALSE) + +# Variables in details but not in vars +detail_vars <- unique(details\$variable) +var_vars <- unique(vars\$variable) +missing_in_vars <- setdiff(detail_vars, var_vars) +missing_in_details <- setdiff(var_vars, detail_vars) + +if (length(missing_in_vars) > 0) { + cat('ERROR: Variables in variable_details.csv but not in variables.csv:\n') + cat(paste(' -', missing_in_vars), sep = '\n') +} +if (length(missing_in_details) > 0) { + cat('WARNING: Variables in variables.csv but not in variable_details.csv:\n') + cat(paste(' -', missing_in_details), sep = '\n') +} +if (length(missing_in_vars) == 0 && length(missing_in_details) == 0) { + cat('OK: All variables present in both files.\n') +} +" +``` + +### Check 4: databaseStart coverage + +For each variable, verify that the `databaseStart` in variables.csv matches the union of all `databaseStart` entries in variable_details.csv: + +```r +Rscript -e " +vars <- read.csv('inst/extdata/variables.csv', stringsAsFactors = FALSE, check.names = FALSE) +details <- read.csv('inst/extdata/variable_details.csv', stringsAsFactors = FALSE, check.names = FALSE) + +parse_dbs <- function(x) { + trimws(unlist(strsplit(x, ','))) +} + +errors <- character() +for (v in unique(vars\$variable)) { + vars_dbs <- sort(parse_dbs(vars\$databaseStart[vars\$variable == v][1])) + details_rows <- details[details\$variable == v, ] + details_dbs <- sort(unique(unlist(lapply(details_rows\$databaseStart, parse_dbs)))) + + in_vars_not_details <- setdiff(vars_dbs, details_dbs) + in_details_not_vars <- setdiff(details_dbs, vars_dbs) + + if (length(in_vars_not_details) > 0) { + errors <- c(errors, paste0(v, ': in variables.csv but not variable_details.csv: ', + paste(in_vars_not_details, collapse = ', '))) + } + if (length(in_details_not_vars) > 0) { + errors <- c(errors, paste0(v, ': in variable_details.csv but not variables.csv: ', + paste(in_details_not_vars, collapse = ', '))) + } +} + +if (length(errors) > 0) { + cat('databaseStart mismatches:\n') + cat(paste(' -', errors), sep = '\n') +} else { + cat('OK: All databaseStart fields are consistent.\n') +} +" +``` + +### Check 5: R CMD check (package integrity) + +Run a lightweight R CMD check to catch package-level issues such as undeclared dependencies, invalid `library()` calls in R/ files, missing NAMESPACE exports, and broken function references: + +```bash +Rscript -e "devtools::check(document = FALSE, args = '--no-tests --no-examples --no-vignettes --no-manual')" 2>&1 | tail -30 +``` + +This catches: +- `library()` calls in R/ files (must use DESCRIPTION Depends/Imports instead) +- Missing package dependencies (e.g., `here` used but not in DESCRIPTION) +- Undefined exports in NAMESPACE +- `source()` calls that fail in package context +- Syntax errors in R files + +**Quick alternative** — if full R CMD check is too slow, test that the package loads: + +```r +Rscript -e "devtools::load_all('.'); cat('Package loads OK\n')" +``` + +If `devtools::load_all()` fails, the GHA will also fail when it tries to install the package. + +### Check 7: Pre-2007 explicit mapping coverage + +For any variable where `databaseStart` includes pre-2007 databases (`cchs2001_m`, `cchs2001_p`, `cchs2003_m`, `cchs2003_p`, `cchs2005_m`, `cchs2005_p`), verify that `variableStart` contains explicit `db::VAR` entries for those cycles rather than relying on `[VAR]` defaults. + +The `[VAR]` default applies the base variable name to all unlisted databases. For pre-2007 cycles, the correct name requires a cycle letter in position 4 (A=2001, C=2003, E=2005). A `[VAR]` default for these cycles will silently look up the wrong variable name. + +```r +Rscript -e " +vd <- read.csv('inst/extdata/variable_details.csv', stringsAsFactors = FALSE) +pre2007 <- c('cchs2001_m', 'cchs2001_p', 'cchs2003_m', 'cchs2003_p', + 'cchs2005_m', 'cchs2005_p') + +issues <- character() +for (v in unique(vd\$variable)) { + rows <- vd[vd\$variable == v, ] + for (i in seq_len(nrow(rows))) { + dbs <- trimws(strsplit(rows\$databaseStart[i], ',')[[1]]) + vs <- rows\$variableStart[i] + pre <- dbs[dbs %in% pre2007] + if (length(pre) == 0) next + # Check each pre-2007 db has an explicit db::VAR mapping + for (db in pre) { + if (!grepl(paste0(db, '::'), vs, fixed = TRUE)) { + issues <- c(issues, paste0(v, ': ', db, ' has no explicit mapping in variableStart')) + } + } + } +} +if (length(issues) > 0) { + cat('Pre-2007 mapping gaps (will use [VAR] default — likely WRONG name):\n') + cat(paste(' -', issues), sep = '\n') +} else { + cat('OK: All pre-2007 databases have explicit variableStart mappings.\n') +} +" +``` + +Pre-2007 mapping gaps are **P1** errors — the variable exists in those cycles but the wrong source variable is read at runtime. + +### Check 8: DerivedVar mixed _p/_m row detection + +DerivedVar rows must not mix `_p` (PUMF) and `_m` (Master) databases in a single row when those database types use different feeder variables. If a single DerivedVar row's `databaseStart` contains both `_p` and `_m` entries, `rec_with_table()` will apply the same feeder variable set to all databases in that row — silently producing wrong results when PUMF and Master use different age, sex, or other input variables. + +```r +Rscript -e " +vd <- read.csv('inst/extdata/variable_details.csv', stringsAsFactors = FALSE) + +mixed <- data.frame(variable = character(), row = integer(), + n_p = integer(), n_m = integer(), stringsAsFactors = FALSE) +derived_rows <- vd[grepl('DerivedVar::', vd\$variableStart), ] +for (i in seq_len(nrow(derived_rows))) { + dbs <- trimws(strsplit(derived_rows\$databaseStart[i], ',')[[1]]) + has_p <- any(grepl('_p$', dbs)) + has_m <- any(grepl('_m$', dbs)) + if (has_p && has_m) { + mixed <- rbind(mixed, data.frame( + variable = derived_rows\$variable[i], + row = which(vd\$variable == derived_rows\$variable[i] & + vd\$variableStart == derived_rows\$variableStart[i])[1], + n_p = sum(grepl('_p$', dbs)), + n_m = sum(grepl('_m$', dbs)), + stringsAsFactors = FALSE + )) + } +} +if (nrow(mixed) > 0) { + cat('DerivedVar rows mixing _p and _m databases:\n') + for (i in seq_len(nrow(mixed))) { + cat(sprintf(' %-30s (row ~%d): %d _p, %d _m — inspect feeder sets\n', + mixed\$variable[i], mixed\$row[i], mixed\$n_p[i], mixed\$n_m[i])) + } + cat('\nFor each flagged variable: compare resolve_dependencies(variable, databases=\"cchs2015_2016_p\")\n') + cat('vs resolve_dependencies(variable, databases=\"cchs2015_2016_m\") — if feeders differ, split the rows.\n') +} else { + cat('OK: No DerivedVar rows mix _p and _m databases.\n') +} +" +``` + +A mixed row is **always suspect**. It is a **P1** error if the `_p` and `_m` feeder sets differ (use `resolve_dependencies()` with a `databases` filter to confirm). It may be acceptable if feeders are identical across both database types, but this should be verified explicitly. + +### Check 6: Trailing empty columns + +Check for trailing empty columns added by Excel editing (a recurring issue across v3 PRs): + +```r +Rscript -e " +vd <- read.csv('inst/extdata/variable_details.csv', stringsAsFactors = FALSE, check.names = FALSE) +cat('variable_details.csv columns:', ncol(vd), '\n') +empty <- which(names(vd) == '' | is.na(names(vd))) +if (length(empty) > 0) cat('WARNING: Empty column names at positions:', empty, '\n') +else cat('OK: No trailing empty columns\n') + +vars <- read.csv('inst/extdata/variables.csv', stringsAsFactors = FALSE, check.names = FALSE) +cat('variables.csv columns:', ncol(vars), '\n') +empty2 <- which(names(vars) == '' | is.na(names(vars))) +if (length(empty2) > 0) cat('WARNING: Empty column names at positions:', empty2, '\n') +else cat('OK: No trailing empty columns\n') +" +``` + +Expected column counts: variables.csv = 20, variable_details.csv = 22. + +## Interpreting results + +| Check | Pass | Severity | Fail action | +|-------|------|----------|------------| +| CSV formatting | No output / clean exit | P2 | Run `Rscript exec/fix-worksheets.R` to auto-fix, then commit | +| Source references | No invalid refs | P0 | Fix variableStart mappings per era rules | +| Cross-file consistency | All variables in both files | P1 | Add missing entries to the appropriate file | +| databaseStart coverage | No mismatches | P1 | Align databaseStart between files | +| R CMD check | 0 errors, 0 warnings | P0 | Fix R/ files: remove `library()` calls, declare deps in DESCRIPTION | +| Trailing empty columns | Expected column counts | P2 | Trim to real columns using R `write.csv()` | +| Pre-2007 explicit mappings | No gaps | P1 | Add explicit `db::VAR` entries for pre-2007 cycles | +| DerivedVar mixed _p/_m | No mixed rows | P1 | Split rows by database type; verify feeders with `resolve_dependencies()` | + +## When to run + +- **Before committing** worksheet changes (L5 stage) +- **Before merging** PRs that modify worksheets +- **When GHA "Check CSV Formatting" fails** — run locally for detailed diagnostics +- **After bulk edits** (adding master cycles to many variables) +- **When R/ files are modified** — run R CMD check to catch package-level issues diff --git a/.claude/skills/cchsflow-worksheets/SKILL.md b/.claude/skills/cchsflow-worksheets/SKILL.md new file mode 100644 index 00000000..5c05ae45 --- /dev/null +++ b/.claude/skills/cchsflow-worksheets/SKILL.md @@ -0,0 +1,107 @@ +--- +name: cchsflow-worksheets +description: Author and edit CCHS harmonization worksheets (variables.csv, variable_details.csv). Use when adding variables, mapping source variables across cycles, following the L0-L6 harmonization workflow, or consulting era-specific naming conventions. +allowed-tools: Bash(Rscript:*), Bash(R:*), Read, Glob, Grep, mcp__cchs-metadata__* +--- + +# cchsflow worksheet authoring + +This skill provides guidance for authoring and editing cchsflow harmonization worksheets. The two primary worksheets are: + +- `inst/extdata/variables.csv` — variable registry (metadata, database coverage) +- `inst/extdata/variable_details.csv` — recoding/transformation rules + +## Variable lookup: cchs-metadata MCP + +**Always use the cchs-metadata MCP server as the primary tool for looking up CCHS variable metadata** during worksheet authoring. It provides the most complete, cross-referenced metadata (16,000+ variables, 251 datasets) and is faster and more reliable than searching raw files. + +Key tools for authoring: +- `mcp__cchs-metadata__get_variable_history(variable_name)` — check which cycles/datasets contain a variable (essential for `databaseStart` authoring) +- `mcp__cchs-metadata__search_variables(query)` — find variables by name or label (essential for identifying era renames) +- `mcp__cchs-metadata__compare_master_pumf(variable_name, cycle)` — check whether PUMF and Master differ (essential for deciding row-splitting) +- `mcp__cchs-metadata__get_value_codes(variable_name)` — get response categories (essential for `recStart`/`recEnd` authoring) +- `mcp__cchs-metadata__suggest_cchsflow_row(variable_name)` — draft a harmonisation row +- `mcp__cchs-metadata__get_source_conflicts(variable_name, dataset_id)` — find cross-source label disagreements (useful for catching metadata inconsistencies before authoring) + +If the MCP is not available, see the troubleshooting section in `.claude/skills/cchsflow-review/SKILL.md` under "If the MCP is not available" for setup instructions (including the standalone CLI fallback). The MCP server (v0.3.0+) lives in `../cchsflow-docs/mcp-server/` and is also available as a [GitHub release](https://github.com/Big-Life-Lab/cchsflow-docs/releases). + +## Key references + +Detailed documentation is in the `docs/` subdirectory: + +- [harmonization-workflow.md](docs/harmonization-workflow.md) — the L0-L6 staged workflow for harmonizing CCHS variables, from documentation assessment through integration testing +- [variableStart-databaseStart-authoring.md](docs/variableStart-databaseStart-authoring.md) — technical rules for coordinating `variableStart` and `databaseStart` fields, including era-specific mappings and the dangerous `[VAR]` default pattern +- [pumf-master-harmonization.md](docs/pumf-master-harmonization.md) — patterns for splitting worksheet rows when PUMF and Master databases require different recoding logic (midpoint imputation vs continuous pass-through) +- [derived-variable-functions.md](docs/derived-variable-functions.md) — how to write R functions for `Func::` rows: 3-step architecture, semantic parameter naming, `derive_passthrough()`, feeder alignment, and `clean_variables()` worksheet-name mapping + +## Quick reference + +### CCHS variable naming eras + +| Era | Years | Pattern | Example | +|-----|-------|---------|---------| +| Pre-2007 | 2001-2005 | Cycle letter in 4th position | `SMKA_203` (2001), `SMKC_203` (2003), `SMKE_203` (2005) | +| 2007-2014 | 2007-2014 | Standard naming | `SMK_203` | +| Post-2014 | 2015+ | 3-digit increments | `SMK_040` | + +### Database suffixes + +| Suffix | Meaning | Notes | +|--------|---------|-------| +| `_p` | PUMF (Public Use Microdata File) | Grouped/derived variables | +| `_m` | Master survey file | Ungrouped source variables | +| `_s` | Share file | Synthetic datasets | +| `_i` | ICES-linked (deprecated) | Replace with `_m` | + +### PUMF vs Master row splitting + +When PUMF has grouped categorical and Master has true continuous source variables, rows must be split by database type. See [pumf-master-harmonization.md](docs/pumf-master-harmonization.md) for the full pattern. + +**Quick test**: If `variableStart` references both a categorical variable (e.g., SMK_06A) and a continuous companion (e.g., SMK_06C) for the same harmonized variable, you likely need the split pattern. + +### The dangerous default pattern + +If `databaseStart` spans both 2007-2014 and 2015+ cycles, a `[VAR]` default will apply the 2007-2014 name to 2015+ databases where the variable may have been renamed. Always add explicit `db::VAR` mappings for 2015+ cycles. + +### Writing CSVs — quoting rules + +**CRITICAL**: Never use `write.csv()` for worksheets — it quotes all fields, which fails the worksheet checker. Use one of: + +```r +# Option 1: readr (preferred for scripts) +readr::write_csv(df, path, na = "", quote = "needed", escape = "double", eol = "\n") + +# Option 2: fix_worksheet() after any write +devtools::load_all(quiet = TRUE) +fix_worksheet(path, "variable_details") # strips unnecessary quotes +``` + +### Rebuilding RData after worksheet changes + +Whenever CSVs change, rebuild the RData files that `rec_with_table()` uses at runtime: + +```r +vd <- read.csv("inst/extdata/variable_details.csv", stringsAsFactors = FALSE) +variable_details <- vd[, c("variable", "dummyVariable", "typeEnd", "databaseStart", + "variableStart", "typeStart", "recEnd", "numValidCat", + "catLabel", "catLabelLong", "units", "recStart", + "catStartLabel", "variableStartShortLabel", + "variableStartLabel", "notes")] +save(variable_details, file = "data/variable_details.RData") + +v <- read.csv("inst/extdata/variables.csv", stringsAsFactors = FALSE) +variables <- v[, c("variable", "label", "labelLong", "section", "subject", + "variableType", "units", "databaseStart", "variableStart", + "description")] +save(variables, file = "data/variables.RData") +``` + +The RData files have fewer columns than the CSVs (16 vs 23, 10 vs 18). Extra metadata columns are CSV-only. + +### CSV validation before committing + +```r +Rscript exec/fix-worksheets.R +``` + +This checks and fixes: excessive quoting, column order, empty trailing columns, CRLF line endings, unsorted rows. From 8645104ed09597e5691e717e46dd1bd068c0b74d Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Thu, 12 Mar 2026 08:20:55 -0400 Subject: [PATCH 04/15] feat(skills): Add cchsflow-worksheets reference documentation Four docs covering: harmonization workflow (L0-L6), PUMF vs Master splitting (including PUMF availability by cycle table), derived variable functions, and variableStart/databaseStart authoring patterns. --- .../docs/derived-variable-functions.md | 224 ++++++++ .../docs/harmonization-workflow.md | 346 +++++++++++++ .../docs/pumf-master-harmonization.md | 489 ++++++++++++++++++ .../variableStart-databaseStart-authoring.md | 365 +++++++++++++ 4 files changed, 1424 insertions(+) create mode 100644 .claude/skills/cchsflow-worksheets/docs/derived-variable-functions.md create mode 100644 .claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md create mode 100644 .claude/skills/cchsflow-worksheets/docs/pumf-master-harmonization.md create mode 100644 .claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md diff --git a/.claude/skills/cchsflow-worksheets/docs/derived-variable-functions.md b/.claude/skills/cchsflow-worksheets/docs/derived-variable-functions.md new file mode 100644 index 00000000..f5832fce --- /dev/null +++ b/.claude/skills/cchsflow-worksheets/docs/derived-variable-functions.md @@ -0,0 +1,224 @@ +# Derived variable functions (Func:: pattern) + +This document describes how to write R functions referenced by `Func::` rows in `variable_details.csv`. These functions implement the calculation logic for derived variables that cannot be expressed as simple recoding rules. + +## The 3-step architecture + +Every `Func::` function follows three steps: + +```r +calculate_example <- function(input_a, input_b, output_format = "tagged_na") { + + # === STEP 1: DATA CLEANING (input metadata) === + cleaned <- clean_variables(vars = list( + INPUT_VAR_A = input_a, # list names = worksheet variable names + INPUT_VAR_B = input_b + ), output_format = output_format) + + # === STEP 2: DOMAIN LOGIC === + result <- dplyr::case_when( + # ... calculation using cleaned$INPUT_VAR_A, cleaned$INPUT_VAR_B + ) + + # === STEP 3: OUTPUT CLEANING (output metadata) === + output_cleaned <- clean_variables(vars = list( + example_der = result # list name = output variable name + ), output_format = output_format) + + return(output_cleaned$example_der) +} +``` + +### Why both Step 1 and Step 3 call `clean_variables()` + +This is not redundant. The two calls use **different variable names** and therefore look up **different metadata**: + +- **Step 1** looks up *input* variable patterns — e.g., `SMKDSTY_A` has valid range 1-6 with codes 7/8/9 as missing +- **Step 3** looks up *output* variable patterns — e.g., `pack_years_der` has valid range 0-165 + +A function that skips Step 3 is a bug, not a simplification. The only exception is `derive_passthrough()`, where input and output are the same variable. + +## Function categories + +### Pass-through functions + +When the worksheet handles PUMF/Master source routing and the function simply cleans and returns, use `derive_passthrough()`: + +```r +#' @export +calculate_age_start_smoking <- function(age_start_smoking = NULL, + output_format = "tagged_na") { + derive_passthrough(age_start_smoking, "age_start_smoking", output_format) +} +``` + +The helper (`R/utility-functions.R`) handles NULL, empty input, and `clean_variables()` in one call. Use this when: + +- The function has a **single input parameter** (plus `output_format`) +- Step 2 is a pure pass-through — no transformation logic +- The worksheet splits handle all PUMF/Master differences + +Current examples: `age_start_smoking`, `age_first_cigarette`, `smoked_100_lifetime`. + +### Domain logic functions + +When the function combines multiple inputs using business rules: + +```r +calculate_cigs_per_day <- function(SMKDSTY_A, SMK_204, SMK_208, + output_format = "tagged_na") { + # Step 1: clean inputs + # Step 2: case_when routing by smoking status + # Step 3: clean output +} +``` + +These keep **all three steps explicit**. Use this when: + +- Multiple inputs are combined via `case_when()` or arithmetic +- The function contains genuinely different logic per input status +- Examples: `cigs_per_day`, `time_quit_smoking`, `pack_years`, `SMKDSTY_cat6` + +### Documentation-only stubs + +For variables harmonised entirely via worksheet recoding rules (no R logic needed), provide a stub so the function is discoverable: + +```r +#' @export +calculate_SMK_204 <- function(data, output_format = "tagged_na") { + stop("DOCUMENTATION ONLY: Use rec_with_table(data, 'SMK_204') for implementation") +} +``` + +**Important**: Do not create a doc stub if a real implementation exists — this causes a name collision where R silently uses whichever definition is sourced last. + +## Parameter naming + +### Semantic names for derived variables + +Function parameters should describe **what the data means**, not **where it comes from**: + +| Preferred (semantic) | Avoid (source-specific) | +|---------------------|------------------------| +| `smoking_status` | `SMKDSTY_A` | +| `age` | `DHHGAGE_cont` | +| `age_start_smoking` | `SMK_040`, `SMKG040_cont` | + +The worksheet handles routing the correct source variable to each parameter. The function is source-agnostic. + +### When source-specific names are acceptable + +Source-level functions that harmonise raw StatCan variables across eras can keep source-specific parameter names: + +```r +# Source-level: combines two era-specific variables +calculate_SMKG040_cont <- function(SMKG203_cont, SMKG207_cont, ...) + +# Domain logic with well-known source names +calculate_cigs_per_day <- function(SMKDSTY_A, SMK_204, SMK_208, ...) +``` + +The test: if the parameter name IS a StatCan variable that exists in both PUMF and Master, it's fine to keep it. If the same concept has different names on PUMF vs Master, use a semantic name and let the worksheet route. + +### The `clean_variables()` + worksheet name mapping + +When parameters are semantic but `clean_variables()` needs worksheet variable names for pattern lookup, pass worksheet names in the list and map afterwards: + +```r +# Step 1: use worksheet names for clean_variables() lookup +cleaned_raw <- clean_variables(vars = list( + SMKDSTY_A = smoking_status, # worksheet name = variable name in CSV + DHHGAGE_cont = age +), output_format = output_format) + +# Map to semantic names for Step 2 +cleaned <- cleaned_raw +cleaned$smoking_status <- cleaned_raw$SMKDSTY_A +cleaned$age <- cleaned_raw$DHHGAGE_cont +``` + +This is necessary because `clean_variables()` uses the list names to look up valid ranges and missing code patterns from `variable_details.csv`. An unknown name falls back to auto-detection, which can misclassify valid values as missing codes (e.g., smoking status 6 interpreted as NA::a). + +## Worksheet feeder alignment + +### Positional matching + +`rec_with_table()` passes `DerivedVar::[a, b, c]` feeders to the `Func::` function **by position**. The feeder count and order must match the function signature: + +``` +# Worksheet +variableStart: DerivedVar::[SMKDSTY_A, DHHGAGE_cont, age_start_smoking, cigs_per_day, ...] +recEnd: Func::calculate_pack_years + +# Function signature (must match positionally) +calculate_pack_years <- function(smoking_status, age, age_start_smoking, cigs_per_day, ...) +# ^pos 1 ^pos 2 ^pos 3 ^pos 4 +``` + +### Feeder names are resolved, not matched + +The feeder names in the worksheet (e.g., `SMKDSTY_A`) are resolved by `rec_with_table()` to actual data columns before being passed to the function. The function parameter names don't need to match the feeder names — only the **position** matters. + +### Nested DerivedVar chains + +When a feeder is itself a `DerivedVar` (e.g., `age_start_smoking`), `rec_with_table()` resolves it first. This means the function receives the already-computed derived value, not the raw source variable. + +## PUMF/Master row splitting for Func:: rows + +### When to split + +Split `Func::` rows when PUMF and Master route **different source variables** to the same function parameter. The function itself stays source-agnostic — only the worksheet feeder list changes: + +``` +# PUMF row +databaseStart: cchs2001_p, cchs2003_p, ... +variableStart: DerivedVar::[SMKDSTY_A, DHHGAGE_cont, age_start_smoking, ...] +recEnd: Func::calculate_pack_years + +# Master row +databaseStart: cchs2001_m, cchs2003_m, ... +variableStart: DerivedVar::[SMKDSTY_A, DHH_AGE, age_start_smoking, ...] +recEnd: Func::calculate_pack_years +``` + +The only difference is position 2: `DHHGAGE_cont` (PUMF grouped midpoint) vs `DHH_AGE` (Master true continuous). The function receives both as `age`. + +### When NOT to split + +Don't split when the feeder variable is the **same on both PUMF and Master**: + +- `SMK_204` exists identically on both → no split needed +- `age_start_smoking` is itself a DerivedVar that handles PUMF/Master internally → no split needed for downstream consumers + +### Domain routing is not a PUMF/Master split + +Functions like `cigs_per_day` take `SMK_204` (current daily) and `SMK_208` (former daily) and route by smoking status. Both variables exist on both PUMF and Master. This is **domain logic**, not a data-file split. The worksheet rows stay combined. + +## NULL handling convention + +v3 functions use `= NULL` defaults for optional parameters to support standalone use outside `rec_with_table()`: + +```r +calculate_pack_years <- function(smoking_status, age, ..., + cigs_occasional = NULL, # optional + days_per_month = NULL, # optional + output_format = "tagged_na") +``` + +NULL inputs are converted to NA vectors at function entry: +```r +if (is.null(cigs_occasional)) cigs_occasional <- rep(NA_real_, n) +``` + +This is tracked for standardisation in issue #173. + +## Reference implementations + +| Pattern | Example function | File | +|---------|-----------------|------| +| Pass-through | `calculate_age_start_smoking()` | `R/smoke-start.R` | +| Domain routing | `calculate_cigs_per_day()` | `R/smoke-intensity.R` | +| Multi-input calculation | `calculate_pack_years()` | `R/smoke-pack-years.R` | +| Categorical binning | `calculate_pack_years_categorical()` | `R/smoke-pack-years.R` | +| Source combining | `calculate_time_quit_smoking()` | `R/smoking-cessation.R` | +| Doc stub | `calculate_SMK_204()` | `R/smoke-intensity.R` | diff --git a/.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md b/.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md new file mode 100644 index 00000000..76a0a31e --- /dev/null +++ b/.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md @@ -0,0 +1,346 @@ +# Harmonization workflow (L0-L6) + +This document describes the staged workflow for harmonizing CCHS variables in cchsflow. + +## Overview + +The L0-L6 workflow ensures systematic, validated harmonization of CCHS variables: + +| Stage | Name | Purpose | Output | +|-------|------|---------|--------| +| L0 | Documentation assessment | Review all data sources | L0_documentation_assessment.md | +| L1 | Variable concordance | Map source variables across cycles | L1_variable_concordance.md | +| L2 | Semantic mapping | Define harmonization rules | L2_semantic_mapping.md | +| L3 | Worksheet authoring | Create CSV worksheets | variables.csv, variable_details.csv | +| L4 | DV specifications | Specify derived variable functions | L4_dv_specifications.md | +| L5 | Testing | Unit tests and validation | test-*.R | +| L6 | Integration | Merge to production, integration testing | QMD reports | + +## L0: Documentation assessment + +### Purpose + +Identify and review all available documentation sources before writing worksheets. This prevents missed variables and ensures accurate era mappings. + +### Required data sources + +**Primary source (always use first):** + +1. **cchs-metadata MCP server** — the unified metadata database with 16,000+ variables across 251 datasets + - `get_variable_history(variable_name)` — confirm which cycles/datasets contain a variable + - `search_variables(query)` — find variables by name or label, identify era renames + - `get_value_codes(variable_name)` — get category codes and labels per cycle + - `compare_master_pumf(variable_name, cycle)` — check PUMF vs Master differences + - Cross-references PUMF RData, DDI XML, and ICES sources with full provenance + - If not available, see troubleshooting in `.claude/skills/cchsflow-review/SKILL.md` + +**Supplementary sources (use to fill gaps or cross-check):** + +2. **DDI YAML files** (`cchsflow-docs/cchs-extracted/data-dictionary/`) + - Raw extracted data dictionaries — coverage 2000-2001 through 2023 + - Use when MCP lacks coverage for a specific cycle (e.g., 2022-2023 if not yet ingested) + +3. **cchs_available_variables_list.csv** (`development/`) + - Quick reference for variable availability across cycles + - Shows source variable names by era + +4. **Existing PR worksheets** (if applicable) + - Check branch-specific variables.csv and variable_details.csv + - Note any existing errors or gaps + +5. **cchsflow-docs variable listings** (when available) + - Cross-reference for comprehensive coverage + - Includes Ontario Linked file availability + +### Multi-source reconciliation process + +**CRITICAL**: Before authoring worksheets, reconcile all sources: + +``` +Step 1: List all variables from existing PR (if any) +Step 2: Cross-check against cchs_available_variables_list.csv +Step 3: Verify each variable exists in DDI for claimed cycles +Step 4: Check cchsflow-docs for Ontario Linked file availability +Step 5: Document any discrepancies in L0 assessment +``` + +### Ontario Linked file tracking + +For Ontario-specific research (e.g., dementia studies), document: + +- Which variables are available in Ontario Linked files +- Which cycles have Ontario-specific restrictions (e.g., 2003 HUI Ontario exclusion) +- Age restrictions for target populations (50+, 55+, 60+) + +### L0 document template + +```markdown +# L0: Documentation assessment - [Domain] + +## Topic overview + +**Domain**: [e.g., Hearing/Vision] +**Sub-topic**: [e.g., HUI hearing items] +**Scope**: [Brief description of variables in scope] + +## Documentation sources reviewed + +| Source | Location | Status | +|--------|----------|--------| +| DDI YAMLs | cchsflow-docs/cchs-extracted/data-dictionary/ | [Reviewed/Pending] | +| cchs_available_variables_list.csv | development/ | [Reviewed/Pending] | +| Existing PR worksheets | [branch]/inst/extdata/ | [Reviewed/N/A] | +| cchsflow-docs listings | [URL] | [Reviewed/Pending/N/A] | + +## Multi-source reconciliation + +### Variables from existing PR +[List variables found in PR worksheets] + +### Variables from cchs_available_variables_list.csv +[List variables for this domain] + +### Discrepancies identified +[Document any gaps or conflicts between sources] + +## Provincial availability + +### Ontario-specific restrictions + +| Variable | Cycle | Issue | +|----------|-------|-------| +| [e.g., HUICGHER] | 2003 | [No Ontario data in PUMF] | + +### Ontario Linked file availability + +| Variable | Cycles available | Notes | +|----------|-----------------|-------| +| [var] | [cycles] | [any restrictions] | + +## Variables in scope + +[Comprehensive list with cycle coverage] + +## Key decisions + +[Document any decisions made during assessment] +``` + +## L1: Variable concordance + +### Purpose + +Map source variable names across all eras and identify naming patterns. + +### How to build concordance + +**Use the cchs-metadata MCP as the primary tool:** +- `get_variable_history(variable_name)` shows all datasets containing a variable — this directly reveals era renames (e.g., SMK_09C appearing in 2007-2014 datasets, SMK_090 in 2015+ datasets) +- `search_variables(query)` with partial name patterns finds related variables across naming eras +- `compare_master_pumf(variable_name, cycle)` reveals PUMF vs Master naming differences for each cycle + +### Era patterns + +| Era | Years | Naming pattern | Example | +|-----|-------|----------------|---------| +| Pre-2007 | 2001-2005 | Cycle letter in 4th position | HUIA_06 (2001), HUIC_06 (2003) | +| 2007-2014 | 2007-2014 | Standard naming | HUI_06 | +| Post-2014 | 2015+ | 3-digit increments | HUI_060 or module redesign | + +### Concordance table template + +| Harmonized | 2001 | 2003 | 2005 | 2007-2008 | 2009-2010 | 2011-2012 | 2013-2014 | 2015-2016 | 2017-2018 | +|------------|------|------|------|-----------|-----------|-----------|-----------|-----------|-----------| +| [target] | [src] | [src] | [src] | [src] | [src] | [src] | [src] | [src] | [src] | + +### PUMF vs Master + +Document differences between PUMF (grouped) and Master (derived) variables: + +| Concept | PUMF variable | Master variable | Difference | +|---------|---------------|-----------------|------------| +| [e.g., Hearing] | HUICGHER | HUICDHER | Grouped vs derived | + +## L2: Semantic mapping + +### Purpose + +Define category mappings and identify semantic breaks across cycles. + +### Semantic break documentation + +For each identified break: + +1. **Year of change**: When the break occurred +2. **Nature of change**: What changed (categories, question wording, etc.) +3. **Impact**: How it affects harmonization +4. **Resolution**: How cchsflow addresses it + +### PUMF vs Master source type differences + +During semantic mapping, identify whether PUMF and Master databases provide the same variable type: +- **Same type**: categorical on both → standard harmonization +- **Different type**: PUMF categorical, Master continuous → requires row splitting (see [pumf-master-harmonization.md](pumf-master-harmonization.md)) + +This determination affects L3 worksheet authoring — rows must be split by database type when recoding logic differs. + +### Category mapping table + +| Harmonized value | Meaning | 2001-2014 source | 2015+ source | +|------------------|---------|------------------|--------------| +| 1 | [meaning] | [code] | [code] | +| 2 | [meaning] | [code] | [code] | + +## L3: Worksheet authoring + +### Purpose + +Create the actual CSV worksheets following cchsflow schema. + +### Pre-authoring checklist + +- [ ] L0-L2 documents complete +- [ ] All source variables verified against DDI +- [ ] Era mappings documented +- [ ] Semantic breaks identified + +### Validation requirements + +**Before merging to inst/extdata:** + +```r +source("R/validate-all-source-references.R") +result <- validate_all_source_references("path/to/variable_details.csv") +print_all_validation_result(result) +``` + +### Common errors to avoid + +1. **Wrong era variable name via `[VAR]` default** - See variableStart-databaseStart-authoring.md +2. **Database name typos** - `cchs_2009_2010_m` vs `cchs2009_2010_m` +3. **Wrong source variable mapping** - Double-check each db::VAR pair + +## L4: DV specifications + +### Purpose + +Specify derived variable functions when variables cannot be passed through. + +### When needed + +- Collapsing categories across semantic breaks +- Deriving continuous from categorical +- Complex multi-variable derivations + +## L5: Testing + +### Purpose + +Validate harmonization logic with unit tests and package checks. + +### Required tests + +1. **Category coverage** - All output categories have test cases +2. **Edge cases** - Missing data, boundary values +3. **Cross-cycle consistency** - Same inputs produce same outputs + +### CSV worksheet validation + +Before committing worksheet changes, validate CSV formatting: + +```r +# Fix excessive quoting and formatting issues +Rscript exec/fix-worksheets.R +``` + +This runs `check_worksheet()` and `fix_worksheet()` from the cchsflow package against both `inst/extdata/variables.csv` and `inst/extdata/variable_details.csv`. The GHA "Check CSV Formatting" workflow will fail if CSVs have: + +- Excessive quoting (all fields quoted when not needed) +- Wrong column order +- Empty trailing columns +- CRLF line endings +- Unsorted rows + +If `exec/fix-worksheets.R` fails due to package load errors (untracked R files or missing dependencies), use this workaround: + +```r +Rscript -e " +library(readr) +vars <- read.csv('inst/extdata/variables.csv', stringsAsFactors = FALSE, check.names = FALSE) +write_csv(vars, 'inst/extdata/variables.csv', na = '', quote = 'needed', escape = 'double', eol = '\n') +details <- read.csv('inst/extdata/variable_details.csv', stringsAsFactors = FALSE, check.names = FALSE) +write_csv(details, 'inst/extdata/variable_details.csv', na = '', quote = 'needed', escape = 'double', eol = '\n') +" +``` + +### R CMD check + +Run R CMD check to catch package-level issues before pushing: + +```r +Rscript -e "devtools::check()" +``` + +Common failures to watch for: + +- **Undefined exports in NAMESPACE** - Functions listed in NAMESPACE that don't exist in any R file. This happens when functions are renamed or removed but NAMESPACE isn't updated. Fix by removing the stale `export()` lines from NAMESPACE or regenerating with `roxygen2::roxygenise()`. +- **Missing documentation** - New exported functions need roxygen docs. +- **Unresolved `source()` calls** - Untracked R files in `R/` that `source()` files not in the repo will break `devtools::load_all()` and roxygen. + +## L6: Integration + +### Purpose + +Merge to production and validate with real PUMF/Master data. + +### Pre-merge checklist + +Before merging a PR: + +1. **CSV formatting passes** - `Rscript exec/fix-worksheets.R` exits cleanly +2. **R CMD check passes** - No errors or warnings +3. **NAMESPACE is correct** - All exported functions exist; no stale exports +4. **GHA checks are green** - Both "R-CMD-check" and "Check CSV Formatting" workflows pass + +### Integration test QMD template + +Each CEP should include: + +1. **availability-matrix.qmd** - Respondent counts by cycle, age group, province +2. **integration-test.qmd** - `rec_with_table()` validation + +### Age cutoffs for dementia research + +Standard cutoffs: 50+, 55+, 60+ + +### Two-tier testing + +1. **Canada-wide** - Full sample availability +2. **Ontario-specific** - Filtered to Ontario (province == 35) + +## Workflow state tracking + +Each CEP subgroup should have a `_workflow_state.yaml`: + +```yaml +domain: [domain] +subgroup: [subgroup] +status: [L0_pending | L1_complete | ... | L6_complete] + +stages: + L0_documentation: + status: [pending | in_progress | complete] + date_completed: "YYYY-MM-DD" + output: L0_documentation_assessment.md + + L1_concordance: + status: [pending | in_progress | complete] + # ... + +# etc. +``` + +## Related documentation + +- [variableStart-databaseStart-authoring.md](variableStart-databaseStart-authoring.md) - Technical authoring rules +- [field-reference.md](field-reference.md) - Field definitions diff --git a/.claude/skills/cchsflow-worksheets/docs/pumf-master-harmonization.md b/.claude/skills/cchsflow-worksheets/docs/pumf-master-harmonization.md new file mode 100644 index 00000000..e822ce38 --- /dev/null +++ b/.claude/skills/cchsflow-worksheets/docs/pumf-master-harmonization.md @@ -0,0 +1,489 @@ +# PUMF vs Master harmonization + +This document describes how to author worksheet rows when PUMF and Master databases require different recoding logic for the same harmonized variable. + +## CCHS PUMF availability by cycle + +Before authoring `_p` database entries, verify which PUMF files actually exist. StatsCan PUMF releases have been irregular, particularly around the COVID-19 period. + +| Cycle | PUMF status | cchsflow database name | +|-------|-------------|------------------------| +| 2001–2017/18 | Released | `cchs2001_p` … `cchs2017_2018_p` | +| 2019-2020 | Released | `cchs2019_2020_p` | +| 2021 | **Not released standalone** — combined with 2022 into a 2021–2022 PUMF | No `cchs2021_p`; no `cchs2021_2022_p` yet | +| 2022 | Combined with 2021 (see above) | No standalone `cchs2022_p` | +| 2023 | Status uncertain | Check before authoring | + +**Key rule**: `cchs2021_p` is an **invalid database name** — do not use it. The 2021 PUMF data is only available as part of the combined 2021-2022 file, which cchsflow has not yet added as a database. + +**For variables with `cchs2021_p` in databaseStart**: This is a branch-wide error introduced during v3-smoking development. Remove `cchs2021_p` from `databaseStart` and `variableStart` for any variable you are working on. The broader fix is tracked separately. + +**Future PUMF uncertainty**: StatsCan has signalled that Master data collection will continue but PUMF scope and release frequency may change. When adding new `_p` databases, confirm availability with the cchs-metadata MCP or GN before authoring. + +## When to split rows by database type + +Most cchsflow variables use the same source variable on both PUMF and Master, so `_p` and `_m` databases share the same worksheet rows. A **split** is needed when: + +1. **PUMF has a grouped/categorical variable** while **Master has the ungrouped continuous version** of the same concept +2. **The recoding logic differs** between the two — typically midpoint imputation from categorical (PUMF) vs continuous pass-through (Master) +3. **DerivedVar rows with different feeder sets** — when a derived variable uses different feeder variables for PUMF vs Master (e.g., `DHHGAGE_cont` on PUMF vs `DHH_AGE` on Master), the DerivedVar rows must be split accordingly (see [DerivedVar row splitting](#derivedvar-row-splitting) below) + +### Common triggers + +| PUMF source | Master source | Concept | +|-------------|---------------|---------| +| SMKG06C (grouped categorical) | SMK_06C (continuous years) | Years since stopped occasional smoking | +| SMKG09C (grouped categorical) | SMK_09C (continuous years) | Years since stopped daily smoking | +| SMKDGSTP (grouped 5-category) | SMKDVSTP (continuous 0-88) | Years since quit completely | +| SMK_06A (4-category) | SMK_06A + SMK_06C | Quit timing (categorical + continuous companion) | + +### Continuous companion variable names by era + +The continuous companion variables follow the standard CCHS era naming conventions: + +| Concept | 2001 | 2003 | 2005 | 2007-2014 | 2015-2021 | 2022-2023 | +|---------|------|------|------|-----------|-----------|-----------| +| Age started smoking | — | SMKC_06C | SMKE_06C | SMK_06C | SMK_070 | N/A | +| Time since quit daily | — | SMKC_09C | SMKE_09C | SMK_09C | SMK_090 | SPU_25A/B (restructured) | +| Time since last smoked | — | SMKC_10C | SMKE_10C | SMK_10C | SMK_110 | N/A | + +**Key notes:** +- 2001 has no continuous companions (PUMF rows apply to both `_p` and `_m`) +- 2022-2023 restructured smoking into CSS/SPU modules — timing variables changed from a single continuous value (years) to month+year pair (SPU_25A/SPU_25B). These are Master-only. +- Use the MCP `get_variable_history()` to verify existence and era names for any variable + +### Naming convention for harmonized Master continuous variables + +The cchsflow convention for harmonized variable names uses the **2007-2014 StatCan name** as the canonical form (matching the existing pattern: SMK_09A not SMK_080). Therefore: + +| Harmonized name | StatCan source (2007-2014) | StatCan source (2015+) | +|-----------------|---------------------------|------------------------| +| **SMK_06C** | SMK_06C | SMK_070 | +| **SMK_09C** | SMK_09C | SMK_090 | +| **SMK_10C** | SMK_10C | SMK_110 | + +These are newly introduced in cchsflow (not in v2.1.0) because Master data was not previously harmonized. + +### When NOT to split + +- **Same variable on both files**: When PUMF and Master use the same variable name and coding (e.g., SMK_10 gate question) — share rows. +- **Same categories, different names**: When PUMF and Master have different variable names but identical category codes — use separate `variableStart` mappings in the same row group (strategy 2 below). +- **No continuous companion**: When only the categorical variable exists on Master (e.g., cchs2001 has no `_C` continuous variables) — use shared PUMF/Master rows for **recoding rows**. Note: this exception does not apply to DerivedVar rows where the feeder variable differs by database type (e.g., `cchs2001_m` still uses `DHH_AGE`, not `DHHGAGE_cont`). + +## Harmonization strategies + +### Identifying which strategy to use + +Work through these questions in order for each variable: + +``` +Q1: Do PUMF and Master use the same source variable name and coding? + YES → Strategy 1 (shared rows) + NO → Continue to Q2 + +Q2: Do the source variables have the same type (both categorical, or both continuous)? + YES → Strategy 2 (different names, same recoding) + NO → Continue to Q3 + +Q3: Does Master have a continuous source where PUMF only has categorical? + YES → Strategy 3 (different recoding logic — PUMF/Master split) + NO → Investigate further (may need a derived function) +``` + +**Important**: Strategies compound. A single variable often requires strategy 2 (era naming) AND strategy 3 (PUMF/Master split) simultaneously. See "Compounding strategies" below. + +### Strategy 1: Same source, same recoding + +PUMF and Master use the same variable name and coding. They share rows — no split needed. + +``` +databaseStart: cchs2007_2008_p, cchs2007_2008_m +variableStart: [SMK_10] +recStart: 1 recEnd: 1 +``` + +This is the most common case. The `_p` and `_m` databases appear together in `databaseStart`. + +### Strategy 2: Different names, same recoding + +The source variable name differs (across eras or between PUMF and Master), but the category codes and recoding logic are identical. Use explicit `db::VAR` mappings within the same row group — **no row duplication needed**. + +**Example — era naming differences within one row group:** + +SMK_09A has different names across eras (SMKA_09A in 2001, SMKC_09A in 2003, etc.) but the same 4 categories with the same codes. One set of rows handles all eras: + +``` +databaseStart: cchs2001_m, cchs2003_m, cchs2005_m, cchs2007_2008_m, ..., cchs2013_2014_m +variableStart: cchs2001_m::SMKA_09A, cchs2003_m::SMKC_09A, cchs2005_m::SMKE_09A, [SMK_09A] +recStart: 1 recEnd: 0.5 +recStart: 2 recEnd: 1.5 +recStart: 3 recEnd: 2.5 +... +``` + +The `[SMK_09A]` default applies to 2007-2014 databases where the name is stable. Each database resolves to its correct source variable via the explicit `db::VAR` mappings. + +**When row groups ARE needed (era boundary with rename):** + +If `databaseStart` spans the 2015 rename boundary, you must split into separate row groups because the `[VAR]` default cannot safely span both eras. The recoding logic is still identical — only the `variableStart` mappings change: + +``` +# Row group 1: 2007-2014 +databaseStart: cchs2007_2008_m, ..., cchs2013_2014_m +variableStart: [SMK_09A] +recStart: 1 recEnd: 0.5 + +# Row group 2: 2015+ +databaseStart: cchs2015_2016_m, ..., cchs2021_m +variableStart: cchs2015_2016_m::SMK_080, cchs2017_2018_m::SMK_080, ... +recStart: 1 recEnd: 0.5 +``` + +Same `recStart`/`recEnd` in both groups. The split is driven by naming safety, not by different recoding rules. + +### Strategy 3: Different recoding logic (PUMF/Master split) + +PUMF and Master need genuinely different recoding — typically midpoint imputation from categorical (PUMF) vs `copy` pass-through from continuous (Master). Must split into separate row groups with different `typeStart`, `recStart`, `recEnd`. + +This strategy is required when: +- The harmonized variable has `typeEnd=cont` (continuous output) +- PUMF only has categorical source variables +- Master has both categorical AND continuous companion variables +- You want to preserve the continuous precision available on Master + +### Compounding strategies: era naming + PUMF/Master split + +Real variables frequently require both strategy 2 and strategy 3 at once. SMK_06A_cont is a good example — it needs: +- **Strategy 2** because the source variable name changes across eras (SMKC_06A → SMKE_06A → SMK_06A → SMK_060) +- **Strategy 3** because Master has a continuous companion (SMK_06C) that PUMF lacks + +The result is multiple row groups, each addressing a combination of era and database type: + +``` +Group 1: PUMF all eras (midpoint for all categories including cat 4) + databaseStart: cchs2001_p, cchs2001_m, cchs2003_p, ..., cchs2017_2018_p + variableStart: cchs2001_p::SMKA_06A, cchs2001_m::SMKA_06A, cchs2003_p::SMKC_06A, ... + recStart: 1→0.5, 2→1.5, 3→2.5, 4→4, 6→NA::a, [7,9]→NA::b, else→NA::b + +Group 2: Master 2003-2014 (midpoint cats 1-3, copy cat 4 from continuous) + databaseStart: cchs2003_m, cchs2005_m, ..., cchs2013_2014_m + variableStart (cat rows): cchs2003_m::SMKC_06A, cchs2005_m::SMKE_06A, [SMK_06A] + variableStart (copy row): cchs2003_m::SMKC_06C, cchs2005_m::SMKE_06C, [SMK_06C] + recStart: 1→0.5, 2→1.5, 3→2.5, 6→NA::a, [7,9]→NA::b, else→NA::b, copy→copy + +Group 3: Master 2015+ (same logic as group 2, but different variable names) + databaseStart: cchs2015_2016_m, ..., cchs2021_m + variableStart (cat rows): cchs2015_2016_m::SMK_060, ... + variableStart (copy row): cchs2015_2016_m::SMK_070, ... + recStart: 1→0.5, 2→1.5, 3→2.5, 6→NA::a, [7,9]→NA::b, else→NA::b, copy→copy +``` + +Note that: +- Group 1 includes `cchs2001_m` because 2001 has no continuous companion — it stays with PUMF +- Groups 2 and 3 are split from each other (strategy 2) because the variable was renamed in 2015 +- Groups 2 and 3 are split from group 1 (strategy 3) because Master needs different recoding for cat 4 + +## Working across eras + +### The era-walking workflow + +Harmonization typically proceeds chronologically — start at 2001 and work forward through each era boundary. At each boundary, reassess whether the existing strategy still applies: + +``` +2001: Establish baseline. What source variables exist? PUMF and Master same or different? + → Choose initial strategy. + +2003-2005: Pre-2007 era. Variable names change by cycle letter but categories are usually stable. + → Likely extends with strategy 2 (explicit db::VAR mappings). + +2007: Standard naming era begins. Check whether new variables were introduced + (e.g., continuous companions like SMK_06C appear on Master). + → May need to add strategy 3 and revisit 2003-2005 row groups. + +2015: Major redesign. Variable names renumbered, PUMF may switch to grouped versions, + question wording may change. + → Re-evaluate: can you extend, or is this a semantic break? + +2022+: Module restructuring (e.g., smoking → substance use). Variable names may change domain prefix. + → Same evaluation as 2015. +``` + +At each boundary, the key question is: **can I extend the existing harmonized variable, or does this boundary require a new one?** + +### One variable vs multiple variables + +**Prefer one variable with multiple row groups** when the output means the same thing across eras, even if the source variables, precision, or strategies differ. Internal row groups handle the complexity — the researcher sees a single consistent variable. + +**Create separate harmonized variables** when the output semantics genuinely change: +- Categories were redefined (not just renamed or renumbered) +- The question wording changed to ask something different +- The population filter changed (e.g., "all respondents" → "daily smokers only") +- Combining eras would mislead researchers about comparability + +### Grey areas + +Some boundaries are judgement calls: + +- **Category collapse/expansion**: If 2015+ added a 5th category, you could collapse to 4 across all eras (one variable) or provide both versions. Consider what researchers need — a single long time series, or era-specific detail? + +- **Precision differences**: PUMF midpoint (4 values) vs Master continuous (actual years) is a precision difference, not a semantic break. One variable with internal PUMF/Master splits is appropriate. Document the precision difference in `notes`. + +- **Partial availability**: If a variable exists on PUMF in some cycles but not others, it can still be one variable — the `databaseStart` simply doesn't include the missing cycles. But if PUMF availability is sparse enough to be misleading, consider separate variables with clear coverage. + +### Naming harmonized variables + +When separate harmonized variables are needed, the name should communicate **why the split exists**: + +| Pattern | Use when | Example | +|---------|----------|---------| +| `_cont` / `_cat` / `_catN` | Output type differs | `SMK_06A_cont`, `SMK_06A_cat`, `SMKDSTY_cat5` | +| `_pre2015` / `_post2015` | Semantic break at era boundary | Where question meaning changed | +| Descriptive qualifier | Concept-specific | `SMKDVSTP` (Master continuous) vs `SMKDGSTP` (PUMF grouped) | + +**Avoid** the legacy `_A` / `_B` convention — the letter suffix doesn't communicate what it distinguishes. When you encounter existing `_A`/`_B` variables, check whether a more descriptive name would reduce cognitive load. + +**Don't encode year ranges in variable names** (e.g., `_2003_2014`) — these become stale when new cycles are added. Instead, document the cycle coverage in `databaseStart` and `notes`. + +## The categorical + continuous hybrid pattern + +This is the core pattern for cessation timing variables (SMK_06A_cont, SMK_09A_cont, SMK_10A_cont). + +### The source data structure + +The CCHS asks former smokers "when did you stop?" with a 4-category response: + +| Category | Label | Midpoint | +|----------|-------|----------| +| 1 | Less than 1 year ago | 0.5 years | +| 2 | 1 to less than 2 years | 1.5 years | +| 3 | 2 to less than 3 years | 2.5 years | +| 4 | 3 or more years ago | **Open-ended** | + +On **PUMF**, only the categorical variable exists (e.g., SMK_06A). Category 4 gets a conservative fixed estimate (4 years). + +On **Master**, StatsCan also provides a **continuous companion** (e.g., SMK_06C) with the actual number of years. Category 4 respondents can get their true value. + +### Why categories 1-3 use the same recoding + +Categories 1-3 are bounded intervals where both PUMF and Master have the same information (the categorical response). Midpoint imputation is appropriate for both: +- Cat 1 → 0.5 (midpoint of 0-1) +- Cat 2 → 1.5 (midpoint of 1-2) +- Cat 3 → 2.5 (midpoint of 2-3) + +Although Master has the continuous value, `rec_with_table()` processes rows sequentially — the categorical source is used for categories 1-3 (matching on `recStart`), and the continuous source is only invoked for the separate `copy` row. + +### Why category 4 needs the split + +Category 4 is open-ended ("3+ years"). On PUMF, the best we can do is assign a fixed value (4 years — conservative). On Master, the continuous variable gives the actual years (3, 5, 12, 27...). The split preserves this precision. + +### Row structure for Master database groups + +Each Master database group produces 7 rows: + +``` +Row 1: typeStart=cat recStart=1 recEnd=0.5 (from categorical source, e.g., SMK_06A) +Row 2: typeStart=cat recStart=2 recEnd=1.5 (from categorical source) +Row 3: typeStart=cat recStart=3 recEnd=2.5 (from categorical source) +Row 4: typeStart=cat recStart=6 recEnd=NA::a (not applicable) +Row 5: typeStart=cat recStart=[7,9] recEnd=NA::b (missing) +Row 6: typeStart=cat recStart=else recEnd=NA::b (catch-all) +Row 7: typeStart=cont recStart=copy recEnd=copy (from continuous source, e.g., SMK_06C) +``` + +**Key**: Row 7 uses a **different `variableStart`** — the continuous companion variable — and `typeStart=cont` instead of `typeStart=cat`. + +### Row structure for PUMF database groups + +PUMF rows are identical except category 4 uses a fixed midpoint: + +``` +Row 1: typeStart=cat recStart=1 recEnd=0.5 +Row 2: typeStart=cat recStart=2 recEnd=1.5 +Row 3: typeStart=cat recStart=3 recEnd=2.5 +Row 4: typeStart=cat recStart=4 recEnd=4 (fixed estimate — no continuous source) +Row 5: typeStart=cat recStart=6 recEnd=NA::a +Row 6: typeStart=cat recStart=[7,9] recEnd=NA::b +Row 7: typeStart=cat recStart=else recEnd=NA::b +``` + +## Reference implementation: SMKDGSTP_cont + +This variable unifies years since quit across PUMF and Master with three distinct pathways: + +| Database type | Source | Strategy | +|---------------|--------|----------| +| Master all cycles | SMKCDSTP/SMKEDSTP/SMKDSTP/SMKDVSTP | `recStart=[0,79]`, `recEnd=copy` — true continuous pass-through | +| PUMF 2007-2008 | SMKDSTP | `recStart=[0,82]`, `recEnd=copy` — continuous available on early PUMF | +| PUMF 2015+ | SMKDGSTP | Categorical midpoint: 0→0.5, 1→1.5, 2→4.0, 3→8.0, 4→15.0 | + +Note that PUMF 2007-2008 uses `copy` because the continuous variable happened to be available on that PUMF file. This is a reminder to check each cycle individually — PUMF availability varies. + +**Location**: `inst/extdata/variable_details.csv`, SMKDGSTP_cont rows. + +## Reference implementation: SMK_06A_cont (cessation fix) + +See `ceps/cep-002-smoking/03-cessation/smk_quit_fix_variable_details.csv` for the 49 Master-only rows and `generate_smk_quit_fix.R` for the generation script. + +**Before the split** (existing state): SMK_06A_cont has rows with both `_p` and `_m` databases mixed together, all using midpoint imputation including a fixed value of 4 for category 4. + +**After the split**: +- Existing rows → PUMF-only (`_p` databases only, plus cchs2001_m) +- New rows → Master-only (`_m` databases from 2003+), with `copy` pass-through for category 4 + +**Exception**: cchs2001_m stays with the PUMF rows because the 2001 cycle has no continuous companion variable (SMKA_06C doesn't exist). + +## How `rec_with_table()` handles `copy` + +The `copy` keyword is recognised in two code paths in `R/recode-with-table.R`: + +**Path 1** (lines 521-526): When `recStart=else` and `recEnd=copy`, all unmatched values are copied from the source column: +```r +if (is_equal(else_value, "copy")) { + recoded_data[variable_being_checked] <- data[data_variable_being_checked] +} +``` + +**Path 2** (lines 628-631): When a specific `recFrom` range matches and `recEnd=copy`, the matching source values are copied directly: +```r +if (is_equal(value_recorded, "copy")) { + value_recorded <- data[valid_row_index, data_variable_being_checked] +} +``` + +For the cessation pattern, Path 2 is used: `recStart=copy` means "match all values" and `recEnd=copy` means "pass through as-is". + +## Step-by-step workflow for applying a split + +### Step 1: Identify databases with continuous sources + +Check DDI or use `R/source-lookups.R` to find which cycles have the continuous companion variable: + +```r +source("R/source-lookups.R") +# Does SMK_06C exist on Master? +variable_exists_in_database("SMK_06C", "cchs2007_2008_m") # TRUE +# Does it exist on PUMF? +variable_exists_in_database("SMK_06C", "cchs2007_2008_p") # FALSE (PUMF has SMKG06C instead) +``` + +### Step 2: Group databases by type + +| Group | Databases | Strategy | +|-------|-----------|----------| +| PUMF all cycles | `_p` databases | Midpoint for all categories | +| Master with continuous | `_m` databases (2003+) | Midpoint cats 1-3, `copy` cat 4 | +| Master without continuous | `cchs2001_m` | Stays with PUMF rows | + +### Step 3: Create PUMF-only rows + +Remove `_m` databases from the existing mixed rows. The `databaseStart` and `variableStart` should reference only `_p` databases (plus any `_m` databases that lack continuous sources). + +### Step 4: Create Master-only rows + +For each Master database group: +1. Create 6 categorical rows using the **categorical source** (e.g., SMK_06A) +2. Create 1 continuous row using the **continuous source** (e.g., SMK_06C) +3. Set `typeStart=cont`, `recStart=copy`, `recEnd=copy` on the continuous row + +Mind the era naming: the continuous variable was renamed in 2015 (e.g., SMK_06C → SMK_070). Use explicit `db::VAR` mappings. + +### Step 5: Verify consistency + +```r +# Check that variables.csv databaseStart = union of all detail rows +# Use /cchsflow-validation to run all checks +``` + +### Step 6: Handle exceptions + +Document any databases that don't follow the pattern in the `reviewNotes` field: +``` +reviewNotes: "cchs2001_m stays with PUMF rows (no continuous variable in 2001)" +``` + +## DerivedVar row splitting + +The PUMF/Master split obligation applies equally to `DerivedVar::` rows, not just recoding rows. When a derived variable uses **different feeder variables** for PUMF vs Master, the DerivedVar rows must be split by database type. + +### The rule + +> A DerivedVar row must not mix `_p` and `_m` databases if those databases use different feeder variables. + +This is invisible to `rec_with_table()` — it will silently process both row groups for every database, using whichever feeder happens to be present. No error is raised. The bug only becomes visible through output comparison or dependency resolution tools. + +### Common case: age variable + +The most common DerivedVar split is the age feeder: + +| Database type | Age feeder | Notes | +|---------------|------------|-------| +| PUMF (`_p`) | `DHHGAGE_cont` | PUMF-only midpoint-imputed continuous age | +| Master (`_m`) | `DHH_AGE` | Master true continuous age, exists in all cycles including 2001 | + +**`DHHGAGE_cont` is PUMF-only.** It does not exist on Master. If a DerivedVar row lists both `_p` and `_m` databases with `DHHGAGE_cont` as a feeder, `rec_with_table()` will fail silently for Master databases. + +**`cchs2001_m` uses `DHH_AGE`**, not `DHHGAGE_cont`. The "2001 Master stays with PUMF rows" exception applies only to recoding rows where the continuous companion variable doesn't exist in 2001 (e.g., `SMK_06C`). For age, the Master variable `DHH_AGE` exists in all cycles including 2001 — so `cchs2001_m` belongs with the Master `DHH_AGE` rows. + +### Example: pack_years_der + +**Before (incorrect):** All 6 DerivedVar rows list all `_p` and `_m` databases together: + +``` +# Rows 1-3 (wrong — _m databases should not be here) +databaseStart: cchs2001_p, ..., cchs2023_p, cchs2001_m, ..., cchs2023_m +variableStart: DerivedVar::[SMKDSTY_A, DHHGAGE_cont, age_start_smoking, ...] + +# Rows 4-6 (wrong — _p databases should not be here) +databaseStart: cchs2001_p, ..., cchs2023_p, cchs2001_m, ..., cchs2023_m +variableStart: DerivedVar::[SMKDSTY_A, DHH_AGE, age_start_smoking, ...] +``` + +**After (correct):** Rows split cleanly by database type: + +``` +# Rows 1-3: PUMF only +databaseStart: cchs2001_p, cchs2003_p, ..., cchs2023_p +variableStart: DerivedVar::[SMKDSTY_A, DHHGAGE_cont, age_start_smoking, ...] + +# Rows 4-6: Master only (including cchs2001_m) +databaseStart: cchs2001_m, cchs2003_m, ..., cchs2023_m +variableStart: DerivedVar::[SMKDSTY_A, DHH_AGE, age_start_smoking, ...] +``` + +### How to check + +Use `resolve_dependencies()` from `variable-tools.R` with a `databases` filter and verify that the feeder list matches expectations for that database type: + +```r +devtools::load_all() +vd <- read.csv("inst/extdata/variable_details.csv", stringsAsFactors = FALSE) + +# Should show DHHGAGE_cont, not DHH_AGE +deps_p <- resolve_dependencies("pack_years_der", variable_details = vd, + databases = "cchs2001_p") +deps_p$graph[["pack_years_der"]]$feeders + +# Should show DHH_AGE, not DHHGAGE_cont +deps_m <- resolve_dependencies("pack_years_der", variable_details = vd, + databases = "cchs2001_m") +deps_m$graph[["pack_years_der"]]$feeders +``` + +If both return the same (combined) feeder list, the rows need splitting. + +## Common errors + +| Error | Consequence | Prevention | +|-------|-------------|------------| +| Forgetting to remove `_m` from original rows after adding Master rows | Duplicate processing — variable gets recoded twice for Master databases | Always update the PUMF rows' `databaseStart` when adding Master rows | +| Using `[SMK_09C]` default for 2015+ | Variable not found — `SMK_09C` was renamed to `SMK_090` in 2015 | Use explicit `db::VAR` mappings for all eras (see [variableStart-databaseStart-authoring.md](variableStart-databaseStart-authoring.md)) | +| Splitting cchs2001_m when no continuous variable exists | `copy` row references a non-existent source variable | Check DDI for each cycle before assuming continuous exists | +| Using `typeStart=cat` on the `copy` row | `rec_with_table()` treats values as category codes instead of continuous | The `copy` row must have `typeStart=cont` | +| Inconsistent midpoint values between PUMF and Master rows | Different output for same category depending on database | Use identical midpoints for categories 1-3 on both PUMF and Master rows | +| Mixing `_p` and `_m` databases in DerivedVar rows with different feeder sets | Silent wrong output — `rec_with_table()` processes both row groups for every database | Split DerivedVar rows by database type whenever feeder sets differ; verify with `resolve_dependencies()` | +| Using `DHHGAGE_cont` as age feeder in a row that includes `_m` databases | Master databases use PUMF age variable — wrong age values or silent failure | `DHHGAGE_cont` is PUMF-only; use `DHH_AGE` for all `_m` databases including `cchs2001_m` | + +## Related documentation + +- [variableStart-databaseStart-authoring.md](variableStart-databaseStart-authoring.md) — era-specific naming and the dangerous default pattern +- [harmonization-workflow.md](harmonization-workflow.md) — L0-L6 staged workflow (identify PUMF vs Master differences at L2) diff --git a/.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md b/.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md new file mode 100644 index 00000000..6ab5f8ba --- /dev/null +++ b/.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md @@ -0,0 +1,365 @@ +# variableStart and databaseStart authoring + +This document covers the complex coordination between `variableStart` and `databaseStart` fields in cchsflow worksheets, including era-specific mappings and derived variable patterns. + +## Core principle + +**databaseStart should be derived from variableStart**, not authored independently. The variableStart field is the source of truth for which databases a harmonized variable supports. + +## How rec_with_table() processes variableStart + +The `get_data_variable_name()` function in `R/recode-with-table.R` (lines 380-424) determines which source variable to use: + +```r +# Priority 1: Explicit mapping for this database +if (grepl(data_name, var_start_names)) { + for (var_name in var_start_names_list) { + if (grepl(data_name, var_name)) { + data_variable_being_checked <- strsplit(var_name, "::")[[1]][[2]] + } + } +# Priority 2: Default [VAR] notation +} else if (grepl("\\[", var_start_names)) { + data_variable_being_checked <- str_match(var_start_names, "\\[(.*?)\\]")[, 2] +} +``` + +**Key insight**: The `[VAR]` default applies to ALL databases in that row's databaseStart that don't have explicit `db::VAR` mappings. + +## variableStart notation patterns + +| Pattern | Meaning | Example | +|---------|---------|---------| +| `db::VAR` | Explicit mapping for one database | `cchs2001_m::SMKA_203` | +| `[VAR]` | Default for unmapped databases | `[SMK_203]` | +| `db::[VAR1, VAR2]` | Multi-variable input for one database | `cchs2015_p::[SMKG005, SMKG040]` | +| `DerivedVar::[VAR1, VAR2]` | Inputs for derived function | Triggers `Func::` processing | + +### Invalid patterns + +| Pattern | Problem | +|---------|---------| +| `[[VAR1, VAR2]]` | Double brackets not supported - typo | +| `[VAR1, VAR2]` without `db::` | Ambiguous - use explicit mappings | + +## CCHS variable naming eras + +Variables have different names across eras. **You must use explicit mappings when spanning eras.** + +| Era | Years | Pattern | Example | +|-----|-------|---------|---------| +| Pre-2007 | 2001-2005 | Cycle letter in 4th position | `SMKA_203` (2001), `SMKC_203` (2003), `SMKE_203` (2005) | +| 2007-2014 | 2007-2014 | Standard naming | `SMK_203`, `SMKDSTY` | +| Post-2014 | 2015-2021 | 3-digit increments of 5 | `SMK_040`, `SMKDVSTY`, `ADL_005` | +| 2023+ | 2023+ | 2-digit (some domains) | `ADL_05` (was `ADL_005`), `ADL_10` (was `ADL_010`) | + +**Note on 2023 renames:** Some domains (e.g., ADL) were renamed again in 2023 from 3-digit to 2-digit numbering (`ADL_005` → `ADL_05`). When a variable's databaseStart spans 2015-2021 and 2023+, the `[VAR]` default cannot be used — explicit mappings are required for both eras. Check cchsflow-docs data dictionaries for each domain. + +### Era mapping reference for common variables + +| Concept | Pre-2007 | 2007-2014 | Post-2014 | 2022+ | +|---------|----------|-----------|-----------|-------| +| Age started daily (current) | SMKA/C/E_203 | SMK_203 | SMK_040 (filtered) | CSS_25 | +| Age started daily (former) | SMKA/C/E_207 | SMK_207 | SMK_040 (filtered) | CSS_25 | +| Type of smoker (derived) | SMKA/C/EDSTY | SMKDSTY | SMKDVSTY | SMKDVSTY | +| When stopped daily (cat) | SMKA/C/E_09A | SMK_09A | SMK_080 | SPU_25 | +| Years since quit (derived) | SMKA/C/EDSTP | SMKDSTP | SMKDVSTP | SMKDVSTP | + +### Cessation timing variables - 2015 redesign + +The cessation timing variables underwent a complete renumbering in 2015. **These are commonly missed:** + +| Series | Component | Pre-2007 | 2007-2014 | 2015-2021 | 2022+ | +|--------|-----------|----------|-----------|-----------|-------| +| **SMK_06** (Former occasional) | Categorical (A) | SMKA/C/E_06A | SMK_06A | SMK_060 | SPU_10 | +| | Month (B) | SMKA/C/E_06B | SMK_06B | SMK_065 | SPU_10A | +| | Years (C) | SMKA/C/E_06C | SMK_06C | **SMK_070** | SPU_10B | +| **SMK_09** (Stopped daily) | Categorical (A) | SMKA/C/E_09A | SMK_09A | SMK_080 | SPU_25 | +| | Month (B) | SMKA/C/E_09B | SMK_09B | SMK_085 | SPU_25A | +| | Years (C) | SMKA/C/E_09C | SMK_09C | **SMK_090** | SPU_25B | +| **SMK_10** (Quit completely) | Gate | SMKA/C/E_10 | SMK_10 | SMK_095 | SPU_30 | +| | Categorical (A) | SMKA/C/E_10A | SMK_10A | SMK_100 | SPU_35 | +| | Month (B) | SMKA/C/E_10B | SMK_10B | SMK_105 | SPU_35A | +| | Years (C) | SMKA/C/E_10C | SMK_10C | **SMK_110** | SPU_35B | + +**Common error**: Using `[SMK_09C]` for 2015+ cycles. The variable is named `SMK_090` in 2015+. + +### PUMF grouped variables - 2015 redesign + +PUMF files use grouped versions (SMKG prefix). These also changed in 2015: + +| Variable | 2007-2014 | 2015+ | +|----------|-----------|-------| +| Years since stopped (occasional) | SMKG06C | SMKG070 | +| Years since stopped daily | SMKG09C | SMKG090 | +| Years since quit completely | SMKG10C | SMKG110 | + +## The dangerous default pattern + +**WRONG** - Using `[VAR]` across naming eras: + +``` +databaseStart: cchs2007_2008_m, cchs2015_2016_m, cchs2022_m +variableStart: [SMK_09A] +``` + +This applies `SMK_09A` to ALL three databases, but `SMK_09A` doesn't exist in 2015+. + +**CORRECT** - Explicit era mappings: + +``` +databaseStart: cchs2007_2008_m, cchs2015_2016_m, cchs2022_m +variableStart: cchs2015_2016_m::SMK_080, cchs2022_m::SMK_080, [SMK_09A] +``` + +Now `[SMK_09A]` only applies to `cchs2007_2008_m`. + +## Mixed direct-recoding and derived-function variables + +A harmonized variable can have **multiple blocks of rows** in variable_details.csv with different processing: + +### Block types + +1. **Direct recoding rows** - Use `recStart`/`recEnd` with midpoints or category mappings +2. **Derived function rows** - Use `Func::function_name` in `recEnd` + +### Example: SMKG203_cont + +This variable has direct recoding for 2001-2014 and derived function for 2015+: + +**Block 1: Direct recoding (2001-2014)** +```csv +variable,databaseStart,variableStart,recStart,recEnd +SMKG203_cont,"cchs2001_p, cchs2003_p","cchs2001_p::SMKAG203, cchs2003_p::SMKCG203",2,13 +SMKG203_cont,"cchs2001_p, cchs2003_p","cchs2001_p::SMKAG203, cchs2003_p::SMKCG203",3,17 +... +SMKG203_cont,"cchs2005_p, cchs2009_2010_p, ...","cchs2005_p::SMKEG203, [SMKG203]",2,13 +SMKG203_cont,"cchs2005_p, cchs2009_2010_p, ...","cchs2005_p::SMKEG203, [SMKG203]",3,16 +... +``` + +**Block 2: Derived function (2015+)** +```csv +variable,databaseStart,variableStart,recStart,recEnd +SMKG203_cont,"cchs2015_2016_p, cchs2017_2018_p, cchs2021_p","DerivedVar::[SMKG005, SMKG040]","[1,55]","Func::calculate_SMKG203_continuous" +SMKG203_cont,"cchs2015_2016_p, cchs2017_2018_p, cchs2021_p","DerivedVar::[SMKG005, SMKG040]",else,NA::b +``` + +### Key rules for mixed variables + +1. **Each block has its own databaseStart** - Direct recoding rows list 2001-2014 databases; derived rows list 2015+ databases + +2. **The `[VAR]` default is scoped to the row's databaseStart** - It doesn't apply across blocks + +3. **Derived function inputs come from DerivedVar::[], not from databaseStart** - The function receives already-recoded variables + +4. **variables.csv must list ALL databases** - Union of all blocks' databaseStart values + +5. **variables.csv variableStart is a summary** - Lists all explicit mappings and multi-variable inputs + +## Coordination between variables.csv and variable_details.csv + +### variables.csv entry (summary) + +```csv +variable,databaseStart,variableStart +SMKG203_cont,"cchs2001_p, cchs2003_p, cchs2005_p, ..., cchs2015_2016_p, cchs2017_2018_p, cchs2021_p","cchs2001_p::SMKAG203, cchs2003_p::SMKCG203, cchs2005_p::SMKEG203, cchs2007_2008_p::SMKG203, ..., cchs2015_2016_p::[SMKG005, SMKG040], cchs2017_2018_p::[SMKG005, SMKG040], cchs2021_p::[SMKG005, SMKG040]" +``` + +### Consistency requirements + +| Check | Requirement | +|-------|-------------| +| Database coverage | variables.csv databaseStart = union of all variable_details.csv databaseStart for that variable | +| Explicit mappings | All `db::VAR` in variable_details must appear in variables.csv variableStart | +| Multi-variable inputs | All `db::[VAR1, VAR2]` patterns must be consistent | + +## Validation infrastructure + +### Validate source references against DDI + +```r +source("R/validate-all-source-references.R") +result <- validate_all_source_references("path/to/variable_details.csv") +print_all_validation_result(result) +``` + +This parses variableStart (including applying `[VAR]` defaults to all unmapped databases) and checks each `db::VAR` pair against DDI. + +### Check variable existence + +```r +source("R/source-lookups.R") + +# Single check +variable_exists_in_database("SMK_09A", "cchs2015_2016_m") # FALSE + +# Find correct name +vars <- get_variables_for_database("cchs2015_2016_m") +grep("SMK_08", vars, value = TRUE) # "SMK_080" +``` + +### Build validated variableStart + +```r +source("R/constrained-authoring.R") + +# This will ERROR if any mapping is invalid +build_variableStart(list( + cchs2007_2008_m = "SMK_09A", + cchs2015_2016_m = "SMK_080" +)) +# Returns: "cchs2007_2008_m::SMK_09A, cchs2015_2016_m::SMK_080" +``` + +## Common error patterns and fixes + +### Error: Variable not found in DDI + +**Cause**: Wrong era variable name applied via `[VAR]` default + +**Fix**: Add explicit mappings for each era: +``` +# Before (wrong) +variableStart: [SMKDVSTY] +databaseStart: cchs2009_2010_m, cchs2015_2016_m + +# After (correct) +variableStart: cchs2009_2010_m::SMKDSTY, [SMKDVSTY] +databaseStart: cchs2009_2010_m, cchs2015_2016_m +``` + +### Error: PUMF variable used for Master database + +**Cause**: Grouped variable (SMKG...) referenced for Master file + +**Fix**: Use ungrouped variable for Master: +``` +# Before (wrong) +variableStart: [SMKG203] +databaseStart: cchs2007_2008_m, cchs2007_2008_p + +# After (correct) +variableStart: cchs2007_2008_m::SMK_203, [SMKG203] +databaseStart: cchs2007_2008_m, cchs2007_2008_p +``` + +### Error: Double brackets in variableStart + +**Cause**: Typo - `[[VAR1, VAR2]]` instead of `DerivedVar::[VAR1, VAR2]` + +**Fix**: Use correct derived variable notation: +``` +# Before (wrong) +variableStart: [[SMKG005, SMKG040]] + +# After (correct) +variableStart: DerivedVar::[SMKG005, SMKG040] +``` + +## Authoring workflow + +### Step 1: Identify all databases and eras + +List target databases and group by naming era: +- Pre-2007: cchs2001, cchs2003, cchs2005 +- 2007-2014: cchs2007_2008 through cchs2013_2014 +- Post-2014: cchs2015_2016 through cchs2023 + +### Step 2: Look up correct variable names per era + +Use DDI or source-lookups.R: +```r +source("R/source-lookups.R") +get_variables_for_database("cchs2015_2016_m") |> grep("SMK", x = _, value = TRUE) +``` + +### Step 3: Build explicit mappings + +For each era, create explicit `db::VAR` mappings: +```r +mappings <- list( + cchs2001_m = "SMKA_203", + cchs2003_m = "SMKC_203", + cchs2005_m = "SMKE_203", + cchs2007_2008_m = "SMK_203", + # ... 2007-2014 all use SMK_203 + cchs2015_2016_m = "SMK_040", + # ... 2015+ all use SMK_040 +) +``` + +### Step 4: Determine if `[VAR]` default is safe + +A `[VAR]` default is safe ONLY when: +- All remaining unmapped databases use the same variable name +- The variable exists in all those databases +- **CRITICAL**: The variable name didn't change across the 2015 redesign boundary + +**Rule of thumb**: If your databaseStart spans both 2007-2014 AND 2015+ cycles, you almost certainly need explicit mappings for the 2015+ cycles. The `[VAR]` fallback will fail silently at runtime. + +### Step 5: Create variable_details rows + +Group rows by: +1. Common databaseStart + variableStart combinations +2. Processing type (direct recoding vs derived function) + +### Step 6: Create variables.csv entry + +Aggregate: +- databaseStart: Union of all variable_details databaseStart +- variableStart: All unique explicit mappings + multi-variable inputs + +### Step 7: Validate (MANDATORY) + +**This step is not optional.** Before merging worksheets to inst/extdata, you MUST validate: + +```r +source("R/validate-all-source-references.R") +result <- validate_all_source_references("path/to/variable_details.csv") +if (length(result$invalid_refs) > 0) { + print(result$invalid_refs) + stop("Cannot proceed with invalid source references") +} +``` + +**What this catches:** +- `[VAR]` defaults that don't exist in 2015+ cycles (e.g., `[SMK_09C]` when 2015+ uses `SMK_090`) +- Typos in variable names +- PUMF variables used for Master databases (or vice versa) +- Missing explicit mappings for renamed variables + +### Step 8: Cross-check variables.csv against variable_details.csv + +Ensure the summary in variables.csv matches the detail rows: + +```r +# Check that all explicit mappings in variable_details appear in variables.csv +# Check that databaseStart in variables.csv covers all databases in variable_details +source("R/csv-workflow.R") +csv_validate("path/to/variables.csv", "path/to/variable_details.csv") +``` + +## Preventing the 2015 rename error + +The most common error is using `[VAR]` for a variable that was renamed in 2015. To prevent this: + +1. **Check the era mapping tables above** before using `[VAR]` notation +2. **Run validation** against DDI before merging to inst/extdata +3. **If databaseStart includes both 2007-2014 AND 2015+ cycles**, add explicit mappings for 2015+ + +Example fix for SMK_09C: +``` +# WRONG - [SMK_09C] doesn't exist in 2015+ +variableStart: cchs2003_m::SMKC_09C, cchs2005_m::SMKE_09C, [SMK_09C] + +# CORRECT - explicit mappings for 2015+ where it's called SMK_090 +variableStart: cchs2003_m::SMKC_09C, cchs2005_m::SMKE_09C, cchs2015_2016_m::SMK_090, cchs2017_2018_m::SMK_090, cchs2019_2020_m::SMK_090, cchs2021_m::SMK_090, [SMK_09C] +``` + +## Related documentation + +- [field-reference.md](field-reference.md) - Field definitions and naming conventions +- [csv-templates.md](csv-templates.md) - Column templates for v3.0.0 schema +- [harmonization-workflow.md](harmonization-workflow.md) - L0-L6 staged workflow +- [derived-variables.md](derived-variables.md) - Derived function patterns From 358458820c6305513fc58f6f82aa85bc1823e16a Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Fri, 13 Mar 2026 19:41:04 -0400 Subject: [PATCH 05/15] docs(skill): Update cchsflow-review with recode block terminology and new checks Add recode block terminology definition to Check 2b. Clarify that the collision check is at the (database, recStart) level, not databaseStart overlap alone. Reference check_recode_blocks() and check_invalid_databases() automated checks. Add cchs2021_p/2022_p/2023_p to Check 5 invalid database patterns with context on why they don't exist as standalone PUMF files. Add multi-block databaseStart fix rule to Step 10: narrow each block to only the databases where its source variable exists; never replace the full databaseStart (risks dropping shorthand-covered databases). Include the Beyond Compare verification step. --- .claude/skills/cchsflow-review/SKILL.md | 113 ++++++++++++++---------- 1 file changed, 68 insertions(+), 45 deletions(-) diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md index e58e6904..9c9350a7 100644 --- a/.claude/skills/cchsflow-review/SKILL.md +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -290,7 +290,6 @@ For each in-scope variable: - Variable listed in `variableStart` but not found in documentation for that cycle → **P0** (wrong variable name) - Variable not checked (no documentation available for that cycle) → note as untested - Variable exists in additional cycles not included in `databaseStart` → informational (expansion opportunity) -- **Pre-2007 databases in `databaseStart` without explicit `db::VAR` mappings in `variableStart`** → **P1** (wrong source variable at runtime via `[VAR]` default). Pre-2007 variable names require a cycle letter in position 4 (A=2001, C=2003, E=2005) — the `[VAR]` default will look up the 2007-2014 name, which does not exist in pre-2007 datasets. Always verify every pre-2007 database has an explicit mapping. #### L1: Variable concordance @@ -315,35 +314,23 @@ Run these checks in parallel for the in-scope variables. Read `.claude/skills/cc #### Check 1: Era boundary defaults -The most dangerous class of bug. The `[VAR]` default in `variableStart` resolves to the base variable name at runtime for any database not explicitly mapped. This is only correct for databases in the **same naming era** as the base name. Whenever `databaseStart` spans an era boundary, all databases in the **other** era must have explicit `db::VAR` mappings. +The most dangerous class of bug. For each variable: -The general rule: **for every era boundary crossed by `databaseStart`, verify that all databases on the far side have explicit mappings.** +1. Parse the `databaseStart` field — does it span both 2007-2014 and 2015+ cycles? +2. Parse the `variableStart` field — do 2015+ databases have explicit `db::VAR` mappings? +3. If a `[VAR]` default exists and 2015+ databases lack explicit mappings, the default will apply the wrong variable name at runtime -For each variable: -1. Parse the `databaseStart` field and identify which era boundaries it crosses -2. Parse the `variableStart` field — do databases on the far side of each boundary have explicit `db::VAR` mappings? -3. If a `[VAR]` default exists and cross-era databases lack explicit mappings, the default will silently apply the wrong variable name at runtime - -**CCHS naming era boundaries:** - -| Boundary | Direction | Pattern | Example | -|----------|-----------|---------|---------| -| Pre-2007 → 2007 | Pre-2007 needs cycle letter | `SMK_204` → `SMKA_204` (2001), `SMKC_204` (2003), `SMKE_204` (2005) | Any variable with 2001/2003/2005 in databaseStart | -| 2007–2014 → 2015 | 3-digit rename | `SMK_06A` → `SMK_060`, `SMK_06C` → `SMK_070` | Smoking, FVC, ADL | -| 2021 → 2022 | CSS/SPU module restructure | Smoking cessation/history split into new modules | SPU_25A/B replace SMK_09A/C for cessation timing | -| 2021 → 2023 | ADL digit reduction | `ADL_005` → `ADL_05` | ADL variables only | - -The pre-2007 and 2015+ boundaries are the most common sources of bugs — they affect almost all smoking, FVC, and ADL variables. Always use `get_variable_history()` from the cchs-metadata MCP to confirm the exact boundary for the variable under review. - -**Key 2015+ renames (most common):** +**Key 2015 renames to check:** - Smoking categorical: SMK_06A → SMK_060, SMK_09A → SMK_080, SMK_10A → SMK_100 -- Smoking continuous (Master): SMK_06C → SMK_070, SMK_09C → SMK_090, SMK_10C → SMK_110 -- Smoking continuous (PUMF): SMKG06C → SMKG070, SMKG09C → SMKG090, SMKG10C → SMKG110 +- Smoking continuous: SMK_06C → SMK_070, SMK_09C → SMK_090, SMK_10C → SMK_110 - Smoking derived: SMKDSTY → SMKDVSTY, SMKDSTP → SMKDVSTP -- Smoking intensity (daily smoker): SMK_204 / SMK_208 → SMK_045 (PUMF) / SMK_040 (Master daily), SMK_075 (both, former daily) +- PUMF grouped: SMKG06C → SMKG070, SMKG09C → SMKG090, SMKG10C → SMKG110 - FVC: FVCDFRU → FVCDVFRU, FVCDSAL → FVCDVGRN, FVCDCAR → FVCDVORA, FVCDPOT → FVCDVPOT, FVCDVEG → FVCDVVEG, FVCDJUI → FVCDVJUI - ADL: ADL_01-06 → ADL_005-030 (3-digit, 2015-2021), then → ADL_05-30 (2-digit, 2023+) +**Key 2023 renames to check:** +- ADL: ADL_005 → ADL_05, ADL_010 → ADL_10, ADL_015 → ADL_15, ADL_020 → ADL_20, ADL_025 → ADL_25, ADL_030 → ADL_30. This is a new era boundary — `[ADL_005]` defaults will not work for 2023 databases. + #### Check 2: databaseStart consistency For each variable: @@ -359,6 +346,35 @@ For each mismatch found, classify it: All mismatches must be explicitly listed in the review summary, even pre-existing ones. Do not silently omit consistency results. +#### Check 2b: Multi-block recStart collisions + +**Terminology:** A **recode block** is a set of rows in variable_details.csv sharing the same `variableStart` value. A recode block defines how one source variable maps to the harmonized output. Variables that changed source variable names or response category definitions across CCHS cycles require multiple blocks — one per distinct source structure. A single block can span multiple eras when the source variable name and category boundaries were stable across them. + +Variables with multiple recode blocks must not have the same `recStart` value appearing in more than one block for the same database. If a `(database, recStart)` pair matches rows from two blocks, `rec_with_table()` will find duplicate rows and produce incorrect output. + +Note: `databaseStart` overlap alone (a database appearing in two blocks' lists) is not sufficient to flag an error — cchsflow legitimately uses parallel PUMF and Master blocks that share databases but have non-overlapping `recStart` ranges. The collision must be at the `(database, recStart)` level. + +**Automated check:** `exec/check-worksheets.R` runs `check_recode_blocks()` automatically. For manual inspection of a specific variable: + +```r +vd_var <- variable_details[variable_details$variable == "VAR", ] +blocks <- split(vd_var, vd_var$variableStart) +db_sets <- lapply(blocks, function(b) { + trimws(unlist(strsplit(b$databaseStart[1], ","))) +}) +# Check all pairwise intersections (overlap is a necessary but not sufficient condition) +pairs <- combn(length(db_sets), 2) +for (i in seq_len(ncol(pairs))) { + overlap <- intersect(db_sets[[pairs[1,i]]], db_sets[[pairs[2,i]]]) + if (length(overlap) > 0) + cat("OVERLAP (check recStart too):", paste(overlap, collapse=", "), "\n") +} +``` + +Flag any confirmed `(database, recStart)` collision as **P0**. + +This check is especially important for continuous variables with era-specific midpoint recodes (e.g., SMK_09A_cont, SMK_06A_cont) where different cycles have different category boundaries and require separate recode blocks. + #### Check 3: PUMF vs Master naming For `_m` (master) databases: @@ -384,11 +400,14 @@ The letter position varies by variable domain but follows a consistent pattern w #### Check 5: Known error patterns +**Automated check:** `exec/check-worksheets.R` runs `check_invalid_databases()` on both worksheets. Review its output before manual scanning — it catches the first four patterns below automatically. + Scan for: - `cchs20013_` — extra zero typo (should be `cchs2013_`) - `chs20` without leading `c` — missing `c` typo (should be `cchs20`). This pattern has been found in ADL and FVC variables (e.g., `chs2011_2012_m` instead of `cchs2011_2012_m`). Check all database names match the `cchs` prefix. - `_i` suffix databases — deprecated, should be `_m` - `_s` suffix databases — deprecated, **always convert to `_m`** when found in reviewed variables. Check that a corresponding `_m` entry doesn't already exist (if it does, delete the `_s` row; if not, rename `_s` → `_m`). This applies even if the `_s` is pre-existing on the target branch — if the PR touches these rows, fix the suffix. **Naming convention**: `_s` share files are single-year extracts, so map to the single-year master form: `cchs2009_s` → `cchs2009_m` (not `cchs2009_2010_m`), `cchs2010_s` → `cchs2010_m`, `cchs2012_s` → `cchs2012_m`. Check `variables.csv` to confirm which `_m` form is expected. +- `cchs2021_p`, `cchs2022_p`, `cchs2023_p` — **invalid PUMF databases**. The 2021 CCHS was not released as a standalone PUMF — it was combined with 2022 data into a 2021-2022 PUMF (not yet in cchsflow). The 2022+ smoking variables were restructured into CSS/SPU modules; no standalone PUMF equivalent exists for variables like SMK_09A in those cycles. Remove these from `databaseStart` for PUMF-only or mixed blocks when encountered in reviewed variables. - `[[VAR]]` — double brackets (invalid notation) - `[VAR1, VAR2]` without `DerivedVar::` prefix — ambiguous multi-variable input @@ -491,6 +510,18 @@ If the PR lacks tests for new derived variables, flag this. **This is the highest-priority check.** Run `rec_with_table()` against actual PUMF data. This is not just a pass/fail test — the output is an analytical tool. By examining prevalence and distributions across cycles and categories, reviewers can identify harmonization problems that worksheet checks alone cannot catch, such as a sudden step change in prevalence at an era boundary (e.g., 2014 → 2015) that signals a naming mismatch or category recode error. +#### Multi-era recode validation + +For variables with multiple recode blocks (identified in Check 2b), standard L6 prevalence checks are insufficient — `rec_with_table()` may silently apply the wrong block or blend blocks without error. For these variables, perform era-specific output validation: + +1. **Identify one representative PUMF cycle per block** — e.g., for SMK_09A_cont: `cchs2001_p` (Block 1 era), `cchs2007_2008_p` (Block 3 era) +2. **Run `rec_with_table()` for each representative cycle** +3. **Verify the recEnd values match the expected midpoints for that era** — not just that they are non-missing + +For continuous variables, check a known respondent's output value against the expected midpoint for their source category. If the era boundary is at 2003 (different category boundaries in 2001 vs 2003+), a respondent with source code 3 should produce recEnd=4 in 2001 but recEnd=2.5 in 2003+. If both cycles produce the same value, the wrong block is being applied to one of them. + +Flag any era boundary where observed output values do not match expected midpoints as **P0**. + #### Scope and limitations **PUMF data only.** L6 can currently test only `_p` databases. The `data/` directory contains PUMF RData files (`cchs2001_p.RData` through `cchs2017_2018_p.RData`). Master (`_m`) data is in a secure environment where LLMs cannot run. @@ -680,28 +711,6 @@ If the in-scope variables include derived variables (functions in `R/`): 4. Compare the derived variable's valid % against its input variables — the DV should not have materially higher valid % than its least-available input 5. For categorical derived variables and key continuous inputs, examine the **exposure distribution** across cycles — not just valid counts. The central harmonization question is whether typical exposures (e.g., proportion with 0 fruit/veg, or >5 servings/day) remain stable across cycles. A sudden shift in the distribution at an era boundary signals a recoding or mapping error even when valid % is unchanged. Include these distributions in both the integration test output and the QMD visualisation -#### DerivedVar feeder check (PUMF/Master split) - -For any derived variable that uses age, sex, or any other variable that differs between PUMF and Master, run a cross-database feeder check: - -```r -devtools::load_all() -# Compare feeders with _p vs _m filter -resolve_dependencies("pack_years_der", databases = "cchs2015_2016_p") -resolve_dependencies("pack_years_der", databases = "cchs2015_2016_m") -``` - -If both calls return the same combined feeder list, DerivedVar rows are mixing `_p` and `_m` databases and need splitting. The correct state: the `_p` call should return PUMF-specific feeders (e.g., `DHHGAGE_cont`), the `_m` call should return Master-specific feeders (e.g., `DHH_AGE`). - -**Key age feeder split:** - -| Feeder | Database type | Note | -|--------|---------------|-------| -| `DHHGAGE_cont` | PUMF (`_p`) only | Midpoint-imputed from grouped PUMF age bands | -| `DHH_AGE` | Master (`_m`) all cycles including 2001 | True continuous age | - -A DerivedVar row that lists both `_p` and `_m` databases when feeders differ is a **P1** error — `rec_with_table()` will silently use the wrong age variable for at least one database type. See `pumf-master-harmonization.md` for the correct row-splitting pattern. - #### What to report from L6 For each cycle tested: @@ -909,6 +918,20 @@ If the review identified worksheet errors (typos, missing mappings, incorrect da } ``` + **Multi-block databaseStart fix rule:** When `check_recode_blocks()` flags a recStart collision, the fix is to narrow each block's `databaseStart` to only the databases where that block's source variable actually exists. The key mental model: + + > Each block's `databaseStart` should contain only the databases where that block's `variableStart` is the correct source variable. + + **Critical anti-pattern — do not replace the entire databaseStart.** A block's `databaseStart` may include databases covered by a `[SHORTHAND]` entry in `variableStart` (e.g., `[SMK_09A]` covers cchs2007_2008_p through cchs2013_2014_p implicitly). If you replace the full `databaseStart` with only the databases visible in the explicit `db::VAR` prefixes, you will drop the shorthand-covered databases and create new gaps. Instead: + + 1. Identify which database(s) are causing the collision (appear in two blocks) + 2. Determine which block those databases actually belong to (based on which era's source variable they use) + 3. Remove those databases from the block they do *not* belong to — leave everything else intact + + **Example:** If `cchs2001_p` appears in both Block 1 (2001 source variable) and Block 2 (2003+ source variable), remove `cchs2001_p` from Block 2's `databaseStart` only. Do not rewrite Block 2's full `databaseStart`. + + Always open Beyond Compare to verify the proposed fix before applying it to `inst/extdata/`. + 4. **Save fixes to a temporary file** — per project conventions (CLAUDE.local.md), write proposed changes to `/tmp/` for user review before editing the main worksheet files directly. The user or PR author integrates the changes. 5. **Verify idempotency** — always read from `inst/extdata/` (the clean source), never from previously modified `/tmp/` files. After running a modification script, re-run it to confirm the output is identical. If the script detects its own changes on the second run (e.g., skips "already has 2021"), the idempotency check passed. From 2b14aa2ee2f0fbd413ea158be566501fbfbe9077 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Sun, 29 Mar 2026 14:07:28 -0400 Subject: [PATCH 06/15] refactor(skill): Split cchsflow-review SKILL.md into orchestrator + docs Split the 68KB monolithic SKILL.md (1,066 lines) into a 372-line orchestrator that delegates to focused docs: - docs/worksheet-reference.md (moved from docs/) - docs/l0-l2-documentation-review.md (L0-L2 checks) - docs/l3-l5-worksheet-checks.md (L3-L5 checks) - docs/l6-implementation-validation.md (rec_with_table testing) - docs/csv-validation-and-fixes.md (validation tools + fix workflow) - docs/review/ (Gem system prompt, notebook manifest/coverage) Added prerequisite section requiring worksheet-reference.md be read before any review. Added .gitignore exception for skill docs/ folders. --- .claude/skills/cchsflow-review/SKILL.md | 751 ++---------------- .../docs/csv-validation-and-fixes.md | 157 ++++ .../docs/l0-l2-documentation-review.md | 131 +++ .../docs/l3-l5-worksheet-checks.md | 238 ++++++ .../docs/l6-implementation-validation.md | 218 +++++ .../docs/review/gemini-gem-system-prompt.md | 104 +++ .../docs/review/notebook-coverage.md | 61 ++ .../docs/review/notebook-manifest.csv | 240 ++++++ .../docs/variable-naming-conventions.md | 97 +++ .../docs/worksheet-reference.md | 684 ++++++++++++++++ .gitignore | 3 +- 11 files changed, 1986 insertions(+), 698 deletions(-) create mode 100644 .claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md create mode 100644 .claude/skills/cchsflow-review/docs/l0-l2-documentation-review.md create mode 100644 .claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md create mode 100644 .claude/skills/cchsflow-review/docs/l6-implementation-validation.md create mode 100644 .claude/skills/cchsflow-review/docs/review/gemini-gem-system-prompt.md create mode 100644 .claude/skills/cchsflow-review/docs/review/notebook-coverage.md create mode 100644 .claude/skills/cchsflow-review/docs/review/notebook-manifest.csv create mode 100644 .claude/skills/cchsflow-review/docs/variable-naming-conventions.md create mode 100644 .claude/skills/cchsflow-review/docs/worksheet-reference.md diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md index 9c9350a7..fe7a9407 100644 --- a/.claude/skills/cchsflow-review/SKILL.md +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -17,6 +17,14 @@ CEP-driven review for cchsflow worksheet changes. Reviews follow the L0-L6 harmo ## Workflow +### Prerequisite: Read the worksheet reference + +**Before performing any review**, read `docs/worksheet-reference.md` (located in this skill's `docs/` folder). This is the canonical reference for how cchsflow worksheets work — variable types, database naming, recStart/recEnd semantics, DerivedVar/Func:: mechanism, PUMF-Master bridging patterns, era splits, midpoint imputation, and v3 naming conventions. Without understanding these conventions, review findings will be unreliable. + +Also available for cross-checking worksheet accuracy: +- **Gem verification workflow**: `docs/review/` (in this skill's folder) contains the Google NotebookLM Gem system prompt, notebook manifest, and coverage summary. The Gem cross-checks worksheet entries against ~239 StatCan PDFs. See the memory file `reference_gem_verification_workflow.md` for the full three-way triangulation process (Gem + MCP + Claude Code). +- **MCP cchs-metadata server**: Primary tool for L0-L1 verification (described in Step 4). + ### Step 1: Scope and triage Before any checks, establish what is being reviewed and assess the shape of the diff. @@ -156,7 +164,6 @@ Check if a CEP already exists for this domain/variable group. CEPs live in `ceps - Focus the review on stages that are incomplete or need re-validation **If no CEP exists**, default to creating a **minimal review CEP** for PR reviews: - ``` ceps/cep-NNN-/ PR--review-summary.md # Findings and recommendations @@ -178,552 +185,36 @@ Use the next available number. Include the domain name (e.g., `cep-007-diet`) to ### Step 4: L0-L2 documentation review -For each in-scope variable, verify the documentation foundations. Read `.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md` for detailed L0-L2 templates. - -#### L0: Documentation assessment - -Verify source variables against CCHS documentation using the **cchsflow-docs** repository (`Big-Life-Lab/cchsflow-docs` on GitHub, cloned alongside cchsflow). This step confirms that variables claimed in `variableStart` and `databaseStart` actually exist in the CCHS data for those cycles. - -##### Primary source: cchs-metadata MCP server - -**Always use the cchs-metadata MCP as the primary tool for L0-L1 verification.** It provides the most complete and queryable metadata — 16,000+ variables across 251 datasets, enriched from PUMF RData, DDI XML, and ICES sources with full provenance tracking. - -**Key tools:** -- `mcp__cchs-metadata__search_variables(query)` — find variables by name or label substring -- `mcp__cchs-metadata__get_variable_detail(variable_name)` — full metadata including labels, question text, value codes, dataset history -- `mcp__cchs-metadata__get_variable_history(variable_name)` — which cycles/datasets contain the variable (essential for era boundary verification) -- `mcp__cchs-metadata__get_value_codes(variable_name)` — response categories with frequencies -- `mcp__cchs-metadata__compare_master_pumf(variable_name, cycle)` — compare PUMF vs Master metadata for a specific cycle (essential for PUMF/Master split decisions) -- `mcp__cchs-metadata__suggest_cchsflow_row(variable_name)` — draft a cchsflow harmonisation row -- `mcp__cchs-metadata__get_dataset_variables(dataset_id)` — list all variables in a specific dataset -- `mcp__cchs-metadata__get_source_conflicts(variable_name, dataset_id)` — find cross-source label disagreements (useful for catching metadata inconsistencies) -- `mcp__cchs-metadata__get_database_summary()` — database overview and statistics - -**Using MCP results:** -- The `cchsflow_name` field maps StatCan source variables to their cchsflow harmonized names — use this to verify that `variableStart` entries point to the correct source variable for each cycle -- Use `get_variable_history` to confirm a variable exists across claimed cycles and to identify era renames (e.g., SMK_09C → SMK_090 at the 2015 boundary) -- Use `compare_master_pumf` to verify whether PUMF and Master share the same source variable or need split rows - -**Caution:** The MCP `label_short`/`label_long` fields may be contaminated by cchsflow labels (see MCP error report from alcohol review). Always cross-check against `label_statcan` which comes from DDI primary sources. - -##### If the MCP is not available - -Check whether the MCP is loaded: -```bash -claude mcp list -``` - -If `cchs-metadata` is missing or shows "Failed to connect", the server needs to be configured. The MCP server (v0.3.0+) lives in the **cchsflow-docs** repository and is also available as a [GitHub release](https://github.com/Big-Life-Lab/cchsflow-docs/releases). - -**Quick setup** (if cchsflow-docs is cloned at `../cchsflow-docs/`): -```bash -cd ../cchsflow-docs/mcp-server && bash ../scripts/setup.sh -claude mcp add cchs-metadata -- python3 /Users/dmanuel/github/cchsflow-docs/mcp-server/server.py -``` - -**Manual setup:** -1. Ensure `cchsflow-docs` is cloned alongside cchsflow (typically `../cchsflow-docs/`) -2. Ensure `mcp-server/server.py` exists in cchsflow-docs -3. Ensure the database exists: `../cchsflow-docs/database/cchs_metadata.duckdb` (download from the [v0.3.0 release](https://github.com/Big-Life-Lab/cchsflow-docs/releases) or rebuild: `Rscript --vanilla ../cchsflow-docs/database/build_db.R`) -4. Add the MCP to Claude Code: - ```bash - claude mcp add cchs-metadata -- python3 /Users/dmanuel/github/cchsflow-docs/mcp-server/server.py - ``` - Or add to `~/.claude.json` (see `.mcp.json.example` in cchsflow-docs for a template): - ```json - "cchs-metadata": { - "type": "stdio", - "command": "python3", - "args": ["/Users/dmanuel/github/cchsflow-docs/mcp-server/server.py"], - "env": {"CCHS_DB_PATH": "/Users/dmanuel/github/cchsflow-docs/database/cchs_metadata.duckdb"} - } - ``` -5. Restart Claude Code for the MCP tools to appear in the tool list - -##### CLI fallback - -If the MCP server cannot be started but the database exists, use the standalone CLI (no FastMCP dependency — only `duckdb` required): -```bash -python3 ../cchsflow-docs/mcp-server/cli.py search smoking -python3 ../cchsflow-docs/mcp-server/cli.py detail SMKDSTY -python3 ../cchsflow-docs/mcp-server/cli.py history SMK_204 -python3 ../cchsflow-docs/mcp-server/cli.py conflicts --variable SMKDSTY -python3 ../cchsflow-docs/mcp-server/cli.py codes SMK_204 -``` - -All commands support `--json` for machine-readable output and `--db PATH` for custom database path. - -See the cchsflow-docs `CLAUDE.md` and `.claude/skills/cchs-database/SKILL.md` for database build workflow and schema details. - -##### Fallback: file-based lookups - -If the MCP is unavailable and cannot be restored, use these file-based sources in the cchsflow-docs repo (typically `../cchsflow-docs/`): - -1. **Extracted YAML data dictionaries** — structured variable definitions by cycle: - ``` - ../cchsflow-docs/cchs-extracted/data-dictionary/{year}/ - ``` - Coverage: 2000-2001 through 2023. - -2. **DDI XML files** — authoritative StatsCan PUMF documentation: - ``` - ../cchsflow-docs/cchs-pumf-docs/CCHS_DDI/ - ``` - -3. **CCHS variable dictionary CSV** — flat file for quick lookups: - ``` - ../cchsflow-docs/data/cchs_variable_dictionary.csv - ``` - -These are the raw sources that feed the MCP database. The MCP is strongly preferred because it cross-references all sources, deduplicates, and provides structured query tools rather than requiring manual grep/search across hundreds of files. - -##### What to verify - -For each in-scope variable: -1. **Existence**: Does the source variable name appear in the documentation for each claimed cycle? -2. **Category codes**: Do `recStart` values match the documented category definitions? -3. **Era renames**: For 2015+ cycles, confirm the renamed variable exists -4. **Cycle coverage up to latest available**: Check whether the variable exists in cycles beyond the PR's `databaseStart` (documentation covers up to 2023) — these may be candidates for expansion - -##### What to flag - -- Variable listed in `variableStart` but not found in documentation for that cycle → **P0** (wrong variable name) -- Variable not checked (no documentation available for that cycle) → note as untested -- Variable exists in additional cycles not included in `databaseStart` → informational (expansion opportunity) - -#### L1: Variable concordance - -Use the cchsflow-docs extracted data dictionaries to verify source variable names across eras: - -- Pre-2007: cycle letter in 4th position (A=2001, C=2003, E=2005) -- 2007-2014: standard naming -- Post-2014: check for 3-digit renames — search the 2015+ YAML files to confirm actual names -- 2022+: check for modular renames (e.g., CSS/SPU prefixes for smoking) - -For each era boundary, compare the variable name in `variableStart` against the corresponding cycle's YAML data dictionary in cchsflow-docs. PUMF and Master data dictionaries may differ — check both `_p` and `_m` YAML files where available. - -#### L2: Semantic mapping - -- Are category codes consistent across cycles? -- Are semantic breaks identified and documented? -- Do recoding rules handle all source categories? +Read and follow `docs/l0-l2-documentation-review.md` for the full procedure. This covers: +- L0: Documentation assessment (MCP cchs-metadata as primary tool, CLI fallback, file-based fallback) +- L1: Variable concordance (era rename chains, pre-2007 cycle letters) +- L2: Semantic mapping (category consistency, recoding rule coverage) ### Step 5: L3-L5 worksheet and testing checks -Run these checks in parallel for the in-scope variables. Read `.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md` for detailed reference. - -#### Check 1: Era boundary defaults - -The most dangerous class of bug. For each variable: - -1. Parse the `databaseStart` field — does it span both 2007-2014 and 2015+ cycles? -2. Parse the `variableStart` field — do 2015+ databases have explicit `db::VAR` mappings? -3. If a `[VAR]` default exists and 2015+ databases lack explicit mappings, the default will apply the wrong variable name at runtime - -**Key 2015 renames to check:** -- Smoking categorical: SMK_06A → SMK_060, SMK_09A → SMK_080, SMK_10A → SMK_100 -- Smoking continuous: SMK_06C → SMK_070, SMK_09C → SMK_090, SMK_10C → SMK_110 -- Smoking derived: SMKDSTY → SMKDVSTY, SMKDSTP → SMKDVSTP -- PUMF grouped: SMKG06C → SMKG070, SMKG09C → SMKG090, SMKG10C → SMKG110 -- FVC: FVCDFRU → FVCDVFRU, FVCDSAL → FVCDVGRN, FVCDCAR → FVCDVORA, FVCDPOT → FVCDVPOT, FVCDVEG → FVCDVVEG, FVCDJUI → FVCDVJUI -- ADL: ADL_01-06 → ADL_005-030 (3-digit, 2015-2021), then → ADL_05-30 (2-digit, 2023+) - -**Key 2023 renames to check:** -- ADL: ADL_005 → ADL_05, ADL_010 → ADL_10, ADL_015 → ADL_15, ADL_020 → ADL_20, ADL_025 → ADL_25, ADL_030 → ADL_30. This is a new era boundary — `[ADL_005]` defaults will not work for 2023 databases. - -#### Check 2: databaseStart consistency - -For each variable: -1. Extract `databaseStart` from variables.csv -2. Extract all `databaseStart` entries from variable_details.csv for that variable -3. The variables.csv list must equal the union of all variable_details.csv lists -4. Flag any databases present in one file but not the other - -For each mismatch found, classify it: -- **PR-introduced**: The mismatch is new (not on target branch) — report as P1 -- **Pre-existing**: The mismatch exists on the target branch — document in pre-existing issues -- **`_p` in vd only**: PUMF databases in variable_details but not variables.csv is a known pattern for variables that span both pre-2015 and 2015+ eras (the pre-2015 block includes `_p` databases that the 2015+ block in variables.csv doesn't list). Note but do not flag as a bug. - -All mismatches must be explicitly listed in the review summary, even pre-existing ones. Do not silently omit consistency results. - -#### Check 2b: Multi-block recStart collisions - -**Terminology:** A **recode block** is a set of rows in variable_details.csv sharing the same `variableStart` value. A recode block defines how one source variable maps to the harmonized output. Variables that changed source variable names or response category definitions across CCHS cycles require multiple blocks — one per distinct source structure. A single block can span multiple eras when the source variable name and category boundaries were stable across them. - -Variables with multiple recode blocks must not have the same `recStart` value appearing in more than one block for the same database. If a `(database, recStart)` pair matches rows from two blocks, `rec_with_table()` will find duplicate rows and produce incorrect output. - -Note: `databaseStart` overlap alone (a database appearing in two blocks' lists) is not sufficient to flag an error — cchsflow legitimately uses parallel PUMF and Master blocks that share databases but have non-overlapping `recStart` ranges. The collision must be at the `(database, recStart)` level. - -**Automated check:** `exec/check-worksheets.R` runs `check_recode_blocks()` automatically. For manual inspection of a specific variable: - -```r -vd_var <- variable_details[variable_details$variable == "VAR", ] -blocks <- split(vd_var, vd_var$variableStart) -db_sets <- lapply(blocks, function(b) { - trimws(unlist(strsplit(b$databaseStart[1], ","))) -}) -# Check all pairwise intersections (overlap is a necessary but not sufficient condition) -pairs <- combn(length(db_sets), 2) -for (i in seq_len(ncol(pairs))) { - overlap <- intersect(db_sets[[pairs[1,i]]], db_sets[[pairs[2,i]]]) - if (length(overlap) > 0) - cat("OVERLAP (check recStart too):", paste(overlap, collapse=", "), "\n") -} -``` - -Flag any confirmed `(database, recStart)` collision as **P0**. - -This check is especially important for continuous variables with era-specific midpoint recodes (e.g., SMK_09A_cont, SMK_06A_cont) where different cycles have different category boundaries and require separate recode blocks. - -#### Check 3: PUMF vs Master naming - -For `_m` (master) databases: -- Pre-2007: cycle letter in source variable name (A=2001, C=2003, E=2005) -- 2007-2014: standard naming (no prefix letter) -- 2015+: check for renamed variables - -For `_p` (PUMF) databases: -- May use grouped/derived variable names (e.g., SMKG prefix, FVCD prefix) - -Verify that `_m` databases don't reference PUMF-only grouped variables, and vice versa. - -For variables where PUMF and Master use fundamentally different source types (categorical vs continuous), see `cchsflow-worksheets/docs/pumf-master-harmonization.md` for the required row-splitting pattern and common errors. - -#### Check 4: Pre-2007 cycle letters - -For variables with pre-2007 master cycles, verify the cycle letter: -- 2001 (`_m` or `_p`): letter A in the variable name (e.g., SMKA_203, FVCADFRU) -- 2003: letter C (e.g., SMKC_203, FVCCDFRU) -- 2005: letter E (e.g., SMKE_203, FVCEDFRU) - -The letter position varies by variable domain but follows a consistent pattern within each domain. - -#### Check 5: Known error patterns - -**Automated check:** `exec/check-worksheets.R` runs `check_invalid_databases()` on both worksheets. Review its output before manual scanning — it catches the first four patterns below automatically. - -Scan for: -- `cchs20013_` — extra zero typo (should be `cchs2013_`) -- `chs20` without leading `c` — missing `c` typo (should be `cchs20`). This pattern has been found in ADL and FVC variables (e.g., `chs2011_2012_m` instead of `cchs2011_2012_m`). Check all database names match the `cchs` prefix. -- `_i` suffix databases — deprecated, should be `_m` -- `_s` suffix databases — deprecated, **always convert to `_m`** when found in reviewed variables. Check that a corresponding `_m` entry doesn't already exist (if it does, delete the `_s` row; if not, rename `_s` → `_m`). This applies even if the `_s` is pre-existing on the target branch — if the PR touches these rows, fix the suffix. **Naming convention**: `_s` share files are single-year extracts, so map to the single-year master form: `cchs2009_s` → `cchs2009_m` (not `cchs2009_2010_m`), `cchs2010_s` → `cchs2010_m`, `cchs2012_s` → `cchs2012_m`. Check `variables.csv` to confirm which `_m` form is expected. -- `cchs2021_p`, `cchs2022_p`, `cchs2023_p` — **invalid PUMF databases**. The 2021 CCHS was not released as a standalone PUMF — it was combined with 2022 data into a 2021-2022 PUMF (not yet in cchsflow). The 2022+ smoking variables were restructured into CSS/SPU modules; no standalone PUMF equivalent exists for variables like SMK_09A in those cycles. Remove these from `databaseStart` for PUMF-only or mixed blocks when encountered in reviewed variables. -- `[[VAR]]` — double brackets (invalid notation) -- `[VAR1, VAR2]` without `DerivedVar::` prefix — ambiguous multi-variable input - -**Pre-existing typo propagation:** Typo patterns often exist in the target branch for other variables and get copied into new variables through copy-paste. For each typo found, check whether the same pattern exists on the target branch for the same variables — if not, it was introduced by this PR even if the pattern exists elsewhere. - -#### Check 5b: dummyVariable naming conventions - -Verify that `dummyVariable` values follow the naming convention defined in `metadata_registry.yaml`. - -**Categorical variables** — regex: `^[a-zA-Z0-9_]+_cat[0-9]+(_[0-9]+|_NA[a-z])$` - -| Row type | Pattern | Example | -|----------|---------|---------| -| Valid category | `{variable}_cat{N}_{recEnd}` | `SMK_204_cat4_1`, `FVC_1A_cat5_3` | -| Missing (not applicable) | `{variable}_cat{N}_NAa` | `SMK_204_cat4_NAa` | -| Missing (don't know/refusal) | `{variable}_cat{N}_NAb` | `SMK_204_cat4_NAb` | - -**Continuous variables and Func rows** use `N/A` (no naming convention). - -**Key rules:** -1. **No colons in dummy names** — use `_NAa` and `_NAb`, not `_NA::a` or `_NA::b`. Colons are invalid in identifiers. -2. **Suffix must match recEnd** — the number after the last underscore should equal the `recEnd` value for that row. A mismatch (e.g., `_cat5_2` with `recEnd=1`) indicates a copy-paste error. -3. **N must match numValidCat** — the number after `_cat` should equal the `numValidCat` value for valid categories of that variable. -4. **Func rows use `N/A`** — derived variable rows (where `recEnd` starts with `Func::`) use `dummyVariable=N/A`. - -**What to flag:** -- `_NA::a` or `_NA::b` patterns (should be `_NAa` / `_NAb`) -- Suffix-recEnd mismatches (e.g., `_cat5_2` on a row with `recEnd=1`) -- Func rows with constructed dummy names instead of `N/A` -- Continuous rows with anything other than `N/A` - -#### Check 5c: Swapped recEnd values - -Check for rows where `recEnd` values appear to be swapped between adjacent rows. This is a **P0 data bug** — it produces incorrect values at runtime with no warning. - -**Detection pattern:** -1. For each variable, examine rows where `recStart` is a valid data range (e.g., `[1,120]`) and adjacent rows where `recStart` is a not-applicable code (e.g., `996`) -2. The valid data range should map to `recEnd=copy` (or the appropriate output value), not to `NA::a` -3. A not-applicable code should map to `NA::a` or `NA::b`, not to `copy` - -**Example (FVC_6D bug found in PR #148):** -``` -# WRONG — recEnd values swapped -recStart=[1,120] recEnd=NA::a ← valid data being set to missing! -recStart=996 recEnd=copy ← not-applicable code being copied as data! - -# CORRECT -recStart=[1,120] recEnd=copy ← valid data copied through -recStart=996 recEnd=NA::a ← not-applicable code set to missing -``` - -**When to check:** Always check continuous variables with `copy` and `NA::a`/`NA::b` recEnd values. Swapped values are especially likely for variables added via copy-paste from similar variables. - -#### Check 5d: Label and metadata consistency - -Scan for common metadata quality issues in modified variables: - -1. **Double spaces** — check `label`, `labelLong`, `catLabel`, `catLabelLong`, `variableStartShortLabel`, and `variableStartLabel` for consecutive spaces -2. **Spelling errors in labels** — common typos: "consumptoin" (consumption), "freqeuncy" (frequency), "repondent" (respondent) -3. **Trailing punctuation in labelLong** — trailing dashes or incomplete labels (e.g., `"Daily consumption - fruit - (D)"` should be `"Daily consumption - fruit (D)"`) -4. **Missing descriptions** — derived daily frequency variables (FVCD*) and other derived variables should have `description` fields -5. **catLabel propagation** — when a label is fixed in `catLabel`, check that the same fix applies to `catLabelLong`, `variableStartShortLabel`, and `variableStartLabel` where those fields share the same text - -These are P2 issues (metadata quality) but are cheap to fix during review and prevent accumulation of inconsistencies. - -#### DV function naming convention (v3) - -New or refactored DV functions should use tidyverse-style verb-first names. The `_fun` suffix is legacy and being phased out as functions are refactored. - -| Verb | Purpose | Example | -|------|---------|---------| -| `calculate_*()` | Mathematical computation | `calculate_pct_time()`, `calculate_bmi()` | -| `categorize_*()` | Classification into groups | `categorize_pct_time()`, `categorize_bmi()` | -| `assess_*()` | Health risk evaluation | `assess_drinking_risk()` | -| `score_*()` | Scoring systems | `score_adl()` | -| `adjust_*()` | Data correction | `adjust_bmi()` | - -Legacy functions (e.g., `bmi_fun()`, `pack_years_fun()`) retain old names until refactored. Worksheets reference functions via `Func::` prefix (e.g., `Func::calculate_pct_time`). - -#### Check 6: L4 — derived variable specification review - -If the in-scope variables include derived variables (functions in `R/`): - -1. **Input consistency**: Read the DV function (e.g., `calculate_pct_time()` in `R/percent-time-canada.R`) and verify that the input variable names it expects match those listed in `variable_details.csv` for the derived variable -2. **Category coverage**: Verify the function handles all category values that the worksheet's `recFrom` maps to — no unhandled cases that would silently produce NA -3. **Output consistency**: Verify the function's return values match the `recTo` values in the worksheet -4. **Output bounds validation**: For continuous DVs, check whether the function validates output range. Values outside the valid domain (e.g., percentage >100 or <0) indicate inconsistent inputs and should return `tagged_na("b")`. The valid range should be documented in the `notes` field of the Func row in variable_details (documentation only for now, ready for future validation framework). If the DV lacks bounds checking, flag as P1. -5. **Documentation**: Check roxygen docs match the actual function signature - -#### Check 7: Unit tests (L5) - -If the PR includes or modifies test files in `tests/testthat/`: -- Verify category coverage (all output categories have test cases) -- Check edge cases (missing data, boundary values) -- Verify cross-cycle consistency - -If the PR lacks tests for new derived variables, flag this. +Read and follow `docs/l3-l5-worksheet-checks.md` for the full procedure. This covers: +- Check 1: Era boundary defaults (most dangerous bug class) +- Check 2: databaseStart consistency +- Check 2b: Multi-block recStart collisions +- Check 3: PUMF vs Master naming +- Check 4: Pre-2007 cycle letters +- Check 5: Known error patterns (typos, deprecated suffixes, invalid databases) +- Check 5b-5e: dummyVariable naming, swapped recEnd, label consistency, opaque suffixes +- DV function naming convention (v3) +- Worksheet-first principle +- Check 6: L4 derived variable specification review +- Check 7: Unit tests (L5) ### Step 6: L6 implementation validation -**This is the highest-priority check.** Run `rec_with_table()` against actual PUMF data. This is not just a pass/fail test — the output is an analytical tool. By examining prevalence and distributions across cycles and categories, reviewers can identify harmonization problems that worksheet checks alone cannot catch, such as a sudden step change in prevalence at an era boundary (e.g., 2014 → 2015) that signals a naming mismatch or category recode error. - -#### Multi-era recode validation - -For variables with multiple recode blocks (identified in Check 2b), standard L6 prevalence checks are insufficient — `rec_with_table()` may silently apply the wrong block or blend blocks without error. For these variables, perform era-specific output validation: - -1. **Identify one representative PUMF cycle per block** — e.g., for SMK_09A_cont: `cchs2001_p` (Block 1 era), `cchs2007_2008_p` (Block 3 era) -2. **Run `rec_with_table()` for each representative cycle** -3. **Verify the recEnd values match the expected midpoints for that era** — not just that they are non-missing - -For continuous variables, check a known respondent's output value against the expected midpoint for their source category. If the era boundary is at 2003 (different category boundaries in 2001 vs 2003+), a respondent with source code 3 should produce recEnd=4 in 2001 but recEnd=2.5 in 2003+. If both cycles produce the same value, the wrong block is being applied to one of them. - -Flag any era boundary where observed output values do not match expected midpoints as **P0**. - -#### Scope and limitations - -**PUMF data only.** L6 can currently test only `_p` databases. The `data/` directory contains PUMF RData files (`cchs2001_p.RData` through `cchs2017_2018_p.RData`). Master (`_m`) data is in a secure environment where LLMs cannot run. - -For master-only changes (e.g., a PR that only adds `_m` cycles), L6 cannot validate at runtime. In this case: -- Rely on L3-L5 worksheet checks (especially era boundary and naming checks) -- Generate the integration test R script anyway and save it to the CEP — the user or a colleague can run it in the secure environment -- Note the limitation explicitly in the review output - -**Future:** Mock data from the `mockdata` repo will enable L6 testing for all database types. - -#### Data locations - -PUMF RData files are in `data/`: -- `cchs2001_p.RData` through `cchs2017_2018_p.RData` - -Each file loads a data frame named after the cycle (e.g., `cchs2001_p`). - -#### Integration test script - -Generate and run a fully executable R script for the in-scope variables — no placeholders. Extract the actual variable names and cycle list from the worksheets. Save the script to the CEP directory so reviewers can re-run it. - -The script should: -1. Read `variable_details.csv` to extract the `_p` databases from `databaseStart` for each in-scope variable -2. Load cchsflow from the PR branch (use `devtools::load_all()` if R functions were modified, otherwise `library(cchsflow)`) -3. For each cycle, run `rec_with_table()` and collect results -4. Print cross-cycle prevalence summary -5. Save results CSV - -Pattern based on CEP-006: - -```r -# devtools::load_all() # Use if PR modifies R/ functions -library(cchsflow) -library(dplyr) - -# Load worksheet from the branch under review -variable_details <- read.csv("inst/extdata/variable_details.csv", - stringsAsFactors = FALSE) - -# Extract PUMF cycles from databaseStart for the in-scope variables -# (agent: replace with actual variable names and cycles from the worksheet) -variables_to_test <- c("FVCDFRU", "FVCDSAL", "FVCDPOT") -cycles <- c("cchs2001_p", "cchs2003_p", "cchs2005_p", - "cchs2007_2008_p", "cchs2009_2010_p", "cchs2011_2012_p", - "cchs2013_2014_p", "cchs2015_2016_p", "cchs2017_2018_p") - -results <- data.frame() - -for (cycle in cycles) { - rdata_file <- file.path("data", paste0(cycle, ".RData")) - if (!file.exists(rdata_file)) { - cat("SKIP", cycle, "- file not found\n") - next - } - - load(rdata_file) - df <- get(cycle) - - result <- tryCatch({ - rec_with_table( - data = df, - variables = variables_to_test, - database_name = cycle, - variable_details = variable_details, - log = FALSE - ) - }, error = function(e) { - cat("ERROR in", cycle, ":", e$message, "\n") - NULL - }) - - if (!is.null(result)) { - n <- nrow(result) - for (v in setdiff(names(result), "ADM_RNO")) { - valid <- sum(!is.na(result[[v]])) - cat(cycle, v, ": valid =", valid, "/", n, - "(", round(100 * valid / n, 1), "%)\n") - - # Category distribution (for categorical variables) - freq <- table(result[[v]], useNA = "ifany") - print(freq) - - results <- rbind(results, data.frame( - cycle = cycle, variable = v, - n = n, valid = valid, - valid_pct = round(100 * valid / n, 1), - stringsAsFactors = FALSE - )) - } - } - - rm(list = cycle) # free memory -} - -# Cross-cycle prevalence summary -cat("\n=== CROSS-CYCLE SUMMARY ===\n") -for (v in unique(results$variable)) { - cat("\n", v, ":\n") - sub <- results[results$variable == v, ] - print(sub[, c("cycle", "n", "valid", "valid_pct")], row.names = FALSE) -} - -# Save results -write.csv(results, "ceps/cep-NNN-domain/vars-pumf-integration-test.csv", - row.names = FALSE) -``` - -#### Cross-cycle prevalence QMD - -After generating the integration test CSV, create a Quarto document (`.qmd`) that visualises the cross-cycle results. This is a standard CEP artifact — visual inspection of prevalence trends is the most effective way to detect era boundary problems. - -The QMD should include: -1. **Cross-cycle valid % line plot** for each key variable (or a representative subset), with cycles on the x-axis and valid % on the y-axis. Add vertical reference lines at era boundaries (2007, 2015). -2. **Category distribution plot** for categorical derived variables (e.g., stacked bar chart of diet_score_cat3 across cycles). -3. **Annotations** for known data patterns — e.g., optional content cycles where low prevalence is expected, documented in the R function's roxygen or CCHS documentation. -4. **Brief narrative** interpreting the plots: are transitions clean? Any unexpected step changes? - -Use base R graphics (`plot()`, `barplot()`) to avoid extra dependencies. The QMD should be self-contained — load the results CSV, not rerun the integration test. - -Pattern: - -```yaml ---- -title: "CEP-NNN: Cross-cycle prevalence" -format: - html: - toc: true - code-fold: true ---- -``` - -```r -results <- read.csv("domain-pumf-integration-test.csv") - -# Extract year from cycle name for x-axis -results$year <- as.numeric(gsub("cchs(\\d{4}).*", "\\1", results$cycle)) - -# Plot valid % by cycle for a key variable -var_data <- results[results$variable == "KEY_VAR", ] -plot(var_data$year, var_data$valid_pct, type = "b", pch = 19, - xlab = "CCHS cycle", ylab = "Valid %", - main = "KEY_VAR: cross-cycle prevalence") -abline(v = c(2007, 2015), lty = 2, col = "grey50") -``` - -Save the QMD to the CEP directory alongside the other artifacts: - -``` -ceps/cep-NNN-/ - cep-NNN-.qmd # Cross-cycle prevalence plots - PR--review-summary.md - integration-test-.R - -pumf-integration-test.csv -``` - -#### Cross-cycle prevalence analysis - -The cross-cycle summary is the most important output. Review the `valid_pct` column for each variable across cycles and look for: - -1. **Step changes at era boundaries** — a sudden jump or drop in prevalence between 2005 → 2007 (pre-2007 to standard era) or 2014 → 2015 (standard to post-2014 era) suggests a naming mismatch or incorrect `[VAR]` default -2. **Unexpected zeros** — a cycle showing 0% valid when the variable should be available indicates a wrong source variable name or missing `db::VAR` mapping -3. **Exposure distribution shifts** — the key harmonization question is whether typical exposures remain stable across cycles. For continuous variables (e.g., daily fruit/veg consumption), check whether the proportion at clinically meaningful thresholds (e.g., 0 servings, >5 servings/day) shifts at era boundaries. For categorical variables, compare `table()` output across cycles. A sudden distribution change at 2015 that doesn't track the gradual secular trend suggests a mapping or recoding error, not a real population change. -4. **Derived variable completeness** — if a derived variable has lower valid % than its inputs, the DV function may be dropping valid cases - -**Optional content cycles:** Some CCHS modules are optional content in certain cycles — provinces opt in, so prevalence drops sharply. Before flagging low prevalence as an issue, check the R function's roxygen documentation and CCHS documentation for known optional content cycles. For example, FVC (fruit and vegetable consumption) was optional in 2005 and 2017-2018, producing ~56% and ~1% valid respectively — these are expected, not errors. - -Cross-cycle trends require human judgement. The skill should produce a clear summary table and flag any obvious discontinuities, but the reviewer interprets the results using their domain knowledge. In future, threshold-based alerts may be added. - -Example of a step change indicating a problem: -``` - cycle valid_pct - cchs2009_2010_p 34.1 <- normal - cchs2011_2012_p 14.7 <- lower (optional content) - cchs2013_2014_p 28.9 <- normal - cchs2015_2016_p 0.0 <- PROBLEM: variable renamed but mapping missing - cchs2017_2018_p 0.0 <- same problem -``` - -#### Derived variable testing - -If the in-scope variables include derived variables (functions in `R/`): - -1. Identify the DV function (e.g., `diet_score_fun()` in `R/diet.R`) -2. Check that all input variables are available in the test cycles -3. Run `rec_with_table()` with the derived variable to verify the full pipeline -4. Compare the derived variable's valid % against its input variables — the DV should not have materially higher valid % than its least-available input -5. For categorical derived variables and key continuous inputs, examine the **exposure distribution** across cycles — not just valid counts. The central harmonization question is whether typical exposures (e.g., proportion with 0 fruit/veg, or >5 servings/day) remain stable across cycles. A sudden shift in the distribution at an era boundary signals a recoding or mapping error even when valid % is unchanged. Include these distributions in both the integration test output and the QMD visualisation - -#### What to report from L6 - -For each cycle tested: -- **N**: Total respondents -- **Valid count and %**: Non-NA values for each variable -- **Category distribution**: `table()` output for categorical variables -- **Errors**: Any `rec_with_table()` failures with error messages - -Flag: -- **Step changes at era boundaries** (most important — signals naming/mapping errors) -- Cycles where valid % is 0 (variable may not exist despite being listed) -- Cycles where category distributions shift unexpectedly -- Derived variable failures or unexplained completeness gaps +Read and follow `docs/l6-implementation-validation.md` for the full procedure. This covers: +- Multi-era recode validation +- Scope and limitations (PUMF data only) +- Integration test script template +- Cross-cycle prevalence QMD +- Cross-cycle prevalence analysis (step changes, unexpected zeros, distribution shifts) +- Derived variable testing +- What to report from L6 ### Step 7: Confidence scoring @@ -808,8 +299,6 @@ Ran `rec_with_table()` against PUMF data for each cycle: CEP: `ceps/cep-NNN-/` - -Generated with [Claude Code](https://claude.ai/code) ``` If no issues survive filtering: @@ -824,169 +313,25 @@ L6 integration test: `rec_with_table()` ran successfully for all PUMF cycles. Checked: era boundary defaults, databaseStart consistency, naming conventions, DV specifications, known error patterns, and PUMF integration. CEP: `ceps/cep-NNN-/` - -Generated with [Claude Code](https://claude.ai/code) ``` #### Self-review reporting For self-review, report findings directly to the user without posting a PR comment. Still save CEP artifacts if CEP generation was not skipped. -### Step 9: Run CSV validation tools +### Step 9: Run CSV validation and propose fixes -Before proposing fixes, run the automated CSV validation tools to catch formatting and schema issues that the manual checks may have missed. +Read and follow `docs/csv-validation-and-fixes.md` for the full procedure. This covers: +- Running `check-worksheets.R` and `standardise_csv()` +- Branch availability for validation tools +- Proposing worksheet fixes (scoped to in-scope variables only) +- Multi-block databaseStart fix rules +- Visual diff review with Beyond Compare +- Scope expansion during review -#### Available tools +### Step 10: Scope expansion during review -**`check_worksheet()` / `fix_worksheet()`** (on `v3-smoking` and later branches): - -```bash -# Check for formatting violations (column order, line endings, row sorting, quoting) -Rscript exec/check-worksheets.R - -# Auto-fix formatting violations -Rscript exec/fix-worksheets.R -``` - -These are enforced by the `check-csv.yml` GitHub Action on PRs that modify `inst/extdata/variables.csv` or `variable_details.csv`. The GHA runs `check-worksheets.R` and fails if violations are found. - -**`standardise_csv()`** (on `feature/csv-standardisation-updates` branch): - -```r -# Basic mode — fix git conflicts (BOM, line endings, column order) -standardise_csv("inst/extdata/variables.csv") - -# Collaboration mode — enhanced schema validation -standardise_csv("inst/extdata/variable_details.csv", collaboration = TRUE, validate_only = TRUE) -``` - -Collaboration mode validates fields against `metadata_registry.yaml` regex patterns including `dummyVariable`, `variableStart`, `recStart`, and `recEnd`. It also checks for missing categorical dummy variables and cross-field rules. - -#### When to run - -- **Always** run `check-worksheets.R` (or `standardise_csv()` if available) before proposing fixes, to ensure proposed changes don't introduce new formatting violations -- **After applying fixes**, run validation again to confirm the fix didn't break formatting -- If the PR's branch has `check-csv.yml` GHA, check whether CI passed — if not, the formatting issues may need to be fixed before the review's substantive issues - -#### Branch availability - -| Tool | Branches | -|------|----------| -| `check_worksheet()` / `fix_worksheet()` | `v3-smoking`, `feature/v3.0.0-validation-infrastructure`, and later | -| `standardise_csv()` with collaboration mode | `feature/csv-standardisation-updates` and later | -| `check-csv.yml` GHA | `v3-smoking` and later | - -If the PR's branch doesn't have these tools, run validation from a branch that does by checking out only the worksheet files: - -```bash -# Validate worksheets from a branch that has the tools -git stash -git checkout v3-smoking -- exec/check-worksheets.R R/check-worksheet.R R/fix-worksheet.R -Rscript exec/check-worksheets.R -git checkout -- exec/ R/check-worksheet.R R/fix-worksheet.R -git stash pop -``` - -### Step 10: Propose worksheet fixes (if issues found) - -If the review identified worksheet errors (typos, missing mappings, incorrect database names), propose fixes to the user rather than silently modifying the worksheets. - -#### Workflow - -1. **Summarize the proposed changes** — list each fix with the affected variable(s), the current (incorrect) value, and the corrected value. For example: - - ``` - Proposed worksheet fixes: - - 1. FVC_1A through FVC_6E (30 variables): Replace `chs2011_2012_m` with - `cchs2011_2012_m` and `chs2013_2014_m` with `cchs2013_2014_m` in both - variables.csv and variable_details.csv - - 2. FVCDPOT: Replace `cchs20013_2014_m` with `cchs2013_2014_m` in - variable_details.csv (extra zero) - ``` - -2. **Wait for user approval** — the user decides whether to apply the fixes now, defer them, or handle them differently (e.g., as a follow-up PR, or let the PR author fix them). - -3. **Apply fixes using R or Python** — never use bash text tools on CSV files. Use R's `read.csv()`/`write.csv()` or Python's csv module to make targeted edits while preserving the file's existing formatting and quoting conventions. - - **CRITICAL: Scope fixes to in-scope variables only.** When applying replacements (e.g., `_s` → `_m`, typo corrections), filter to only the rows belonging to the PR's in-scope variables. Never apply global `gsub()` or `str_replace_all()` across the entire dataframe — this will modify hundreds of unrelated variables. Always subset first: - ```r - alc_idx <- which(vd$variable %in% in_scope_vars) - for (i in alc_idx) { - vd$databaseStart[i] <- gsub("cchs2009_s", "cchs2009_m", vd$databaseStart[i]) - } - ``` - - **Multi-block databaseStart fix rule:** When `check_recode_blocks()` flags a recStart collision, the fix is to narrow each block's `databaseStart` to only the databases where that block's source variable actually exists. The key mental model: - - > Each block's `databaseStart` should contain only the databases where that block's `variableStart` is the correct source variable. - - **Critical anti-pattern — do not replace the entire databaseStart.** A block's `databaseStart` may include databases covered by a `[SHORTHAND]` entry in `variableStart` (e.g., `[SMK_09A]` covers cchs2007_2008_p through cchs2013_2014_p implicitly). If you replace the full `databaseStart` with only the databases visible in the explicit `db::VAR` prefixes, you will drop the shorthand-covered databases and create new gaps. Instead: - - 1. Identify which database(s) are causing the collision (appear in two blocks) - 2. Determine which block those databases actually belong to (based on which era's source variable they use) - 3. Remove those databases from the block they do *not* belong to — leave everything else intact - - **Example:** If `cchs2001_p` appears in both Block 1 (2001 source variable) and Block 2 (2003+ source variable), remove `cchs2001_p` from Block 2's `databaseStart` only. Do not rewrite Block 2's full `databaseStart`. - - Always open Beyond Compare to verify the proposed fix before applying it to `inst/extdata/`. - -4. **Save fixes to a temporary file** — per project conventions (CLAUDE.local.md), write proposed changes to `/tmp/` for user review before editing the main worksheet files directly. The user or PR author integrates the changes. - -5. **Verify idempotency** — always read from `inst/extdata/` (the clean source), never from previously modified `/tmp/` files. After running a modification script, re-run it to confirm the output is identical. If the script detects its own changes on the second run (e.g., skips "already has 2021"), the idempotency check passed. - -6. **Offer visual diff review** — before applying changes to `inst/extdata/`, pause and ask the user whether they want to review the diff in a visual diff tool (e.g., Beyond Compare, Kaleidoscope, VS Code diff). This is especially valuable for large worksheet changes where the programmatic summary may miss formatting issues (e.g., Python csv re-quoting all fields, creating a noisy diff that obscures the real changes). - - **For PR reviews**: Use the **merge base** as the comparison baseline, not the target branch tip. This ensures the diff shows only what the PR branch changed, excluding divergence on the target branch since the PR was created. This is especially important for full-file rewrites where comparing against the target tip shows noise from unrelated target-side changes. - - ```bash - # Find the merge base between the PR branch and target - MERGE_BASE=$(git merge-base origin/ ) - - # Extract the file at the merge base - git show ${MERGE_BASE}:inst/extdata/variable_details.csv > /tmp/vd_mergebase.csv - git show ${MERGE_BASE}:inst/extdata/variables.csv > /tmp/vars_mergebase.csv - - # Compare merge base vs current PR branch (shows only PR changes) - bcompare /tmp/vd_mergebase.csv inst/extdata/variable_details.csv - bcompare /tmp/vars_mergebase.csv inst/extdata/variables.csv - ``` - - **For self-review / proposed fixes**: Compare the current working copy against the proposed modifications in `/tmp/`: - - ```bash - bcompare inst/extdata/variable_details.csv /tmp/variable_details_updated.csv - bcompare inst/extdata/variables.csv /tmp/variables_updated.csv - ``` - - If the user doesn't have a visual diff tool configured, offer to help set one up. Common options: - - **Beyond Compare**: `brew install --cask beyond-compare` — configure as git difftool with `git config --global diff.tool bc` and `git config --global difftool.bc.path /usr/local/bin/bcompare` - - **VS Code**: `code --diff ` - - **Kaleidoscope**: `ksdiff ` - - **FileMerge** (macOS built-in): `opendiff ` - - **Why merge-base matters:** In the GEN_10 PR (#169) review, comparing against the target tip showed 23 extra DHHGAGE_E rows and SDCDCGT changes that were on the target branch, not the PR. This noise obscured the actual PR changes. Using merge-base revealed only the GEN_07 and GEN_10 rows — the true scope. Similarly, in the diet PR (#148) review, Python's csv writer re-quoted every field, producing a noisy git diff. A visual diff tool with merge-base comparison would have caught both issues immediately. - -#### When not to fix - -- Pre-existing issues on the target branch that are outside the PR's scope — note them in the review but do not propose fixes as part of this PR -- **Exception: `_s` suffix databases** — always fix `_s` → `_m` when encountered in reviewed variables, even if pre-existing. Deprecated suffixes should not persist in the worksheets. -- Issues that require domain judgement (e.g., whether a variable should use a different source name) — flag for human review -- Changes to R functions — these require separate code review and testing - -### Scope expansion during review - -If the review identifies expansion opportunities (e.g., additional cycles available in cchsflow-docs that are not yet in the worksheets) and the user requests adding them, the review transitions into authoring: - -1. **Enter plan mode** to design the worksheet changes. The plan should cover which variables, databases, and variableStart mappings need updating. -2. **Write a modification script** (Python csv module) that reads from `inst/extdata/`, applies all changes, and writes to `/tmp/` for user review. The script should handle both the expansion and any typo fixes from the review. -3. **Run verification** — check databaseStart consistency, era boundary correctness, and variableStart mappings in the `/tmp/` output files. -4. **Present changes to the user** with a clear summary of what was modified before applying to `inst/extdata/`. -5. **Update the CEP** to document the expansion (new cycles, era boundaries, naming changes). -6. **Re-run CSV validation** (Step 9) on the expanded worksheets. - -The key constraint: all changes go through `/tmp/` for review before touching `inst/extdata/`. The review skill delegates to the worksheets skill for authoring decisions (era naming conventions, variableStart patterns). +If the review identifies expansion opportunities and the user requests adding them, follow the scope expansion procedure in `docs/csv-validation-and-fixes.md`. ### Step 11: Retrospective — review the skill @@ -1002,6 +347,18 @@ Summarise the retrospective to the user. If skill updates are warranted, propose ## Reference +### Skill docs (in this folder) + +- **Worksheet reference (MUST READ)**: `docs/worksheet-reference.md` — canonical guide to cchsflow worksheet conventions +- **L0-L2 documentation review**: `docs/l0-l2-documentation-review.md` — MCP setup, variable verification, concordance +- **L3-L5 worksheet checks**: `docs/l3-l5-worksheet-checks.md` — era boundaries, naming, error patterns +- **L6 implementation validation**: `docs/l6-implementation-validation.md` — rec_with_table() testing, prevalence analysis +- **CSV validation and fixes**: `docs/csv-validation-and-fixes.md` — check/fix tools, fix workflow, visual diff +- **Variable naming conventions**: `docs/variable-naming-conventions.md` — harmonized variable naming rules +- **Gem verification workflow**: `docs/review/` — NotebookLM Gem system prompt, notebook manifest, coverage summary + +### External references + - L0-L6 workflow: `.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md` - Era mapping tables: `.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md` - Schema definitions: `inst/metadata/schemas/core/variables.yaml`, `inst/metadata/schemas/core/variable_details.yaml` diff --git a/.claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md b/.claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md new file mode 100644 index 00000000..9489062d --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md @@ -0,0 +1,157 @@ +# CSV validation and worksheet fixes + +## CSV validation tools + +Before proposing fixes, run the automated CSV validation tools to catch formatting and schema issues that the manual checks may have missed. + +### Available tools + +**`check_worksheet()` / `fix_worksheet()`** (on `v3-smoking` and later branches): + +```bash +# Check for formatting violations (column order, line endings, row sorting, quoting) +Rscript exec/check-worksheets.R + +# Auto-fix formatting violations +Rscript exec/fix-worksheets.R +``` + +These are enforced by the `check-csv.yml` GitHub Action on PRs that modify `inst/extdata/variables.csv` or `variable_details.csv`. The GHA runs `check-worksheets.R` and fails if violations are found. + +**`standardise_csv()`** (on `feature/csv-standardisation-updates` branch): + +```r +# Basic mode — fix git conflicts (BOM, line endings, column order) +standardise_csv("inst/extdata/variables.csv") + +# Collaboration mode — enhanced schema validation +standardise_csv("inst/extdata/variable_details.csv", collaboration = TRUE, validate_only = TRUE) +``` + +Collaboration mode validates fields against `metadata_registry.yaml` regex patterns including `dummyVariable`, `variableStart`, `recStart`, and `recEnd`. It also checks for missing categorical dummy variables and cross-field rules. + +### When to run + +- **Always** run `check-worksheets.R` (or `standardise_csv()` if available) before proposing fixes, to ensure proposed changes don't introduce new formatting violations +- **After applying fixes**, run validation again to confirm the fix didn't break formatting +- If the PR's branch has `check-csv.yml` GHA, check whether CI passed — if not, the formatting issues may need to be fixed before the review's substantive issues + +### Branch availability + +| Tool | Branches | +|------|----------| +| `check_worksheet()` / `fix_worksheet()` | `v3-smoking`, `feature/v3.0.0-validation-infrastructure`, and later | +| `standardise_csv()` with collaboration mode | `feature/csv-standardisation-updates` and later | +| `check-csv.yml` GHA | `v3-smoking` and later | + +If the PR's branch doesn't have these tools, run validation from a branch that does by checking out only the worksheet files: + +```bash +# Validate worksheets from a branch that has the tools +git stash +git checkout v3-smoking -- exec/check-worksheets.R R/check-worksheet.R R/fix-worksheet.R +Rscript exec/check-worksheets.R +git checkout -- exec/ R/check-worksheet.R R/fix-worksheet.R +git stash pop +``` + +## Proposing worksheet fixes + +If the review identified worksheet errors (typos, missing mappings, incorrect database names), propose fixes to the user rather than silently modifying the worksheets. + +### Workflow + +1. **Summarize the proposed changes** — list each fix with the affected variable(s), the current (incorrect) value, and the corrected value. For example: + + ``` + Proposed worksheet fixes: + + 1. FVC_1A through FVC_6E (30 variables): Replace `chs2011_2012_m` with + `cchs2011_2012_m` and `chs2013_2014_m` with `cchs2013_2014_m` in both + variables.csv and variable_details.csv + + 2. FVCDPOT: Replace `cchs20013_2014_m` with `cchs2013_2014_m` in + variable_details.csv (extra zero) + ``` + +2. **Wait for user approval** — the user decides whether to apply the fixes now, defer them, or handle them differently (e.g., as a follow-up PR, or let the PR author fix them). + +3. **Apply fixes using R or Python** — never use bash text tools on CSV files. Use R's `read.csv()`/`write.csv()` or Python's csv module to make targeted edits while preserving the file's existing formatting and quoting conventions. + + **CRITICAL: Scope fixes to in-scope variables only.** When applying replacements (e.g., `_s` → `_m`, typo corrections), filter to only the rows belonging to the PR's in-scope variables. Never apply global `gsub()` or `str_replace_all()` across the entire dataframe — this will modify hundreds of unrelated variables. Always subset first: + ```r + alc_idx <- which(vd$variable %in% in_scope_vars) + for (i in alc_idx) { + vd$databaseStart[i] <- gsub("cchs2009_s", "cchs2009_m", vd$databaseStart[i]) + } + ``` + + **Multi-block databaseStart fix rule:** When `check_recode_blocks()` flags a recStart collision, the fix is to narrow each block's `databaseStart` to only the databases where that block's source variable actually exists. The key mental model: + + > Each block's `databaseStart` should contain only the databases where that block's `variableStart` is the correct source variable. + + **Critical anti-pattern — do not replace the entire databaseStart.** A block's `databaseStart` may include databases covered by a `[SHORTHAND]` entry in `variableStart` (e.g., `[SMK_09A]` covers cchs2007_2008_p through cchs2013_2014_p implicitly). If you replace the full `databaseStart` with only the databases visible in the explicit `db::VAR` prefixes, you will drop the shorthand-covered databases and create new gaps. Instead: + + 1. Identify which database(s) are causing the collision (appear in two blocks) + 2. Determine which block those databases actually belong to (based on which era's source variable they use) + 3. Remove those databases from the block they do *not* belong to — leave everything else intact + + **Example:** If `cchs2001_p` appears in both Block 1 (2001 source variable) and Block 2 (2003+ source variable), remove `cchs2001_p` from Block 2's `databaseStart` only. Do not rewrite Block 2's full `databaseStart`. + + Always open Beyond Compare to verify the proposed fix before applying it to `inst/extdata/`. + +4. **Save fixes to a temporary file** — per project conventions (CLAUDE.local.md), write proposed changes to `/tmp/` for user review before editing the main worksheet files directly. The user or PR author integrates the changes. + +5. **Verify idempotency** — always read from `inst/extdata/` (the clean source), never from previously modified `/tmp/` files. After running a modification script, re-run it to confirm the output is identical. If the script detects its own changes on the second run (e.g., skips "already has 2021"), the idempotency check passed. + +6. **Offer visual diff review** — before applying changes to `inst/extdata/`, pause and ask the user whether they want to review the diff in a visual diff tool (e.g., Beyond Compare, Kaleidoscope, VS Code diff). This is especially valuable for large worksheet changes where the programmatic summary may miss formatting issues (e.g., Python csv re-quoting all fields, creating a noisy diff that obscures the real changes). + + **For PR reviews**: Use the **merge base** as the comparison baseline, not the target branch tip. This ensures the diff shows only what the PR branch changed, excluding divergence on the target branch since the PR was created. This is especially important for full-file rewrites where comparing against the target tip shows noise from unrelated target-side changes. + + ```bash + # Find the merge base between the PR branch and target + MERGE_BASE=$(git merge-base origin/ ) + + # Extract the file at the merge base + git show ${MERGE_BASE}:inst/extdata/variable_details.csv > /tmp/vd_mergebase.csv + git show ${MERGE_BASE}:inst/extdata/variables.csv > /tmp/vars_mergebase.csv + + # Compare merge base vs current PR branch (shows only PR changes) + bcompare /tmp/vd_mergebase.csv inst/extdata/variable_details.csv + bcompare /tmp/vars_mergebase.csv inst/extdata/variables.csv + ``` + + **For self-review / proposed fixes**: Compare the current working copy against the proposed modifications in `/tmp/`: + + ```bash + bcompare inst/extdata/variable_details.csv /tmp/variable_details_updated.csv + bcompare inst/extdata/variables.csv /tmp/variables_updated.csv + ``` + + If the user doesn't have a visual diff tool configured, offer to help set one up. Common options: + - **Beyond Compare**: `brew install --cask beyond-compare` — configure as git difftool with `git config --global diff.tool bc` and `git config --global difftool.bc.path /usr/local/bin/bcompare` + - **VS Code**: `code --diff ` + - **Kaleidoscope**: `ksdiff ` + - **FileMerge** (macOS built-in): `opendiff ` + + **Why merge-base matters:** In the GEN_10 PR (#169) review, comparing against the target tip showed 23 extra DHHGAGE_E rows and SDCDCGT changes that were on the target branch, not the PR. This noise obscured the actual PR changes. Using merge-base revealed only the GEN_07 and GEN_10 rows — the true scope. Similarly, in the diet PR (#148) review, Python's csv writer re-quoted every field, producing a noisy git diff. A visual diff tool with merge-base comparison would have caught both issues immediately. + +### When not to fix + +- Pre-existing issues on the target branch that are outside the PR's scope — note them in the review but do not propose fixes as part of this PR +- **Exception: `_s` suffix databases** — always fix `_s` → `_m` when encountered in reviewed variables, even if pre-existing. Deprecated suffixes should not persist in the worksheets. +- Issues that require domain judgement (e.g., whether a variable should use a different source name) — flag for human review +- Changes to R functions — these require separate code review and testing + +## Scope expansion during review + +If the review identifies expansion opportunities (e.g., additional cycles available in cchsflow-docs that are not yet in the worksheets) and the user requests adding them, the review transitions into authoring: + +1. **Enter plan mode** to design the worksheet changes. The plan should cover which variables, databases, and variableStart mappings need updating. +2. **Write a modification script** (Python csv module) that reads from `inst/extdata/`, applies all changes, and writes to `/tmp/` for user review. The script should handle both the expansion and any typo fixes from the review. +3. **Run verification** — check databaseStart consistency, era boundary correctness, and variableStart mappings in the `/tmp/` output files. +4. **Present changes to the user** with a clear summary of what was modified before applying to `inst/extdata/`. +5. **Update the CEP** to document the expansion (new cycles, era boundaries, naming changes). +6. **Re-run CSV validation** on the expanded worksheets. + +The key constraint: all changes go through `/tmp/` for review before touching `inst/extdata/`. The review skill delegates to the worksheets skill for authoring decisions (era naming conventions, variableStart patterns). diff --git a/.claude/skills/cchsflow-review/docs/l0-l2-documentation-review.md b/.claude/skills/cchsflow-review/docs/l0-l2-documentation-review.md new file mode 100644 index 00000000..79717062 --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/l0-l2-documentation-review.md @@ -0,0 +1,131 @@ +# L0-L2 documentation review + +For each in-scope variable, verify the documentation foundations. Read `.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md` for detailed L0-L2 templates. + +## L0: Documentation assessment + +Verify source variables against CCHS documentation using the **cchsflow-docs** repository (`Big-Life-Lab/cchsflow-docs` on GitHub, cloned alongside cchsflow). This step confirms that variables claimed in `variableStart` and `databaseStart` actually exist in the CCHS data for those cycles. + +### Primary source: cchs-metadata MCP server + +**Always use the cchs-metadata MCP as the primary tool for L0-L1 verification.** It provides the most complete and queryable metadata — 16,000+ variables across 251 datasets, enriched from PUMF RData, DDI XML, and ICES sources with full provenance tracking. + +**Key tools:** +- `mcp__cchs-metadata__search_variables(query)` — find variables by name or label substring +- `mcp__cchs-metadata__get_variable_detail(variable_name)` — full metadata including labels, question text, value codes, dataset history +- `mcp__cchs-metadata__get_variable_history(variable_name)` — which cycles/datasets contain the variable (essential for era boundary verification) +- `mcp__cchs-metadata__get_value_codes(variable_name)` — response categories with frequencies +- `mcp__cchs-metadata__compare_master_pumf(variable_name, cycle)` — compare PUMF vs Master metadata for a specific cycle (essential for PUMF/Master split decisions) +- `mcp__cchs-metadata__suggest_cchsflow_row(variable_name)` — draft a cchsflow harmonisation row +- `mcp__cchs-metadata__get_dataset_variables(dataset_id)` — list all variables in a specific dataset +- `mcp__cchs-metadata__get_source_conflicts(variable_name, dataset_id)` — find cross-source label disagreements (useful for catching metadata inconsistencies) +- `mcp__cchs-metadata__get_database_summary()` — database overview and statistics + +**Using MCP results:** +- The `cchsflow_name` field maps StatCan source variables to their cchsflow harmonized names — use this to verify that `variableStart` entries point to the correct source variable for each cycle +- Use `get_variable_history` to confirm a variable exists across claimed cycles and to identify era renames (e.g., SMK_09C → SMK_090 at the 2015 boundary) +- Use `compare_master_pumf` to verify whether PUMF and Master share the same source variable or need split rows + +**Caution:** The MCP `label_short`/`label_long` fields may be contaminated by cchsflow labels (see MCP error report from alcohol review). Always cross-check against `label_statcan` which comes from DDI primary sources. + +### If the MCP is not available + +Check whether the MCP is loaded: +```bash +claude mcp list +``` + +If `cchs-metadata` is missing or shows "Failed to connect", the server needs to be configured. The MCP server (v0.3.0+) lives in the **cchsflow-docs** repository and is also available as a [GitHub release](https://github.com/Big-Life-Lab/cchsflow-docs/releases). + +**Quick setup** (if cchsflow-docs is cloned at `../cchsflow-docs/`): +```bash +cd ../cchsflow-docs/mcp-server && bash ../scripts/setup.sh +claude mcp add cchs-metadata -- python3 /Users/dmanuel/github/cchsflow-docs/mcp-server/server.py +``` + +**Manual setup:** +1. Ensure `cchsflow-docs` is cloned alongside cchsflow (typically `../cchsflow-docs/`) +2. Ensure `mcp-server/server.py` exists in cchsflow-docs +3. Ensure the database exists: `../cchsflow-docs/database/cchs_metadata.duckdb` (download from the [v0.3.0 release](https://github.com/Big-Life-Lab/cchsflow-docs/releases) or rebuild: `Rscript --vanilla ../cchsflow-docs/database/build_db.R`) +4. Add the MCP to Claude Code: + ```bash + claude mcp add cchs-metadata -- python3 /Users/dmanuel/github/cchsflow-docs/mcp-server/server.py + ``` + Or add to `~/.claude.json` (see `.mcp.json.example` in cchsflow-docs for a template): + ```json + "cchs-metadata": { + "type": "stdio", + "command": "python3", + "args": ["/Users/dmanuel/github/cchsflow-docs/mcp-server/server.py"], + "env": {"CCHS_DB_PATH": "/Users/dmanuel/github/cchsflow-docs/database/cchs_metadata.duckdb"} + } + ``` +5. Restart Claude Code for the MCP tools to appear in the tool list + +### CLI fallback + +If the MCP server cannot be started but the database exists, use the standalone CLI (no FastMCP dependency — only `duckdb` required): +```bash +python3 ../cchsflow-docs/mcp-server/cli.py search smoking +python3 ../cchsflow-docs/mcp-server/cli.py detail SMKDSTY +python3 ../cchsflow-docs/mcp-server/cli.py history SMK_204 +python3 ../cchsflow-docs/mcp-server/cli.py conflicts --variable SMKDSTY +python3 ../cchsflow-docs/mcp-server/cli.py codes SMK_204 +``` + +All commands support `--json` for machine-readable output and `--db PATH` for custom database path. + +See the cchsflow-docs `CLAUDE.md` and `.claude/skills/cchs-database/SKILL.md` for database build workflow and schema details. + +### Fallback: file-based lookups + +If the MCP is unavailable and cannot be restored, use these file-based sources in the cchsflow-docs repo (typically `../cchsflow-docs/`): + +1. **Extracted YAML data dictionaries** — structured variable definitions by cycle: + ``` + ../cchsflow-docs/cchs-extracted/data-dictionary/{year}/ + ``` + Coverage: 2000-2001 through 2023. + +2. **DDI XML files** — authoritative StatsCan PUMF documentation: + ``` + ../cchsflow-docs/cchs-pumf-docs/CCHS_DDI/ + ``` + +3. **CCHS variable dictionary CSV** — flat file for quick lookups: + ``` + ../cchsflow-docs/data/cchs_variable_dictionary.csv + ``` + +These are the raw sources that feed the MCP database. The MCP is strongly preferred because it cross-references all sources, deduplicates, and provides structured query tools rather than requiring manual grep/search across hundreds of files. + +### What to verify + +For each in-scope variable: +1. **Existence**: Does the source variable name appear in the documentation for each claimed cycle? +2. **Category codes**: Do `recStart` values match the documented category definitions? +3. **Era renames**: For 2015+ cycles, confirm the renamed variable exists +4. **Cycle coverage up to latest available**: Check whether the variable exists in cycles beyond the PR's `databaseStart` (documentation covers up to 2023) — these may be candidates for expansion + +### What to flag + +- Variable listed in `variableStart` but not found in documentation for that cycle → **P0** (wrong variable name) +- Variable not checked (no documentation available for that cycle) → note as untested +- Variable exists in additional cycles not included in `databaseStart` → informational (expansion opportunity) + +## L1: Variable concordance + +Use the cchsflow-docs extracted data dictionaries to verify source variable names across eras: + +- Pre-2007: cycle letter in 4th position (A=2001, C=2003, E=2005) +- 2007-2014: standard naming +- Post-2014: check for 3-digit renames — search the 2015+ YAML files to confirm actual names +- 2022+: check for modular renames (e.g., CSS/SPU prefixes for smoking) + +For each era boundary, compare the variable name in `variableStart` against the corresponding cycle's YAML data dictionary in cchsflow-docs. PUMF and Master data dictionaries may differ — check both `_p` and `_m` YAML files where available. + +## L2: Semantic mapping + +- Are category codes consistent across cycles? +- Are semantic breaks identified and documented? +- Do recoding rules handle all source categories? diff --git a/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md b/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md new file mode 100644 index 00000000..c45cef83 --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md @@ -0,0 +1,238 @@ +# L3-L5 worksheet and testing checks + +Run these checks in parallel for the in-scope variables. + +## Check 1: Era boundary defaults + +The most dangerous class of bug. For each variable: + +1. Parse the `databaseStart` field — does it span both 2007-2014 and 2015+ cycles? +2. Parse the `variableStart` field — do 2015+ databases have explicit `db::VAR` mappings? +3. If a `[VAR]` default exists and 2015+ databases lack explicit mappings, the default will apply the wrong variable name at runtime + +**Key 2015 renames to check:** +- Smoking categorical: SMK_06A → SMK_060, SMK_09A → SMK_080, SMK_10A → SMK_100 +- Smoking continuous: SMK_06C → SMK_070, SMK_09C → SMK_090, SMK_10C → SMK_110 +- Smoking derived: SMKDSTY → SMKDVSTY, SMKDSTP → SMKDVSTP +- PUMF grouped: SMKG06C → SMKG070, SMKG09C → SMKG090, SMKG10C → SMKG110 +- FVC: FVCDFRU → FVCDVFRU, FVCDSAL → FVCDVGRN, FVCDCAR → FVCDVORA, FVCDPOT → FVCDVPOT, FVCDVEG → FVCDVVEG, FVCDJUI → FVCDVJUI +- ADL: ADL_01-06 → ADL_005-030 (3-digit, 2015-2021), then → ADL_05-30 (2-digit, 2023+) + +**Key 2023 renames to check:** +- ADL: ADL_005 → ADL_05, ADL_010 → ADL_10, ADL_015 → ADL_15, ADL_020 → ADL_20, ADL_025 → ADL_25, ADL_030 → ADL_30. This is a new era boundary — `[ADL_005]` defaults will not work for 2023 databases. + +## Check 2: databaseStart consistency + +For each variable: +1. Extract `databaseStart` from variables.csv +2. Extract all `databaseStart` entries from variable_details.csv for that variable +3. The variables.csv list must equal the union of all variable_details.csv lists +4. Flag any databases present in one file but not the other + +For each mismatch found, classify it: +- **PR-introduced**: The mismatch is new (not on target branch) — report as P1 +- **Pre-existing**: The mismatch exists on the target branch — document in pre-existing issues +- **`_p` in vd only**: PUMF databases in variable_details but not variables.csv is a known pattern for variables that span both pre-2015 and 2015+ eras (the pre-2015 block includes `_p` databases that the 2015+ block in variables.csv doesn't list). Note but do not flag as a bug. + +All mismatches must be explicitly listed in the review summary, even pre-existing ones. Do not silently omit consistency results. + +## Check 2b: Multi-block recStart collisions + +**Terminology:** A **recode block** is a set of rows in variable_details.csv sharing the same `variableStart` value. A recode block defines how one source variable maps to the harmonized output. Variables that changed source variable names or response category definitions across CCHS cycles require multiple blocks — one per distinct source structure. A single block can span multiple eras when the source variable name and category boundaries were stable across them. + +Variables with multiple recode blocks must not have the same `recStart` value appearing in more than one block for the same database. If a `(database, recStart)` pair matches rows from two blocks, `rec_with_table()` will find duplicate rows and produce incorrect output. + +Note: `databaseStart` overlap alone (a database appearing in two blocks' lists) is not sufficient to flag an error — cchsflow legitimately uses parallel PUMF and Master blocks that share databases but have non-overlapping `recStart` ranges. The collision must be at the `(database, recStart)` level. + +**Automated check:** `exec/check-worksheets.R` runs `check_recode_blocks()` automatically. For manual inspection of a specific variable: + +```r +vd_var <- variable_details[variable_details$variable == "VAR", ] +blocks <- split(vd_var, vd_var$variableStart) +db_sets <- lapply(blocks, function(b) { + trimws(unlist(strsplit(b$databaseStart[1], ","))) +}) +# Check all pairwise intersections (overlap is a necessary but not sufficient condition) +pairs <- combn(length(db_sets), 2) +for (i in seq_len(ncol(pairs))) { + overlap <- intersect(db_sets[[pairs[1,i]]], db_sets[[pairs[2,i]]]) + if (length(overlap) > 0) + cat("OVERLAP (check recStart too):", paste(overlap, collapse=", "), "\n") +} +``` + +Flag any confirmed `(database, recStart)` collision as **P0**. + +This check is especially important for continuous variables with era-specific midpoint recodes (e.g., SMK_09A_cont, SMK_06A_cont) where different cycles have different category boundaries and require separate recode blocks. + +## Check 3: PUMF vs Master naming + +For `_m` (master) databases: +- Pre-2007: cycle letter in source variable name (A=2001, C=2003, E=2005) +- 2007-2014: standard naming (no prefix letter) +- 2015+: check for renamed variables + +For `_p` (PUMF) databases: +- May use grouped/derived variable names (e.g., SMKG prefix, FVCD prefix) + +Verify that `_m` databases don't reference PUMF-only grouped variables, and vice versa. + +For variables where PUMF and Master use fundamentally different source types (categorical vs continuous), the required pattern is to split into separate recode blocks — one for PUMF, one for Master — each with its own `databaseStart` and `variableStart`. + +For harmonized variable **naming** decisions (when to use `_cont`, `_catN`, era suffixes, etc.), see `docs/variable-naming-conventions.md`. + +## Check 4: Pre-2007 cycle letters + +For variables with pre-2007 master cycles, verify the cycle letter: +- 2001 (`_m` or `_p`): letter A in the variable name (e.g., SMKA_203, FVCADFRU) +- 2003: letter C (e.g., SMKC_203, FVCCDFRU) +- 2005: letter E (e.g., SMKE_203, FVCEDFRU) + +The letter position varies by variable domain but follows a consistent pattern within each domain. + +## Check 5: Known error patterns + +**Automated check:** `exec/check-worksheets.R` runs `check_invalid_databases()` on both worksheets. Review its output before manual scanning — it catches the first four patterns below automatically. + +Scan for: +- `cchs20013_` — extra zero typo (should be `cchs2013_`) +- `chs20` without leading `c` — missing `c` typo (should be `cchs20`). This pattern has been found in ADL and FVC variables (e.g., `chs2011_2012_m` instead of `cchs2011_2012_m`). Check all database names match the `cchs` prefix. +- `_i` suffix databases — deprecated, should be `_m` +- `_s` suffix databases — deprecated, **always convert to `_m`** when found in reviewed variables. Check that a corresponding `_m` entry doesn't already exist (if it does, delete the `_s` row; if not, rename `_s` → `_m`). This applies even if the `_s` is pre-existing on the target branch — if the PR touches these rows, fix the suffix. **Naming convention**: `_s` share files are single-year extracts, so map to the single-year master form: `cchs2009_s` → `cchs2009_m` (not `cchs2009_2010_m`), `cchs2010_s` → `cchs2010_m`, `cchs2012_s` → `cchs2012_m`. Check `variables.csv` to confirm which `_m` form is expected. +- `cchs2021_p`, `cchs2022_p`, `cchs2023_p` — **invalid PUMF databases**. The 2021 CCHS was not released as a standalone PUMF — it was combined with 2022 data into a 2021-2022 PUMF (not yet in cchsflow). The 2022+ smoking variables were restructured into CSS/SPU modules; no standalone PUMF equivalent exists for variables like SMK_09A in those cycles. Remove these from `databaseStart` for PUMF-only or mixed blocks when encountered in reviewed variables. +- `[[VAR]]` — double brackets (invalid notation) +- `[VAR1, VAR2]` without `DerivedVar::` prefix — ambiguous multi-variable input + +**Pre-existing typo propagation:** Typo patterns often exist in the target branch for other variables and get copied into new variables through copy-paste. For each typo found, check whether the same pattern exists on the target branch for the same variables — if not, it was introduced by this PR even if the pattern exists elsewhere. + +## Check 5b: dummyVariable naming conventions + +Verify that `dummyVariable` values follow the naming convention below. (Note: `inst/metadata/documentation/metadata_registry.yaml` is referenced as the authoritative source for these patterns but does not yet exist — this skill section is the current reference.) + +**Categorical variables** — regex: `^[a-zA-Z0-9_]+_cat[0-9]+(_[0-9]+|_NA[a-z])$` + +| Row type | Pattern | Example | +|----------|---------|---------| +| Valid category | `{variable}_cat{N}_{recEnd}` | `SMK_204_cat4_1`, `FVC_1A_cat5_3` | +| Missing (not applicable) | `{variable}_cat{N}_NAa` | `SMK_204_cat4_NAa` | +| Missing (don't know/refusal) | `{variable}_cat{N}_NAb` | `SMK_204_cat4_NAb` | + +**Continuous variables and Func rows** use `N/A` (no naming convention). + +**Key rules:** +1. **No colons in dummy names** — use `_NAa` and `_NAb`, not `_NA::a` or `_NA::b`. Colons are invalid in identifiers. +2. **Suffix must match recEnd** — the number after the last underscore should equal the `recEnd` value for that row. A mismatch (e.g., `_cat5_2` with `recEnd=1`) indicates a copy-paste error. +3. **N must match numValidCat** — the number after `_cat` should equal the `numValidCat` value for valid categories of that variable. +4. **Func rows use `N/A`** — derived variable rows (where `recEnd` starts with `Func::`) use `dummyVariable=N/A`. + +**What to flag:** +- `_NA::a` or `_NA::b` patterns (should be `_NAa` / `_NAb`) +- Suffix-recEnd mismatches (e.g., `_cat5_2` on a row with `recEnd=1`) +- Func rows with constructed dummy names instead of `N/A` +- Continuous rows with anything other than `N/A` + +**DerivedVar block recEnd values:** In `DerivedVar` blocks, `recEnd` documents the *output category codes* produced by the R function, not recode targets. For categorical DVs these will be integers (1, 2, 3, …); for continuous DVs they will be midpoints or numeric outputs. Do **not** flag integer `recEnd` values in a `DerivedVar` block as inconsistent with midpoint values in a sibling direct recode block — the two block types serve different purposes. See `docs/variable-naming-conventions.md` for full explanation. + +## Check 5c: Swapped recEnd values + +Check for rows where `recEnd` values appear to be swapped between adjacent rows. This is a **P0 data bug** — it produces incorrect values at runtime with no warning. + +**Detection pattern:** +1. For each variable, examine rows where `recStart` is a valid data range (e.g., `[1,120]`) and adjacent rows where `recStart` is a not-applicable code (e.g., `996`) +2. The valid data range should map to `recEnd=copy` (or the appropriate output value), not to `NA::a` +3. A not-applicable code should map to `NA::a` or `NA::b`, not to `copy` + +**Example (FVC_6D bug found in PR #148):** +``` +# WRONG — recEnd values swapped +recStart=[1,120] recEnd=NA::a ← valid data being set to missing! +recStart=996 recEnd=copy ← not-applicable code being copied as data! + +# CORRECT +recStart=[1,120] recEnd=copy ← valid data copied through +recStart=996 recEnd=NA::a ← not-applicable code set to missing +``` + +**When to check:** Always check continuous variables with `copy` and `NA::a`/`NA::b` recEnd values. Swapped values are especially likely for variables added via copy-paste from similar variables. + +## Check 5d: Label and metadata consistency + +Scan for common metadata quality issues in modified variables: + +1. **Double spaces** — check `label`, `labelLong`, `catLabel`, `catLabelLong`, `variableStartShortLabel`, and `variableStartLabel` for consecutive spaces +2. **Spelling errors in labels** — common typos: "consumptoin" (consumption), "freqeuncy" (frequency), "repondent" (respondent) +3. **Trailing punctuation in labelLong** — trailing dashes or incomplete labels (e.g., `"Daily consumption - fruit - (D)"` should be `"Daily consumption - fruit (D)"`) +4. **Missing descriptions** — derived daily frequency variables (FVCD*) and other derived variables should have `description` fields +5. **catLabel propagation** — when a label is fixed in `catLabel`, check that the same fix applies to `catLabelLong`, `variableStartShortLabel`, and `variableStartLabel` where those fields share the same text + +These are P2 issues (metadata quality) but are cheap to fix during review and prevent accumulation of inconsistencies. + +## Check 5e: Opaque `_A`/`_B` variable name suffixes + +When reviewing variables that use `_A` or `_B` suffixes, flag the name as potentially opaque and prompt the reviewer to consider a more descriptive suffix — but **only when the variable is being actively modified** in the current PR or review. Do not propose drive-by renames of untouched variables. + +**Smoking variables with `_A`/`_B` suffixes:** + +| Current name | Meaning of `_A` | Meaning of `_B` | +|---|---|---| +| SMKDSTY_A / SMKDSTY_B | Pre-2015 6-category structure | 2015+ 6-category structure | +| SMKG01C_A / SMKG01C_B | Pre-2015 grouped categories | 2015+ grouped categories | +| SMKG203_A / SMKG203_B | Pre-2015 grouped categories | 2015+ grouped categories | +| SMKG207_A / SMKG207_B | Pre-2015 grouped categories | 2015+ grouped categories | + +The `_A`/`_B` convention consistently encodes era-based category structure splits, but is opaque to users who don't know the convention. Compare with self-documenting suffixes already in use: `_cat3`, `_cat5`, `_cont`. + +**When to flag:** If the PR modifies any `_A`/`_B` variable (adds cycles, changes recodes, updates metadata), include a note in the review: + +> "SMKDSTY_A uses an opaque `_A` suffix. Consider whether a more descriptive name (e.g., `SMKDSTY_cat6_pre2015`) is warranted as part of this change. A rename requires backward compatibility support (deprecated alias)." + +**Backward compatibility:** Renaming a harmonised variable name breaks existing user code that references the old name. Any rename must include a deprecation mechanism — either a wrapper that calls the new name with `.Deprecated()`, or dual entries in `variables.csv` during a transition period. See `docs/variable-naming-conventions.md` for the naming convention and deprecation approach. + +**Scoring:** P2 (naming quality). Do not block a PR over this — it is an improvement opportunity, not a correctness issue. + +## DV function naming convention (v3) + +New or refactored DV functions should use tidyverse-style verb-first names. The `_fun` suffix is legacy and being phased out as functions are refactored. + +| Verb | Purpose | Example | +|------|---------|---------| +| `calculate_*()` | Mathematical computation | `calculate_pct_time()`, `calculate_bmi()` | +| `categorize_*()` | Classification into groups | `categorize_pct_time()`, `categorize_bmi()` | +| `assess_*()` | Health risk evaluation | `assess_drinking_risk()` | +| `score_*()` | Scoring systems | `score_adl()` | +| `adjust_*()` | Data correction | `adjust_bmi()` | + +Legacy functions (e.g., `bmi_fun()`, `pack_years_fun()`) retain old names until refactored. Worksheets reference functions via `Func::` prefix (e.g., `Func::calculate_pct_time`). + +## Worksheet-first principle + +`variable_details.csv` `recEnd` is the **source of truth** for value mappings. A DV function (`Func::`) is only warranted when the mapping requires logic that `recStart → recEnd` rows cannot express — for example: + +- **Multi-variable computation** (e.g., `pack_years_fun()` combining smoking intensity and duration) +- **Conditional branching** across multiple input variables +- **Date arithmetic** or other transformations not expressible as row-level recodes + +Simple categorical-to-midpoint conversions belong in the worksheet as direct recode rows, **not** in R code. The reference pattern is `DHHGAGE_cont`: a continuous variable with era-specific midpoint blocks, entirely worksheet-driven with no R function. + +**Why this matters:** When an R function hard-codes midpoints that duplicate (or should duplicate) `recEnd` values, it creates two sources of truth. If the worksheet is updated but the function is not (or vice versa), the pipeline silently produces wrong values. Eliminating redundant functions removes this class of bugs entirely. + +## Check 6: L4 — derived variable specification review + +If the in-scope variables include derived variables (functions in `R/`): + +1. **Input consistency**: Read the DV function (e.g., `calculate_pct_time()` in `R/percent-time-canada.R`) and verify that the input variable names it expects match those listed in `variable_details.csv` for the derived variable +2. **Category coverage**: Verify the function handles all category values that the worksheet's `recFrom` maps to — no unhandled cases that would silently produce NA +3. **Output consistency**: Verify the function's return values match the `recTo` values in the worksheet +3b. **No hard-coded worksheet values**: Check that the DV function does not contain literal midpoints or category values (e.g., `~ 0.5`, `~ 1.5`, `~ 4`) that duplicate or should duplicate `recEnd` values in `variable_details.csv`. If a function hard-codes values that the worksheet already expresses (or could express) as `recStart → recEnd` rows, flag as **P1** — the function should be refactored to read from the worksheet or eliminated entirely. Reference: the `DHHGAGE_cont` pattern. +4. **Output bounds validation**: For continuous DVs, check whether the function validates output range. Values outside the valid domain (e.g., percentage >100 or <0) indicate inconsistent inputs and should return `tagged_na("b")`. The valid range should be documented in the `notes` field of the Func row in variable_details (documentation only for now, ready for future validation framework). If the DV lacks bounds checking, flag as P1. +5. **Documentation**: Check roxygen docs match the actual function signature +6. **Necessity check** (worksheet-first): Before reviewing function logic, verify that the `Func::` DerivedVar block is actually needed. Check whether the same mapping could be expressed as direct recode rows (`recStart → recEnd`). If the DerivedVar input uses the same categorical scale as an existing direct recode block and the function only maps categories to output values, the function is redundant — flag as **P1** and recommend converting to direct recode rows. See "Worksheet-first principle" above. + +## Check 7: Unit tests (L5) + +If the PR includes or modifies test files in `tests/testthat/`: +- Verify category coverage (all output categories have test cases) +- Check edge cases (missing data, boundary values) +- Verify cross-cycle consistency + +If the PR lacks tests for new derived variables, flag this. diff --git a/.claude/skills/cchsflow-review/docs/l6-implementation-validation.md b/.claude/skills/cchsflow-review/docs/l6-implementation-validation.md new file mode 100644 index 00000000..ecc2fad0 --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/l6-implementation-validation.md @@ -0,0 +1,218 @@ +# L6 implementation validation + +**This is the highest-priority check.** Run `rec_with_table()` against actual PUMF data. This is not just a pass/fail test — the output is an analytical tool. By examining prevalence and distributions across cycles and categories, reviewers can identify harmonization problems that worksheet checks alone cannot catch, such as a sudden step change in prevalence at an era boundary (e.g., 2014 → 2015) that signals a naming mismatch or category recode error. + +## Multi-era recode validation + +For variables with multiple recode blocks (identified in Check 2b), standard L6 prevalence checks are insufficient — `rec_with_table()` may silently apply the wrong block or blend blocks without error. For these variables, perform era-specific output validation: + +1. **Identify one representative PUMF cycle per block** — e.g., for SMK_09A_cont: `cchs2001_p` (Block 1 era), `cchs2007_2008_p` (Block 3 era) +2. **Run `rec_with_table()` for each representative cycle** +3. **Verify the recEnd values match the expected midpoints for that era** — not just that they are non-missing + +For continuous variables, check a known respondent's output value against the expected midpoint for their source category. If the era boundary is at 2003 (different category boundaries in 2001 vs 2003+), a respondent with source code 3 should produce recEnd=4 in 2001 but recEnd=2.5 in 2003+. If both cycles produce the same value, the wrong block is being applied to one of them. + +Flag any era boundary where observed output values do not match expected midpoints as **P0**. + +## Scope and limitations + +**PUMF data only.** L6 can currently test only `_p` databases. The `data/` directory contains PUMF RData files (`cchs2001_p.RData` through `cchs2017_2018_p.RData`). Master (`_m`) data is in a secure environment where LLMs cannot run. + +For master-only changes (e.g., a PR that only adds `_m` cycles), L6 cannot validate at runtime. In this case: +- Rely on L3-L5 worksheet checks (especially era boundary and naming checks) +- Generate the integration test R script anyway and save it to the CEP — the user or a colleague can run it in the secure environment +- Note the limitation explicitly in the review output + +**Future:** Mock data from the `mockdata` repo will enable L6 testing for all database types. + +## Data locations + +PUMF RData files are in `data/`: +- `cchs2001_p.RData` through `cchs2017_2018_p.RData` + +Each file loads a data frame named after the cycle (e.g., `cchs2001_p`). + +## Integration test script + +Generate and run a fully executable R script for the in-scope variables — no placeholders. Extract the actual variable names and cycle list from the worksheets. Save the script to the CEP directory so reviewers can re-run it. + +The script should: +1. Read `variable_details.csv` to extract the `_p` databases from `databaseStart` for each in-scope variable +2. Load cchsflow from the PR branch (use `devtools::load_all()` if R functions were modified, otherwise `library(cchsflow)`) +3. For each cycle, run `rec_with_table()` and collect results +4. Print cross-cycle prevalence summary +5. Save results CSV + +Pattern based on CEP-006: + +```r +# devtools::load_all() # Use if PR modifies R/ functions +library(cchsflow) +library(dplyr) + +# Load worksheet from the branch under review +variable_details <- read.csv("inst/extdata/variable_details.csv", + stringsAsFactors = FALSE) + +# Extract PUMF cycles from databaseStart for the in-scope variables +# (agent: replace with actual variable names and cycles from the worksheet) +variables_to_test <- c("FVCDFRU", "FVCDSAL", "FVCDPOT") +cycles <- c("cchs2001_p", "cchs2003_p", "cchs2005_p", + "cchs2007_2008_p", "cchs2009_2010_p", "cchs2011_2012_p", + "cchs2013_2014_p", "cchs2015_2016_p", "cchs2017_2018_p") + +results <- data.frame() + +for (cycle in cycles) { + rdata_file <- file.path("data", paste0(cycle, ".RData")) + if (!file.exists(rdata_file)) { + cat("SKIP", cycle, "- file not found\n") + next + } + + load(rdata_file) + df <- get(cycle) + + result <- tryCatch({ + rec_with_table( + data = df, + variables = variables_to_test, + database_name = cycle, + variable_details = variable_details, + log = FALSE + ) + }, error = function(e) { + cat("ERROR in", cycle, ":", e$message, "\n") + NULL + }) + + if (!is.null(result)) { + n <- nrow(result) + for (v in setdiff(names(result), "ADM_RNO")) { + valid <- sum(!is.na(result[[v]])) + cat(cycle, v, ": valid =", valid, "/", n, + "(", round(100 * valid / n, 1), "%)\n") + + # Category distribution (for categorical variables) + freq <- table(result[[v]], useNA = "ifany") + print(freq) + + results <- rbind(results, data.frame( + cycle = cycle, variable = v, + n = n, valid = valid, + valid_pct = round(100 * valid / n, 1), + stringsAsFactors = FALSE + )) + } + } + + rm(list = cycle) # free memory +} + +# Cross-cycle prevalence summary +cat("\n=== CROSS-CYCLE SUMMARY ===\n") +for (v in unique(results$variable)) { + cat("\n", v, ":\n") + sub <- results[results$variable == v, ] + print(sub[, c("cycle", "n", "valid", "valid_pct")], row.names = FALSE) +} + +# Save results +write.csv(results, "ceps/cep-NNN-domain/vars-pumf-integration-test.csv", + row.names = FALSE) +``` + +## Cross-cycle prevalence QMD + +After generating the integration test CSV, create a Quarto document (`.qmd`) that visualises the cross-cycle results. This is a standard CEP artifact — visual inspection of prevalence trends is the most effective way to detect era boundary problems. + +The QMD should include: +1. **Cross-cycle valid % line plot** for each key variable (or a representative subset), with cycles on the x-axis and valid % on the y-axis. Add vertical reference lines at era boundaries (2007, 2015). +2. **Category distribution plot** for categorical derived variables (e.g., stacked bar chart of diet_score_cat3 across cycles). +3. **Annotations** for known data patterns — e.g., optional content cycles where low prevalence is expected, documented in the R function's roxygen or CCHS documentation. +4. **Brief narrative** interpreting the plots: are transitions clean? Any unexpected step changes? + +Use base R graphics (`plot()`, `barplot()`) to avoid extra dependencies. The QMD should be self-contained — load the results CSV, not rerun the integration test. + +Pattern: + +```yaml +--- +title: "CEP-NNN: Cross-cycle prevalence" +format: + html: + toc: true + code-fold: true +--- +``` + +```r +results <- read.csv("domain-pumf-integration-test.csv") + +# Extract year from cycle name for x-axis +results$year <- as.numeric(gsub("cchs(\\d{4}).*", "\\1", results$cycle)) + +# Plot valid % by cycle for a key variable +var_data <- results[results$variable == "KEY_VAR", ] +plot(var_data$year, var_data$valid_pct, type = "b", pch = 19, + xlab = "CCHS cycle", ylab = "Valid %", + main = "KEY_VAR: cross-cycle prevalence") +abline(v = c(2007, 2015), lty = 2, col = "grey50") +``` + +Save the QMD to the CEP directory alongside the other artifacts: + +``` +ceps/cep-NNN-/ + cep-NNN-.qmd # Cross-cycle prevalence plots + PR--review-summary.md + integration-test-.R + -pumf-integration-test.csv +``` + +## Cross-cycle prevalence analysis + +The cross-cycle summary is the most important output. Review the `valid_pct` column for each variable across cycles and look for: + +1. **Step changes at era boundaries** — a sudden jump or drop in prevalence between 2005 → 2007 (pre-2007 to standard era) or 2014 → 2015 (standard to post-2014 era) suggests a naming mismatch or incorrect `[VAR]` default +2. **Unexpected zeros** — a cycle showing 0% valid when the variable should be available indicates a wrong source variable name or missing `db::VAR` mapping +3. **Exposure distribution shifts** — the key harmonization question is whether typical exposures remain stable across cycles. For continuous variables (e.g., daily fruit/veg consumption), check whether the proportion at clinically meaningful thresholds (e.g., 0 servings, >5 servings/day) shifts at era boundaries. For categorical variables, compare `table()` output across cycles. A sudden distribution change at 2015 that doesn't track the gradual secular trend suggests a mapping or recoding error, not a real population change. +4. **Derived variable completeness** — if a derived variable has lower valid % than its inputs, the DV function may be dropping valid cases + +**Optional content cycles:** Some CCHS modules are optional content in certain cycles — provinces opt in, so prevalence drops sharply. Before flagging low prevalence as an issue, check the R function's roxygen documentation and CCHS documentation for known optional content cycles. For example, FVC (fruit and vegetable consumption) was optional in 2005 and 2017-2018, producing ~56% and ~1% valid respectively — these are expected, not errors. + +Cross-cycle trends require human judgement. The skill should produce a clear summary table and flag any obvious discontinuities, but the reviewer interprets the results using their domain knowledge. In future, threshold-based alerts may be added. + +Example of a step change indicating a problem: +``` + cycle valid_pct + cchs2009_2010_p 34.1 <- normal + cchs2011_2012_p 14.7 <- lower (optional content) + cchs2013_2014_p 28.9 <- normal + cchs2015_2016_p 0.0 <- PROBLEM: variable renamed but mapping missing + cchs2017_2018_p 0.0 <- same problem +``` + +## Derived variable testing + +If the in-scope variables include derived variables (functions in `R/`): + +1. Identify the DV function (e.g., `diet_score_fun()` in `R/diet.R`) +2. Check that all input variables are available in the test cycles +3. Run `rec_with_table()` with the derived variable to verify the full pipeline +4. Compare the derived variable's valid % against its input variables — the DV should not have materially higher valid % than its least-available input +5. For categorical derived variables and key continuous inputs, examine the **exposure distribution** across cycles — not just valid counts. The central harmonization question is whether typical exposures (e.g., proportion with 0 fruit/veg, or >5 servings/day) remain stable across cycles. A sudden shift in the distribution at an era boundary signals a recoding or mapping error even when valid % is unchanged. Include these distributions in both the integration test output and the QMD visualisation + +## What to report from L6 + +For each cycle tested: +- **N**: Total respondents +- **Valid count and %**: Non-NA values for each variable +- **Category distribution**: `table()` output for categorical variables +- **Errors**: Any `rec_with_table()` failures with error messages + +Flag: +- **Step changes at era boundaries** (most important — signals naming/mapping errors) +- Cycles where valid % is 0 (variable may not exist despite being listed) +- Cycles where category distributions shift unexpectedly +- Derived variable failures or unexplained completeness gaps diff --git a/.claude/skills/cchsflow-review/docs/review/gemini-gem-system-prompt.md b/.claude/skills/cchsflow-review/docs/review/gemini-gem-system-prompt.md new file mode 100644 index 00000000..fb4bf00d --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/review/gemini-gem-system-prompt.md @@ -0,0 +1,104 @@ +# cchsflow worksheet reviewer — Gem system prompt + +## Persona + +You are a specialist reviewer for the cchsflow R package, which harmonises Canadian Community Health Survey (CCHS) variables across survey cycles (2001-2023). You have deep expertise in StatCan survey documentation and the cchsflow worksheet format. + +## Task + +Your job is to verify that worksheet mappings (variables.csv and variable_details.csv) correctly encode how StatCan source variables map to harmonised cchsflow variables. For each variable in a review extract, check: + +1. **Source variable existence**: Does the StatCan variable named in variableStart actually exist in the databases listed in databaseStart? Search the data dictionaries for that cycle. +2. **Era-specific name accuracy**: Do the `cchs{year}::{VAR}` mappings match the correct era-specific name? (e.g., 2001 uses SMKA prefix, 2003 uses SMKC, 2005 uses SMKE, 2007+ uses SMK_) +3. **Response category completeness**: Do the recStart values cover all response categories from the data dictionary? Are any categories missing or extra? +4. **Recode correctness**: Do recStart-to-recEnd mappings make sense? (e.g., midpoint of "5-11 years" should be 8, not 6) +5. **Database coverage**: Are there cycles where the source variable exists (per data dictionaries) but are not listed in databaseStart? Or databases listed where the variable does not exist? +6. **DerivedVar feeders**: For DerivedVar blocks, do the listed feeder variables exist in variables.csv and cover the same databases? Cross-check against derived variable specifications. +7. **Missing value handling**: Are NA::a and NA::b correctly sourced from the data dictionary's valid skip and not stated codes? Cross-check universe definitions in the questionnaire. + +## Context + +### Your knowledge base + +Your knowledge comes from two sources: + +#### Gem attachment: cchsflow worksheet reference + +The file `worksheet-reference.md` is attached to this Gem. It explains how the cchsflow project works — the worksheet format, naming conventions, recode mechanics, harmonisation patterns, and common anti-patterns. It covers both the CCHS survey context (file types, cycle naming, variable naming eras) and the cchsflow worksheet mechanics (variables.csv schema, variable_details.csv schema, block types, missing value conventions, database identifiers). + +This is your primary reference for judging whether a mapping is **well-formed** — does it follow cchsflow conventions? + +#### NotebookLM: StatCan source documents (~250 PDFs) + +The NotebookLM notebook contains StatCan's own documentation. These are the ground truth for judging whether a mapping is **correct** — does it match what StatCan published? The documents include: + +1. **Data dictionaries** — List every variable in a survey file with its name, label, response categories (coded values and labels), universe (who was asked), and valid skip conditions. Available for both PUMF and Master files. Use these to verify variable names, response categories, and missing value codes. + +2. **Questionnaires** — The actual survey questions, with skip patterns and flow logic. Use these to verify universe definitions (who should get NA::a valid skip) and to understand the intent behind response categories. + +3. **Derived variable specifications** — StatCan's documentation of how they compute derived variables (e.g., SMKDVSTY, SMKDGSTP) from raw survey items. These describe the input variables, decision logic, and output categories. Use these to verify DerivedVar feeder variables and recode logic. + +Documents are organised by survey cycle (e.g., "CCHS 2015-2016") and file type (PUMF or Master). When verifying a worksheet row, search for the variable name in documents matching the cycle and file type indicated by the databaseStart field. + +**Important**: Not all cycles may be represented in the notebook. If you cannot find a document covering a specific cycle, say so explicitly rather than guessing. + +### Worksheet schema + +**variables.csv** — one row per harmonised variable: +- `variable`: harmonised name (e.g., SMK_005) +- `databaseStart`: comma-separated list of databases (e.g., cchs2015_2016_p, cchs2017_2018_m) +- `variableStart`: source variable mapping or DerivedVar specification + +**variable_details.csv** — multiple rows per variable (one per recode rule per database group): +- `variable`: harmonised name (must match variables.csv) +- `databaseStart`: which databases this row applies to +- `variableStart`: source variable reference (e.g., `cchs2001_p::SMKAG203`, `[SMK_005]`, or `DerivedVar::[var1, var2]`) +- `recEnd`: output value (harmonised) +- `recStart`: input value (source) +- `typeStart`/`typeEnd`: cat (categorical) or cont (continuous) + +### Database naming + +- `cchs{year}_{type}` where type is `p` (PUMF), `m` (Master), or `s` (deprecated Share) +- Dual-year: `cchs2007_2008_p` (combined cycle) +- Single-year: `cchs2021_m` (single collection year) +- `cchs2021_p` does NOT exist — 2021 was combined into a 2021-2022 PUMF +- `cchs2022_p` and `cchs2023_p` are valid standalone PUMFs + +### Block types + +Rows for the same variable with the same variableStart form a "block." Block types: +1. **Direct recode**: variableStart references source variables. recStart-to-recEnd maps source values to harmonised values. +2. **DerivedVar**: variableStart = `DerivedVar::[feeder1, feeder2]`. Uses an R function (in `recEnd` as `Func::function_name`) to compute values from other harmonised variables. +3. **Copy**: recEnd = `copy`. Pass-through of continuous values. + +### Missing values + +- `NA::a` = not applicable / valid skip (respondent not in universe) +- `NA::b` = missing / don't know / refused / not stated +- Every block MUST have at least an NA::b catch-all row + +### 2022-2023 CSS/SPU restructure + +Smoking variables were restructured in 2022: +- SMK_005 (smoker type presently) was dropped; handled by SMKDVSTY derivation +- SMK_030 (ever smoked daily) was renamed to SPU_05 +- SMK_040 (age began daily) was renamed to SPU_15 +- SMK_045 (current daily cigs) was renamed to CSS_25 +- SMK_075 (former daily cigs) was renamed to SPU_20 + +## Format + +For each variable, report: +- **OK** if no issues found, with a one-line summary of what you verified +- **Issue** with: specific row reference, what is wrong, and what the source document says the correct value should be + +Group findings by variable. Use a summary table at the top listing each variable and its status (OK / N issues found). + +## Constraints + +- Do not speculate. If your notebook does not contain documentation for a specific cycle or file type, say "I do not have documentation for {database} — cannot verify." +- Do not invent response categories or variable names. Only report what you find in the loaded documents. +- Do not suggest code changes. Your role is to identify issues, not fix them. +- If you are uncertain about a finding, flag it as "Possible issue (low confidence)" rather than asserting it as fact. +- Cite the specific document when flagging an issue (e.g., "Per the 2015-2016 PUMF data dictionary, SMK_005 has categories 1, 2, 3" or "Per the 2017-2018 derived variable specifications, SMKDVSTY uses SMK_005 and SMK_030 as inputs"). diff --git a/.claude/skills/cchsflow-review/docs/review/notebook-coverage.md b/.claude/skills/cchsflow-review/docs/review/notebook-coverage.md new file mode 100644 index 00000000..1f586c3e --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/review/notebook-coverage.md @@ -0,0 +1,61 @@ +# NotebookLM coverage summary + +**Notebook:** CCHS cchsflow review notebook +**Manifest:** `notebook-manifest.csv` (in this directory) +**Last updated:** 2026-03-28 + +## Coverage + +| Collection | Files | Cycles | +|---|---|---| +| Master | 113 | 2001-2023 (complete) | +| PUMF | 126 | 2001-2022 (2023 missing) | + +### Master files (all cycles 2001-2023) + +Complete coverage: data dictionaries, derived variable specs, questionnaires, and user guides for every cycle. + +### PUMF files by cycle + +| Cycle | DD | DV | QU | UG | Files | +|---|---|---|---|---|---| +| 2001 | yes | - | - | - | 8 | +| 2003 | yes | yes | yes | - | 8 | +| 2005 | yes | yes | yes | yes | 10 | +| 2007-2008 | - | yes | yes | yes | 8 | +| 2009-2010 | yes | yes | - | yes | 9 | +| 2010 | yes | yes | yes | yes | 9 | +| 2011-2012 | yes | yes | yes | yes | 11 | +| 2012 | yes | yes | yes | yes | 11 | +| 2013-2014 | yes | yes | yes | yes | 11 | +| 2014 | yes | yes | yes | yes | 11 | +| 2015-2016 | yes | yes | yes | yes | 10 | +| 2017-2018 | yes | yes | yes | yes | 8 | +| 2019-2020 | yes | yes | yes | yes | 6 | +| 2022 | yes | yes | - | yes | 6 | +| 2023 | - | - | - | - | 0 | + +**DD** = data dictionary, **DV** = derived/grouped variable specs, **QU** = questionnaire, **UG** = user guide + +### Known gaps + +- **2001 PUMF**: Only data dictionary; no derived variables, questionnaire, or user guide +- **2003 PUMF**: No user guide +- **2007-2008 PUMF**: No data dictionary +- **2009-2010 PUMF**: No questionnaire +- **2022 PUMF**: No questionnaire +- **2023 PUMF**: Not in notebook (may not be released yet) + +### Impact on Gem reviews + +- **PUMF response category verification** is strong for 2005-2020 (data dictionaries present) +- **PUMF grouped/derived variable verification** is strong for 2003-2022 +- **Master verification** is comprehensive across all cycles +- **2023 PUMF** cannot be verified at all — flag as "cannot verify" in reviews + +## Adding documents + +1. Obtain PDFs from the Statistics Canada CCHS documentation releases +2. Upload to the NotebookLM notebook +3. Update `manifest.csv` with filename, collection, cycle, file_size_bytes, and sha256 +4. Re-run coverage analysis to confirm gap is filled diff --git a/.claude/skills/cchsflow-review/docs/review/notebook-manifest.csv b/.claude/skills/cchsflow-review/docs/review/notebook-manifest.csv new file mode 100644 index 00000000..4a3b32c3 --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/review/notebook-manifest.csv @@ -0,0 +1,240 @@ +filename,collection,cycle,file_size_bytes,sha256 +"cchs_2001s_dd_m_en_3_v1.pdf","master","",1403999,"eac610b85c15aa1e5278b1540b94ee926255d4d7d06c86994b7b912c0e447b93" +"cchs_2001s_ot_m_en_1_v1.pdf","master","",194588,"756c8d8c6a792214dfed265762d9f6b53e396fbfe9b2501a566e55780e85349c" +"cchs_2001s_ot_m_en_2_v1.pdf","master","",41868,"ffba113d2c2cfb135dc2a8e44175a2768512bf7d5845d2302e06aa935386e09f" +"cchs_2001s_qu_m_en_1_v1.pdf","master","",19472,"e9bb37219b5bdd4eafcc24ca85b66379aed2a528a984a50c0a37e01cee3bc8a8" +"cchs_2001s_qu_m_en_2_v1.pdf","master","",372116,"99e40288c73a298a29d560e19eaf0956fc2230ee6829c12c92d8522f784d24c5" +"cchs_2001s_ug_m_en_1_v1.pdf","master","",576234,"3e4a13116c148890fefd0c1bc6bde61612d4d40d3f71f48332a7ad11f8395178" +"cchs_2001s_ug_m_en_2_v1.pdf","master","",680672,"ec67811ffa131a6e209a3209ed52263a8e3248e497c72f1cbf4f4f38c87711c8" +"cchs_2003s_dd_m_en_3_v1.pdf","master","",2365514,"9e2fe1e26eb50c375637e61a0705e430ade5b3ee49bb823144da40f515b4b27f" +"cchs_2003s_dd_m_en_4_v1.pdf","master","",166871,"2a7687df5cf772d38503f79e04d93b3eadaf7be2053de8d374ed76e3246e817b" +"cchs_2003s_dd_m_en_5_v1.pdf","master","",2565758,"9783484f5e93288b93324bb49e04730f5816ebb4c0de144a34270c39850ce6e0" +"cchs_2003s_dd_m_en_6_v1.pdf","master","",1237708,"2186f4f9af93f0dadf0705e8dd3052c3c432fe842b2d412e719ad446bf25f76a" +"cchs_2003s_dd_m_en_7_v1.pdf","master","",1155663,"10725359aeff818757f55530199943c6668434bdc96c412623111d21eb6ef9e9" +"cchs_2003s_dd_m_en_8_v1.pdf","master","",1316849,"86672ea5c59a1e1e50b8169a312c234bb7f2add46e14798954afd9f1bd5ff9f2" +"cchs_2003s_ot_m_en_1_v1.pdf","master","",193435,"465ddf2a3f32dc4f6b07a35323c12b60d0c48b5b7d7cf4c04c8fe326998d4cd3" +"cchs_2003s_ot_m_en_2_v1.pdf","master","",41868,"ffba113d2c2cfb135dc2a8e44175a2768512bf7d5845d2302e06aa935386e09f" +"cchs_2003s_qu_m_en_1_v1.pdf","master","",1453431,"19dc786809b3e057ce8767ab5441643b6effafdfb2450cc97dfde352518cc208" +"cchs_2003s_qu_m_en_2_v1.pdf","master","",19541,"ad2fb125e89cca9f97c737e38e0825fa22d0ac79bd8b6a6de2ce9c17b1ad69a1" +"cchs_2003s_qu_m_en_3_v1.pdf","master","",19541,"ad2fb125e89cca9f97c737e38e0825fa22d0ac79bd8b6a6de2ce9c17b1ad69a1" +"cchs_2003s_qu_m_en_4_v1.pdf","master","",1453431,"19dc786809b3e057ce8767ab5441643b6effafdfb2450cc97dfde352518cc208" +"cchs_2003s_qu_m_en_5_v1.pdf","master","",19541,"ad2fb125e89cca9f97c737e38e0825fa22d0ac79bd8b6a6de2ce9c17b1ad69a1" +"cchs_2003s_ug_m_en_1_v1.pdf","master","",642148,"8730bf6ac188ba7ed594e299a6daa9c2516337f7d8cde08be6da641525376f19" +"cchs_2003s_ug_m_en_2_v1.pdf","master","",672367,"81970885a8fd9f644f2475e54db5ff73da06f95f8364b0318fd1c790302d3db5" +"cchs_2005s_dd_m_en_1_v1.pdf","master","",2726172,"1eebd6e7e75b1e4860f46e410b20eb496441a253a30c75d890e2c1d76d89bc96" +"cchs_2005s_dd_m_en_2_v1.pdf","master","",1367564,"43e7625a1f75dbb746aaea08f69dc7a4b39b0475af359658ba7d9762392262d5" +"cchs_2005s_dd_m_en_3_v1.pdf","master","",1038019,"4014695eea73aa2cdcb601b3a95aa0693ff1340baa91ef30172819bb4027a8f4" +"cchs_2005s_dd_m_en_4_v1.pdf","master","",1591262,"3c33b946b4ef33d0843886196c638a85bd2532d7fb57ece79a3d77c8da42a2b0" +"cchs_2005s_qu_m_en_1_v1.pdf","master","",695683,"ae04bf8698778a53ca2a1ec80e7c075d54d114d8c8a9b706b10c190101a16e9f" +"cchs_2005s_qu_s_en_1_v1.pdf","master","",695683,"ae04bf8698778a53ca2a1ec80e7c075d54d114d8c8a9b706b10c190101a16e9f" +"cchs_2005s_ug_m_en_1_v1.pdf","master","",2053483,"22f3bf767ac46d3982a26ebe5095bf0099978bd384c58155fdfc834fab89e863" +"cchs_2005s_ug_s_en_1_v1.pdf","master","",2053483,"22f3bf767ac46d3982a26ebe5095bf0099978bd384c58155fdfc834fab89e863" +"cchs_2007d_dd_m_en_1_v1.pdf","master","",3064240,"acfc697e75f372b7361caa142570e7ad79c75aa5cc3b577207697a1c641d8694" +"cchs_2007d_dv_m_en_1_v1.pdf","master","",895734,"94ff67d33ccb9bbc8adf31d449184e8570e1f90a2e357c5180dc1753d665b1c0" +"cchs_2007d_qu_m_en_1_v1.pdf","master","",5293511,"b1d3233579d6efc64cff13329f922c288202759fa3c9bcd4a4bd29a88ff62519" +"cchs_2007d_ug_m_en_1_v1.pdf","master","",561315,"51212cfec49d9b4c9292896653bf9126f26a665aba561d865080c62bdc16eb6b" +"cchs_2009d_dd_m_en_1_v1.pdf","master","",2678562,"3f0dc666fa96ed7fc6668517bb4d7bd655b587642f1c9d9b2d3aa8c45195f3ea" +"cchs_2009d_dd_m_en_2_v1.pdf","master","",1339474,"16ba55380e5d41dc17df9e137236cd1a36ad199d4823ce15b5ec991033acabe9" +"cchs_2009d_dv_m_en_1_v1.pdf","master","",620706,"b773899e79d97918bc4f3ac9683d783625e04c5394df2bbf368d2534bcb0bffd" +"cchs_2009d_dv_m_en_2_v1.pdf","master","",897280,"7c4d48083a8a9390428befb687ee3dd707470de02302547ed25298a61339bea6" +"cchs_2009d_qu_m_en_1_v1.pdf","master","",1483040,"77d898be759b829e32e2bf78caca65d53456816382f84d82dcedc9ed074e096d" +"cchs_2009d_ug_m_en_1_v1.pdf","master","",822144,"75f9eecc232a6d67833b2007b7b09025b3576c6adbbe3d148a23ad901ac26b58" +"cchs_2010d_dd_m_en_1_v1.pdf","master","",3552708,"480896e925753e95cdc7cba14e66eadcc138a10d5ab1011fba50473747460a6d" +"cchs_2010d_dv_m_en_1_v1.pdf","master","",1334091,"587fe078be440fc96089f3af6e7e0aac8a255dc0ebe38f5664ba5e607c2c96ce" +"cchs_2010d_qu_m_en_1_v1.pdf","master","",1851816,"15352a6551d6f306a354e58f974ba3b2c06725ae401147f3278625784f002861" +"cchs_2010d_ug_m_en_1_v1.pdf","master","",2503012,"981c4d7a49d16a9c0f2a3638622221e8a963e1aafcf4fcc5334b1dad6c352395" +"cchs_2011s_dd_m_en_1_v1.pdf","master","",3106168,"f101c452036758a64bd782169dd822109405b368ef0074ff32d03e7a609ddd0d" +"cchs_2011s_dd_m_en_2_v1.pdf","master","",1567695,"fb1e94db47a1726138a0cf66278e9b96282f9505c9a7d13de46c1e07a3903f7c" +"cchs_2011s_dv_m_en_1_v1.pdf","master","",1253494,"10c3f48c9f6a1eac24d9164eb3a87d0158a88680319bf45888bd059d2838820e" +"cchs_2011s_qu_m_en_1_v1.pdf","master","",2156200,"2098f11ee2d1f5533463736f41cdf8ef2b4a46c221fa083bffd3276e7778a46e" +"cchs_2011s_ug_m_en_1_v1.pdf","master","",2031363,"4e412352c8c99ca94901cb6f219ff933499c6ea6c6ce54b2458bca49402768ad" +"cchs_2012s_dd_m_en_1_v1.pdf","master","",3259620,"4bb83ffc8c69fbb143952cda6713d9e681ede6b231cdd0e9ce9242a61e2c1c0c" +"cchs_2012s_dv_m_en_1_v1.pdf","master","",1094759,"16aa5ab9989a4f1b33223327ecec7f46f4a77cd67213e51b46f300b6f24193d6" +"cchs_2012s_qu_m_en_1_v1.pdf","master","",8526154,"bab6adb274d91397eed85f5a6adf82237175b5f0604a326ead23b835b907a9a4" +"cchs_2012s_ug_m_en_1_v1.pdf","master","",2558957,"8b60babed8079819fc90b4fb1e5cfa3a152416c1cf909692d6c443416b488c57" +"cchs_2013d_dd_m_en_1_v1.pdf","master","",3016966,"a946e45aa0eecb7215a17293da870deec872d834ccd03e9e52c10c7ca7a3ab58" +"cchs_2013d_dd_m_en_2_v1.pdf","master","",1591060,"27187a7ac6d92890c5eb225d94b3e051c2513fd511d363219a3dc2ae00ed0521" +"cchs_2013d_dv_m_en_1_v1.pdf","master","",1087508,"c4822e5d7cdafbc49d586bf9b1ff0003dc9db39249bbb001fbb4f10922da5c06" +"cchs_2013d_dv_m_en_2_v1.pdf","master","",774413,"007d1edface33264557cf73f491653673a8480122516cb0f4ddc233e70c9ad93" +"cchs_2013d_qu_m_en_1_v1.pdf","master","",3161198,"755e3de2290f987140e346537dfc2eb81fc08fa9976c9fcafe929d07ed03c53d" +"cchs_2013d_ug_m_en_1_v1.pdf","master","",1885903,"23f66eeda7f5cd4049e917fdaf76d1008e6cd8de3c0573cda8623a31d6216cfd" +"cchs_2014d_dd_m_en_1_v1.pdf","master","",3074937,"25acb9898b40690b40bdd4f4177768f6749de44ad1ac3eb0921d32f50dd8c75f" +"cchs_2014d_dv_m_en_1_v1.pdf","master","",1112667,"bbebc6a9248970e7ac351c99476fe5725911a3e951633f9d3c278a13e67ee666" +"cchs_2014d_qu_m_en_1_v1.pdf","master","",2039916,"fe47e18559cafa81e4bb2893c78f4ab27b70fec45907d95cc8b774ae359babfc" +"cchs_2014d_ug_m_en_1_v1.pdf","master","",1767439,"f5a92ea1525bafc5227b021a5bbc463367bcffbb3248c9f813164f2359fea02f" +"cchs_2015s_dd_m_en_1_v1.pdf","master","",4475675,"680e4beb8e2d22a0e38e61cfab553db96856b8dfc49b29c14799bd46347af849" +"cchs_2015s_dd_s_en_1_v1.pdf","master","",4474457,"be58a03f57f1f502ee81ae07617064105358caee757dcc016ad98e370d03ddec" +"cchs_2015s_dv_m_en_1_v1.pdf","master","",1323378,"cc4472a167950070be43a87609455b1a6dfb49464a1b2f0c9eb31a9d91da6b01" +"cchs_2015s_dv_s_en_1_v1.pdf","master","",1305791,"836eb75b1c7ad3b8fe7e76326046c21fd1b10ae0ad818f1edf4e553f38727f12" +"cchs_2015s_qu_m_en_1_v1.pdf","master","",2845787,"d8cc30ebd783db26160088ad6bb103a5f54a7716edd9bf8f1671f24449f08f18" +"cchs_2015s_qu_s_en_1_v1.pdf","master","",2845787,"d8cc30ebd783db26160088ad6bb103a5f54a7716edd9bf8f1671f24449f08f18" +"cchs_2015s_ug_m_en_1_v1.pdf","master","",1088878,"02d750d687ea89cafbf3e20e70a477c711764afd02a062448e814dd03edb5f06" +"cchs_2015s_ug_s_en_1_v1.pdf","master","",1088878,"02d750d687ea89cafbf3e20e70a477c711764afd02a062448e814dd03edb5f06" +"cchs_2016s_dd_m_en_1_v1.pdf","master","",4383548,"6d2a525a30fb6ef08895fa823d65f61c9d6634cc1ee539eeb52b03203eb5960b" +"cchs_2016s_dd_s_en_1_v1.pdf","master","",4302395,"e20d94354b32543872e6842f0b7be4ccfe830c705dd6382b4a0edb5c34d18bdf" +"cchs_2016s_dv_m_en_1_v1.pdf","master","",1158857,"6c48396d4abe6c519e9818211ee6e18c8a56c3750d56bc9e1c0688705d71533c" +"cchs_2016s_dv_s_en_1_v1.pdf","master","",1130753,"7db7b56384cb4dbadb152dfbbcda15fa5c9fcf04c5309588b795442c27bf00cd" +"cchs_2016s_ot_m_en_1_v1.pdf","master","",1404162,"75c6e6e7ea006611b4f6d6fb6144dc044f7776d0b61cba8ce62d0e91c9c99e25" +"cchs_2016s_ot_m_en_2_v1.pdf","master","",1287280,"a21f952a158838d9a849dca4ff66ff9b666b02697a5d758825b90e2c1ce0bc78" +"cchs_2016s_qu_m_en_3_v1.pdf","master","",4204957,"f8792239b8da16d6d5fee892cab9757126c811768114c80e299a5dd8b1a721bb" +"cchs_2016s_qu_s_en_1_v1.pdf","master","",4204957,"f8792239b8da16d6d5fee892cab9757126c811768114c80e299a5dd8b1a721bb" +"cchs_2016s_ug_m_en_1_v1.pdf","master","",2082462,"3f1ff244d316e50b18b9beaba3d09eea31963d3126fd349e5e15a982919390d0" +"cchs_2016s_ug_s_en_1_v1.pdf","master","",2082462,"3f1ff244d316e50b18b9beaba3d09eea31963d3126fd349e5e15a982919390d0" +"cchs_2017s_dd_m_en_1_v1.pdf","master","",3798168,"b3da51a59a076317b1b212b9a32de12d3e145c39efe9c8d53e933a41db585da5" +"cchs_2017s_dv_m_en_1_v1.pdf","master","",1091219,"375338f193f2328beb990bffd5371d0c53a479d773e12d0e5a88f07f32220365" +"cchs_2017s_qu_m_en_1_v1.pdf","master","",1275607,"fcbde4486b960a5bdb0544e2ba863c1c78794f0a38fca5cfb91ae9d40f83ba30" +"cchs_2017s_ug_m_en_1_v1.pdf","master","",1833902,"a5d65b5176715f2fbeafed168a05f9e497ae55368c0949ced43121623a715897" +"cchs_2018s_dd_m_en_1_v1.pdf","master","",3876881,"be907e7b83ea9111698f9557a847df28f74bc3a66d3e039aef866c7de8dcd40c" +"cchs_2018s_dv_m_en_1_v1.pdf","master","",1075657,"c0da71cf78e9af68f0bb656023c27c66735ba0077605d2a0a20481af6c696e74" +"cchs_2018s_qu_m_en_1_v1.pdf","master","",1577431,"d12ed9a860911d6d151a596ec748e61eba8cbf9cab8139de2e5bdc9682305fd3" +"cchs_2018s_ug_m_en_1_v1.pdf","master","",2272942,"51dc33c5b6ab93755288df38831b11dbf75d70e99952bdea86066c49161a5fac" +"cchs_2019s_dd_m_en_1_v1.pdf","master","",3735429,"5ec9263dcdfa95956cda9f73a9119cde731e3355e3d9813da6a6c048c907442e" +"cchs_2019s_dv_m_en_1_v1.pdf","master","",1112398,"5da0e180bf826e204da790842bf19af59418482bbd982c49fc9821ff6c159a3c" +"cchs_2019s_qu_m_en_1_v1.pdf","master","",4451790,"e6af4c7da8adc9db7ff8cc753f677ec6af4984eb777a0728be4bd76d4dd288be" +"cchs_2019s_ug_m_en_1_v1.pdf","master","",2393204,"6522ab794796ff9b0cb993208a7f852392ad6c27b7d2a8ac8d877f01854c6173" +"cchs_2020s_dd_m_en_1_v1.pdf","master","",3802143,"7dbfecbefbc301b690ea2d218645654fb32c329166fa8eb286ae2fe0cf10c791" +"cchs_2020s_dv_m_en_1_v1.pdf","master","",1112421,"977d13fe9ba017b1412fbfa3f7d956936baee546c01ef2f67bcf6623602b9c56" +"cchs_2020s_qu_m_en_1_v1.pdf","master","",9924770,"87823201025f1bcd853f1c3cd14ab3a02a8df0124e46468b53af84ecb3cbe317" +"cchs_2020s_ug_m_en_1_v1.pdf","master","",1788936,"fcb8e1a06bf66faa2ce45e14cbfba14034c880942f0906dd96f0c022bfc56ac5" +"cchs_2021s_dd_m_en_1_v1.pdf","master","",2778428,"767be1d350b7d1e289951b9a46436a07cf9f38ac7c07cb38ffe1db6f84bfaf44" +"cchs_2021s_dv_m_en_1_v1.pdf","master","",1163780,"d5532f989d276d81f68059f0fcc4d027547e58503ef7ceb9ea070880d085c3aa" +"cchs_2021s_dv_s_en_1_v1.pdf","master","",1163780,"d5532f989d276d81f68059f0fcc4d027547e58503ef7ceb9ea070880d085c3aa" +"cchs_2021s_qu_m_en_1_v1.pdf","master","",6447368,"25acfa39e4a510789b55654b66b1e17b126719fcc97821abe2e0c155abda8722" +"cchs_2021s_qu_s_en_1_v1.pdf","master","",6447368,"25acfa39e4a510789b55654b66b1e17b126719fcc97821abe2e0c155abda8722" +"cchs_2021s_ug_m_en_1_v1.pdf","master","",1010430,"d59bc8af692fd586afd6f20b36e3d4a34a380b4de7198d308079fc7bb9ee91c1" +"cchs_2021s_ug_s_en_1_v1.pdf","master","",1010430,"d59bc8af692fd586afd6f20b36e3d4a34a380b4de7198d308079fc7bb9ee91c1" +"cchs_2022s_dd_m_en_1_v1.pdf","master","",2509219,"5441e42959cfb5850e9548238314fca870bd44ca00e163bb240219875f38c65d" +"cchs_2022s_dv_m_en_1_v1.pdf","master","",605683,"92784fb731f735cfd3b4e187dff39fd2e5a1d5f6681be2e30931f4f498ba07a8" +"cchs_2022s_qu_m_en_1_v1.pdf","master","",2638895,"56304c9ee67552aaea237558f8ecfc173f431953c39b786cad265d1c2550978c" +"cchs_2022s_ug_m_en_1_v1.pdf","master","",1100557,"4c82864bd060e31dcb6ef163871e85c70a3fa4e30abe4f8bde8f24f6b3e60848" +"cchs_2023s_dd_m_en_1_v1.pdf","master","",2650628,"bc4718b41ce2f96f456bf7dd5c418e0bb794c691582c99e0143e1de1daaeed98" +"cchs_2023s_dd_m_en_2_v1.pdf","master","",2656946,"cff50b892d85625dae7257e95e10068db673cb46df9fc574486cede38b8254b0" +"cchs_2023s_dv_m_en_1_v1.pdf","master","",1708761,"3a442a84064a8f9ae3dbaf8e125168f56c3269b2179ea61ebc11a241b2c9b9a5" +"cchs_2023s_qu_m_en_1_v1.pdf","master","",1472820,"7c6f27087016fbee45850b59cc5757ebe5ce474e06eb2d42ba6b655df0289668" +"cchs_2023s_ug_m_en_1_v1.pdf","master","",2399863,"db3596d97fab8b22ca8c8c3c5471b1de08310e9fbc8de1f0b27170bfe6fb6701" +"ALPHA.pdf","pumf","2001",109309,"e8f313cd15ad7dc2cff13fc3e39e861ec858411c2e9d555ea335a2208e62d708" +"LAYOUT.pdf","pumf","2001",114559,"2a37beef76bbeb6409ab527928dffb05e557ec0fbf931e16ace02e1b8d29efed" +"QUALITY.pdf","pumf","2001",46297,"6be040c5c9dfb5442bd2593a85d7f69a061d6abe9c6cf61b2ca0da9ff6ccf64c" +"QUESTE.pdf","pumf","2001",372117,"9f344ba5728d4d07d82c37c653319d0632f63553156728e3d9e4c9c7bbac537d" +"SUMMARYE.pdf","pumf","2001",1385822,"43e92f2287684cf886a9aeb7423b0d447bac77b1155d9889759f5951a53406c1" +"TABLES_E.pdf","pumf","2001",383292,"8be2aed5e7a560130c3a38081357108a4fc22a2da049a4591e30a5143b008c1e" +"TOPIC.pdf","pumf","2001",121345,"78dda0e9274710aabd85e929474cbbe1f1a609f8d4076f2957ff709e7cd07b56" +"cchs_cycle1-1cbkv2.pdf","pumf","2001",781862,"212581177024cc045d1e8851145bb8d7da08b594ed2d3a3df26e887e5219ed2c" +"cvtab_e.pdf","pumf","2003",489074,"31e55aae6a27bdaabf973959a17596ae08f81d39a75fe7bb07870ca12fff5904" +"derive_e.pdf","pumf","2003",1105927,"02887f1d2f9aa7e808ea8d6329c93573289122a85dd4b93d4a300b41ed697e9c" +"dict_e.pdf","pumf","2003",1385750,"463b42e29d34dee93942a2ccb412866200d5ee4d99d34d7b94138f685e76be85" +"guide_e.pdf","pumf","2003",695216,"f05357973fac33420f7522c9e76e5dd38681e085b74bc1ce2b72a98499145a4f" +"index-a_e.pdf","pumf","2003",171171,"3fa6fb391e44792423d7508119d6625c89a94e043a9afdf94d358b640cdcb4c9" +"layout_e.pdf","pumf","2003",167741,"bad931ffad3354c3cb60ee810041d489edd14f4bea92e22bdc3a4be99f28dc2b" +"qualit_e.pdf","pumf","2003",11198,"21a22f073b79a8b9ea30b9f6f280570e12a2cbff8059130ff9e36f564b666616" +"quest_e.pdf","pumf","2003",528453,"aaf9be1b5897ae86d57fad2d034f45e7ee409a70bb0cacc48da871e0d1c58af8" +"Alphabetic Index.pdf","pumf","2005",145248,"1dfcb8d248bca95d7edea87f72dd05a37064d332a465724acd4734b443091b01" +"Approximate Sampling Variability Tables.pdf","pumf","2005",437056,"a28dd82c8ba09eec3268188279730a0bddadda27c269bcb16a3613eb41d6268e" +"Cross Sectional Samples.pdf","pumf","2005",128299,"44d70d5e16966e83e283e0b6e3bac7bf9300b8c5dd35fbfcf4d4610bee456077" +"Data Dictionary.pdf","pumf","2005",1566427,"1c38f118033a0f19274f0f2e01ed463fd1426047d2683ba8bf02eb8cafd68610" +"Derived Variable and Grouped Variable Specufucations.pdf","pumf","2005",972041,"839dd8ce1f1fdc6a16a33e7856595b5de0a8799b2142b24cd5f58c81b6980b91" +"Quality Assurance Guidelines.pdf","pumf","2005",26818,"5deebbb74efd65b2382dbc03ed3f21cb0c0be202c1bd4f99e347233ff453628c" +"Questionnaire.pdf","pumf","2005",1411933,"2ff50f2f8d0ef98fc9729bcc86014ce78f5fd3c53201b70312ee8ff80a16654a" +"Record_Layout.pdf","pumf","2005",126883,"1165a53dbb75b6e6a1cb6bd7a3c301e83de777a00f5db607bc3590f4fcd9193f" +"Topical Index.pdf","pumf","2005",132098,"c2c81010874428c65cc1ea41b8e17c438068919097c375e69ac4ff95c3757d90" +"User Guide.pdf","pumf","2005",2050514,"eb375e86b2d7edbc14539f461e6877590d069622d8ccb0bcb5897a5407ef2240" +"Approximate Sampling Variability Tables.pdf","pumf","2007-2008",501191,"c361a2ce64cb404af340fc54df7c7f43c97289fa38570e15712dec2fae2d4381" +"Availability of Optional Content.pdf","pumf","2007-2008",170301,"eba3bc6a6a55d29cd88cb26a4811c9747f9034df9843a26c441ca9fdee64359c" +"Derived Variable Specifications.pdf","pumf","2007-2008",895734,"94ff67d33ccb9bbc8adf31d449184e8570e1f90a2e357c5180dc1753d665b1c0" +"Derived Variables.pdf","pumf","2007-2008",3222273,"ae6bf552d73f906544a46090c1ac9edfc70ac84d8cf1ddf0c8cd59cc306305f5" +"Questionnaire.pdf","pumf","2007-2008",10461610,"2ab0bafe6f251350d63ae522a3f00c5f679dcb476a73ce38312c2fb4a552e7fc" +"Record Layout.pdf","pumf","2007-2008",192085,"7375c8c9dd3b9daee49030c79e28bacb8cb01ae28f5280f98dd4d90c91e73539" +"User Guide.pdf","pumf","2007-2008",561315,"51212cfec49d9b4c9292896653bf9126f26a665aba561d865080c62bdc16eb6b" +"Weights Methodology.pdf","pumf","2007-2008",136116,"4cd99cf385d2e8a8a16867ded7b55b3b083aa2d6f51ee2603b21e86f268ff98b" +"cchs-escc2009-2010_alpha_index-eng.pdf","pumf","2009-2010",188114,"c7f5e023fc598feef4373e6988df59c660eb02b336c4bcfa736e7b7c9dbbe133" +"cchs-escc2009-2010cbk-eng.pdf","pumf","2009-2010",2315814,"3b0b08ea08b018d6ae3f7cf4f65b3e1ca2b982eaf8371a4abde8a6c7c11a2ebe" +"cchs-escc2009-2010derived-var-eng.pdf","pumf","2009-2010",1047088,"aeaa087420ec3d61e45c826b05491a54d1b5f2e42953249c0cb7dca0c4fa39eb" +"cchs-escc2009-2010rcl-eng.pdf","pumf","2009-2010",188952,"61dad16d402399febc0444afdbe6a993ff2289f622bacfcc36efb3de78c36609" +"cchs-escc2009-2010topical_index-eng.pdf","pumf","2009-2010",185958,"d474851d4dec2dee77fa30a53d104741a62dea3f5ded3d7495ee750059868042" +"cchs-escc2009-2010vt-eng.pdf","pumf","2009-2010",822690,"e4d6a6cc44c2b9658f8bbfcf12c7ebdba6e86d9a5862459fdd95e69108f8daa3" +"cchs-escc2010_2009-2010gid-complement-eng.pdf","pumf","2009-2010",232545,"54f21e8a5a7083e4ad00068faf2512c4db3b61ff62c7df6ba0e6dd272a4b8e03" +"cchs-escc2010_2009-2010gid-eng.pdf","pumf","2009-2010",2503012,"981c4d7a49d16a9c0f2a3638622221e8a963e1aafcf4fcc5334b1dad6c352395" +"cchs-escc2010_2009-2010qualit-eng.pdf","pumf","2009-2010",21393,"af507852790d5cef9c45b4c749a69f7da12ae2314ff0aee43b25c64e9c630031" +"cchs-escc2010_2009-2010gid-complement-eng.pdf","pumf","2010",232545,"54f21e8a5a7083e4ad00068faf2512c4db3b61ff62c7df6ba0e6dd272a4b8e03" +"cchs-escc2010_2009-2010gid-eng.pdf","pumf","2010",2503012,"981c4d7a49d16a9c0f2a3638622221e8a963e1aafcf4fcc5334b1dad6c352395" +"cchs-escc2010_alpha_index-eng.pdf","pumf","2010",196674,"7c80ccaecc050eb2c046f7473d880a6b91304b5220265e3566c5191996ba2afd" +"cchs-escc2010cbk-eng.pdf","pumf","2010",2414753,"7e7a57423ea5e91fdd00d4869a13d00e74f64432c5bd242caee4a4e361c4b5b8" +"cchs-escc2010derived-var-eng.pdf","pumf","2010",1091300,"ab59a15f0745ee70cbcdb1b76ae06c323a74df94a22be935561fb90099d5fe49" +"cchs-escc2010que-eng.pdf","pumf","2010",1851816,"15352a6551d6f306a354e58f974ba3b2c06725ae401147f3278625784f002861" +"cchs-escc2010rcl-eng.pdf","pumf","2010",197440,"8c5744e69068c634869463cf68bea63af3a855061bfad77962957431ebe55f1c" +"cchs-escc2010topical_index-eng.pdf","pumf","2010",193687,"95c0192f194b3efaf856508d88aa8b05cfe727604cf1168a56012e0298a44ebe" +"cchs-escc2010vt-eng.pdf","pumf","2010",834308,"36c5d1af81bc2df891cd2bd73781fe0e42038d1c163a3fc0f477799148aecb34" +"CCHS_2011-2012_Alpha_Index.pdf","pumf","2011-2012",183863,"c4cf3628309f1d57f12272a3baea9471bddcd1ed72a5c07475a10cee564fcbd9" +"CCHS_2011-2012_CV_Tables.pdf","pumf","2011-2012",1239429,"86cf008c09bed8136115755b8ba4d6999a7600107594ab8cbf515866c5c76351" +"CCHS_2011-2012_DataDictionary_Freqs.pdf","pumf","2011-2012",2149959,"98dd6d54e050e7b6766b6b827bbc5e6739114da74e3b327e2c4d39c15d869f06" +"CCHS_2011-2012_Derived_Variables.pdf","pumf","2011-2012",976649,"b3186784efff42253ffc282b7524fb84bcd572f05873933c435647e9e4c2f6ba" +"CCHS_2011-2012_Record_Layout.pdf","pumf","2011-2012",185517,"22a8ceafa094051043274a00622cc27d3595ff2b32fd9f9b061f748bb77b22fa" +"CCHS_2011-2012_Topical_Index.pdf","pumf","2011-2012",181737,"43e4a8a1aa189f865aab4101ee6a923eb445b9f0a874194bc8f71b37f16b9352" +"Microdata_license_agreement.pdf","pumf","2011-2012",111642,"daf3df534bb21243fed5dda4387268ac9e8f27890345557fabbb0e98d4719619" +"QUALIT_E.pdf","pumf","2011-2012",95471,"b9707488dac432d30aaf5d13240005468baa4a8360c5b3dc74529aa67d981a3f" +"cchs-escc2012_2011-2012gid-complement-eng.pdf","pumf","2011-2012",320368,"c923f08a9d499865f2e5af1835997d26509f3e8c0ef9347de59781354674b846" +"cchs-escc2012_2011-2012gid-eng.pdf","pumf","2011-2012",2560416,"9b3a62d21d9c64fb3bf1b3aa637f2d5ad77ebe9fcb1528422483bfe3dfd2a319" +"cchs-escc2012que-eng.pdf","pumf","2011-2012",8526154,"bab6adb274d91397eed85f5a6adf82237175b5f0604a326ead23b835b907a9a4" +"CCHS_2012_Alpha_Index.pdf","pumf","2012",202254,"8045247937f89a90178f11ddeafe06f9ac41c0ea3b4915949db7a00499aa526b" +"CCHS_2012_CV_Tables.pdf","pumf","2012",1239962,"b73e1e9442c2af077eb73bea463ba8267c6a8f767cf819c3bd71bd5f0838a4fe" +"CCHS_2012_DataDictionary_Freqs.pdf","pumf","2012",2410712,"7f663df9c7d13cd2f91235121e2338e69d4ff35fd7e1b2ec62271b991ad1a6e7" +"CCHS_2012_Derived_Variables.pdf","pumf","2012",1055882,"5c42a1b1fa398b1e955d3bc12226cfa894fad5f1f33bb761af1ed38d9969b14c" +"CCHS_2012_Record_Layout.pdf","pumf","2012",204458,"6b20442eaa2d926b0e97512df3a547a43e2546e4dc802aa1ec5aa3335bade6c6" +"CCHS_2012_Topical_Index.pdf","pumf","2012",198289,"3a320336eb12e675e3062eda1135d9f5fd8b8a98c5219caf446636f2a02d2c2d" +"Microdata_license_agreement.pdf","pumf","2012",111642,"daf3df534bb21243fed5dda4387268ac9e8f27890345557fabbb0e98d4719619" +"QUALIT_E.pdf","pumf","2012",95471,"b9707488dac432d30aaf5d13240005468baa4a8360c5b3dc74529aa67d981a3f" +"cchs-escc2012_2011-2012gid-complement-eng.pdf","pumf","2012",320368,"c923f08a9d499865f2e5af1835997d26509f3e8c0ef9347de59781354674b846" +"cchs-escc2012_2011-2012gid-eng.pdf","pumf","2012",2560416,"9b3a62d21d9c64fb3bf1b3aa637f2d5ad77ebe9fcb1528422483bfe3dfd2a319" +"cchs-escc2012que-eng.pdf","pumf","2012",8526154,"bab6adb274d91397eed85f5a6adf82237175b5f0604a326ead23b835b907a9a4" +"CCHS_2013-2014_Alpha_Index.pdf","pumf","2013-2014",183833,"9e192c9eea11039bf68a201b14db27d9f08de55ece3240a9081969fac3d22194" +"CCHS_2013-2014_CV_Tables.pdf","pumf","2013-2014",1291339,"539c4a31acf2fefccc73fb9ede0ea51856cfe7bbe9d397952ddd26c311b6649d" +"CCHS_2013-2014_DataDictionary_Freqs.pdf","pumf","2013-2014",1813987,"b5f620875c5753a952dd0915f404f2024ddea02a7eb2ab902797bb7e3a76e1ba" +"CCHS_2013-2014_Derived_Variables.pdf","pumf","2013-2014",1107126,"c13d100c9e559b7bbea8a51835e680af3300ee0e7b21358f9386a27face97e19" +"CCHS_2013-2014_Record_Layout.pdf","pumf","2013-2014",182190,"f5ff36eadc93cf109e73d376edd6489a89dacdf29b6113e165ac3cf3384660a7" +"CCHS_2013-2014_Topical_Index.pdf","pumf","2013-2014",181004,"7e146b9014c70af77779c9445baa903092b918f6c91d6f3fb9099dfae0590015" +"CCHS_2014_2013-2014_Complement_User_Guide.pdf","pumf","2013-2014",746261,"7d6c6f2cd80644ef2127513757f227cb2a808922c6e7b1673ebc4557254968d3" +"CCHS_2014_2013-2014_User_Guide.pdf","pumf","2013-2014",1767439,"f5a92ea1525bafc5227b021a5bbc463367bcffbb3248c9f813164f2359fea02f" +"CCHS_2014_Questionnaire.pdf","pumf","2013-2014",2078309,"36072c5d9371fbc7eee8d5894f3e3e106fcbea0e4a645831b52541fdef7f156f" +"Microdata_license_agreement.pdf","pumf","2013-2014",111642,"daf3df534bb21243fed5dda4387268ac9e8f27890345557fabbb0e98d4719619" +"QUALIT_E.pdf","pumf","2013-2014",21923,"a1128617570e9649f21c56f4a663779efc38edd91d163863ba82fdd8b9861c6c" +"CCHS_2014_2013-2014_Complement_User_Guide.pdf","pumf","2014",746261,"7d6c6f2cd80644ef2127513757f227cb2a808922c6e7b1673ebc4557254968d3" +"CCHS_2014_2013-2014_User_Guide.pdf","pumf","2014",1767439,"f5a92ea1525bafc5227b021a5bbc463367bcffbb3248c9f813164f2359fea02f" +"CCHS_2014_Alpha_Index.pdf","pumf","2014",197436,"8e405cce6e19cec8260e8f5535efaae06a96a20a1e83af43251b3ee84bf7e85c" +"CCHS_2014_CV_Tables.pdf","pumf","2014",1293845,"085dfe5c18b0409d29336f774a3c28e9b6e1627ee6e12a13efa21dcb1c8d9f6b" +"CCHS_2014_DataDictionary_Freqs.pdf","pumf","2014",1987367,"4035dc015395bb2a15534baa1671118ba5e377ad00ad5cdd5ab05a3414747d75" +"CCHS_2014_Derived_Variables.pdf","pumf","2014",1189889,"8b227324c240e43c0ad6ace87126c1c51fccecec5febaa31052141a04196291e" +"CCHS_2014_Questionnaire.pdf","pumf","2014",2078309,"36072c5d9371fbc7eee8d5894f3e3e106fcbea0e4a645831b52541fdef7f156f" +"CCHS_2014_Record_Layout.pdf","pumf","2014",196418,"25536ced9af959d2754394bbe8bc03f9ee2bab704cd36a91af0daf020bdeb703" +"CCHS_2014_Topical_Index.pdf","pumf","2014",194158,"b3b97ce46603471bfcd3fe402ac48a6db1a3e4a29e617e4b2a5bcb09eba43618" +"Microdata_license_agreement.pdf","pumf","2014",111642,"daf3df534bb21243fed5dda4387268ac9e8f27890345557fabbb0e98d4719619" +"QUALIT_E.pdf","pumf","2014",21923,"a1128617570e9649f21c56f4a663779efc38edd91d163863ba82fdd8b9861c6c" +"Bootstrap_E.pdf","pumf","2015-2016",441822,"9a90ea3f3ae109aea2f9413da8385f9a871718225794bfc1a285b26640533b35" +"CCHS_2015-2016_PUMF_CV_Tables.pdf","pumf","2015-2016",2812499,"3564974a87cf12d5537b252865de4e6797431ddcbeb15d0146b03f66662b6f86" +"CCHS_2015-2016_PUMF_Complement_User_Guide.pdf","pumf","2015-2016",389699,"cf0f866cedb8b46fba244b6818519a8716f58f52c4f5b5536ef1534e61f51047" +"CCHS_2015-2016_PUMF_DataDictionary_Freqs.pdf","pumf","2015-2016",2900302,"92f7cd0df65f476107bc938ef5d39f53de79ee1a22b010463e6dc56dd1a279a1" +"CCHS_2015-2016_PUMF_Derived_Variables.pdf","pumf","2015-2016",1143250,"03b72388e21460b36d465fa3d75bbbd8bbcba4d32407bf0482ade071dfdd64ff" +"CCHS_2015-2016_PUMF_Errata.pdf","pumf","2015-2016",626330,"2af8e0f2e0bb758a7adbe6c63971d824422b7c9cba9c8f9b54c669e2a985f954" +"CCHS_2015-2016_PUMF_Quality_assurance_guidelines.pdf","pumf","2015-2016",78708,"1629b5dc79c7577d47f2343afe68d80022601927d62ceb84a17d9b78b6f79b8e" +"CCHS_2015-2016_PUMF_Record_Layout.pdf","pumf","2015-2016",430844,"a790bd5dc8e204b8acf658c264ee678bf4f10910b17d7d09d82c224466cfe688" +"CCHS_2015-2016_Questionnaire.pdf","pumf","2015-2016",1356563,"e934ed3541515d65507117a2e4f985b9e12a572c0658e3e537330945dc36ae7d" +"CCHS_2015-2016_User_Guide.pdf","pumf","2015-2016",2076776,"a6a153da80aa509fbd8bef3e20714f737db151dfbf59610e67d63a239e5c5f39" +"Bootstrap_E.pdf","pumf","2017-2018",441822,"9a90ea3f3ae109aea2f9413da8385f9a871718225794bfc1a285b26640533b35" +"CCHS 2017-2018 PUMF Derived Variables.pdf","pumf","2017-2018",1201847,"8193d4e2cab69e4c48c5b32fc221d0168c67872d4f4650432c4debe9a488c3ff" +"CCHS2017-2018PUMFComplementUserGuide.pdf","pumf","2017-2018",687796,"f071ad137d2fd4bf6af80c69d13db1570b5f206215a1a8ad5e12b0b8d8231acf" +"CCHS2017-2018PUMFDataDictionary.pdf","pumf","2017-2018",3299822,"5013d607bcfb505aa953234ba78c96927269a5d5ac49fe7a5191189a6969c58c" +"CCHS2017-2018PUMFQualityassuranceguidelines.pdf","pumf","2017-2018",110405,"e356861b6d7548f7345a828b073b24ebdd73179736f13c661a8f5b48c84ecb1d" +"CCHS2017-2018Questionnaire.pdf","pumf","2017-2018",1467597,"e79fe74494b999b71b4648255024e4f8c02870e5c2ac571bf11702bab9a01bf0" +"CCHS2017-2018UserGuide.pdf","pumf","2017-2018",2272440,"317354ca11f3c05ba450fb0d415804e76a61ce5a0ef68b700d921f1bba54a48d" +"CCHS_2017_2018_CV_Tables_PUMF.pdf","pumf","2017-2018",2902357,"c07b8fee64261b1843bd20cd070fc8ecab73e1b3570a4a9cefa6ed9acbb943aa" +"CCHS 2019 TEU_HLU Data Dictionary (rounded frequencies).pdf","pumf","2019-2020",2123888,"d26abc02734817c297bc96f39cb2f15d39b9a8a3d8a2d62103ced944ffea5b97" +"CCHS_2019-2020_DataDictionary_Freqs.pdf","pumf","2019-2020",2563050,"4e883e0c0303a146af11004992e37dd80527681b62fcadc10af84e8112d04df4" +"CCHS_2019-2020_Derived_and_Grouped_Variables.pdf","pumf","2019-2020",632590,"d3de41bfed0488ac210f55d86f6c6fb4b8ecc22ea37cb46308cbdf0031eddeba" +"CCHS_2019-2020_PUMF_Complement_User_Guide.pdf","pumf","2019-2020",251656,"fe3d6ea22359b4132903439f0ba2c9f0841ee90c692e325a4f4b23153fbf5949" +"CCHS_2019-2020_Questionnaire.pdf","pumf","2019-2020",4061517,"7446f702f2b3e65520a9ee4ba40eadf6ce378fadc24fc7e6bdef1b0542fbf33e" +"CCHS_2019-2020_User_Guide.pdf","pumf","2019-2020",1263572,"ea4957e2d869b8ce435a629a43d3e3481e67b9dd16fb0b42577bdff924b62711" +"CCHS 2022 PUMF Complement User Guide.pdf","pumf","2022",240317,"343fbf912ba24b65aa3832ebd8893c76aad81e25dedb377c138b07be352c6e69" +"CCHS_2022_CV_Tables_PUMF.pdf","pumf","2022",2526281,"bd231646ddad267ca1ab9dda1dc6c69f760e10bce2ec63d3725c06849a3cf080" +"CCHS_2022_DataDictionary_Freqs.pdf","pumf","2022",1868384,"93a384e232da357bd94594d9a9aa00f2255f26a87da9c0706809379438a54078" +"CCHS_2022_Income_Master File.pdf","pumf","2022",389291,"69f889abce5b9bb22b702a47706e57f80bb2f1c72a8bb4f74e5a9aaa2e549ef4" +"CCHS_2022_PUMF_Grouped_Variables.pdf","pumf","2022",349879,"db4ed037cbd84f5ed147d02fc6e1ba109ff7fc878aa954bd976ed6bb41520506" +"CCHS_2022_User_Guide.pdf","pumf","2022",1062206,"233efb1141a05a6bbfe8ccb90556559e5860f4ce51138ba385a36a1c236635fc" diff --git a/.claude/skills/cchsflow-review/docs/variable-naming-conventions.md b/.claude/skills/cchsflow-review/docs/variable-naming-conventions.md new file mode 100644 index 00000000..acced01e --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/variable-naming-conventions.md @@ -0,0 +1,97 @@ +# Variable naming conventions + +This document governs how harmonized variable names are chosen in cchsflow. It +applies both when authoring new variables and when reviewing PRs that introduce +new names. + +## Core principle: preserve StatCan names + +Use the common harmonized StatCan variable name (e.g., `SMK_09A`, `DHHGAGE`) +unless a suffix is specifically required to distinguish a transformation or +structural variant. Avoid decorative suffixes — only add what is necessary to +disambiguate. + +## Suffix rules + +### `_cont` — categorical-to-continuous recode only + +Add `_cont` when the source variable is **categorical** and the harmonized +variable applies midpoint imputation to produce a continuous output. + +``` +SMK_09A → categorical (codes 1–4) +SMK_09A_cont → midpoint-imputed continuous (0.5, 1.5, 2.5, 4) +``` + +**Do not** add `_cont` if the source variable is already continuous. In that +case, keep the StatCan name unchanged. + +### `_catN` — category-count change or derived clarification + +Add `_catN` (where N = number of output categories) only when: + +1. The number of categories **changes** from the source (collapsing or + expanding), or +2. The variable is **derived** and the suffix clarifies the output structure + for users. + +**Do not** add `_catN` if the harmonized categories are identical to the source +variable's categories. A variable that recodes unchanged 4-category responses +is still just the source name — it does not become `_cat4`. + +### Era/cycle suffixes — use descriptive names, not letters + +When a variable has a genuine structural break across cycles (different question +wording, different category boundaries, or a different source variable with +incompatible categories), use an **era-based suffix** rather than an abstract +letter. + +| Avoid | Use instead | When | +|-------|-------------|------| +| `_A` | `_2001`, `_pre2003` | 2001-only variant (cycle 1.1) | +| `_B` | `_2003plus` | 2003+ variant | +| `_A`, `_B` | `_pre2007`, `_2007plus` | Pre/post 2007 restructuring | + +Existing `_A`/`_B` suffixes are deprecated. Replace them with era-based names +when a variable is refactored or reviewed, unless the refactor is out of scope. + +### Other clarifying suffixes + +Add clarifying suffixes as needed when disambiguation is genuinely required and +the above rules do not apply. Keep suffixes short and self-explanatory. A +reviewer should be able to infer the meaning of a suffix without consulting the +worksheet. + +## DerivedVar block `recEnd` semantics + +In `variable_details.csv`, `DerivedVar` blocks document the output of a custom +R function. The `recEnd` values in these blocks are **output category codes** +produced by the function — not recode targets and not midpoints. + +- Categorical DV output: `recEnd` values are integers (1, 2, 3, …) matching + the function's return values +- Continuous DV output: `recEnd` values are midpoints or numeric outputs + matching the function's return values + +This differs from direct recode blocks, where `recEnd` is the target value that +`rec_with_table()` writes into the output. Do not flag integer `recEnd` values +in a `DerivedVar` block as inconsistent with midpoint values in a sibling direct +recode block — they serve different purposes. + +## Examples + +| Variable | Name chosen | Rationale | +|----------|-------------|-----------| +| `SMK_09A` (categorical, unchanged) | `SMK_09A` | No transformation; keep StatCan name | +| `SMK_09A` → midpoint imputed | `SMK_09A_cont` | Categorical → continuous recode | +| `SMK_09A` collapsed to 4 cats | `SMK_09A_cat4` | Category count change from source | +| `SMK_09A` 2001 variant (different categories) | `SMK_09A_cat4_2001` | Era suffix for structural break | +| `SMK_09A` 2003+ variant | `SMK_09A_cat4_2003plus` | Era suffix, more readable than `_B` | +| Continuous source variable (no transform) | `DHHGAGE` | Already continuous; no `_cont` | + +## Relationship to `dummyVariable` naming + +`dummyVariable` values follow a separate convention (see Check 5b in the +review skill) and are derived from the harmonized variable name. The suffix +rules above govern the harmonized name itself, which then propagates into +`dummyVariable` values. diff --git a/.claude/skills/cchsflow-review/docs/worksheet-reference.md b/.claude/skills/cchsflow-review/docs/worksheet-reference.md new file mode 100644 index 00000000..3173e0a6 --- /dev/null +++ b/.claude/skills/cchsflow-review/docs/worksheet-reference.md @@ -0,0 +1,684 @@ +# Understanding cchsflow worksheets + +A reference for reading, writing, and validating the cchsflow harmonisation worksheets. This document covers both the CCHS survey context needed to judge whether a mapping is *correct* and the cchsflow worksheet mechanics needed to judge whether a mapping is *well-formed*. + +**Audience:** Human contributors, LLM reviewers (Claude Code, Gemini/NotebookLM), and anyone who needs to understand how cchsflow encodes variable harmonisation. + +**Design spec:** `docs/superpowers/specs/2026-03-26-worksheet-reference-design.md` + +------------------------------------------------------------------------ + +## Part 1: CCHS foundations + +This section provides just enough survey context for a reviewer to understand what cchsflow is harmonising. For comprehensive CCHS terminology, see the [cchsflow-docs glossary](https://github.com/Big-Life-Lab/cchsflow-docs/blob/main/docs/glossary.md). + +### What is the CCHS? + +The **Canadian Community Health Survey** (CCHS) is a national cross-sectional health survey conducted by Statistics Canada. It collects data on health status, healthcare utilisation, and health determinants from approximately 65,000 respondents per cycle. The survey has run annually since 2001, with over 20 years of data now available. + +cchsflow harmonises variables across these cycles so that researchers can pool or compare data longitudinally despite changes in variable names, response categories, and questionnaire structure. + +### File types + +Statistics Canada releases CCHS data in several file types: + +| File type | Access | Content | cchsflow suffix | +|----------------|----------------|----------------|-------------------------| +| **PUMF** (Public Use Microdata File) | Public download | Grouped/suppressed values for privacy | `_p` | +| **Master** | Restricted (Research Data Centres) | Exact values, full variable set | `_m` | +| **Share** | Deprecated | Legacy public-use subset | `_s` (convert to `_m`) | + +**Key difference for harmonisation:** PUMF files often group continuous variables into categories (e.g., exact age → age groups) and may suppress rare values. Master files retain exact values. This means the same conceptual variable may need different recode rules for PUMF and Master — which is why cchsflow worksheets use separate blocks for `_p` and `_m` databases. + +### Cycle naming + +| Era | Naming | Examples | +|------------------|--------------------------|-----------------------------| +| Early cycles (2001-2005) | Single-year, labelled as "Cycle N.1" | 2001 (Cycle 1.1), 2003 (Cycle 2.1), 2005 (Cycle 3.1) | +| Transition (2007) | First dual-year collection | 2007-2008 | +| Annual period (2008-2023) | Annual or dual-year | 2009-2010, 2011-2012, ..., 2022, 2023 | + +**Exception:** The 2021 CCHS was not released as a standalone PUMF. It was combined with 2022 data into a 2021-2022 PUMF. Any `cchs2021_p` database reference is invalid in the current release structure. + +### Database identifiers + +cchsflow uses a consistent naming convention for databases: + +``` +cchs{year}_{type} +``` + +- `cchs2001_p` — 2001 PUMF +- `cchs2007_2008_m` — 2007-2008 Master +- `cchs2022_p` — 2022 PUMF + +The `_s` (share file) suffix is **deprecated**. Share files map to single-year Master databases: `cchs2009_s` → `cchs2009_m`, `cchs2010_s` → `cchs2010_m`, `cchs2012_s` → `cchs2012_m`. + +**Single-year vs dual-year databases** are separate databases, not aliases. `cchs2009_m` and `cchs2009_2010_m` are distinct — some variables only appear in the single-year file, and Statistics Canada may drop those variables from the combined dual-year file. cchsflow supports both where available; when both exist, single-year databases are generally the primary focus. + +### StatCan variable naming system + +Statistics Canada uses systematic naming conventions to distinguish how a variable was measured and processed. Understanding these is essential for interpreting `variableStart` entries in the worksheets. + +**Module prefixes** identify the survey section (e.g., `SMK` = Smoking, `ALC` = Alcohol, `GEN` = General Health, `DHH` = Demographics). + +**Naming patterns within a module:** + +| Pattern | Meaning | Example | +|------------------------|------------------------|------------------------| +| `MOD_NNN` | Base survey question — direct questionnaire response | `SMK_005` (Have you smoked 100 cigarettes?) | +| `MOD_NNNA` | Lettered sub-question | `SMK_09A` (When did you stop daily?) | +| `MODG_NNN` or `MODGNNN` | Grouped/categorical version | `SMKG005` (grouped smoking frequency) | +| `MODDXXX` | StatCan-derived variable (computed from others) | `SMKDSTY` (smoking status, derived) | +| `MOD_NNNC` | Continuous companion (exact values alongside categorical) | `SMK_09C` (exact years since quit) | + +**Variables rename across cycles.** The same concept may have different names in different survey years due to questionnaire redesign: + +- `SMK_09C` (2003-2014) → `SMK_090` (2015-2021) → `SPU_25` (2022-2023) +- `SMK_045` (pre-2022) → `CSS_25` (2022-2023) + +This is why `variableStart` in the worksheets supports era-specific aliases — the harmonised variable needs to map to the correct source name for each cycle. + +### PUMF vs Master: implications for harmonisation + +| Aspect | PUMF | Master | +|--------------------------|--------------------|--------------------------| +| Continuous variables | Often grouped into categories | Exact values retained | +| Rare values | Suppressed or top-coded | Present | +| Variable availability | Subset of Master variables | Full variable set | +| Missing codes | Same encoding (6, 7, 8, 9) | Same encoding | +| Derived variables | May differ from Master | StatCan-computed | + +**Harmonisation consequence:** A single cchsflow variable often requires separate worksheet blocks for PUMF and Master databases. For example, `SMKG09C` uses a direct recode block for PUMF (mapping categorical codes) and a range-based `[SMK_09C]` block for Master (grouping continuous years into categories). + +------------------------------------------------------------------------ + +## Part 2: Worksheet schema — `variables.csv` + +`variables.csv` is the **variable registry**. Each row defines one harmonised variable: its name, label, which databases contain it, and the source variable names used in each database. + +**Current dimensions:** \~384 rows, 18 columns. + +### Column reference + +| Column | Type | Description | +|---------------------|------------------|----------------------------------| +| `variable` | text | **Primary key.** Harmonised variable name (e.g., `SMK_01A`, `DHHGAGE_cont`). Must be unique. | +| `label` | text | Short label (≤40 characters). Used in output datasets. | +| `labelLong` | text | Descriptive label. Human-readable explanation of the variable. | +| `variableType` | text | Output data type: `Categorical` or `Continuous`. | +| `databaseStart` | text | Comma-separated list of databases where this variable is available (e.g., `cchs2001_p, cchs2003_p, cchs2007_2008_m`). | +| `variableStart` | text | Comma-separated source variable names with optional `db::name` aliases for era-specific names (e.g., `cchs2003_p::SMKCG09C, cchs2005_p::SMKEG09C, SMKG09C`). Plain names apply to all unlisted databases. | +| `subject` | text | Domain classification (e.g., `Smoking`, `Physical Activity`, `Demographics`). | +| `section` | text | Sub-domain or module section. | +| `units` | text | Measurement units (e.g., `years`, `cigarettes/day`, `score`). | +| `notes` | text | Free-text notes about the variable. | +| `description` | text | Extended description of the variable's purpose and derivation. | +| `version` | text | Version when the variable was added or last modified (e.g., `3.0.0-alpha`). | +| `lastUpdated` | date | Date of last modification (YYYY-MM-DD). | +| `reviewNotes` | text | Notes from review process. | +| `ICES.confirmation` | text | ICES review status or confirmation notes. Temporary column for development. This will be depreciated on the final version 3.0 | +| `Observation..MD.` | text | Observations from MD (medical doctor) review. Temporary column for devcelopment. | +| `status` | text | Lifecycle state: `active`, `deprecated`, or `draft`. | +| `versionNotes` | text | Notes about version changes. | + +### Key relationships + +- `variable` in `variables.csv` is the foreign key referenced by `variable_details.csv` +- `databaseStart` must be a superset of all databases appearing in `variable_details.csv` for that variable +- `variableStart` names must match the source variable names used in `variable_details.csv` blocks + +------------------------------------------------------------------------ + +## Part 3: Worksheet schema — `variable_details.csv` + +`variable_details.csv` is the **recode specification**. Each row defines one mapping rule: for a given variable, in a given set of databases, map a source value to a target value. Rows group into blocks that collectively define how a variable is recoded for a set of databases. + +**Current dimensions:** \~3,664 rows, 23 columns. + +### Column reference + +| Column | Type | Description | +|---------------------|------------------|----------------------------------| +| `variable` | text | **Foreign key** to `variables.csv`. The harmonised variable this row belongs to. | +| `dummyVariable` | text | Row identifier. Convention: `{variable}_{typeEnd}{numValidCat}_{sequence}` (e.g., `SMK_01A_cat2_1`). Values repeat across blocks for the same variable when different blocks produce the same output categories. See [naming conventions](#part-7-naming-conventions). | +| `typeEnd` | text | Output data type for this row: `cat` (categorical) or `cont` (continuous). | +| `databaseStart` | text | Comma-separated databases this row applies to. A database must appear in exactly one block for each variable. | +| `variableStart` | text | Source variable specification. Supports four notations (see below). | +| `ICES.confirmation` | text | ICES review confirmation. | +| `typeStart` | text | Source variable data type: `cat` or `cont`. Determines recode behaviour: `cat` sources use value mapping (including midpoint imputation); `cont` sources may use range-based binning or copy. A single variable can have blocks with different `typeStart` values for different databases (e.g., PUMF categorical vs Master continuous). | +| `recEnd` | text | **Target value** — the output of the recode. See special values below. | +| `numValidCat` | text | Number of valid (non-missing) categories. For `_cont` variables derived from categorical sources, this reflects the source category count, not the number of distinct continuous output values. | +| `catLabel` | text | Short category label for the `recEnd` value. | +| `catLabelLong` | text | Long category label. | +| `units` | text | Measurement units. | +| `recStart` | text | **Source value** — the input to be recoded. See special values below. | +| `catStartLabel` | text | Label for the source value (`recStart`). | +| `variableStartShortLabel` | text | Short label of the source variable. | +| `variableStartLabel` | text | Full label of the source variable. | +| `notes` | text | Free-text notes. | +| `version` | text | Version when added or modified. | +| `lastUpdated` | date | Date of last modification. | +| `status` | text | Lifecycle state: `active`, `deprecated`, `draft`. | +| `reviewNotes` | text | Review notes. | +| `versionNotes` | text | Notes about version changes. | +| `review` | text | Review status or reviewer. | + +### `variableStart` notations + +The `variableStart` column supports four distinct notations, each serving a different purpose: + +**1. Plain name** — the source variable has the same name across all listed databases: + +``` +SMKG09C +``` + +**2. Database-qualified alias** — era-specific source names when the variable was renamed across cycles: + +``` +cchs2003_p::SMKCG09C, cchs2005_p::SMKEG09C, [SMKG09C] +``` + +This means: use `SMKCG09C` in `cchs2003_p`, `SMKEG09C` in `cchs2005_p`, and `SMKG09C` (resolved via bracket notation) in all other listed databases. + +**3. Bracket notation** — `[VARIABLE_NAME]` resolves to whatever source name that variable uses in each database, as defined in `variables.csv`: + +``` +[SMK_09C] +``` + +This is useful when the source variable itself has era-specific aliases defined in `variables.csv`. The bracket notation delegates name resolution to the variable registry rather than hard-coding names. + +**4. DerivedVar** — inputs to an R function that computes the output: + +``` +DerivedVar::[SMK_09A_cont, SMK_06A_cont] +``` + +The variables listed in brackets are passed as arguments to the function specified in the block's `recEnd=Func::function_name` row. + +### `recStart` special values + +| Value | Meaning | Example | +|---------------------|--------------------------|--------------------------| +| Integer or decimal | Literal source value to match | `1`, `2`, `6`, `3.5` | +| `[min,max)` | Half-open interval (includes min, excludes max) | `[3,6)` matches 3, 4, 5 | +| `[min,max]` | Closed interval (includes both endpoints) | `[11,82]` matches 11 through 82 | +| `else` | Catch-all: matches any value not matched by other rows | Maps unmatched values (typically to `NA::b`) | +| `N/A` | Not applicable — used in DerivedVar blocks where the R function handles all recoding | Always paired with DerivedVar `variableStart` | +| R-like expression | Conditional expression referencing input variables — used in DerivedVar output rows | `SMKDSTY_A in (3,5,6)`, `is.na(SMK_204)` | + +### `recEnd` special values + +| Value | Meaning | Example | +|---------------------|--------------------------|--------------------------| +| Integer or decimal | Target output value | `1`, `2`, `0.5`, `8` | +| `copy` | Pass-through: output equals input value unchanged | Used for continuous variables that need no transformation | +| `Func::function_name` | DerivedVar header: delegates recoding to the named R function | `Func::calculate_SMK_06A_cont` | +| `NA::a` | Not applicable — legitimate skip (e.g., non-smoker asked about quitting) | Maps source missing code `6` or equivalent | +| `NA::b` | Missing — refusal, don't know, or not stated | Maps source codes `7`, `8`, `9` (or `97`, `98`, `99`) | + +------------------------------------------------------------------------ + +## Part 4: Block structure + +Block structure is the most important concept for understanding how cchsflow worksheets work. A **block** is a group of contiguous rows in `variable_details.csv` that share the same `variable`, `variableStart`, and `databaseStart` values. Together, the rows in a block define the complete recode specification for that variable in those databases. + +### Block types + +#### Direct recode blocks + +The most common type. Each row maps a `recStart` value to a `recEnd` value. `rec_with_table()` applies these mappings directly. + +``` +variable variableStart databaseStart recStart recEnd +SMK_01A [SMK_01A] cchs2001_p, cchs2003_p 1 1 +SMK_01A [SMK_01A] cchs2001_p, cchs2003_p 2 2 +SMK_01A [SMK_01A] cchs2001_p, cchs2003_p 6 NA::a +SMK_01A [SMK_01A] cchs2001_p, cchs2003_p [7,9] NA::b +SMK_01A [SMK_01A] cchs2001_p, cchs2003_p else NA::b +``` + +The block above says: for `SMK_01A` in `cchs2001_p` and `cchs2003_p`, take the harmonised source `SMK_01A` (which maps to era-specific StatCan names via the full `variableStart` aliases), pass values 1 and 2 through, map 6 to not-applicable, and map everything else to missing. + +#### DerivedVar blocks + +Used when the recode logic requires computation that worksheets cannot express (multi-variable input, conditional branching, date arithmetic). The first row is a header with `Func::function_name` in `recEnd`; subsequent rows document the function's possible output values. + +``` +variable variableStart recStart recEnd +cigs_per_day DerivedVar::[SMK_204, SMK_208, SMKDSTY_A] N/A Func::calculate_cigs_per_day +cigs_per_day DerivedVar::[SMK_204, SMK_208, SMKDSTY_A] [1,99] copy +cigs_per_day DerivedVar::[SMK_204, SMK_208, SMKDSTY_A] SMKDSTY_A in (3,5,6) NA::a +cigs_per_day DerivedVar::[SMK_204, SMK_208, SMKDSTY_A] is.na(SMK_204) & is.na(SMK_208) NA::b +cigs_per_day DerivedVar::[SMK_204, SMK_208, SMKDSTY_A] else NA::b +``` + +The `recEnd` values after the `Func::` row are **output documentation** — they describe the values the function can produce and conditions for missing values. This differs from direct recode blocks where `recEnd` *is* the target. Note that DerivedVar blocks may use conditional `recStart` expressions that reference input variables (e.g., `SMKDSTY_A in (3,5,6)`). + +#### Range-based blocks + +Map continuous source values to categorical or continuous targets using interval notation. Common for Master file variables where exact values are available. + +``` +variable variableStart databaseStart recStart recEnd +SMKG09C [SMK_09C] cchs2003_m, cchs2005_m, ... [3,6) 1 +SMKG09C [SMK_09C] cchs2003_m, cchs2005_m, ... [6,11) 2 +SMKG09C [SMK_09C] cchs2003_m, cchs2005_m, ... [11,82] 3 +SMKG09C [SMK_09C] cchs2003_m, cchs2005_m, ... 996 NA::a +SMKG09C [SMK_09C] cchs2003_m, cchs2005_m, ... [997,999] NA::b +SMKG09C [SMK_09C] cchs2003_m, cchs2005_m, ... else NA::b +``` + +This groups exact years-since-quit from Master's `SMK_09C` into three categories: 1 (3-5 years), 2 (6-10 years), 3 (11+ years). + +#### Copy blocks + +Pass source values through unchanged. Used for continuous variables that already have the correct scale. + +``` +variable variableStart databaseStart recStart recEnd +SMKG09C_cont [SPU_25] cchs2019_2020_p, cchs2022_p, ... [0,121] copy +SMKG09C_cont [SPU_25] cchs2019_2020_p, cchs2022_p, ... 996 NA::a +SMKG09C_cont [SPU_25] cchs2019_2020_p, cchs2022_p, ... [997,999] NA::b +``` + +### Multi-block variables + +Most variables have more than one block because they need different recode logic for different databases. Common reasons: + +1. **PUMF vs Master split** — PUMF has categorical source, Master has continuous source. Each needs its own block with different `recStart` patterns. + +2. **Era-specific sources** — the source variable was renamed or restructured between cycles. Each era gets a block with the appropriate `variableStart`. + +3. **File type availability** — some variables exist only on Master or only on PUMF for certain cycles. + +**Example:** `SMKG09C` has three blocks: - A direct recode block for older PUMF databases (categorical → categorical) - A range-based `[SMK_09C]` block for Master databases (continuous → categorical) - A direct recode block for recent PUMF databases (categorical → categorical, different source names) + +### Block precedence + +When `rec_with_table()` processes a variable for a specific database, it selects the block whose `databaseStart` includes that database. **Each database should appear in exactly one block per variable.** If a database appears in multiple blocks, the behaviour is undefined and likely indicates a worksheet error. + +### The worksheet-first principle + +Worksheet `recEnd` values are the **source of truth** for value mappings. R functions (`Func::` in DerivedVar blocks) should only be used when the recode logic genuinely cannot be expressed in worksheet rows: + +- Multi-variable computation (combining inputs from several source variables) +- Conditional branching that depends on runtime values +- Date arithmetic or other calculations + +Simple categorical-to-midpoint conversions belong in worksheet rows, not R code. The reference implementation for worksheet-only continuous variables is `DHHGAGE_cont`, which converts age groups to midpoints entirely through `recStart → recEnd` mappings with no R function. + +**Anti-pattern:** The deleted `calculate_SMK_09A_cont()` function hard-coded midpoint values (0.5, 1.5, 2.5, 4.0) that duplicated the worksheet's own `recEnd` values. This redundancy created maintenance risk — changes to the worksheet would not propagate to the function, or vice versa. + +### How DerivedVar blocks invoke R functions + +When `rec_with_table()` encounters a DerivedVar block, it: + +1. **Processes feeder variables first.** The variables listed in `DerivedVar::[var1, var2, ...]` are recursively processed through their own worksheet blocks before the function is called. The function receives already-harmonised values, not raw source data. +2. **Calls the function by position.** Arguments are passed in the order listed in `DerivedVar::[...]`, not by parameter name. This means the **count and order** of inputs must match the function's parameter count exactly. +3. **Operates row-wise.** The function is called once per row in the dataset. + +**Practical constraints for writing DerivedVar blocks and functions:** + +- A function with 2 parameters needs exactly 2 inputs in `DerivedVar::[a, b]` +- Reordering inputs in the DerivedVar list changes which value goes to which parameter +- The function can assume its inputs are clean harmonised values (not raw StatCan codes) + +For detailed technical documentation of the `rec_with_table()` engine, see the cchsflow-review skill. + +------------------------------------------------------------------------ + +## Part 5: Recode patterns + +This section shows real examples from the current worksheets. Each illustrates a distinct pattern with the actual rows used by `rec_with_table()`. + +### Pattern 1: Simple categorical passthrough + +**Variable:** `SMK_01A` (ever smoked a whole cigarette) + +The simplest pattern — source values map directly to the same output values. + +| variable | variableStart | databaseStart (abbreviated) | recStart | recEnd | +|-------------|-------------|-----------------------|-------------|-------------| +| SMK_01A | cchs2001_p::SMKA_01A, ..., \[SMK_01A\] | cchs2001_m, cchs2001_p, ... | 1 | 1 | +| SMK_01A | cchs2001_p::SMKA_01A, ..., \[SMK_01A\] | cchs2001_m, cchs2001_p, ... | 2 | 2 | +| SMK_01A | cchs2001_p::SMKA_01A, ..., \[SMK_01A\] | cchs2001_m, cchs2001_p, ... | 6 | NA::a | +| SMK_01A | cchs2001_p::SMKA_01A, ..., \[SMK_01A\] | cchs2001_m, cchs2001_p, ... | \[7,9\] | NA::b | +| SMK_01A | cchs2001_p::SMKA_01A, ..., \[SMK_01A\] | cchs2001_m, cchs2001_p, ... | else | NA::b | + +**What makes it distinctive:** Values 1 and 2 pass through unchanged. The `variableStart` uses era-qualified aliases (e.g., `cchs2001_p::SMKA_01A`) for cycles where StatCan renamed the source, with `[SMK_01A]` as the default for unlisted databases. Source missing codes (6 = not applicable, 7-9 = refusal/don't know/not stated) are mapped to `NA::a` and `NA::b`. The `else` row catches any unexpected values. + +### Pattern 2: Era-specific source names + +**Variable:** `SMKG09C` (years since quit, grouped — former daily smoker) + +Different CCHS cycles use different variable names for the same concept. + +| variable | variableStart | databaseStart (abbreviated) | recStart | recEnd | +|-------------|-------------|-----------------------|-------------|-------------| +| SMKG09C | cchs2003_p::SMKCG09C, cchs2005_p::SMKEG09C, cchs2015_2016_p::SMKG090, cchs2017_2018_p::SMKG090, \[SMKG09C\] | cchs2003_p, cchs2005_p, cchs2007_2008_p, ... | 1 | 1 | +| SMKG09C | (same) | (same) | 2 | 2 | +| SMKG09C | (same) | (same) | 3 | 3 | +| SMKG09C | (same) | (same) | 6 | NA::a | +| SMKG09C | (same) | (same) | \[7,9\] | NA::b | +| SMKG09C | (same) | (same) | else | NA::b | + +**What makes it distinctive:** The `variableStart` field maps specific databases to their era-specific source names: `SMKCG09C` in 2003, `SMKEG09C` in 2005, `SMKG090` in 2015-2018, and `SMKG09C` (via bracket resolution) for all other listed databases. + +### Pattern 3: Categorical to continuous midpoint (worksheet-only) + +**Variable:** `SMK_09A_cont` (years since stopped daily, midpoint-imputed) + +Categorical source values are mapped to continuous midpoint values entirely through worksheet `recEnd` values — no R function needed. + +| variable | variableStart | databaseStart (abbreviated) | recStart | recEnd | +|-------------|-------------|-----------------------|-------------|-------------| +| SMK_09A_cont | cchs2003_p::SMKC_09A, ..., \[SMK_09A\] | cchs2003_p, cchs2005_p, ... | 1 | 0.5 | +| SMK_09A_cont | cchs2003_p::SMKC_09A, ..., \[SMK_09A\] | cchs2003_p, cchs2005_p, ... | 2 | 1.5 | +| SMK_09A_cont | cchs2003_p::SMKC_09A, ..., \[SMK_09A\] | cchs2003_p, cchs2005_p, ... | 3 | 2.5 | +| SMK_09A_cont | cchs2003_p::SMKC_09A, ..., \[SMK_09A\] | cchs2003_p, cchs2005_p, ... | 4 | 4 | +| SMK_09A_cont | cchs2003_p::SMKC_09A, ..., \[SMK_09A\] | cchs2003_p, cchs2005_p, ... | 6 | NA::a | +| SMK_09A_cont | cchs2003_p::SMKC_09A, ..., \[SMK_09A\] | cchs2003_p, cchs2005_p, ... | \[7,9\] | NA::b | + +**What makes it distinctive:** This is the **worksheet-first principle** in action. The midpoint values (0.5, 1.5, 2.5, 4.0) are encoded directly in `recEnd`. No DerivedVar block or R function is needed. This follows the `DHHGAGE_cont` pattern — the reference implementation for worksheet-only continuous variables. + +### Pattern 4: DerivedVar with R function + +**Variable:** `cigs_per_day` (cigarettes smoked per day) + +When multiple source variables must be combined with conditional logic, a DerivedVar block delegates to an R function. + +| variable | variableStart | databaseStart (abbreviated) | recStart | recEnd | +|-------------|-------------|-----------------------|-------------|-------------| +| cigs_per_day | DerivedVar::\[SMK_204, SMK_208, SMKDSTY_A\] | cchs2001_p, cchs2003_p, ... | N/A | Func::calculate_cigs_per_day | +| cigs_per_day | DerivedVar::\[SMK_204, SMK_208, SMKDSTY_A\] | cchs2001_p, cchs2003_p, ... | \[1,99\] | copy | +| cigs_per_day | DerivedVar::\[SMK_204, SMK_208, SMKDSTY_A\] | cchs2001_p, cchs2003_p, ... | SMKDSTY_A in (3,5,6) | NA::a | +| cigs_per_day | DerivedVar::\[SMK_204, SMK_208, SMKDSTY_A\] | cchs2001_p, cchs2003_p, ... | is.na(SMK_204) & is.na(SMK_208) | NA::b | +| cigs_per_day | DerivedVar::\[SMK_204, SMK_208, SMKDSTY_A\] | cchs2001_p, cchs2003_p, ... | else | NA::b | + +**What makes it distinctive:** The first row's `recEnd=Func::calculate_cigs_per_day` tells `rec_with_table()` to call the R function `calculate_cigs_per_day()`, passing `SMK_204`, `SMK_208`, and `SMKDSTY_A` as inputs. Subsequent rows document output value ranges and missing value conditions. This pattern uses conditional `recStart` expressions (e.g., `SMKDSTY_A in (3,5,6)`) that evaluate against the input variables. + +### Pattern 5: Range-based Master recode + +**Variable:** `SMKG09C_cont` (years since quit, continuous — former daily smoker, Master) + +Exact continuous values from Master files are grouped into broader categories using interval notation. + +| variable | variableStart | databaseStart (abbreviated) | recStart | recEnd | +|-------------|-------------|-----------------------|-------------|-------------| +| SMKG09C_cont | \[SMK_09C\] | cchs2003_m, cchs2005_m, ... | \[3,6) | 4 | +| SMKG09C_cont | \[SMK_09C\] | cchs2003_m, cchs2005_m, ... | \[6,11) | 8 | +| SMKG09C_cont | \[SMK_09C\] | cchs2003_m, cchs2005_m, ... | \[11,82\] | 12 | +| SMKG09C_cont | \[SMK_09C\] | cchs2003_m, cchs2005_m, ... | 996 | NA::a | +| SMKG09C_cont | \[SMK_09C\] | cchs2003_m, cchs2005_m, ... | \[997,999\] | NA::b | +| SMKG09C_cont | \[SMK_09C\] | cchs2003_m, cchs2005_m, ... | else | NA::b | + +**What makes it distinctive:** The `recStart` column uses interval notation (`[3,6)` means 3 ≤ value \< 6). The `recEnd` values are midpoints of the ranges (4, 8, 12), converting exact years to grouped midpoints. This is the Master counterpart to a PUMF direct recode block. + +### Pattern 6: PUMF/Master split + +**Variable:** `SMKG09C` — shows how one variable uses completely different blocks for PUMF and Master. + +**PUMF block** (direct recode — categorical source): + +| variableStart | databaseStart | recStart | recEnd | +|-------------------|---------------------|----------------|----------------| +| cchs2003_p::SMKCG09C, ... \[SMKG09C\] | cchs2003_p, cchs2005_p, ... | 1 | 1 | +| (same) | (same) | 2 | 2 | +| (same) | (same) | 3 | 3 | + +**Master block** (range-based — continuous source): + +| variableStart | databaseStart | recStart | recEnd | +|---------------|-----------------------------|-----------|--------| +| \[SMK_09C\] | cchs2003_m, cchs2005_m, ... | \[3,6) | 1 | +| (same) | (same) | \[6,11) | 2 | +| (same) | (same) | \[11,82\] | 3 | + +**What makes it distinctive:** Same output categories (1, 2, 3) but completely different source variables and recode logic. PUMF has pre-grouped categorical input; Master has continuous years that must be range-mapped. Both blocks produce the same harmonised output. + +### Pattern 7: Copy pass-through + +**Variable:** `SMKG09C_cont` — the `[SPU_25]` block for recent PUMF databases. + +| variable | variableStart | databaseStart | recStart | recEnd | +|---------------|---------------|---------------|---------------|---------------| +| SMKG09C_cont | \[SPU_25\] | cchs2019_2020_p, cchs2021_p, cchs2022_p, cchs2023_p | \[0,121\] | copy | +| SMKG09C_cont | \[SPU_25\] | cchs2019_2020_p, cchs2021_p, cchs2022_p, cchs2023_p | 996 | NA::a | +| SMKG09C_cont | \[SPU_25\] | cchs2019_2020_p, cchs2021_p, cchs2022_p, cchs2023_p | \[997,999\] | NA::b | + +**What makes it distinctive:** `recEnd=copy` means the source value passes through unchanged. Valid values (0-121 months) are copied as-is; only missing codes are recoded. This is used when the source variable is already in the desired format. + +------------------------------------------------------------------------ + +## Part 6: Missing values + +### cchsflow missing value encoding + +cchsflow uses two missing value codes throughout the worksheets: + +| Code | Meaning | Typical source codes | +|-----------------|-----------------|---------------------------------------| +| `NA::a` | **Not applicable** — the question does not apply to this respondent (legitimate skip) | `6`, `96`, `996` (StatCan "not applicable") | +| `NA::b` | **Missing** — refusal, don't know, or not stated | `7`/`8`/`9`, `97`/`98`/`99`, `997`/`998`/`999` | + +This two-category system simplifies StatCan's more detailed missing codes while preserving the critical distinction between "does not apply" and "data is missing." + +### StatCan source missing codes + +Statistics Canada uses consistent missing code conventions, but the specific values depend on the variable's range: + +| Variable range | Not applicable | Refusal | Don't know | Not stated | +|----------------|----------------|---------|------------|------------| +| 1-5 | 6 | 7 | 8 | 9 | +| 1-95 | 96 | 97 | 98 | 99 | +| 1-995 | 996 | 997 | 998 | 999 | + +### Tagged NAs in R + +In R, cchsflow uses the `haven` package's `tagged_na()` function to preserve missing value types. Tagged NAs look like regular `NA` to standard R functions but carry a hidden tag (`"a"` or `"b"`) that can be inspected with `haven::is_tagged_na()` and `haven::na_tag()`. + +This allows downstream analysis to distinguish between "not applicable" and "truly missing" when needed, while still behaving as `NA` for standard operations like `mean(x, na.rm = TRUE)`. + +### The `clean_variables()` function + +cchsflow's `clean_variables()` function auto-detects missing codes when database context is unavailable. It uses single-digit pattern matching (values 6-9 for narrow-range variables, 96-99 for wider ranges) to identify and convert source missing codes to tagged NAs. + +**Caution:** Auto-detection can misclassify legitimate values as missing codes when the variable's valid range overlaps with missing code ranges. This is more likely with exact continuous values (e.g., a value of 8 could be either "8 years" or "don't know"). The worksheet's explicit `recStart → recEnd` mappings take precedence over auto-detection. + +------------------------------------------------------------------------ + +## Part 7: Naming conventions {#part-7-naming-conventions} + +cchsflow uses specific naming conventions for harmonised variables, functions, and row identifiers. The authoritative reference is `.claude/skills/cchsflow-review/docs/variable-naming-conventions.md`. + +> **v3 status:** The conventions below reflect v3 decisions. Some are universally agreed upon (new variables use tidyverse verbs); others are being adopted incrementally during refactoring (renaming legacy `_A`/`_B` suffixes, renaming `_fun` functions). Legacy code may not yet follow all conventions. + +### Harmonised variable names + +**Selecting the base name:** cchsflow generally uses the CCHS 2007-2014 StatCan variable name as the harmonised name, harmonising other years to that form. For example, `SMK_09A` uses the 2003-2014 name even though 2001 used `SMKC_09A` and 2022-2023 uses `SPU_25A`. + +**When to create a new name:** When three or more StatCan names exist for the same concept and none is clearly dominant, cchsflow creates a new descriptive name rather than arbitrarily picking one era's name. The new name should be more meaningful than any of the StatCan names. Example: `cigs_per_day` rather than picking among `SMK_204`, `SMK_045`, `CSS_25`. + +### Variable suffixes + +| Suffix | When to use | Example | +|--------------------|-------------------------------|----------------------| +| `_cont` | Categorical source → continuous midpoint output (PUMF) | `SMK_09A_cont` | +| `_C` (no underscore) | Master continuous companion using StatCan naming | `SMK_09C`, `SMK_06C` | +| `_catN` | Number of output categories changes from source | `SMK_09A_cat4` | +| `_2001`, `_2003plus` | Structural break across cycles (different category boundaries) | `SMK_09A_2001`, `SMK_09A_2003plus` | +| `_A`, `_B` (deprecated) | Legacy era-split suffixes — replace with descriptive names when refactoring | `SMKG01C_A` (2001-2003), `SMKG01C_B` (2005+) | + +**`_cont` vs `_C` distinction:** `_cont` means midpoint imputation from a categorical PUMF source (e.g., `SMK_09A_cont` maps codes 1-4 to midpoints 0.5-4.0). `_C` reuses the StatCan `MOD_NNNC` naming for continuous companions that exist on Master files (e.g., `SMK_09C` = exact years from Master's `SMK_09C` variable). The `_C` variables are new in v3 (Master was not in v2). + +**Do not** add `_cont` if the source is already continuous. **Do not** add `_catN` if the categories are unchanged from the source. + +### Function naming conventions + +DerivedVar blocks reference R functions via `Func::function_name`. v3 uses tidyverse-style verb prefixes: + +| Prefix | Meaning | Example | +|--------------|----------------------------------------------|-------------------------------| +| `calculate_` | Numeric computation from inputs | `calculate_pack_years` | +| `score_` | Index or scale scoring | `score_depression_scale` | + +**Legacy functions** use a `variable_fun` convention (e.g., `SMKDSTY_fun`). These are left in place unless the variable family is being refactored — when doing a major block rewrite (e.g., smoking v3), rename the function to the new convention and add a re-exported alias if needed for backward compatibility. + +**New variables** must use the tidyverse verb convention. This is a firm team decision. + +### `dummyVariable` convention + +`dummyVariable` values follow the pattern `{variable}_{typeEnd}{numValidCat}_{sequence}`: + +- `SMK_01A_cat2_1` — first row of a 2-category categorical variable +- `DHHGAGE_cont_1` — first row of a continuous variable +- `SMK_01A_cat2_2` — second row + +Values may repeat across blocks for the same variable. Some v3 variables use `N/A` as a placeholder when the convention has not yet been applied. + +------------------------------------------------------------------------ + +## Part 8: Validation rules + +A valid set of cchsflow worksheets satisfies all of the following rules. These are expressed declaratively — they describe what must be true, not how to check it. The current worksheets have known violations of some rules (particularly rules 3-6 and 12) due to legacy data and ongoing migration. New variables should satisfy all rules; existing violations are tracked for resolution. + +### Structural rules + +1. **Unique primary key.** Every `variable` value in `variables.csv` must be unique. + +2. **Consistent row identifiers.** Every non-`N/A` `dummyVariable` value in `variable_details.csv` must follow the naming convention `{variable}_{typeEnd}{numValidCat}_{sequence}`. Values may repeat across blocks for the same variable when different blocks produce the same output categories (e.g., two era blocks both mapping to categories 1 and 2). + +3. **Foreign key integrity.** Every `variable` in `variable_details.csv` must have a corresponding row in `variables.csv`. + +4. **Database coverage agreement.** The set of databases in a variable's `databaseStart` in `variables.csv` must be a superset of the databases listed across all of that variable's blocks in `variable_details.csv`. + +### Block rules + +5. **Exclusive database assignment.** Each database in a variable's coverage must appear in exactly one block in `variable_details.csv`. If a database appears in multiple blocks for the same variable, the behaviour is undefined. + +6. **Complete missing value rows.** Every direct recode and range-based block must include rows for `NA::a` and `NA::b` in `recEnd`. DerivedVar blocks should include NA rows to document possible missing outputs, but the R function handles missing values internally. Copy-only blocks may omit NA rows if the source variable's missing values pass through unchanged. + +7. **Catch-all row.** Every direct categorical recode block should include an `else` row in `recStart` to handle unexpected source values. Copy blocks and range-based blocks with exhaustive ranges are exempt — the range notation already constrains valid values. + +### DerivedVar rules + +8. **Function existence.** A DerivedVar block's `Func::function_name` must correspond to an exported function in the package's `NAMESPACE`. + +9. **Input availability.** The variables listed in `DerivedVar::[var1, var2, ...]` must themselves be defined in `variables.csv` and available in the databases listed in the block's `databaseStart`. + +10. **Source variable existence.** For DerivedVar blocks, the source variable (`SPU_25`, `SMK_09A`, etc.) must actually exist in the databases listed in `databaseStart`. A DerivedVar block should not list databases where the source variable does not exist. + +### Value rules + +11. **Type consistency.** In direct recode blocks, `recEnd` values must be consistent with `typeEnd`: integers for `cat`, numeric (including decimals) for `cont`. + +12. **No deprecated databases.** The `_s` database suffix must not appear. Share files should be mapped to their Master equivalents (`_m`). + +13. **Mutually exclusive recStart.** Within a direct recode block, `recStart` values must not overlap. Each source value should match at most one row (before the `else` catch-all). + +### Cross-worksheet rules + +14. **Source name consistency.** Source variable names in `variableStart` in `variable_details.csv` must be consistent with `variableStart` in `variables.csv`. Era-specific aliases must match. + +15. **Database list consistency.** Databases listed in `variable_details.csv` blocks must use the standard `cchs{year}_{type}` naming convention and correspond to known CCHS releases. + +------------------------------------------------------------------------ + +## Part 9: Common patterns and anti-patterns + +### Pattern: Worksheet-first principle + +**When to use worksheet recodes:** Simple value mappings — categorical passthrough, midpoint imputation, range-based grouping, copy. These should always be expressed as `recStart → recEnd` rows in the worksheet. + +**When to use DV functions:** Multi-variable computation (e.g., `cigs_per_day` combines `SMK_204`, `SMK_208`, and `SMKDSTY_A` with conditional logic), conditional branching that depends on runtime values, or date arithmetic. + +**Reference implementation:** `DHHGAGE_cont` converts age group categories to midpoint values entirely through worksheet `recEnd` mappings. No R function exists or is needed. + +### Pattern: PUMF-Master bridging variables + +Some cchsflow variables exist specifically to provide a unified interface across PUMF and Master data, allowing researchers to move seamlessly between file types. These bridging variables combine: + +- **PUMF blocks:** Midpoint imputation from categorical sources (e.g., `SMK_09A_cont` maps categories 1-4 to midpoints 0.5, 1.5, 2.5, 4.0) +- **Master blocks:** Direct continuous values or copy from exact-value sources (e.g., `SMK_09C` on Master provides exact years since quit) + +The result is a single harmonised variable that produces comparable continuous values regardless of whether the researcher uses PUMF or Master data. The PUMF values carry inherent imprecision from midpoint estimation (~15-20% relative error), while Master values are exact. + +**Combining functions** like `time_quit_smoking` and `cigs_per_day` take these pre-computed continuous values as DerivedVar inputs and apply priority logic (e.g., prefer former-daily over former-occasional timing) to produce a single output. This two-layer architecture — worksheet midpoint recodes feeding DerivedVar combining functions — keeps the simple mappings in worksheets while delegating multi-variable routing to R functions. + +**Key bridging variables:** `SMK_09A_cont`, `SMK_06A_cont`, `SMK_10A_cont`, `SMKG203_cont`, `SMKG207_cont`, `SMKG040_cont`, `DHHGAGE_cont`. + +### Pattern: Variables.csv stubs (planned variables) + +A `variables.csv` entry can exist without corresponding `variable_details.csv` rows. These stubs serve as architectural markers for planned work — the variable's metadata (label, subject, description) is defined, but the actual recode mappings have not yet been created. + +**Example:** `time_quit_smoking_complete` and `time_quit_smoking_daily` exist in `variables.csv` as pathway-specific variants of `time_quit_smoking`, but have zero `variable_details` rows. They are planned for v3 but not yet implemented. + +**Validation:** A worksheet check should flag variables.csv entries with zero variable_details rows as **warnings** (not errors). These are intentional stubs, not accidental omissions. + +### Anti-pattern: Orphaned DerivedVar blocks + +When an R function is deleted but its `Func::` reference remains in `variable_details.csv`, `rec_with_table()` will fail at runtime for any database that matches the orphaned block. + +**Example:** `calculate_SMK_09A_cont()` was deleted because it was redundant (worksheet-first principle). The `SMKG09C` and `SMKG09C_cont` variables had DerivedVar blocks still referencing it. These blocks were also redundant — every database they listed was already covered by other blocks — and were deleted. + +**Prevention:** Before deleting an R function, search `variable_details.csv` for `Func::function_name` and either delete or convert all matching DerivedVar blocks. + +### Anti-pattern: Database over-specification + +A DerivedVar block lists databases where the source variable does not exist. + +**Example:** A `DerivedVar::[SPU_25]` block listed `cchs2003_p` through `cchs2023_p`, but `SPU_25` only exists in 2022-2023 databases. The block would fail for all earlier databases (and was redundant because other blocks already covered them). + +**Prevention:** Verify that the source variables in a DerivedVar block actually exist in every database listed in `databaseStart`. + +### Anti-pattern: Mismatched feeder variable names + +A DerivedVar block references feeder variables by names that don't match their `variables.csv` entries. This causes silent resolution failures because `rec_with_table()` looks up feeders by their harmonised name. + +**Example:** `DerivedVar::[SMKG005, SMKG040]` referenced `SMKG005`, but the harmonised variable is actually `SMK_005` (with underscore). The `G` prefix in `SMKG005` was the StatCan PUMF name, not the cchsflow harmonised name. The block should read `DerivedVar::[SMK_005, SMKG040]`. + +**Prevention:** DerivedVar feeder names must exactly match the `variable` column in `variables.csv`. When copying variableStart patterns from source variable names (e.g., StatCan DDI names), verify the harmonised equivalent. + +### Anti-pattern: Out-of-scope modifications + +When writing `variable_details.csv` programmatically (e.g., with R), operations like `gsub()` on the full data frame can modify rows for variables outside the intended scope. + +**Example:** A global `gsub("_s", "_m", ...)` intended to fix deprecated share file suffixes for smoking variables instead modified hundreds of unrelated variables. + +**Prevention:** Always subset to in-scope variables before applying transformations. Write to a temporary file for review before overwriting the main worksheet. + +------------------------------------------------------------------------ + +## Appendix A: Glossary cross-reference + +| cchsflow term | cchsflow-docs glossary term | Ontology concept (future) | +|------------------|-----------------------------|--------------------------| +| Database (`cchs2001_p`) | Dataset | Instance context | +| Variable (`SMK_01A`) | Variable | Represented Variable | +| Source variable (`SMK_005`, `SMKCG09C`) | — | Instance Variable | +| Block | — | — (worksheet-specific) | +| `recStart → recEnd` | — | Relationship (recoded) | +| `Func::` function | — | Derivation rule | +| `NA::a` / `NA::b` | — | Missing data classification | +| PUMF / Master | Master Files / Share Files | File type context | +| Era suffix (`_2001`, `_2003plus`) | Cycle | Temporal boundary | + +The ontology prototype in `cchsflow-docs/development/ontology/examples/smoking_variables.yaml` implements the DDI Variable Cascade model (Conceptual → Represented → Instance), which provides a formal framework for the relationships that cchsflow worksheets encode procedurally. See `cchsflow-docs/development/ontology/REQUIREMENTS.md` for the full specification. + +------------------------------------------------------------------------ + +## Appendix B: Relationship to other documentation + +| Resource | Location | Relationship | +|----------------------|----------------------|----------------------------| +| YAML column schemas | `inst/metadata/schemas/core/` | This document adds semantics; schemas define column order | +| Naming conventions | `.claude/skills/cchsflow-review/docs/variable-naming-conventions.md` | Authoritative source for naming rules; summarised in Part 7 | +| cchsflow-review skill | `.claude/skills/cchsflow-review/SKILL.md` | Procedural review process; this document provides the declarative knowledge it references | +| cchsflow-docs glossary | `../cchsflow-docs/docs/glossary.md` | CCHS terminology; Part 1 draws from it | +| cchsflow-docs architecture | `../cchsflow-docs/docs/architecture.md` | Database schema and metadata infrastructure | +| Ontology prototype | `../cchsflow-docs/development/ontology/` | Formal variable relationship model; Appendix A bridges to it | +| CEP documents | `ceps/` | Variable-specific harmonisation specs; this document is general | +| Package documentation | `man/`, pkgdown site | R function API reference | \ No newline at end of file diff --git a/.gitignore b/.gitignore index 3d80b556..9481e306 100644 --- a/.gitignore +++ b/.gitignore @@ -4,4 +4,5 @@ .Ruserdata *.DS_Store docs/ -..Rcheck/ \ No newline at end of file +!.claude/skills/**/docs/ +..Rcheck/ From 335b856b55d5128d40f77ba117b8877d06b5e3b2 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Sun, 29 Mar 2026 19:47:00 -0400 Subject: [PATCH 07/15] fix(skill): Renumber Check 6 DV specification items sequentially The sub-item numbered 3b is renumbered to 4, with subsequent items shifted to 5-7 for consistent sequential numbering. --- .../skills/cchsflow-review/docs/l3-l5-worksheet-checks.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md b/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md index c45cef83..00638692 100644 --- a/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md +++ b/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md @@ -223,10 +223,10 @@ If the in-scope variables include derived variables (functions in `R/`): 1. **Input consistency**: Read the DV function (e.g., `calculate_pct_time()` in `R/percent-time-canada.R`) and verify that the input variable names it expects match those listed in `variable_details.csv` for the derived variable 2. **Category coverage**: Verify the function handles all category values that the worksheet's `recFrom` maps to — no unhandled cases that would silently produce NA 3. **Output consistency**: Verify the function's return values match the `recTo` values in the worksheet -3b. **No hard-coded worksheet values**: Check that the DV function does not contain literal midpoints or category values (e.g., `~ 0.5`, `~ 1.5`, `~ 4`) that duplicate or should duplicate `recEnd` values in `variable_details.csv`. If a function hard-codes values that the worksheet already expresses (or could express) as `recStart → recEnd` rows, flag as **P1** — the function should be refactored to read from the worksheet or eliminated entirely. Reference: the `DHHGAGE_cont` pattern. -4. **Output bounds validation**: For continuous DVs, check whether the function validates output range. Values outside the valid domain (e.g., percentage >100 or <0) indicate inconsistent inputs and should return `tagged_na("b")`. The valid range should be documented in the `notes` field of the Func row in variable_details (documentation only for now, ready for future validation framework). If the DV lacks bounds checking, flag as P1. -5. **Documentation**: Check roxygen docs match the actual function signature -6. **Necessity check** (worksheet-first): Before reviewing function logic, verify that the `Func::` DerivedVar block is actually needed. Check whether the same mapping could be expressed as direct recode rows (`recStart → recEnd`). If the DerivedVar input uses the same categorical scale as an existing direct recode block and the function only maps categories to output values, the function is redundant — flag as **P1** and recommend converting to direct recode rows. See "Worksheet-first principle" above. +4. **No hard-coded worksheet values**: Check that the DV function does not contain literal midpoints or category values (e.g., `~ 0.5`, `~ 1.5`, `~ 4`) that duplicate or should duplicate `recEnd` values in `variable_details.csv`. If a function hard-codes values that the worksheet already expresses (or could express) as `recStart → recEnd` rows, flag as **P1** — the function should be refactored to read from the worksheet or eliminated entirely. Reference: the `DHHGAGE_cont` pattern. +5. **Output bounds validation**: For continuous DVs, check whether the function validates output range. Values outside the valid domain (e.g., percentage >100 or <0) indicate inconsistent inputs and should return `tagged_na("b")`. The valid range should be documented in the `notes` field of the Func row in variable_details (documentation only for now, ready for future validation framework). If the DV lacks bounds checking, flag as P1. +6. **Documentation**: Check roxygen docs match the actual function signature +7. **Necessity check** (worksheet-first): Before reviewing function logic, verify that the `Func::` DerivedVar block is actually needed. Check whether the same mapping could be expressed as direct recode rows (`recStart → recEnd`). If the DerivedVar input uses the same categorical scale as an existing direct recode block and the function only maps categories to output values, the function is redundant — flag as **P1** and recommend converting to direct recode rows. See "Worksheet-first principle" above. ## Check 7: Unit tests (L5) From 540b1c164b3ce43bf842e5c620c1610eee9c7838 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Mon, 30 Mar 2026 11:55:04 -0400 Subject: [PATCH 08/15] feat(skill): Add PUMF-Master variable family pattern, completeness audit, and dev mode - Add PUMF-Master variable family pattern documentation to worksheet-reference.md explaining the systematic relationship between Master continuous, PUMF categorical, and _cont bridging variables (with DHH_AGE example and DHHGAGE_B footnote) - Add Check 8: Completeness audit (8a missing-code rows, 8b cycle coverage, 8c variable family completeness) to l3-l5-worksheet-checks.md - Add --dev mode to SKILL.md for authoring/development use where omissions are P1 - Cross-reference variable family pattern from Check 3 (PUMF vs Master naming) --- .claude/skills/cchsflow-review/SKILL.md | 9 +++- .../docs/l3-l5-worksheet-checks.md | 51 ++++++++++++++++++- .../docs/worksheet-reference.md | 25 +++++++++ 3 files changed, 82 insertions(+), 3 deletions(-) diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md index fe7a9407..a97fc786 100644 --- a/.claude/skills/cchsflow-review/SKILL.md +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -11,10 +11,15 @@ CEP-driven review for cchsflow worksheet changes. Reviews follow the L0-L6 harmo ## Usage ``` -/cchsflow-review -/cchsflow-review # review unstaged changes +/cchsflow-review # PR review mode +/cchsflow-review # self-review (unstaged changes) +/cchsflow-review --dev # development/authoring mode ``` +**Review mode** (default): Validates existing worksheet entries. Checks 1-8 focus on correctness of what's present. Check 8 (completeness) runs but flags omissions as informational rather than blocking. + +**Development mode** (`--dev`): Runs all review checks plus full completeness audit with MCP verification. Omissions are flagged as P1. Useful when authoring new variables or expanding existing ones to additional cycles. The completeness audit actively searches for missing cycle coverage, missing variable family members (`_cont` bridges, categorical companions), and missing-code row gaps. + ## Workflow ### Prerequisite: Read the worksheet reference diff --git a/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md b/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md index 00638692..bb81da91 100644 --- a/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md +++ b/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md @@ -77,7 +77,7 @@ For `_p` (PUMF) databases: Verify that `_m` databases don't reference PUMF-only grouped variables, and vice versa. -For variables where PUMF and Master use fundamentally different source types (categorical vs continuous), the required pattern is to split into separate recode blocks — one for PUMF, one for Master — each with its own `databaseStart` and `variableStart`. +For variables where PUMF and Master use fundamentally different source types (categorical vs continuous), see "The PUMF-Master variable family pattern" in `docs/worksheet-reference.md`. Continuous measures on Master are systematically grouped on PUMF, requiring separate worksheet rows. The required pattern is to split into separate recode blocks — one for PUMF, one for Master — each with its own `databaseStart` and `variableStart`. For harmonized variable **naming** decisions (when to use `_cont`, `_catN`, era suffixes, etc.), see `docs/variable-naming-conventions.md`. @@ -236,3 +236,52 @@ If the PR includes or modifies test files in `tests/testthat/`: - Verify cross-cycle consistency If the PR lacks tests for new derived variables, flag this. + +## Check 8: Completeness audit (omissions) + +Checks 1-7 validate what's present. This check detects what's absent — missing rows, missing cycles, and missing variable family members. In **review mode**, flag omissions as informational. In **development mode** (`--dev`), flag as P1. + +Use MCP `get_variable_history` and `get_value_codes` as the authoritative reference for what should exist. + +### 8a: Missing-code row completeness + +Every variable that recodes source values must handle missing codes. The standard CCHS missing code pattern is: + +| recStart | recEnd | Meaning | +|----------|--------|---------| +| `96` (or `996`, `9996`) | `NA::a` | Not applicable | +| `[97,99]` (or `[997,999]`) | `NA::b` | Don't know / Refusal / Not stated | +| `else` | `NA::b` | Catch-all for unmapped values | + +For each in-scope variable: + +1. Check that `NA::a` and `NA::b` rows exist +2. Verify the missing code values match the source variable's scale (2-digit variables use 96/97-99; 3-digit use 996/997-999; 4-digit use 9996/9997-9999) +3. Check that an `else→NA::b` catch-all exists +4. Use `mcp__cchs-metadata__get_value_codes` to confirm the actual not-applicable code for the source variable — some variables use non-standard codes + +**Flag:** Missing `NA::a` row → P1 (data that should be NA will fall through to `else→NA::b` — functionally correct but loses the a/b distinction). Missing `else` row → P0 (unmapped values will cause runtime errors or silent data loss). + +### 8b: Cycle coverage completeness + +For each in-scope variable: + +1. Run `mcp__cchs-metadata__get_variable_history` for the source variable +2. Compare the cycles where the source exists against the worksheet's `databaseStart` +3. Flag cycles where the source variable exists but the worksheet has no coverage + +This catches the pattern where a contributor added a variable for 2007-2014 but didn't realise it also exists on 2001-2005 (with cycle-letter prefix) or 2015+ (with renamed form). + +**Flag:** Missing cycle coverage → informational in review mode, P1 in development mode. + +### 8c: Variable family completeness + +When a continuous measure is added for Master, verify the PUMF side exists (see "The PUMF-Master variable family pattern" in `docs/worksheet-reference.md`): + +1. Does a categorical version exist for PUMF? (e.g., `SMKG09C` for `SMK_09C`) +2. Does a `_cont` midpoint bridging variable exist? +3. For derived variables: do all feeder variables have coverage on all databases listed in the DV's `databaseStart`? + +A continuous Master variable without a PUMF categorical companion means PUMF users have no access to that measure. A missing `_cont` bridge means no pseudo-continuous approximation for PUMF. + +**Flag:** Missing `_cont` bridge → P1 (PUMF users get no continuous approximation). Feeder coverage gap → P0 (DV will fail at runtime for those cycles). diff --git a/.claude/skills/cchsflow-review/docs/worksheet-reference.md b/.claude/skills/cchsflow-review/docs/worksheet-reference.md index 3173e0a6..7b27eff9 100644 --- a/.claude/skills/cchsflow-review/docs/worksheet-reference.md +++ b/.claude/skills/cchsflow-review/docs/worksheet-reference.md @@ -30,6 +30,31 @@ Statistics Canada releases CCHS data in several file types: **Key difference for harmonisation:** PUMF files often group continuous variables into categories (e.g., exact age → age groups) and may suppress rare values. Master files retain exact values. This means the same conceptual variable may need different recode rules for PUMF and Master — which is why cchsflow worksheets use separate blocks for `_p` and `_m` databases. +#### The PUMF-Master variable family pattern + +**Key rule: If a continuous measure exists on Master, always expect only a categorical (grouped) version on PUMF.** This is not occasional — it is systematic across CCHS. Every continuous demographic, health behaviour, and health outcome variable on PUMF is grouped into categories for privacy protection. + +A single health concept (e.g., "respondent age") therefore typically requires a **family** of harmonized variables in cchsflow: + +| Variable | Type | File type | Worksheet pattern | +|----------|------|-----------|-------------------| +| `DHH_AGE` | Continuous passthrough | Master only | `[12,102]→copy` | +| `DHHGAGE_B`\* | Categorical (grouped bins) | PUMF only | `1→1, 2→2, ...16→16` | +| `DHHGAGE_cont` | Midpoint imputation | PUMF (+ Master passthrough) | `1→13, 2→16, ...` | + +\* `DHHGAGE_B` is a **StatCan-assigned name** — the `_B` denotes the 2005+ category structure (16 age groups), not the cchsflow era-split convention. `DHHGAGE_A` (15 age groups, 2001-2003) is similarly a StatCan name. These are candidates for year-based renaming (e.g., `DHHGAGE_pre2005`, `DHHGAGE_2005plus`) but are out of scope unless being refactored. + +The `_cont` suffix convention bridges PUMF categorical data to pseudo-continuous values via midpoint imputation. For Master data, the `_cont` variable typically passes through the true continuous value unchanged. This pattern applies broadly — smoking duration, consumption frequency, BMI, income, and most other continuous measures follow it. + +**Common errors from not understanding this pattern:** + +- Adding PUMF databases to a Master-only continuous variable (e.g., putting `_p` databases on `DHH_AGE`'s copy row — PUMF has no single-year age) +- Missing the `_cont` bridging variable when adding a new continuous measure +- Assuming a DerivedVar feeder (e.g., `DHH_AGE` in `pack_years_der`) works on PUMF when it's Master-only — the PUMF pipeline needs to use `DHHGAGE_cont` instead +- Creating a continuous Master variable without a corresponding PUMF categorical variable, leaving PUMF users with no access to the measure + +**When reviewing or authoring:** For any continuous variable, check that the worksheet has the full family: Master continuous, PUMF categorical, and `_cont` bridge. Missing any piece means incomplete coverage for that file type. + ### Cycle naming | Era | Naming | Examples | From d30e9eb0d1c5d8d5d0edce5e4a5a409460c4c575 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Mon, 30 Mar 2026 15:53:15 -0400 Subject: [PATCH 09/15] feat(skill): Add cchsflow-derive skill with development workflow and done criteria MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SKILL.md orchestrates existing docs (foundations, patterns, testing) and adds a 5-step done criteria checklist that includes R CMD check — filling a gap where package-level validation was missing from the DV function workflow. --- .claude/skills/cchsflow-derive/SKILL.md | 182 +++++++++ .../skills/cchsflow-derive/docs/7-levels.md | 115 ++++++ .../cchsflow-derive/docs/foundations.md | 385 ++++++++++++++++++ .../docs/function-inventory.md | 82 ++++ .../docs/patterns/cat-to-continuous.md | 124 ++++++ .../docs/patterns/category-grouping.md | 113 +++++ .../docs/patterns/formula-calculation.md | 152 +++++++ .../docs/patterns/multi-source-routing.md | 155 +++++++ .../docs/patterns/pass-through.md | 87 ++++ .../docs/patterns/pathway-branching.md | 167 ++++++++ .../skills/cchsflow-derive/docs/testing.md | 240 +++++++++++ 11 files changed, 1802 insertions(+) create mode 100644 .claude/skills/cchsflow-derive/SKILL.md create mode 100644 .claude/skills/cchsflow-derive/docs/7-levels.md create mode 100644 .claude/skills/cchsflow-derive/docs/foundations.md create mode 100644 .claude/skills/cchsflow-derive/docs/function-inventory.md create mode 100644 .claude/skills/cchsflow-derive/docs/patterns/cat-to-continuous.md create mode 100644 .claude/skills/cchsflow-derive/docs/patterns/category-grouping.md create mode 100644 .claude/skills/cchsflow-derive/docs/patterns/formula-calculation.md create mode 100644 .claude/skills/cchsflow-derive/docs/patterns/multi-source-routing.md create mode 100644 .claude/skills/cchsflow-derive/docs/patterns/pass-through.md create mode 100644 .claude/skills/cchsflow-derive/docs/patterns/pathway-branching.md create mode 100644 .claude/skills/cchsflow-derive/docs/testing.md diff --git a/.claude/skills/cchsflow-derive/SKILL.md b/.claude/skills/cchsflow-derive/SKILL.md new file mode 100644 index 00000000..5732af2e --- /dev/null +++ b/.claude/skills/cchsflow-derive/SKILL.md @@ -0,0 +1,182 @@ +--- +name: cchsflow-derive +description: Write and review derived variable functions for cchsflow. Use when implementing new DV functions (calculate_*, assess_*, categorize_*), upgrading existing functions to v3 architecture, reviewing DV code for correctness, or preparing DV changes for commit. Covers the 3-step architecture, source-agnostic design, quality tiers, patterns, testing, and package-level validation. +allowed-tools: Bash(Rscript:*), Bash(R:*), Bash(git:*), Read, Glob, Grep +--- + +# cchsflow derived variable development + +Write, review, and validate derived variable functions using the v3 3-step architecture. + +## Usage + +``` +/cchsflow-derive # general guidance (reads foundations) +/cchsflow-derive calculate_bmi # review/write a specific function +/cchsflow-derive --check # run done criteria checks +``` + +## Before you start + +### Required reading + +Before writing or reviewing a DV function, read these docs (in this skill's `docs/` folder): + +1. **[foundations.md](docs/foundations.md)** — 3-step architecture, missing data handling, quality tiers, coding standards, anti-patterns. Read this first. +2. **The pattern doc** that matches your function (see "Choose a pattern" below) + +### Choose a pattern + +Identify which pattern your function follows, then read the corresponding doc: + +| Pattern | When to use | Doc | +|---------|-------------|-----| +| **Formula calculation** | Compute a value from inputs (BMI, pack-years) | [formula-calculation.md](docs/patterns/formula-calculation.md) | +| **Category grouping** | Map values to categories (BMI categories, smoking status) | [category-grouping.md](docs/patterns/category-grouping.md) | +| **Pass-through** | Clean and forward a single variable | [pass-through.md](docs/patterns/pass-through.md) | +| **Cat-to-continuous** | Midpoint imputation from categorical ranges | [cat-to-continuous.md](docs/patterns/cat-to-continuous.md) | +| **Multi-source routing** | Choose best source with priority chain | [multi-source-routing.md](docs/patterns/multi-source-routing.md) | +| **Pathway branching** | Complex decision tree with gate variables | [pathway-branching.md](docs/patterns/pathway-branching.md) | + +### Reference material + +- **[7-levels.md](docs/7-levels.md)** — function complexity taxonomy (L1-L7) +- **[function-inventory.md](docs/function-inventory.md)** — all existing DV functions with pattern, level, and tier +- **[testing.md](docs/testing.md)** — unit test and golden fixture patterns, common failure diagnostics + +## Development workflow + +### 1. Write tests + +Follow the test tier matching your function's quality tier (see [testing.md](docs/testing.md)): + +- **Bronze**: Happy path + one missing input +- **Silver**: + out-of-range, vectors, dataframe via `mutate()` +- **Gold**: + every `case_when()` branch, tagged NA type verification, `output_format` parameter + +### 2. Write the function + +Follow the pattern template from the appropriate pattern doc. Key principles: + +- **Source-agnostic**: Semantic parameter names (`height_m`, `weight_kg`), not CCHS variable names. ONE function for both PUMF and Master; the worksheet routes different source variables to the same parameters. +- **3-step**: `clean_variables(output_format = "tagged_na")` → `case_when()` logic → `clean_variables(output_format = output_format)` +- **Step 1 always uses `"tagged_na"`**: Never pass the user's `output_format` to Step 1 — `any_missing()` in Step 2 won't detect numeric missing codes. +- **Namespace-qualify**: `dplyr::case_when()`, `haven::tagged_na()` — functions must work standalone. + +### 3. Write roxygen documentation + +Silver and gold tier require the full template (see foundations.md § Documentation): + +```r +#' @title [verb phrase] +#' @description [1-2 sentences] +#' @details [implementation notes, PUMF vs Master table if source-agnostic] +#' @param var1 [description] +#' @param output_format Output missing data format: "tagged_na" (default) or "original". +#' @param ... Arguments passed from deprecated aliases. +#' @return [type and range] +#' @examples +#' # Scalar +#' # Vector +#' # Dataframe +#' # Standalone with rec_with_table (in \dontrun{}) +#' @references +#' @seealso +#' @export +``` + +**`@param ...` rule**: If deprecated aliases use `@rdname` pointing to your function and their signature is `function(...)`, you MUST add `@param ... Arguments passed from deprecated aliases.` to your roxygen. Otherwise R CMD check will report "Undocumented arguments in Rd file: '...'". + +### 4. Write deprecated aliases (if renaming) + +If the function replaces an older function name, add aliases in `R/deprecated-aliases.R`: + +```r +#' @rdname new_function_name +#' @export +old_function_name <- function(...) { + .Deprecated("new_function_name", + msg = "old_function_name() is deprecated. Use new_function_name() instead.") + new_function_name(...) +} +``` + +### 5. Update worksheets (if needed) + +If the function is referenced from `variable_details.csv` via `Func::`: + +- Update `recEnd` to point to the new function name +- Update `dummyVariable` if function name changed +- Run `Rscript exec/fix-worksheets.R` after any CSV modification +- Rebuild RData if worksheet structure changed (see cchsflow-worksheets skill) + +## Done criteria + +**Before committing DV function changes, ALL of these must pass.** Run them in order — earlier checks are faster and catch different issues. + +### Check 1: Unit tests pass + +```r +# From the project root (or worktree root) +Rscript -e 'devtools::load_all(); testthat::test_file("tests/testthat/test-.R")' +``` + +Verify: 0 failures for in-scope tests. Pre-existing failures in other test files are acceptable (note them but don't block on them). + +### Check 2: R CMD check passes + +```r +# Quick check — catches NAMESPACE, roxygen, imports (skips tests/examples) +Rscript -e 'devtools::check(document = FALSE, args = "--no-tests --no-examples --no-vignettes --no-manual")' + +# Full check — recommended before PR +Rscript -e 'devtools::check()' +``` + +Verify: 0 **new** errors/warnings/notes compared to the branch baseline. Common issues caught only here: + +- Undocumented `...` from `@rdname` aliases +- Missing NAMESPACE exports +- Broken `@examples` +- Undeclared imports in DESCRIPTION + +### Check 3: Worksheet validation (if worksheets changed) + +Invoke the `cchsflow-validation` skill, or run manually: + +```r +Rscript exec/fix-worksheets.R +``` + +### Check 4: Roxygen checklist + +Verify manually against the template in Step 3 above: + +- [ ] `@title`, `@description`, `@details` present +- [ ] All `@param` documented (including `...` if aliases exist) +- [ ] `@examples` includes scalar, vector, dataframe, and `rec_with_table()` +- [ ] `@return` describes type and range +- [ ] `@export` present +- [ ] `@seealso` links related functions + +### Check 5: Test coverage checklist + +- [ ] Every `case_when()` branch has a test +- [ ] Scalar, vector, and dataframe inputs tested +- [ ] Missing inputs tested (NA, tagged_na("a"), tagged_na("b")) +- [ ] Boundary values tested (for categorization functions) +- [ ] Deprecated aliases tested (expect deprecation warning + correct delegation) + +## Cross-references + +### Related cchsflow skills + +- **cchsflow-review** — PR review of worksheet changes (L0-L6 process). Lives on `skills/review-validation` branch. +- **cchsflow-validation** — programmatic worksheet validation. Lives on `skills/review-validation` branch. +- **cchsflow-worksheets** — worksheet authoring guidance. Lives on `skills/review-validation` branch. + +### External references + +- R CMD check guidance: `~/github/ai-infrastructure/context/domains/r_packages.md` § "Local verification before committing" +- V3 coding standards: project memory `project_derive_function_standards.md` +- Reference implementations: `calculate_bmi()` in `R/bmi.R` (formula), `calculate_pack_years()` in `R/smoke-pack-years.R` (complex) diff --git a/.claude/skills/cchsflow-derive/docs/7-levels.md b/.claude/skills/cchsflow-derive/docs/7-levels.md new file mode 100644 index 00000000..48ead65a --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/7-levels.md @@ -0,0 +1,115 @@ +# Function levels (L1-L7) + +A taxonomy of reusable function complexity. Higher levels compose lower +levels. Understanding the level helps you write the right amount of code +and reuse existing infrastructure. + +## Level definitions + +| Level | Name | Purpose | Example | +|-------|------|---------|---------| +| L1 | Foundational utility | Low-level missing data, cleaning, pattern detection | `any_missing()`, `clean_variables()`, `assign_missing()` | +| L2 | Midpoint mapping | Convert categorical ranges to continuous values via lookup table | `smkg_age_midpoint()` | +| L3 | Single-source pass-through | Wrap and clean a single input, worksheet handles routing | `calculate_age_start_smoking()` | +| L4 | Categorical-to-continuous conversion | Apply midpoint imputation with domain logic | `calculate_SMK_06A_cont()` | +| L5 | Filter/route by status | Extract subset of input based on status filtering | `calculate_SMKG203_cont()`, `assess_quit_pathway()` | +| L6 | Multi-source combining | Route multiple sources with priority hierarchy | `calculate_time_quit_smoking_complete()` | +| L7 | Complex multi-source unification | Full decision tree combining multiple inputs | `calculate_SMKDSTY_cat6()`, `calculate_pack_years()` | + +## Decision tree + +Use this to classify your function: + +``` +Does your function just pass through a single source? + → YES → L3 (pass-through) + → NO ↓ + +Does it convert categories to continuous values? + → YES, using a lookup table only → L2 (midpoint mapping) + → YES, with domain logic → L4 (cat-to-continuous) + → NO ↓ + +Does it filter/extract based on a status variable? + → YES, single source filtered by status → L5 (filter/route) + → NO ↓ + +Does it combine multiple sources with priority? + → YES, with pathway-aware routing → L6 (combining) + → NO ↓ + +Does it have a complex decision tree with multiple inputs? + → YES → L7 (complex unification) +``` + +## How levels compose + +Pack-years demonstrates the full stack: + +``` +calculate_pack_years (L7) +├── clean_variables() (L1) +├── any_missing() + get_priority_missing() (L1) +├── SMKDSTY_A (L7: calculate_SMKDSTY_cat6) +├── age_start_smoking (L3: calculate_age_start_smoking) +│ └── derive_passthrough() (L1) +├── time_quit_smoking (L6: calculate_time_quit_smoking_complete) +│ ├── calculate_SMK_06A_cont() (L4) +│ │ └── smkg_age_midpoint() (L2) +│ └── pathway logic with SMK_10_gate (L5: assess_quit_pathway) +├── cigs_per_day (L7: calculate_cigs_per_day) +│ └── status-based routing (L5 pattern) +└── age (L3: via worksheet routing) +``` + +## Level-by-level guidance + +### L1: Foundational utilities + +These are shared infrastructure. You rarely write new L1 functions — you +use them. Key functions to know: + +- `clean_variables(vars, variable_details, output_format)` — step 1 and 3 +- `any_missing(var1, var2, ...)` — vectorised missing detection +- `get_priority_missing(var1, var2, ...)` — NA::b wins over NA::a +- `assign_missing(type, var_name, variable_details)` — create typed missing +- `derive_passthrough(value, variable_name, variable_details, output_format)` — L3 helper + +### L2: Midpoint mapping + +A lookup table that converts categorical codes to continuous values. +Typically a simple named vector or small helper function. + +```r +smkg_age_midpoint <- function(category) { + midpoints <- c(8, 13, 16, 18.5, 22, 27, 32, 37, 42, 47, 55) + midpoints[category] +} +``` + +### L3: Single-source pass-through + +Minimal wrapper around `derive_passthrough()`. The worksheet handles +which source variable to feed in. + +```r +calculate_age_start_smoking <- function( + age_start_smoking, variable_details = NULL, output_format = "tagged_na") { + derive_passthrough(age_start_smoking, "age_start_smoking", + variable_details, output_format) +} +``` + +### L4-L7: See pattern docs + +These levels correspond to specific patterns: + +- L4 → `patterns/cat-to-continuous.md` +- L5 → `patterns/multi-source-routing.md` (filter variant) +- L6 → `patterns/multi-source-routing.md` or `patterns/pathway-branching.md` +- L7 → `patterns/formula-calculation.md` or `patterns/category-grouping.md` + +## Existing function inventory + +See `function-inventory.md` for a complete mapping of all current DV +functions to their levels and patterns. diff --git a/.claude/skills/cchsflow-derive/docs/foundations.md b/.claude/skills/cchsflow-derive/docs/foundations.md new file mode 100644 index 00000000..f156880b --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/foundations.md @@ -0,0 +1,385 @@ +# Foundations + +Core concepts that apply to all derived variable functions in cchsflow. + +## Why the 3-step architecture? + +CCHS data arrives with raw numeric missing codes (6, 7, 8, 9 for +single-digit variables; 996, 997, 998, 999 for triple-digit). These codes +are embedded in the same numeric column as valid data — a `9` might be a +real value or "not stated" depending on the variable. + +Before v3, functions had to know which codes were valid and which were +missing for each variable, leading to hardcoded values scattered throughout +the codebase. The 3-step architecture solves this by extracting missing +code definitions from `variable_details.csv` and handling them uniformly. + +v3 uses the `haven` package's `tagged_na()` to represent missing data. +Tagged NAs behave like regular `NA` in most R operations but carry a tag +("a" for not applicable, "b" for not stated) that preserves the reason +the data is missing. This is important for downstream analysis — a +researcher needs to know whether a value is missing because the question +didn't apply (never-smoker asked about quit date) or because the +respondent declined to answer. + +## When to use each tier + +Problems identified in legacy code, grouped by which tier addresses them. + +### All tiers (even bronze must avoid these) + +- **String vs object NA** — comparing `NA(a)` as a string instead of using `haven::tagged_na()` objects +- **Missing codes treated as data** — `9` passes numeric comparisons silently when it means "not stated" +- **Mixed return types** — functions returning character in some branches, numeric in others +- **No output validation** — domain logic can produce out-of-range values with no check +- **Pathway confusion** — unclear which respondents should get NA::a vs a calculated value + +### Silver adds (no hardcoding, standalone, documented) + +- **Hardcoded missing codes** — `if (x == 9)` breaks when a variable uses `99` or `996` +- **Duplicated lookup tables** — same midpoint map copy-pasted across functions +- **Deep if-else nesting** — 4+ levels of `if_else2()` obscure the logic +- **Not standalone** — functions that can't be copy-pasted into a researcher's script + +### Gold adds (full 3-step, priority missing, clean_variables) + +- **Separate PUMF/Master functions** — duplicated code for the same formula with different variable names. Use ONE source-agnostic function with semantic params; let the worksheet route +- **Vectorisation ambiguity** — unclear whether a function handles vectors or scalars only +- **No input validation** — out-of-range inputs silently produce wrong results +- **Sophisticated pathway awareness** — complex decision trees (e.g., quit timing) need explicit pathway routing with gate variables + +## The 3-step architecture in detail + +### Step 1: Clean inputs + +```r +cleaned <- clean_variables( + vars = list( + SMK_005 = SMK_005, + SMK_030 = SMK_030 + ), + output_format = "tagged_na" +) +``` + +**What this does:** + +1. Looks up each variable's missing codes in `variable_details.csv` via + `get_complete_pattern()`. For `SMK_005`, this might return + `na_a_codes = c(6)` and `na_b_codes = c(7, 8, 9)`. +2. Converts those raw codes to tagged NAs: `6` → `tagged_na("a")`, + `7/8/9` → `tagged_na("b")`. +3. Validates that all input vectors have the same length. +4. Returns a named list of cleaned vectors. + +**CRITICAL: Step 1 must always use `output_format = "tagged_na"`.** The +user's requested format should only be passed to step 3. If step 1 uses +`"original"`, `clean_variables()` converts missing codes back to numeric +values (e.g., 999), and `any_missing()` in step 2 will not detect them — +it sees 999 as valid data. This is a confirmed bug in several existing +functions that pass `output_format` through to step 1. + +**The `output_format` parameter** controls whether missing codes are +represented as `haven::tagged_na()` values ("tagged_na") or kept as their +original numeric codes ("original"). Use "tagged_na" for step 2 logic, +then pass the user's requested format in step 3. + +**Why `cleaned$SMK_005` instead of just `SMK_005`?** After step 1, the +raw value `9` (which looked like valid data) has been converted to +`tagged_na("b")`. The `cleaned$` prefix accesses the cleaned version +where missing codes are now detectable by `any_missing()`. Using the +uncleaned input in step 2 would miss these hidden missing values. + +**No hardcoded missing codes.** The pattern (which codes mean "missing") +comes from `variable_details.csv`, not from the function. If a variable +uses different codes in different cycles, the metadata handles it. + +### Step 2: Domain logic + +```r +result <- dplyr::case_when( + # Check for missing data first — always the first arm + any_missing(cleaned$SMK_005) ~ + get_priority_missing(cleaned$SMK_005, cleaned$SMK_030), + + # Domain logic with cleaned values + cleaned$SMK_005 == 1 ~ 1L, # Daily smoker + cleaned$SMK_005 == 2 ~ 2L, # Occasional smoker + cleaned$SMK_005 == 3 ~ 3L, # Former smoker + + # Catch-all: anything that didn't match + .default = assign_missing("not_applicable", "SMKDSTY_cat3", output_format) +) +``` + +**Why check missing first?** If `SMK_005` is `tagged_na("b")` (not stated), +comparing it to `1` returns `NA` (not `FALSE`), and `case_when()` would +skip that arm. By checking `any_missing()` first, we catch all missing +values before they fall through the logic. + +**Why `get_priority_missing()` and not just `NA`?** When multiple inputs +are missing, we want the most informative missing code. If one input is +NA::a (not applicable) but another is NA::b (not stated), the output +should be NA::b — because the data was collected but missing, which is +different from the question not applying. + +### Step 3: Clean outputs + +```r +output_clean <- clean_variables( + vars = list(SMKDSTY_cat3 = result), + output_format = output_format # Pass through user's requested format +) +output_clean$SMKDSTY_cat3 +``` + +**What this does:** + +1. Validates that output values fall within the expected range defined in + `variable_details.csv` for the output variable. +2. Converts the output to the user's requested format. If the user asked + for "original" codes (numeric 6/7/8/9), tagged NAs are converted back. + +**Why clean outputs?** It catches bugs where domain logic produces a value +outside the expected range, and it respects the user's choice of missing +data representation. + +## Missing data handling + +### Why tagged NAs? + +Base R has only one `NA` type. The CCHS has four missing categories: + +| CCHS code | Meaning | cchsflow tagged_na | +|-----------|---------|-------------------| +| 6 | Not applicable | `tagged_na("a")` | +| 7 | Don't know | `tagged_na("b")` | +| 8 | Refusal | `tagged_na("b")` | +| 9 | Not stated | `tagged_na("b")` | + +cchsflow collapses these to two: NA::a (not applicable) and NA::b +(missing/not stated). The `haven` package makes this work — tagged NAs +behave like regular NAs in arithmetic and comparisons, but the tag is +preserved for downstream analysis. + +See `vignette("tagged_na_usage")` for more background. + +### Key functions + +| Function | Purpose | Returns | +|----------|---------|---------| +| `any_missing(var1, var2, ...)` | Detect if any input has a missing value | Logical vector | +| `get_priority_missing(var1, var2, ...)` | Return highest-priority missing code | NA::b wins over NA::a | +| `assign_missing(type, var_name, output_format)` | Create a missing value of the right type | Tagged NA or numeric code | + +### Missing data priority + +NA::b (not stated) has higher priority than NA::a (not applicable). If any +input is NA::b, the output should be NA::b. This reflects that "data was +collected but missing" is more informative than "question didn't apply." + +### Pattern in code + +```r +# Always check missing first in case_when() +dplyr::case_when( + any_missing(cleaned$var1, cleaned$var2) ~ + get_priority_missing(cleaned$var1, cleaned$var2), + # ... domain logic ... + .default = assign_missing("not_applicable", "output_var", output_format) +) +``` + +## Quality tiers + +### Bronze — ship it + +Minimum for working code: + +- Correct output for valid inputs +- Basic missing data handling (at minimum, NA passthrough) +- May use `if`/`else` or `if_else2()` +- May hardcode midpoint values or thresholds +- Roxygen with `@title`, `@param`, `@return`, basic `@examples` + +### Silver — solid + +Everything in bronze, plus: + +- No hardcoded values (use constants, lookup tables, or recEnd) +- Comprehensive roxygen with executable examples for: + - Scalar input + - Vector input + - Dataframe input (via `mutate()`) +- Standalone `rec_with_table()` examples in documentation +- `case_when()` instead of if-else chains + +### Gold — reference + +Everything in silver, plus: + +- Full 3-step architecture (clean_variables → logic → clean_variables) +- `any_missing()` / `get_priority_missing()` for missing data +- `assign_missing()` for explicit not-applicable returns +- Missing codes extracted from `variable_details.csv` — no hardcoded codes +- `haven::tagged_na()` for missing value representation +- Tidyverse naming conventions (verb-based function names) +- Tidyverse dependencies (`dplyr::case_when()`, etc.) +- Function works standalone (copy-paste without full cchsflow install) + +## Coding standards + +### Source-agnostic functions + +Functions use semantic parameter names (`height_m`, `weight_kg`, `age`) +not CCHS variable names (`HWTGHTM`, `HWTDHTM`). One function serves +both PUMF and Master — the worksheet routes different source variables +to the same parameters. This makes functions portable (copy-paste into +other systems) and eliminates code duplication. + +Inside `clean_variables()` Step 1, map semantic names to a representative +CCHS variable for pattern lookup: + +```r +cleaned <- clean_variables( + vars = list(HWTGHTM = height_m, HWTGWTK = weight_kg), + output_format = "tagged_na" +) +``` + +### Naming + +- Function names use verbs: `calculate_`, `assess_`, `derive_` +- Follow pattern: `calculate_()` +- Use snake_case for all function and parameter names +- Parameter names are semantic (descriptive of the concept, not the CCHS variable) + +### Tidyverse + +- `dplyr::case_when()` replaces nested if-else chains +- `haven::tagged_na()` for missing value coding +- Do NOT use `if_else2()` — it is deprecated. Use `dplyr::if_else()` if + needed, but prefer `case_when()` for multi-branch logic. + +### Standalone functions + +Functions should work without the full cchsflow package installed. A +researcher should be able to copy-paste a function and its dependencies +into their own script. This means: + +- Use `dplyr::case_when()` not `case_when()` (namespace-qualify) +- Document which packages are needed +- Include self-contained examples + +### Input types + +Every function must work on: + +- **Scalar**: `calculate_foo(var1 = 1, var2 = 2)` +- **Vector**: `calculate_foo(var1 = c(1, 2, 3), var2 = c(4, 5, 6))` +- **Dataframe**: via `mutate()` — `df %>% mutate(result = calculate_foo(col1, col2))` + +This is achieved naturally by using `case_when()` which is vectorised. + +### Documentation (roxygen) + +Silver and gold tier require the template below. See also anti-patterns. + +```r +#' @title Calculate [variable description] +#' +#' @description +#' [What the function does, in 1-2 sentences] +#' +#' @details +#' [Implementation approach, source variables, coverage notes] +#' +#' @param var1 [description] +#' @param var2 [description] +#' +#' @return [description of output type and range] +#' +#' @examples +#' # Scalar +#' calculate_foo(var1 = 1, var2 = 2) +#' +#' # Vector +#' calculate_foo(var1 = c(1, 2, 3), var2 = c(4, 5, 6)) +#' +#' # Dataframe +#' library(dplyr) +#' df <- data.frame(var1 = c(1, 2), var2 = c(3, 4)) +#' df %>% mutate(result = calculate_foo(var1, var2)) +#' +#' # Standalone with rec_with_table +#' result <- rec_with_table( +#' cchs2015_2016_p, +#' variables = variables, +#' variable_details = variable_details, +#' log = TRUE +#' ) +#' +#' @export +``` + +## Anti-patterns + +Common bugs found during review. Check for these when reviewing or +writing gold-tier functions. + +### Step 1 output_format pass-through + +**Wrong:** +```r +cleaned <- clean_variables(vars = list(...), output_format = output_format) +``` + +**Right:** +```r +# Step 1: always tagged_na so any_missing() works in Step 2 +cleaned <- clean_variables(vars = list(...), output_format = "tagged_na") +# Step 3: user's format +output_clean <- clean_variables(vars = list(...), output_format = output_format) +``` + +### Joint missing check on gate + source variables + +When a function filters by status then passes through a source variable, +checking both together in the first `any_missing()` arm short-circuits +the domain logic for respondents where the source is expected to be NA::a. + +**Wrong:** +```r +any_missing(cleaned$SMK_005, cleaned$SMKG040_cont) ~ + get_priority_missing(cleaned$SMK_005, cleaned$SMKG040_cont, ...) +``` + +A non-daily smoker (SMK_005=2) has SMKG040_cont=NA::a by design. The +joint check catches this before the status routing, returning NA::a via +`get_priority_missing()` instead of the `.default` arm. Result is the +same but the logic path is wrong and fragile. + +**Right:** Check the gate variable first, then check source within its +applicable arm: +```r +any_missing(cleaned$SMK_005) ~ + get_priority_missing(cleaned$SMK_005, ...), +cleaned$SMK_005 == 1 & !any_missing(cleaned$SMKG040_cont) ~ + cleaned$SMKG040_cont, +cleaned$SMK_005 == 1 & any_missing(cleaned$SMKG040_cont) ~ + get_priority_missing(cleaned$SMKG040_cont, ...), +.default = assign_missing("not_applicable", ...) +``` + +### Dead code after any_missing() catches all NAs + +`any_missing(status)` catches all tagged NAs including NA::a. A later +arm like `haven::is_tagged_na(status, "a")` is unreachable — the first +arm already matched. Remove redundant arms to avoid confusion. + +### source() in package files + +Do not use `tryCatch { source("R/helper.R") }` in R package files. +Package functions are loaded via NAMESPACE. Conditional `source()` blocks +are an anti-pattern that can double-load functions or mask package +versions. For standalone use, document dependencies in `@details`. diff --git a/.claude/skills/cchsflow-derive/docs/function-inventory.md b/.claude/skills/cchsflow-derive/docs/function-inventory.md new file mode 100644 index 00000000..94059c58 --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/function-inventory.md @@ -0,0 +1,82 @@ +# Function inventory + +All existing derived variable functions, mapped to patterns, levels, and +quality tiers. + +## Modern functions (v3) + +| Function | File | Pattern | Level | Tier | Notes | +|----------|------|---------|-------|------|-------| +| `calculate_SMKDSTY_cat6` | smoking-status.R | Category grouping | L7 | Gold | 3 inputs → 6 categories | +| `calculate_SMKDSTY_A` | smoking-status.R | — | — | — | Deprecated alias for cat6 | +| `calculate_smoke_simple` | smoking-status.R | Category grouping | L7 | Gold | Uses nested helpers | +| `calculate_SMK_06A_cont` | smoking-cessation.R | Cat-to-continuous | L4 | Gold | Quit timing midpoints | +| `calculate_time_quit_smoking_complete` | smoking-cessation.R | Pathway branching | L6 | Gold | 5 pathways | +| `calculate_time_quit_smoking_daily` | smoking-cessation.R | Multi-source routing | L6 | Gold | Master > PUMF priority | +| `assess_quit_pathway` | smoking-cessation.R | Multi-source routing | L5 | Gold | Pathway classifier | +| `calculate_SMKG203_cont` | smoke-start.R | Multi-source routing | L5 | Gold | Filter: daily only | +| `calculate_SMKG207_cont` | smoke-start.R | Multi-source routing | L5 | Gold | Filter: former daily only | +| `calculate_SMKG040_cont` | smoke-start.R | Multi-source routing | L7 | Gold | Combines 203 + 207 | +| `calculate_age_start_smoking` | smoke-start.R | Pass-through | L3 | Gold | Via derive_passthrough | +| `calculate_age_first_cigarette` | smoke-start.R | Pass-through | L3 | Gold | Via derive_passthrough | +| `calculate_smoked_100_lifetime` | smoke-start.R | Pass-through | L3 | Gold | Via derive_passthrough | +| `calculate_cigs_per_day` | smoke-intensity.R | Multi-source routing | L7 | Gold | Status-based source routing | +| `calculate_pack_years` | smoke-pack-years.R | Formula calculation | L7 | Gold | Full decision tree | +| `calculate_pack_years_categorical` | smoke-pack-years.R | Category grouping | L7 | Gold | Uses PACK_YEARS_CONSTANTS | + +## Doc stub functions (worksheet-only, no R logic) + +These functions exist only to document that the variable is harmonised via +`rec_with_table()` without custom R code. They call `stop()` with a message. + +| Function | File | Variable | +|----------|------|----------| +| `calculate_SMKDSTY_cat5` | smoking-status.R | SMKDSTY_cat5 | +| `calculate_SMKDSTY_cat3` | smoking-status.R | SMKDSTY_cat3 | +| `calculate_SMK_005` | smoking-status.R | SMK_005 | +| `calculate_SMK_030` | smoking-status.R | SMK_030 | +| `calculate_SMK_01A` | smoking-status.R | SMK_01A | +| `calculate_SMKG040_cat` | smoke-start.R | SMKG040_cat | +| `calculate_SMKG203_cat` | smoke-start.R | SMKG203_cat | +| `calculate_SMKG207_cat` | smoke-start.R | SMKG207_cat | +| `calculate_SMK_207` | smoke-start.R | SMK_207 | +| `calculate_SMK_203` | smoke-start.R | SMK_203 | +| `calculate_SMK_204` | smoke-intensity.R | SMK_204 | +| `calculate_SMK_208` | smoke-intensity.R | SMK_208 | +| `calculate_SMK_05B` | smoke-intensity.R | SMK_05B | +| `calculate_SMK_05C` | smoke-intensity.R | SMK_05C | + +## Legacy functions (v2, to be deprecated) + +| Function | File | Pattern | Level | Tier | Modern replacement | +|----------|------|---------|-------|------|--------------------| +| `time_quit_smoking_fun` | smoking.R | Cat-to-continuous | L4 | Bronze | `calculate_time_quit_smoking_complete/daily` | +| `smoke_simple_fun` | smoking.R | Category grouping | L7 | Bronze | `calculate_smoke_simple` | +| `pack_years_fun` | smoking.R | Formula calculation | L7 | Bronze | `calculate_pack_years` | +| `SMKG040_fun` | smoking.R | Multi-source routing | L5 | Bronze | `calculate_SMKG040_cont` | +| `pack_years_fun_cat` | smoking.R | Category grouping | L2 | Bronze | `calculate_pack_years_categorical` | +| `SMKDSTY_fun` | smoking.R | Category grouping | L7 | Bronze | `calculate_SMKDSTY_cat6` | +| `SMKG203_fun` | smoking.R | Multi-source routing | L5 | Bronze | `calculate_SMKG203_cont` | +| `SMKG207_fun` | smoking.R | Multi-source routing | L5 | Bronze | `calculate_SMKG207_cont` | + +## Infrastructure functions (L1) + +| Function | File | Purpose | +|----------|------|---------| +| `clean_variables` | clean-variables.R | Input/output cleaning (steps 1 and 3) | +| `parse_range_notation` | clean-variables.R | Parse variable_details.csv range notation | +| `derive_passthrough` | clean-variables.R | L3 helper for pass-through functions | +| `any_missing` | missing-data-functions.R | Vectorised missing detection | +| `get_priority_missing` | missing-data-functions.R | Priority-based missing processor | +| `assign_missing` | missing-data-functions.R | Create typed missing values | + +## Helpers (not exported) + +| Function | File | Level | Purpose | +|----------|------|-------|---------| +| `smkg_age_midpoint` | smoking.R | L2 | Age-started category → midpoint lookup | +| `.calculate_pack_years_core` | smoke-pack-years.R | L7 | Pure arithmetic for pack-years | +| `process_missing_codes` | clean-variables.R | L1 | Internal missing code conversion | +| `convert_input_to_tagged_na` | clean-variables.R | L1 | Raw → tagged_na conversion | +| `detect_missing_vectorized` | missing-data-functions.R | L1 | Element-wise missing detection | +| `apply_priority_hierarchy` | missing-data-functions.R | L1 | Priority processing | diff --git a/.claude/skills/cchsflow-derive/docs/patterns/cat-to-continuous.md b/.claude/skills/cchsflow-derive/docs/patterns/cat-to-continuous.md new file mode 100644 index 00000000..5881d063 --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/patterns/cat-to-continuous.md @@ -0,0 +1,124 @@ +# Pattern: Categorical to continuous + +## What it is + +A function that converts categorical ranges into continuous values using +midpoint imputation. The input is a categorical variable (e.g., "1-2 years"), +the output is a continuous value (e.g., 1.5). + +## When to use + +- Input is categorical with ordered ranges +- Output is continuous (midpoint of each range) +- Examples: quit timing categories → years, age-started categories → age + +## How to recognise from worksheet + +``` +variable, variableStart, recEnd +SMK_06A_cont, DerivedVar::[SMK_06A_2003plus,SMKG06C], Func::calculate_SMK_06A_cont +``` + +The `_cont` suffix is a strong signal. DerivedVar with a categorical source +and optional continuous companion for the open-ended top category. + +## Bronze template + +```r +calculate_my_var_cont <- function(cat_var) { + ifelse(cat_var == 1, 0.5, + ifelse(cat_var == 2, 1.5, + ifelse(cat_var == 3, 2.5, + ifelse(cat_var == 4, 5.0, NA)))) +} +``` + +## Silver template + +```r +#' @title Calculate [continuous version of categorical variable] +#' +#' @description Converts categorical [variable] to continuous values using +#' midpoint imputation. +#' +#' @param cat_var Categorical input values. +#' @param continuous_companion Optional continuous value for the open-ended +#' top category (e.g., exact years from Master file). +#' +#' @return Numeric vector of midpoint-imputed values. +#' +#' @examples +#' # Scalar +#' calculate_my_var_cont(cat_var = 2) +#' +#' # Vector +#' calculate_my_var_cont(cat_var = c(1, 2, 3, 4)) +#' +#' # With continuous companion for top category +#' calculate_my_var_cont(cat_var = 4, continuous_companion = 7.3) +#' +#' @export +calculate_my_var_cont <- function(cat_var, continuous_companion = NULL) { + # Midpoints derived from variable_details.csv recEnd ranges + # Category 1: [0, 1) → 0.5 + # Category 2: [1, 2) → 1.5 + # Category 3: [2, 3) → 2.5 + # Category 4: [3, inf) → use companion if available, else 5.0 + dplyr::case_when( + cat_var == 1 ~ 0.5, + cat_var == 2 ~ 1.5, + cat_var == 3 ~ 2.5, + cat_var == 4 & !is.na(continuous_companion) ~ continuous_companion, + cat_var == 4 ~ 5.0, + .default = NA_real_ + ) +} +``` + +## Gold template + +```r +calculate_my_var_cont <- function(cat_var, continuous_companion = NULL, + output_format = "tagged_na") { + # Step 1: Clean inputs — always tagged_na for Step 2 + cleaned <- clean_variables( + vars = list(cat_var = cat_var, companion = continuous_companion), + output_format = "tagged_na" + ) + + # Step 2: Domain logic — midpoints from recEnd ranges + result <- dplyr::case_when( + any_missing(cleaned$cat_var) ~ + get_priority_missing(cleaned$cat_var), + cleaned$cat_var == 1 ~ 0.5, + cleaned$cat_var == 2 ~ 1.5, + cleaned$cat_var == 3 ~ 2.5, + cleaned$cat_var == 4 & !any_missing(cleaned$companion) ~ + cleaned$companion, + cleaned$cat_var == 4 ~ 5.0, + .default = assign_missing("not_applicable", "my_var_cont", + output_format) + ) + + # Step 3: Clean outputs — user's requested format + output_clean <- clean_variables( + vars = list(my_var_cont = result), + output_format = output_format + ) + output_clean$my_var_cont +} +``` + +## Reference implementations + +- `calculate_SMK_06A_cont()` — R/smoking-cessation.R (quit timing midpoints) +- `smkg_age_midpoint()` — R/smoking.R (L2 helper for age-started categories) +- `calculate_SMKG203_continuous()` — R/smoking.R (PUMF age-started) + +## Common mistakes + +- Hardcoding midpoint values that should come from recEnd ranges in + `variable_details.csv` (acceptable at bronze, not at silver/gold) +- Forgetting the open-ended top category needs special handling (often has + a continuous companion from Master files) +- Not documenting where midpoint values come from diff --git a/.claude/skills/cchsflow-derive/docs/patterns/category-grouping.md b/.claude/skills/cchsflow-derive/docs/patterns/category-grouping.md new file mode 100644 index 00000000..f8c5b4a9 --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/patterns/category-grouping.md @@ -0,0 +1,113 @@ +# Pattern: Category grouping + +## What it is + +A function that collapses or remaps multiple input categories into fewer +output categories. The core logic is a mapping table implemented as +`case_when()` arms. + +## When to use + +- Input is categorical, output is categorical with fewer levels +- The function maps N input categories to M output categories (M < N) +- Examples: 6-category smoking status, ADL difficulty grouping, alcohol + frequency banding + +## How to recognise from worksheet + +``` +variable, variableStart, recEnd +SMKDSTY_cat6, DerivedVar::[SMK_005,SMK_030,SMK_01A], Func::calculate_SMKDSTY_cat6 +``` + +DerivedVar with multiple categorical feeders, output is also categorical. + +## Bronze template + +```r +calculate_my_grouped_var <- function(input_var) { + ifelse(input_var %in% c(1, 2), 1L, + ifelse(input_var %in% c(3, 4), 2L, + ifelse(input_var %in% c(5, 6), 3L, NA))) +} +``` + +## Silver template + +```r +#' @title Calculate [grouped variable] +#' +#' @description Collapses [source] categories into [N] groups. +#' +#' @param input_var [Source variable] values. +#' +#' @return Integer vector: 1 = [group1], 2 = [group2], 3 = [group3]. +#' +#' @examples +#' # Scalar +#' calculate_my_grouped_var(input_var = 3) +#' +#' # Vector +#' calculate_my_grouped_var(input_var = c(1, 3, 5, NA)) +#' +#' # Dataframe +#' library(dplyr) +#' df <- data.frame(source = c(1, 2, 3, 4, 5, 6)) +#' df %>% mutate(grouped = calculate_my_grouped_var(source)) +#' +#' @export +calculate_my_grouped_var <- function(input_var) { + dplyr::case_when( + input_var %in% c(1, 2) ~ 1L, + input_var %in% c(3, 4) ~ 2L, + input_var %in% c(5, 6) ~ 3L, + .default = NA_integer_ + ) +} +``` + +## Gold template + +```r +calculate_my_grouped_var <- function(input_var, + output_format = "tagged_na") { + # Step 1: Clean inputs — always tagged_na for Step 2 + cleaned <- clean_variables( + vars = list(input_var = input_var), + output_format = "tagged_na" + ) + + # Step 2: Domain logic + result <- dplyr::case_when( + any_missing(cleaned$input_var) ~ + get_priority_missing(cleaned$input_var), + cleaned$input_var %in% c(1, 2) ~ 1L, + cleaned$input_var %in% c(3, 4) ~ 2L, + cleaned$input_var %in% c(5, 6) ~ 3L, + .default = assign_missing("not_applicable", "my_grouped_var", + output_format) + ) + + # Step 3: Clean outputs — user's requested format + output_clean <- clean_variables( + vars = list(my_grouped_var = result), + output_format = output_format + ) + output_clean$my_grouped_var +} +``` + +## Reference implementations + +- `calculate_SMKDSTY_cat6()` — R/smoking-status.R (3 inputs → 6 categories) +- `calculate_smoke_simple()` — R/smoking-status.R (2 inputs → 4 categories, + uses nested helper variables) + +## Common mistakes + +- Forgetting the `.default` arm in `case_when()` — leaves unmatched values as NA + without tracking whether it's NA::a or NA::b +- Not checking `any_missing()` first — missing inputs should propagate, not + fall through to a category +- Hardcoding category labels instead of using integer codes that match + `output_format.csv` diff --git a/.claude/skills/cchsflow-derive/docs/patterns/formula-calculation.md b/.claude/skills/cchsflow-derive/docs/patterns/formula-calculation.md new file mode 100644 index 00000000..ae2ffc70 --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/patterns/formula-calculation.md @@ -0,0 +1,152 @@ +# Pattern: Formula calculation + +## What it is + +A function that computes a derived value from multiple inputs using +arithmetic formulas. The logic is mathematical rather than categorical +mapping. Functions use **semantic parameter names** and are +**source-agnostic** — the same function serves both PUMF and Master +data via worksheet routing. + +## When to use + +- Output is computed from multiple inputs via arithmetic +- Examples: pack-years (age, start age, quit time, cigarettes/day), BMI + (height, weight), alcohol grams/day + +## Source-agnostic design + +Functions take semantic parameters (`height_m`, `weight_kg`, `age`) +not CCHS variable names (`HWTGHTM`, `HWTDHTM`). The worksheet routes +different source variables to the same function depending on database +type: + +``` +# Both rows call the SAME function with different feeder variables +variable, databaseStart, variableStart, recEnd +HWTGBMI_der, cchs2001_p, DerivedVar::[HWTGHTM, HWTGWTK], Func::calculate_bmi +HWTDBMI_der, cchs2001_m, DerivedVar::[HWTDHTM, HWTDWTK], Func::calculate_bmi +``` + +The function is portable — it can be copy-pasted into other systems +that have height in metres and weight in kilograms. + +## How to recognise from worksheet + +``` +variable, variableStart, recEnd +pack_years_der, DerivedVar::[SMKDSTY_A,DHHGAGE_cont,...], Func::calculate_pack_years +``` + +DerivedVar with many feeders and a `Func::` pointing to a calculation +function. The feeder list is typically longer than for other patterns. + +## Bronze template + +```r +calculate_bmi <- function(height_m, weight_kg) { + result <- weight_kg / (height_m ^ 2) + ifelse(result < 10 | result > 60, NA, result) +} +``` + +## Silver template + +```r +#' @title Calculate BMI +#' +#' @description Calculates body mass index from height (metres) and +#' weight (kilograms). +#' +#' @param height_m Height in metres. +#' @param weight_kg Weight in kilograms. +#' +#' @return Numeric vector of BMI values (kg/m^2). Values outside +#' [10, 60] are set to NA. +#' +#' @examples +#' # Scalar +#' calculate_bmi(height_m = 1.75, weight_kg = 70) +#' +#' # Vector +#' calculate_bmi(height_m = c(1.60, 1.75, 1.80), +#' weight_kg = c(55, 70, 90)) +#' +#' # Dataframe +#' library(dplyr) +#' df <- data.frame(ht = c(1.60, 1.75), wt = c(55, 70)) +#' df %>% mutate(bmi = calculate_bmi(ht, wt)) +#' +#' @export +calculate_bmi <- function(height_m, weight_kg) { + dplyr::case_when( + is.na(height_m) | is.na(weight_kg) ~ NA_real_, + height_m <= 0 ~ NA_real_, + TRUE ~ weight_kg / (height_m ^ 2) + ) +} +``` + +## Gold template + +```r +calculate_bmi <- function(height_m, weight_kg, + output_format = "tagged_na") { + # Step 1: Clean inputs — use a representative CCHS variable name for + # pattern lookup. When called via rec_with_table(), inputs are already + # pre-cleaned; Step 1 is a safety net for direct callers. + cleaned <- clean_variables( + vars = list(HWTGHTM = height_m, HWTGWTK = weight_kg), + output_format = "tagged_na" + ) + + ht <- cleaned$HWTGHTM + wt <- cleaned$HWTGWTK + + # Step 2: Domain logic + result <- dplyr::case_when( + any_missing(ht, wt) ~ + get_priority_missing(ht, wt, output_format = output_format), + ht <= 0 ~ + assign_missing("not_stated", "HWTGBMI_der", output_format), + .default = wt / (ht ^ 2) + ) + + # Step 3: Clean outputs — user's requested format + output_clean <- clean_variables( + vars = list(HWTGBMI_der = result), + output_format = output_format + ) + output_clean$HWTGBMI_der +} +``` + +**Step 1 mapping note:** The `vars` list keys are CCHS variable names +used for missing code pattern lookup in `variable_details.csv`. The +function's semantic parameter names (`height_m`, `weight_kg`) are the +external API; internally, `clean_variables()` needs a known variable +name to find the right missing codes. Pick a representative CCHS +variable that covers both PUMF and Master patterns — like +`calculate_pack_years()` uses `DHHGAGE_cont` for age regardless of +whether the input comes from PUMF or Master. + +## Reference implementations + +- `calculate_bmi()` — R/bmi.R (source-agnostic, semantic params, + worksheet routes PUMF/Master to same function) +- `calculate_pack_years()` — R/smoke-pack-years.R (complex, 6 status + pathways with different formulas per pathway) +- `.calculate_pack_years_core()` — R/smoke-pack-years.R (internal, pure + arithmetic with `pmax()` floor values) +- `calculate_pack_years_categorical()` — R/smoke-pack-years.R (formula + output → categorical binning) + +## Common mistakes + +- Not handling division by zero (height == 0 for BMI, duration == 0 for rates) +- Forgetting that formula inputs may themselves be derived variables that + carry missing data — check `any_missing()` on all inputs +- Hardcoding floor/ceiling values instead of using named constants + (see `PACK_YEARS_CONSTANTS` for the right approach) +- For complex formulas with status-based branching (like pack-years), + consider extracting the core arithmetic into an internal helper function diff --git a/.claude/skills/cchsflow-derive/docs/patterns/multi-source-routing.md b/.claude/skills/cchsflow-derive/docs/patterns/multi-source-routing.md new file mode 100644 index 00000000..1615f1ff --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/patterns/multi-source-routing.md @@ -0,0 +1,155 @@ +# Pattern: Multi-source routing + +## What it is + +A function that selects the best available value from multiple sources, +typically prioritising Master (exact continuous) over PUMF (midpoint +imputed). May also filter by smoking status to determine applicability. + +**Note:** Do not confuse multi-source routing with PUMF/Master +variants. When PUMF and Master use the same formula with different +input variables, use ONE source-agnostic function with semantic params +(see formula-calculation pattern). Multi-source routing is for when a +single function needs to choose between multiple available sources at +runtime (e.g., prefer continuous over midpoint when both are present). + +## When to use + +- Multiple sources provide the same information at different precision levels +- A priority chain determines which source to use +- May include status-based filtering (L5 variant) +- Examples: quit timing (Master exact years vs PUMF midpoint), age started + daily (daily smokers only vs former daily smokers only) + +## How to recognise from worksheet + +``` +variable, variableStart, recEnd +time_quit_smoking_daily, DerivedVar::[SMKDSTY_cat5,SMK_09A_cont,SMK_09C], Func::calculate_time_quit_smoking_daily +``` + +DerivedVar with multiple sources that represent the same concept at +different precision or from different file types. The function implements +the priority logic. + +## Bronze template + +```r +calculate_my_routed_var <- function(source_pumf, source_master = NULL) { + ifelse(!is.na(source_master), source_master, + ifelse(!is.na(source_pumf), source_pumf, NA)) +} +``` + +## Silver template + +```r +#' @title Calculate [routed variable] +#' +#' @description Selects the best available value from Master (exact) or +#' PUMF (midpoint) sources. +#' +#' @param source_pumf PUMF midpoint-imputed value. +#' @param source_master Master exact continuous value (may be NULL/NA if +#' working with PUMF data). +#' +#' @return Numeric vector using Master when available, PUMF otherwise. +#' +#' @examples +#' # Scalar — PUMF only +#' calculate_my_routed_var(source_pumf = 1.5) +#' +#' # Scalar — Master available +#' calculate_my_routed_var(source_pumf = 1.5, source_master = 1.3) +#' +#' # Vector — mixed availability +#' calculate_my_routed_var( +#' source_pumf = c(1.5, 2.5, 0.5), +#' source_master = c(1.3, NA, 0.4) +#' ) +#' +#' @export +calculate_my_routed_var <- function(source_pumf, source_master = NULL) { + dplyr::case_when( + !is.na(source_master) ~ source_master, + !is.na(source_pumf) ~ source_pumf, + .default = NA_real_ + ) +} +``` + +## Gold template + +```r +calculate_my_routed_var <- function(status, source_pumf, + source_master = NULL, + output_format = "tagged_na") { + # Step 1: Clean inputs — always tagged_na for Step 2 + cleaned <- clean_variables( + vars = list( + status = status, + source_pumf = source_pumf, + source_master = source_master + ), + output_format = "tagged_na" + ) + + # Step 2: Domain logic — priority chain with universe check + result <- dplyr::case_when( + # Universe check: only applicable to certain statuses + any_missing(cleaned$status) ~ + get_priority_missing(cleaned$status), + cleaned$status %in% c(1, 2, 3, 6) ~ + assign_missing("not_applicable", "my_routed_var", output_format), + + # Priority: Master exact > PUMF midpoint + !any_missing(cleaned$source_master) ~ cleaned$source_master, + !any_missing(cleaned$source_pumf) ~ cleaned$source_pumf, + + # All sources missing + .default = get_priority_missing(cleaned$source_pumf, + cleaned$source_master) + ) + + # Step 3: Clean outputs — user's requested format + output_clean <- clean_variables( + vars = list(my_routed_var = result), + output_format = output_format + ) + output_clean$my_routed_var +} +``` + +## L5 variant: Status-based filtering + +When the function extracts a subset based on status (not priority routing): + +```r +# Only daily smokers get age-started-daily; others get NA::a +result <- dplyr::case_when( + any_missing(cleaned$status) ~ get_priority_missing(cleaned$status), + cleaned$status == 1 ~ cleaned$age_started, + .default = assign_missing("not_applicable", "var_name", output_format) +) +``` + +## Reference implementations + +- `calculate_time_quit_smoking_daily()` — R/smoking-cessation.R (Master + exact years > PUMF midpoint, with universe check) +- `calculate_SMKG040_cont()` — R/smoke-start.R (combines daily + former + daily age-started with priority) +- `calculate_SMKG203_cont()` — R/smoke-start.R (L5 variant: filters for + current daily smokers only) +- `calculate_SMKG207_cont()` — R/smoke-start.R (L5 variant: filters for + former daily smokers only) + +## Common mistakes + +- Forgetting the universe check — not every respondent should get a value. + Former-only variables should return NA::a for current/never smokers. +- Using `is.na()` instead of `any_missing()` — `is.na()` doesn't detect + tagged NAs properly in some contexts +- Not documenting which source takes priority and why +- Mixing up `!any_missing()` (value IS available) with `any_missing()` + (value IS missing) — easy to invert the logic diff --git a/.claude/skills/cchsflow-derive/docs/patterns/pass-through.md b/.claude/skills/cchsflow-derive/docs/patterns/pass-through.md new file mode 100644 index 00000000..10183916 --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/patterns/pass-through.md @@ -0,0 +1,87 @@ +# Pattern: Pass-through + +## What it is + +A function that passes a single source variable through with cleaning and +validation. No domain logic — the worksheet handles routing the correct +source variable for each database/cycle. + +## When to use + +- `variableStart` is a single source name (no `DerivedVar::`) +- `recEnd` is `copy` or a simple value remap +- The function just needs to clean and validate the input + +## How to recognise from worksheet + +``` +variable, variableStart, recEnd +age_start_smoking, SMKG040_cont, copy +age_start_smoking, SMK_040, copy +``` + +Multiple rows for the same variable with different `databaseStart` ranges — +the worksheet routes the right source, the function just passes through. + +## Bronze template + +```r +calculate_my_var <- function(input_var) { + input_var +} +``` + +## Silver template + +```r +#' @title Calculate [variable description] +#' +#' @description Pass-through variable. The worksheet routes the appropriate +#' source variable; this function cleans and validates the input. +#' +#' @param input_var Source variable value(s). +#' +#' @return Cleaned value(s). +#' +#' @examples +#' # Scalar +#' calculate_my_var(input_var = 25) +#' +#' # Vector +#' calculate_my_var(input_var = c(15, 20, 25, NA)) +#' +#' # Dataframe +#' library(dplyr) +#' df <- data.frame(source = c(15, 20, 25)) +#' df %>% mutate(result = calculate_my_var(source)) +#' +#' @export +calculate_my_var <- function(input_var, + output_format = "tagged_na") { + derive_passthrough(input_var, "my_var", output_format) +} +``` + +## Gold template + +Same as silver — pass-through functions are inherently simple. Gold adds +namespace-qualified calls and explicit dependency documentation. + +```r +calculate_my_var <- function(input_var, + output_format = "tagged_na") { + derive_passthrough(input_var, "my_var", output_format) +} +``` + +## Reference implementations + +- `calculate_age_start_smoking()` — R/smoke-start.R +- `calculate_age_first_cigarette()` — R/smoke-start.R +- `calculate_smoked_100_lifetime()` — R/smoke-start.R + +## Common mistakes + +- Writing domain logic that belongs in the worksheet `recEnd` column +- Forgetting the `output_format` parameter +- Not using `derive_passthrough()` (reimplementing the cleaning logic) diff --git a/.claude/skills/cchsflow-derive/docs/patterns/pathway-branching.md b/.claude/skills/cchsflow-derive/docs/patterns/pathway-branching.md new file mode 100644 index 00000000..037ac70a --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/patterns/pathway-branching.md @@ -0,0 +1,167 @@ +# Pattern: Pathway branching + +## What it is + +A function that routes to different sources or calculations based on a +respondent's pathway through a complex decision tree. The branching is +driven by status variables that determine which data is applicable and +how to interpret it. + +This is the most complex pattern. It combines elements of multi-source +routing and category grouping, but with pathway-aware logic that makes +the branching non-trivial. + +## When to use + +- Multiple pathways exist for the same concept (e.g., quit timing depends + on whether the person quit directly or reduced gradually) +- A gate variable determines which pathway applies +- Different formulas or sources apply to each pathway +- Examples: time since quitting (direct quit vs gradual reducer vs + occasional-only), cessation pathway assessment + +## How to recognise from worksheet + +``` +variable, variableStart, recEnd +time_quit_smoking_complete, DerivedVar::[SMKDSTY_cat5,SMK_10_gate,...,SMKDVSTP], Func::calculate_time_quit_smoking_complete +``` + +DerivedVar with many feeders including a gate/pathway variable AND a +Master priority source. The feeder list is long because each pathway +needs its own source. + +## Bronze template + +Not recommended for this pattern. The branching logic is complex enough +that skipping missing data handling creates subtle bugs. Start at silver. + +## Silver template + +```r +#' @title Calculate [pathway-branched variable] +#' +#' @description Routes to the appropriate source based on [pathway variable]. +#' +#' @param status Smoking status category. +#' @param gate Pathway gate variable (1 = path A, 2 = path B). +#' @param source_a Source for pathway A. +#' @param source_b Source for pathway B. +#' @param source_master Master exact value (priority when available). +#' +#' @return Numeric vector of routed values. +#' +#' @examples +#' # Scalar — pathway A +#' calculate_my_branched_var(status = 4, gate = 1, source_a = 3.5, +#' source_b = NA) +#' +#' # Scalar — pathway B +#' calculate_my_branched_var(status = 4, gate = 2, source_a = NA, +#' source_b = 5.2) +#' +#' # Vector — mixed pathways +#' calculate_my_branched_var( +#' status = c(4, 4, 5), +#' gate = c(1, 2, NA), +#' source_a = c(3.5, NA, 2.0), +#' source_b = c(NA, 5.2, NA) +#' ) +#' +#' @export +calculate_my_branched_var <- function(status, gate, source_a, source_b, + source_master = NULL) { + dplyr::case_when( + # Master priority + !is.na(source_master) ~ source_master, + # Universe: only former smokers + status %in% c(1, 2, 3, 6) ~ NA_real_, + # Pathway A + gate == 1 ~ source_a, + # Pathway B + gate == 2 ~ source_b, + .default = NA_real_ + ) +} +``` + +## Gold template + +```r +calculate_my_branched_var <- function(status, gate, source_a, source_b, + source_master = NULL, + output_format = "tagged_na") { + # Step 1: Clean inputs — always tagged_na for Step 2 + cleaned <- clean_variables( + vars = list( + status = status, + gate = gate, + source_a = source_a, + source_b = source_b, + source_master = source_master + ), + output_format = "tagged_na" + ) + + # Step 2: Domain logic — pathway-aware routing + result <- dplyr::case_when( + # Missing status → propagate + any_missing(cleaned$status) ~ + get_priority_missing(cleaned$status), + + # Universe: not applicable to current/never smokers + cleaned$status %in% c(1, 2, 3, 6) ~ + assign_missing("not_applicable", "my_branched_var", output_format), + + # Master priority: exact value available → use it + !any_missing(cleaned$source_master) ~ cleaned$source_master, + + # Former occasional (no gate needed) → source_a + cleaned$status == 5 & !any_missing(cleaned$source_a) ~ + cleaned$source_a, + + # Former daily, pathway A (direct quit) + cleaned$status == 4 & cleaned$gate == 1 & + !any_missing(cleaned$source_a) ~ cleaned$source_a, + + # Former daily, pathway B (gradual reducer) + cleaned$status == 4 & cleaned$gate == 2 & + !any_missing(cleaned$source_b) ~ cleaned$source_b, + + # Former daily, no gate (early cycles) → fallback to source_a + cleaned$status == 4 & any_missing(cleaned$gate) & + !any_missing(cleaned$source_a) ~ cleaned$source_a, + + # All pathways exhausted + .default = get_priority_missing(cleaned$source_a, cleaned$source_b, + cleaned$source_master) + ) + + # Step 3: Clean outputs — user's requested format + output_clean <- clean_variables( + vars = list(my_branched_var = result), + output_format = output_format + ) + output_clean$my_branched_var +} +``` + +## Reference implementations + +- `calculate_time_quit_smoking_complete()` — R/smoking-cessation.R + (the canonical example: Master priority → occasional pathway → + daily/direct quit → daily/gradual reducer → 2001 fallback) +- `assess_quit_pathway()` — R/smoking-cessation.R (L5: classifies which + pathway a respondent follows) + +## Common mistakes + +- Not handling the "no gate" case for early cycles (2001-2005 don't have + the gate variable — need a fallback pathway) +- Forgetting that pathway variables may themselves be missing — + `any_missing(cleaned$gate)` is a valid condition, not an error +- Making the `.default` arm too aggressive — if all pathways failed, use + `get_priority_missing()` to propagate the best missing code, not just + NA::a +- Not documenting the pathway decision tree in `@details` — this pattern + is complex enough that future maintainers need a prose explanation diff --git a/.claude/skills/cchsflow-derive/docs/testing.md b/.claude/skills/cchsflow-derive/docs/testing.md new file mode 100644 index 00000000..fa9584de --- /dev/null +++ b/.claude/skills/cchsflow-derive/docs/testing.md @@ -0,0 +1,240 @@ +# Testing derived variable functions + +How to write, maintain, and diagnose tests for cchsflow derived variables. + +## Test types + +cchsflow uses two kinds of tests for derived variables: + +| Type | Location | What it checks | +|------|----------|----------------| +| **Unit tests** | `tests/testthat/test-.R` | Single function, scalar inputs, expected outputs | +| **Golden fixture tests** | `tests/testthat/test-recode-with-table.R` | Full `rec_with_table()` pipeline against saved RData snapshots | + +### Unit tests + +Each DV function should have unit tests covering: + +1. **Valid inputs** — representative values for each branch of the `case_when()` +2. **Out-of-range inputs** — values outside valid range, verify correct missing type +3. **Missing inputs** — `NA`, `tagged_na("a")`, `tagged_na("b")` as appropriate +4. **Edge cases** — boundary values, zero-length vectors + +Example (gold tier): + +```r +test_that("calculate_my_var returns correct value for status 1", { + result <- calculate_my_var(smoking_status = 1, age = 45, value = 20) + expect_equal(result, 25.0) +}) + +test_that("calculate_my_var returns NA::a for never smokers", { + result <- calculate_my_var(smoking_status = 6, age = 50, value = NA) + expect_true(haven::is_tagged_na(result, "a")) +}) + +test_that("calculate_my_var returns NA::b when input is missing", { + result <- calculate_my_var(smoking_status = 1, age = 45, + value = tagged_na("b")) + expect_true(haven::is_tagged_na(result, "b")) +}) +``` + +### Golden fixture tests + +`test-recode-with-table.R` runs `rec_with_table()` on sample PUMF data +(200-row `_p` datasets in `data/`) and compares every column against saved +"standard" RData files in `tests/testdata/rec_with_table_test_data.RData`. + +These are **regression tests** — they catch unintended changes but must be +regenerated when functions intentionally change. + +## Common failure patterns + +### Pattern 1: v2 string vs v3 tagged_na + +**Symptom**: `actual is a character vector ('NA(b)')`, `expected is a double vector (NA)` + +**Cause**: The function uses the v2 pattern (returns string `"NA(b)"`) but the +test expects v3 behaviour (`tagged_na("b")`). + +**Which is wrong?** Depends on the function's tier: +- **v2 (bronze/silver)** functions: the test expectation is aspirational. + Either downgrade the test to match v2 output, or upgrade the function to + gold tier. +- **v3 (gold)** functions: the function should return `tagged_na("b")` via + `assign_missing()`. If it returns string `"NA(b)"`, the function has a bug. + +**Example**: `low_drink_score_fun(-1, 1)` returns `"NA(b)"` (string) but +test expects `tagged_na("b")`. The function is v2; the test was written for +a v3 future that hasn't been implemented. + +### Pattern 2: Missing type distinction (NA::a vs NA::b) + +**Symptom**: `actual: "NA(a)"`, `expected: "NA(b)"` in golden fixtures. + +**Cause**: v3 infrastructure correctly distinguishes "not applicable" (the +question doesn't apply to this respondent) from "not stated" (the respondent +didn't answer). Old golden fixtures used `NA(b)` for both. + +**Which is wrong?** Usually the fixture is wrong. A never-smoker should get +`NA(a)` for pack-years, not `NA(b)`. Regenerate the fixtures after verifying +the function logic is correct. + +**Common variables affected**: `pack_years_cat`, `diet_score_cat3`, and any +derived variable whose input universe excludes certain respondent groups. + +### Pattern 3: Scoring range shift + +**Symptom**: Values are systematically offset (e.g., actual=0 where expected=5, +actual=2 where expected=3). + +**Cause**: The function's scoring logic was changed (e.g., from independence +score to needs-help count) but the unit test expectations and golden fixtures +weren't updated. + +**How to diagnose**: Check `git log --oneline -10 -- R/.R` to see if +the function was recently modified. Compare the current logic against the +test expectation to determine which is semantically correct. + +**Fix**: Update both the unit test AND the golden fixtures. These must stay +in sync. + +### Pattern 4: Calculation formula changes + +**Symptom**: Continuous values differ (e.g., actual=0.95, expected=1.90). + +**Cause**: The calculation formula was intentionally changed (e.g., fixing a +factor-of-2 error, changing midpoint imputation). Golden fixtures reflect the +old formula. + +**How to diagnose**: Check the git history for the calculation function. +Verify the new formula is correct against the CEP or design documentation. + +**Fix**: Regenerate golden fixtures after confirming the new formula is +correct. + +### Pattern 5: Golden fixture level padding + +**Symptom**: Factor levels differ in whitespace (e.g., `"1 "` vs `"1 "`). + +**Cause**: Factor level widths are determined by the longest level string. +When new levels are added (like `"NA(b)"` becoming `"NA(b)"` instead of +`"NA"`), padding changes for all levels in the column. + +**This is cosmetic** — usually accompanies a real change (patterns 1-4). + +## When to regenerate golden fixtures + +Regenerate `rec_with_table_test_data.RData` when: + +1. A DV function's logic intentionally changed (new formula, scoring range) +2. Missing data handling upgraded from v2 to v3 (string to tagged_na) +3. New variables added to `variable_details.csv` +4. Worksheet changes alter how `rec_with_table()` routes variables + +**Do NOT regenerate** to paper over unexpected failures. First confirm the +function change is correct. + +### How to regenerate + +```r +# Load current worksheets and sample data +variables <- read.csv("inst/extdata/variables.csv") +variable_details <- read.csv("inst/extdata/variable_details.csv") + +# Generate new standards for each cycle +cchs2001Standard <- suppressWarnings( + rec_with_table(cchs2001_p, + variables = variables$variable, + variable_details = variable_details, + note = FALSE) +) +# Repeat for cchs2003_p, cchs2005_p, cchs2015_2016_p + +# Save +save(variables, variable_details, + cchs2001Standard, cchs2003Standard, + cchs2005Standard, cchs2015Standard, + file = "tests/testdata/rec_with_table_test_data.RData") +``` + +**Always review the diff** between old and new fixtures. Every changed value +should be explainable by a known function change. + +## Writing tests for new DV functions + +### Bronze tier + +At minimum, test the happy path and one missing input: + +```r +test_that("calculate_my_var returns expected value", { + expect_equal(calculate_my_var(input = 25), 25) +}) +``` + +### Silver tier + +Add out-of-range, missing, and vector tests: + +```r +test_that("calculate_my_var handles vector input", { + result <- calculate_my_var(input = c(10, 20, NA)) + expect_equal(result[1:2], c(10, 20)) + expect_true(is.na(result[3])) +}) +``` + +### Gold tier + +Test every branch of the `case_when()`, verify missing type tags, and test +`output_format` parameter: + +```r +test_that("calculate_my_var returns NA::a for not-applicable status", { + result <- calculate_my_var(smoking_status = 6, value = NA) + expect_true(haven::is_tagged_na(result, "a")) +}) + +test_that("calculate_my_var returns NA::b when required input missing", { + result <- calculate_my_var(smoking_status = 1, + value = tagged_na("b")) + expect_true(haven::is_tagged_na(result, "b")) +}) + +test_that("calculate_my_var respects output_format = 'numeric'", { + result <- calculate_my_var(smoking_status = 6, value = NA, + output_format = "numeric") + # Should return the numeric missing code, not tagged_na + expect_false(haven::is_tagged_na(result)) +}) +``` + +### Testing the gate vs source pattern + +For functions with gate variables (e.g., smoking status) and source +variables (e.g., cigarettes per day), test these combinations: + +| Gate | Source | Expected | +|------|--------|----------| +| Valid, applies | Valid | Calculated value | +| Valid, applies | NA::b | NA::b (missing source) | +| Valid, does not apply | NA::a | NA::a (not applicable) | +| NA::b | Any | NA::b (missing gate) | +| NA::a | Any | NA::a (not applicable gate) | + +This catches the **joint missing check bug** where checking +`any_missing(gate, source)` together short-circuits for respondents where +the source is legitimately NA::a. + +## Current test debt (as of 2026-03) + +The following pre-existing failures reflect the v2→v3 transition in +progress. They are tracked here so developers don't waste time +investigating known issues: + +| Tests | Category | Status | +|-------|----------|--------| +| `test-adl.R:46` (1 failure) | ADL scoring range changed (0-based vs 1-based) | Needs: decide correct scoring, update test | +| `test-alcohol.R:164-194` (6 failures) | Functions return v2 string `"NA(b)"`, tests expect v3 `tagged_na("b")` | Needs: upgrade functions to gold OR downgrade test expectations | From c267a380787f61e393d66605a46104af53365116 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Mon, 30 Mar 2026 20:23:23 -0400 Subject: [PATCH 10/15] feat(skill): Add R code/test triage to cchsflow-review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Triage step now detects when PRs touch R/ or tests/ files, flags that Step 7b package health check will run, and cross-references the cchsflow-derive done criteria for new or modified functions. Also strengthens GHA failure handling — treat failing CI as blocking. --- .claude/skills/cchsflow-review/SKILL.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md index a97fc786..2e2fb22c 100644 --- a/.claude/skills/cchsflow-review/SKILL.md +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -53,8 +53,9 @@ For PR reviews, run triage first: 1. **Get the diff** and identify which variables were modified in `variable_details.csv` and `variables.csv` 2. **Check `variables.csv` diff size** — if the entire file was rewritten (line count matches total rows), flag as potential formatting/schema change vs targeted edits -3. **Check GHA status** — have CI checks run? Are they passing? +3. **Check GHA status** — have CI checks run? Are they passing? If GHA ran and **failed**, treat this as blocking — diagnose the failure before proceeding with worksheet review. Common GHA failures (CSV formatting, R CMD check) indicate package-level issues that should be resolved first. 4. **Count modified variables** and group by domain +5. **Check for R/ and tests/ changes** — if `git diff origin/...HEAD --name-only` shows files under `R/` or `tests/testthat/`, the PR touches code, not just worksheets. Flag this in the triage output and note that **Step 7b (package health check)** will run. If R functions are new or substantially modified, the **cchsflow-derive** skill's done criteria (unit tests, R CMD check, roxygen, test coverage) also apply — see `.claude/skills/cchsflow-derive/SKILL.md` § "Done criteria". **Important:** `gh pr diff --stat` does not exist and `gh pr diff` does not support path filtering. Instead, check out the PR branch and use git directly: @@ -99,6 +100,8 @@ Print the proposed scope and triage summary clearly to the console, then proceed ``` Triage: Files changed: variables.csv (379+/379-), variable_details.csv (186+/186-) + R/ files changed: R/immigration.R (whitespace only) + Tests changed: tests/testthat/test-immigration.R (whitespace only) Variables modified: 302 total (8 in-scope, 294 out-of-scope) GHA checks: not run Full-file rewrite detected in variables.csv (likely formatting change) @@ -108,6 +111,7 @@ Proposed review scope: Database types: PUMF (_p) and Master (_m) Cycles: 2001 through 2017-2018 Out-of-scope: 294 other variables, column reordering + Package health: Step 7b will run (R/ files in diff) Proceeding with review. Interrupt to adjust scope. ``` From a12784f0ef6d63a9fbf7d44a5de249ada5e3a4d5 Mon Sep 17 00:00:00 2001 From: Rafidul <134554829+rafdoodle@users.noreply.github.com> Date: Wed, 1 Apr 2026 19:39:16 -0400 Subject: [PATCH 11/15] Added information on variable naming conventions to cchsflow worksheet skill --- .claude/skills/cchsflow-worksheets/SKILL.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/.claude/skills/cchsflow-worksheets/SKILL.md b/.claude/skills/cchsflow-worksheets/SKILL.md index 5c05ae45..affd5507 100644 --- a/.claude/skills/cchsflow-worksheets/SKILL.md +++ b/.claude/skills/cchsflow-worksheets/SKILL.md @@ -53,6 +53,25 @@ Detailed documentation is in the `docs/` subdirectory: | `_s` | Share file | Synthetic datasets | | `_i` | ICES-linked (deprecated) | Replace with `_m` | +### Harmonized variable naming conventions + +Suffixes applied to the **harmonized** variable name in variables.csv (not source variable names): + +| Suffix | Meaning | Examples | +|--------|---------|---------| +| `_catN` | Grouped categorical with N categories | `DHHGAGE_cat5`, `DHHGAGE_cat8`, `ADL_01_cat4` | +| `_pre{year}` | Era-specific: cycles before year | `DHHGAGE_pre2005`, `SMK_10A_pre2015` | +| `_{year}plus` | Era-specific: cycles from year onward | `DHHGAGE_2005plus`, `SMK_10A_2015plus` | +| `_cont` | Continuous (midpoint-imputed or pass-through) | `DHHGAGE_cont`, `SMK_10A_cont` | +| `_der` | Derived via `Func::` function across all cycles | `DHHGAGE_der`, `ADL_der` | +| `_{scheme}` | Project/cohort-specific categorization | `DHH_MS_DemPoRT` | + +**Rules:** +- When two variables share the same base name and category count, use a scheme suffix to distinguish (e.g. `DHH_MS` vs `DHH_MS_DemPoRT`, both cat4). +- Era suffixes take precedence over `_catN` when the variable is cycle-restricted (e.g. `DHHGAGE_pre2005` not `DHHGAGE_cat15_pre2005`). +- `_der` is used without `_catN` even when the derived variable has a fixed number of categories. +- `dummyVariable` follows `{variable}_cat{N}_{value}` — the `_catN` in dummyVariable is independent of whether the variable name itself carries `_catN`. + ### PUMF vs Master row splitting When PUMF has grouped categorical and Master has true continuous source variables, rows must be split by database type. See [pumf-master-harmonization.md](docs/pumf-master-harmonization.md) for the full pattern. From 5b74bf8e57d4fffba7c2f086aad92d1fd81d7015 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Mon, 6 Apr 2026 11:28:37 -0400 Subject: [PATCH 12/15] feat(validation): Add scoped worksheet validation with performance fix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Replace character-by-character .parse_csv_text() with vectorised readr::read_lines() + grep approach (variable_details.csv: infinite hang → 1.7s) - Add R/scope-worksheets.R with scope_worksheets() and parse_scope_args() for filtering by --subject or --variables - Update exec scripts to accept --subject/--variables CLI args - Scoped fix merges corrected rows back into full worksheets - Update 3 skills (review, validation, worksheets) with scoped usage docs Scoped check: ~0.2s for typical 15-20 variable subset. Full check: ~2s for entire variable_details.csv (3,700+ rows). --- .claude/skills/cchsflow-review/SKILL.md | 10 +- .claude/skills/cchsflow-validation/SKILL.md | 24 + .claude/skills/cchsflow-worksheets/SKILL.md | 28 +- R/check-worksheet.R | 604 ++++++++++++++++++++ R/fix-worksheet.R | 160 ++++++ R/scope-worksheets.R | 107 ++++ exec/check-worksheets.R | 96 ++++ exec/fix-worksheets.R | 152 +++++ 8 files changed, 1159 insertions(+), 22 deletions(-) create mode 100644 R/check-worksheet.R create mode 100644 R/fix-worksheet.R create mode 100644 R/scope-worksheets.R create mode 100644 exec/check-worksheets.R create mode 100644 exec/fix-worksheets.R diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md index 2e2fb22c..9313b2ed 100644 --- a/.claude/skills/cchsflow-review/SKILL.md +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -338,6 +338,13 @@ Read and follow `docs/csv-validation-and-fixes.md` for the full procedure. This - Visual diff review with Beyond Compare - Scope expansion during review +**Scoped validation (recommended):** Use `--subject` or `--variables` to limit checks to in-scope rows: +```bash +Rscript exec/check-worksheets.R --subject "Ethnicity,Language,Migration" +Rscript exec/fix-worksheets.R --variables "SDCGCGT,SDCFIMM" +``` +Scoped mode is faster (~0.2s vs ~2s) and filters out pre-existing issues in unrelated variables. Use full-file mode for final pre-merge checks. + ### Step 10: Scope expansion during review If the review identifies expansion opportunities and the user requests adding them, follow the scope expansion procedure in `docs/csv-validation-and-fixes.md`. @@ -372,7 +379,8 @@ Summarise the retrospective to the user. If skill updates are warranted, propose - Era mapping tables: `.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md` - Schema definitions: `inst/metadata/schemas/core/variables.yaml`, `inst/metadata/schemas/core/variable_details.yaml` - Regex patterns and naming conventions: `inst/metadata/documentation/metadata_registry.yaml` -- CSV formatting check/fix: `exec/check-worksheets.R`, `exec/fix-worksheets.R` (uses `R/check-worksheet.R`, `R/fix-worksheet.R`) +- CSV formatting check/fix: `exec/check-worksheets.R`, `exec/fix-worksheets.R` (uses `R/check-worksheet.R`, `R/fix-worksheet.R`). Supports `--subject` and `--variables` for scoped validation. +- Scope filtering: `R/scope-worksheets.R` (`scope_worksheets()`, `parse_scope_args()`) - CSV standardisation with schema validation: `R/csv-utils.R` (`standardise_csv()`), `R/schema-validation.R` (`validate_csv_against_schema()`) - Validation constants: `R/validation-constants.R` - GHA workflow for CSV checks: `.github/workflows/check-csv.yml` diff --git a/.claude/skills/cchsflow-validation/SKILL.md b/.claude/skills/cchsflow-validation/SKILL.md index 50ecfd63..cfe1068d 100644 --- a/.claude/skills/cchsflow-validation/SKILL.md +++ b/.claude/skills/cchsflow-validation/SKILL.md @@ -17,6 +17,30 @@ Run programmatic validation checks on cchsflow worksheets. This skill runs the s When invoked without arguments, validates the production worksheets at `inst/extdata/`. +### Scoped validation + +For development workflow, scope validation to in-scope variables instead of the full file: + +```bash +# By subject (matches the subject column in variables.csv) +Rscript exec/check-worksheets.R --subject "Ethnicity,Language,Migration" +Rscript exec/fix-worksheets.R --subject "Smoking" + +# By variable name +Rscript exec/check-worksheets.R --variables "SDCGCGT,SDCFIMM,SDCGLNG" + +# Combined (union of both filters) +Rscript exec/fix-worksheets.R --subject "Ethnicity" --variables "COPD_Emph_der" +``` + +Scoped mode extracts matching rows to temp files, runs checks/fixes on those, then (for fix) merges corrected rows back into the full worksheets. This reduces check time from ~2s (full file) to ~0.2s (scoped). + +**When to use scoped vs full:** +- **Scoped**: During development, PR review, iterative worksheet authoring +- **Full**: CI/GHA, pre-merge final check, after bulk edits + +The R functions `scope_worksheets()` and `parse_scope_args()` in `R/scope-worksheets.R` can also be called programmatically. + ## Validation checks ### Check 1: CSV formatting diff --git a/.claude/skills/cchsflow-worksheets/SKILL.md b/.claude/skills/cchsflow-worksheets/SKILL.md index affd5507..02491784 100644 --- a/.claude/skills/cchsflow-worksheets/SKILL.md +++ b/.claude/skills/cchsflow-worksheets/SKILL.md @@ -53,25 +53,6 @@ Detailed documentation is in the `docs/` subdirectory: | `_s` | Share file | Synthetic datasets | | `_i` | ICES-linked (deprecated) | Replace with `_m` | -### Harmonized variable naming conventions - -Suffixes applied to the **harmonized** variable name in variables.csv (not source variable names): - -| Suffix | Meaning | Examples | -|--------|---------|---------| -| `_catN` | Grouped categorical with N categories | `DHHGAGE_cat5`, `DHHGAGE_cat8`, `ADL_01_cat4` | -| `_pre{year}` | Era-specific: cycles before year | `DHHGAGE_pre2005`, `SMK_10A_pre2015` | -| `_{year}plus` | Era-specific: cycles from year onward | `DHHGAGE_2005plus`, `SMK_10A_2015plus` | -| `_cont` | Continuous (midpoint-imputed or pass-through) | `DHHGAGE_cont`, `SMK_10A_cont` | -| `_der` | Derived via `Func::` function across all cycles | `DHHGAGE_der`, `ADL_der` | -| `_{scheme}` | Project/cohort-specific categorization | `DHH_MS_DemPoRT` | - -**Rules:** -- When two variables share the same base name and category count, use a scheme suffix to distinguish (e.g. `DHH_MS` vs `DHH_MS_DemPoRT`, both cat4). -- Era suffixes take precedence over `_catN` when the variable is cycle-restricted (e.g. `DHHGAGE_pre2005` not `DHHGAGE_cat15_pre2005`). -- `_der` is used without `_catN` even when the derived variable has a fixed number of categories. -- `dummyVariable` follows `{variable}_cat{N}_{value}` — the `_catN` in dummyVariable is independent of whether the variable name itself carries `_catN`. - ### PUMF vs Master row splitting When PUMF has grouped categorical and Master has true continuous source variables, rows must be split by database type. See [pumf-master-harmonization.md](docs/pumf-master-harmonization.md) for the full pattern. @@ -119,8 +100,13 @@ The RData files have fewer columns than the CSVs (16 vs 23, 10 vs 18). Extra met ### CSV validation before committing -```r +```bash +# Full file (CI/pre-merge) Rscript exec/fix-worksheets.R + +# Scoped to your working variables (faster, recommended during development) +Rscript exec/fix-worksheets.R --subject "Ethnicity,Language,Migration" +Rscript exec/check-worksheets.R --variables "SDCGCGT,SDCFIMM" ``` -This checks and fixes: excessive quoting, column order, empty trailing columns, CRLF line endings, unsorted rows. +This checks and fixes: excessive quoting, column order, empty trailing columns, CRLF line endings, unsorted rows. Scoped mode runs only on matching rows (~0.2s vs ~2s for the full file). diff --git a/R/check-worksheet.R b/R/check-worksheet.R new file mode 100644 index 00000000..c0763ba5 --- /dev/null +++ b/R/check-worksheet.R @@ -0,0 +1,604 @@ +#' Check a CSV worksheet for formatting errors +#' +#' @param file_path Path to the CSV file to check +#' @param file_type Type of file being checked. Either "variables" or +#' "variable_details". +#' +#' @return A list of errors found. Each error is a named list containing +#' information about the error. +#' +#' @export +#' +#' @examples +#' \dontrun{ +#' check_worksheet("inst/extdata/variables.csv", "variables") +#' } +check_worksheet <- function( + file_path, file_type = c("variables", "variable_details")) { + file_type <- match.arg(file_type) + + schema <- load_schema(file_type) + expected_columns <- schema$expected_column_order + + if (!file.exists(file_path)) { + return(list(.create_file_not_found_error(file_type, file_path))) + } + + csv_result <- tryCatch( + { + list( + data = read.csv( + file_path, stringsAsFactors = FALSE, check.names = FALSE), + error = NULL + ) + }, + error = function(e) { + list( + data = NULL, + error = .create_invalid_csv_error(file_type, file_path, e$message) + ) + } + ) + if (!is.null(csv_result$error)) { + return(list(csv_result$error)) + } + + column_order_errors <- .check_column_order( + csv_result$data, + expected_columns, + list(file_path = file_path, file_type = file_type) + ) + + # Check for missing ID column (always) and row sorting (only if multiple rows) + row_sorting_errors <- if (!is.null(schema$id_column_name)) { + id_column_name <- schema$id_column_name + if (!(id_column_name %in% colnames(csv_result$data))) { + list(.create_missing_id_column_error( + file_type, file_path, id_column_name + )) + } else if (nrow(csv_result$data) > 1) { + .check_row_sorting( + csv_result$data, + id_column_name, + list(file_path = file_path, file_type = file_type) + ) + } else { + list() + } + } else { + list() + } + + empty_column_errors <- .check_trailing_empty_columns( + csv_result$data, list(file_path = file_path, file_type = file_type) + ) + + raw_lines <- readr::read_lines(file_path) + line_ending_errors <- .check_line_endings( + raw_lines, list(file_path = file_path, file_type = file_type)) + + excessive_quote_errors <- .check_excessive_quoting( + raw_lines, list(file_path = file_path, file_type = file_type)) + + all_errors <- purrr::flatten(list( + line_ending_errors, + excessive_quote_errors, + column_order_errors, + row_sorting_errors, + empty_column_errors + )) + + return(all_errors) +} + +#' Check whether a worksheet has the correct line endings +#' +#' Uses vectorised grep on raw lines for performance. The raw file is read +#' with readr::read_lines() which strips LF but preserves trailing CR if +#' present, so CRLF lines end with \\r. +#' +#' @param raw_lines Character vector of lines from readr::read_lines() +#' @param error_ctx Information used when creating the error object. A named +#' list with the following fields: +#' * file_type: The type of worksheet the CSV contains. Can be "variables" or +#' "variable_details". +#' * file_path: The file path to the worksheet +#' +#' @return The list of line ending errors found in the worksheet +.check_line_endings <- function(raw_lines, error_ctx) { + crlf_rows <- grep("\r$", raw_lines) + purrr::map(crlf_rows, function(row_index) { + .create_line_ending_crlf_error( + error_ctx$file_type, error_ctx$file_path, row_index) + }) +} + +#' Check the columns order in a worksheet +#' +#' @param csv_data A data.frame containing the worksheet rows +#' @param expected_columns The worksheet column in their expected order +#' @param error_ctx Information used when creating the error object. A named +#' list with the following fields: +#' * file_type: The type of worksheet the CSV contains. Can be "variables" or +#' "variable_details". +#' * file_path: The file path to the worksheet +#' +#' @return The list of column order errors found in the worksheet +.check_column_order <- function(csv_data, expected_columns, error_ctx) { + actual_columns <- colnames(csv_data) + + column_order_errors <- 1:length(expected_columns) %>% + purrr::keep(function(expected_column_index) { + expected_column <- expected_columns[expected_column_index] + actual_column <- actual_columns[expected_column_index] + # Handle case where actual has fewer columns than expected + if (is.na(actual_column)) { + return(TRUE) + } + return(expected_column != actual_column) + }) %>% + purrr::map(function(missing_expected_column_index) { + expected_column <- expected_columns[missing_expected_column_index] + actual_column <- actual_columns[missing_expected_column_index] + # Use NA string if column doesn't exist + if (is.na(actual_column)) { + actual_column <- NA_character_ + } + return( + .create_column_order_error( + error_ctx$file_type, + error_ctx$file_path, + expected_column, + missing_expected_column_index, + actual_column + ) + ) + }) + return(column_order_errors) +} + +#' Check the rows order in a worksheet +#' +#' @param csv_data A data.frame containing the worksheet rows +#' @param id_column_name Name of the column to check for sorting +#' @param error_ctx Information used when creating the error object. A named +#' list with the following fields: +#' * file_type: The type of worksheet the CSV contains. Can be "variables" or +#' "variable_details". +#' * file_path: The file path to the worksheet +#' +#' @return The list of row order errors found in the worksheet +.check_row_sorting <- function(csv_data, id_column_name, error_ctx) { + actual_sorting <- csv_data[[id_column_name]] + expected_sorting <- sort(actual_sorting) + if (identical(actual_sorting, expected_sorting)) { + return(list()) + } else { + return(list( + .create_unsorted_rows_error( + error_ctx$file_type, error_ctx$file_path, id_column_name + ) + )) + } +} + +#' Check a worksheet for trailing empty columns +#' +#' @param csv_data Data.frame containing the worksheet rows +#' @param error_ctx Information used when creating the error object. A named +#' list with the following fields: +#' * file_type: The type of worksheet the CSV contains. Can be "variables" or +#' "variable_details". +#' * file_path: The file path to the worksheet +#' +#' @return List of errors +.check_trailing_empty_columns <- function(csv_data, error_ctx) { + col_names <- colnames(csv_data) + + # Count consecutive empty strings from the end + reversed_names <- rev(col_names) + trailing_empty_count <- sum(cumsum(reversed_names != "") == 0) + + if (trailing_empty_count == 0) { + return(list()) + } + + num_cols <- length(col_names) + purrr::map(1:trailing_empty_count, function(i) { + col_position <- num_cols - trailing_empty_count + i + .create_trailing_empty_columns_error( + error_ctx$file_type, error_ctx$file_path, col_position) + }) +} + +#' Check a worksheet for excessive quoting +#' +#' Scans raw CSV lines for quoted fields that don't require quoting. A field +#' needs quoting only if it contains a comma, double-quote, newline, or +#' carriage return. Uses vectorised pre-filter: only lines containing a +#' double-quote are inspected field-by-field. +#' +#' @param raw_lines Character vector of lines from readr::read_lines() +#' @param error_ctx Information used when creating the error object. A named +#' list with the following fields: +#' * file_type: The type of worksheet the CSV contains. Can be "variables" or +#' "variable_details". +#' * file_path: The file path to the worksheet +#' +#' @return list of errors +.check_excessive_quoting <- function(raw_lines, error_ctx) { + # Pre-filter: only inspect lines that contain a quote character + has_quote <- grep('"', raw_lines, fixed = TRUE) + if (length(has_quote) == 0) return(list()) + + errors <- list() + for (row_index in has_quote) { + # Parse the single line into fields using R's CSV reader + fields <- tryCatch( + scan( + text = raw_lines[row_index], what = "", sep = ",", + quote = '"', quiet = TRUE, strip.white = FALSE + ), + error = function(e) NULL + ) + if (is.null(fields)) next + + # Now check which fields in the raw line are unnecessarily quoted. + # Re-split the raw line respecting CSV quoting to get raw field text. + raw_fields <- .split_csv_line(raw_lines[row_index]) + + for (col_index in seq_along(raw_fields)) { + field <- raw_fields[col_index] + if (nchar(field) >= 2 && + substr(field, 1, 1) == '"' && + substr(field, nchar(field), nchar(field)) == '"') { + content <- substr(field, 2, nchar(field) - 1) + content <- gsub('""', '"', content, fixed = TRUE) + if (!grepl('[,"\n\r]', content)) { + errors <- c(errors, list(.create_excessive_quoting_error( + error_ctx$file_type, error_ctx$file_path, + row_index, col_index, field + ))) + } + } + } + } + errors +} + +#' Create the error object for when the worksheet could not be found +#' +#' @param file_type: The type of worksheet. Can be "variables" or +#' "variable_details". +#' @param file_path The invalid path +#' +#' @return A named list +.create_file_not_found_error <- function(file_type, file_path) { + return(list( + error_type = "file_not_found", + file_type = file_type, + file_path = file_path, + message = glue::glue("{.pretty_print_file_type(file_type)} not found at {file_path}.") + )) +} + +#' Create the error for when the worksheet is not valid CSV +#' +#' @param file_type: The type of worksheet. Can be "variables" or +#' "variable_details". +#' @param file_path Path to the worksheet +#' @param error_message Reason(s) for why the worksheet is invalid CSV +#' +#' @return A named list +.create_invalid_csv_error <- function(file_type, file_path, error_message) { + return(list( + error_type = "invalid_csv", + file_type = file_type, + file_path = file_path, + message = glue::glue("Invalid {.pretty_print_file_type(file_type)} at path {file_path}: {error_message}") + )) +} + +#' Create an error for when the worksheet has invalid line endings +#' +#' @param file_type: The type of worksheet. Can be "variables" or +#' "variable_details". +#' @param file_path Path to the worksheet +#' @param row_num Index of the row with the invalid line ending +#' +#' @return A named list +.create_line_ending_crlf_error <- function(file_type, file_path, row_num) { + expected_line_ending <- "LF" + actual_line_ending <- "CRLF" + return(list( + error_type = "line_ending_crlf", + file_type = file_type, + file_path = file_path, + row_num = row_num, + expected_line_ending = expected_line_ending, + actual_line_ending = actual_line_ending, + message <- glue::glue("Error in {.pretty_print_file_type(file_type)} at {file_path}. Row {row_num} has an invalid line ending. Expected {expected_line_ending} but found {actual_line_ending}.") + )) +} + +#' Create an error for when the worksheet has excessive quoting +#' +#' @param file_type: The type of worksheet. Can be "variables" or +#' "variable_details". +#' @param file_path Path to the worksheet +#' @param row_num Row number with excessive quotes +#' @param col_num Column number with excessive quotes +#' @param cell_value Value of the cell with excessive quotes +#' +#' @return A named list +.create_excessive_quoting_error <- function( + file_type, file_path, row_num, col_num, cell_value) { + return(list( + error_type = "excessive_quoting", + file_type = file_type, + file_path = file_path, + row_num = row_num, + col_num = col_num, + cell_value = cell_value, + message = glue::glue("Error in {.pretty_print_file_type(file_type)} at {file_path}. Cell at row {row_num} and column {col_num} with value {cell_value} has excessive quoting.") + )) +} + +#' Create an error for when the worksheet columns are in the wrong order +#' +#' @param file_type: The type of worksheet. Can be "variables" or +#' "variable_details". +#' @param file_path Path to the worksheet +#' @param expected_column The column expected at the offending position +#' @param col_num The position of the column with the wrong header +#' @param actual_column The actual column value +#' +#' @return A named list +.create_column_order_error <- function( + file_type, file_path, expected_column, col_num, actual_column) { + return(list( + error_type = "column_order", + file_type = file_type, + file_path = file_path, + expected_column = expected_column, + col_num = col_num, + actual_column = actual_column, + message = glue::glue("Error in {.pretty_print_file_type(file_type)} at {file_path}. Incorrect column order. Expected column \"{expected_column}\" at column position {col_num} but found \"{actual_column}\".") + )) +} + +#' Create an error for when the worksheet is missing the ID column +#' +#' @param file_type: The type of worksheet. Can be "variables" or +#' "variable_details". +#' @param file_path Path to the worksheet +#' @param id_column_name Name of the expected ID column +#' +#' @return A named list +.create_missing_id_column_error <- function( + file_type, file_path, id_column_name) { + return(list( + error_type = "missing_id_column", + file_type = file_type, + file_path = file_path, + id_column_name = id_column_name, + message = glue::glue("Error in {.pretty_print_file_type(file_type)} at {file_path}. Missing required ID column \"{id_column_name}\".") + )) +} + +#' Create an error for when the worksheet rows are unsorted +#' +#' @param file_type: The type of worksheet. Can be "variables" or +#' "variable_details". +#' @param file_path Path to the worksheet +#' @param id_column_name Name of the column that should be sorted +#' +#' @return A named list +.create_unsorted_rows_error <- function(file_type, file_path, id_column_name) { + return(list( + error_type = "unsorted_rows", + file_type = file_type, + file_path = file_path, + id_column_name = id_column_name, + message = glue::glue("Error in {.pretty_print_file_type(file_type)} at {file_path}. Rows are not ordered by the {id_column_name} column.") + )) +} + +#' Create an error for when the worksheet has trailing empty columns +#' +#' @param file_type: The type of worksheet. Can be "variables" or +#' "variable_details". +#' @param file_path Path to the worksheet +#' @param col_num Position of the trailing empty column +#' +#' @return A named list +.create_trailing_empty_columns_error <- function( + file_type, file_path, col_num) { + return(list( + error_type = "empty_columns", + file_type = file_type, + file_path = file_path, + col_num = col_num, + message = glue::glue("Error in {.pretty_print_file_type(file_type)} at {file_path}. Trailing empty column found at position {col_num}.") + )) +} + +#' Check variable_details.csv for recode block recStart collisions +#' +#' For variables with multiple recode blocks (distinct variableStart values), +#' checks whether the same recStart value appears in rows from more than one +#' block for the same database. This directly detects the condition that causes +#' rec_with_table() to match duplicate rows and produce incorrect output. +#' +#' Note: databaseStart overlap alone is not sufficient to flag an error because +#' cchsflow legitimately uses parallel PUMF and Master blocks with shared +#' databases but non-overlapping recStart ranges. +#' +#' @param file_path Path to variable_details.csv +#' +#' @return A list of errors found. Each error is a named list containing +#' information about the error. +#' +#' @export +#' +#' @examples +#' \dontrun{ +#' check_recode_blocks("inst/extdata/variable_details.csv") +#' } +check_recode_blocks <- function(file_path) { + if (!file.exists(file_path)) { + return(list(.create_file_not_found_error("variable_details", file_path))) + } + + vd <- tryCatch( + read.csv(file_path, stringsAsFactors = FALSE, check.names = FALSE), + error = function(e) NULL + ) + if (is.null(vd)) { + return(list()) + } + + required_cols <- c("variable", "variableStart", "databaseStart", "recStart") + if (!all(required_cols %in% names(vd))) { + return(list()) + } + + errors <- list() + all_vars <- unique(vd$variable) + + for (var in all_vars) { + rows <- vd[vd$variable == var, ] + + # Exclude Func:: rows — derived variable routers that legitimately span all + # databases. Only check actual recode rows. + recode_rows <- rows[!grepl("^Func::", rows$recEnd), ] + blocks <- unique(recode_rows$variableStart) + + if (length(blocks) < 2) next + + # For each recode row, expand databaseStart into individual databases and + # build a lookup: (database, recStart) -> character vector of blocks + db_recstart_blocks <- list() + + for (vs in blocks) { + block_rows <- recode_rows[recode_rows$variableStart == vs, ] + dbs <- trimws(unlist(strsplit(block_rows$databaseStart[1], ","))) + + for (rec in block_rows$recStart) { + for (db in dbs) { + key <- paste0(db, "|||", rec) + db_recstart_blocks[[key]] <- unique(c(db_recstart_blocks[[key]], vs)) + } + } + } + + # Flag any (database, recStart) key present in more than one block + collision_keys <- names(db_recstart_blocks)[ + vapply(db_recstart_blocks, length, integer(1)) > 1 + ] + + if (length(collision_keys) > 0) { + errors <- c(errors, list(.create_recode_block_collision_error( + file_path, var, collision_keys, db_recstart_blocks + ))) + } + } + + return(errors) +} + +#' Create an error for recStart collisions across recode blocks +#' +#' @param file_path Path to the worksheet +#' @param variable_name Name of the variable with the collision +#' @param collision_keys Character vector of "database|||recStart" keys with collisions +#' @param db_recstart_blocks Named list mapping keys to block vectors +#' +#' @return A named list +.create_recode_block_collision_error <- function( + file_path, variable_name, collision_keys, db_recstart_blocks) { + # Summarize by block pair: collect distinct recStart values per pair + pair_recs <- list() + for (k in collision_keys) { + blks <- sort(db_recstart_blocks[[k]]) + pair_key <- paste(blks, collapse = " vs ") + rec <- strsplit(k, "|||", fixed = TRUE)[[1]][2] + pair_recs[[pair_key]] <- unique(c(pair_recs[[pair_key]], rec)) + } + + pair_summaries <- vapply(names(pair_recs), function(pk) { + recs <- pair_recs[[pk]] + n_recs <- length(recs) + rec_str <- if (n_recs <= 4) { + paste(recs, collapse = ", ") + } else { + paste0(paste(head(recs, 4), collapse = ", "), " ... (", n_recs, " total)") + } + paste0(pk, " share recStart: ", rec_str) + }, character(1)) + + detail_str <- paste(pair_summaries, collapse = "; ") + n_pairs <- length(pair_recs) + n_collisions <- length(collision_keys) + + return(list( + error_type = "recode_block_collision", + file_type = "variable_details", + file_path = file_path, + variable = variable_name, + collision_keys = collision_keys, + message = glue::glue( + "Error in Variable details sheet at {file_path}. ", + "Variable \"{variable_name}\" has {n_collisions} recStart collision(s) ", + "across {n_pairs} block pair(s): {detail_str}." + ) + )) +} + +#' Split a single CSV line into raw field strings preserving quoting +#' +#' Unlike scan() which strips quotes, this returns each field exactly as it +#' appears in the raw line — quoted fields retain their surrounding quotes. +#' This is needed for detecting excessive quoting. +#' +#' @param line A single CSV line as a character string +#' +#' @return Character vector of raw field strings +.split_csv_line <- function(line) { + fields <- character() + n <- nchar(line) + if (n == 0) return("") + start <- 1 + in_quotes <- FALSE + i <- 1 + while (i <= n) { + ch <- substr(line, i, i) + if (in_quotes) { + if (ch == '"') { + if (i < n && substr(line, i + 1, i + 1) == '"') { + i <- i + 1 # skip escaped quote + } else { + in_quotes <- FALSE + } + } + } else { + if (ch == '"') { + in_quotes <- TRUE + } else if (ch == ",") { + fields <- c(fields, substr(line, start, i - 1)) + start <- i + 1 + } + } + i <- i + 1 + } + fields <- c(fields, substr(line, start, n)) + fields +} + +.pretty_print_file_type <- function(file_type) { + if(file_type == "variables") { + return("Variables sheet") + } else { + return("Variable details sheet") + } +} diff --git a/R/fix-worksheet.R b/R/fix-worksheet.R new file mode 100644 index 00000000..b40ab5b3 --- /dev/null +++ b/R/fix-worksheet.R @@ -0,0 +1,160 @@ +#' Fix formatting issues in a CSV worksheet +#' +#' @description Automatically corrects formatting violations in a CSV worksheet +#' based on the errors detected by \code{check_worksheet}. +#' +#' @param file_path Path to the CSV file to fix +#' @param file_type Type of file being fixed. Either "variables" or +#' "variable_details". +#' +#' @return A list of error objects, each with an added \code{fixed} field: +#' \itemize{ +#' \item fixed: Boolean indicating if the error was fixed +#' } +#' Returns an empty list if no errors were found. +#' +#' @export +#' +#' @examples +#' \dontrun{ +#' errors <- fix_worksheet("inst/extdata/variables.csv", "variables") +#' fixed_count <- sum(sapply(errors, function(e) e$fixed)) +#' print(sprintf("Fixed %d errors", fixed_count)) +#' } +fix_worksheet <- function( + file_path, file_type = c("variables", "variable_details") +) { + file_type <- match.arg(file_type) + + errors <- check_worksheet(file_path, file_type) + + if (length(errors) == 0) { + return(list()) + } + + error_types <- purrr::map_chr(errors, ~ .x$error_type) + + # The following errors can't be fixed + unfixable_errors <- c("file_not_found", "invalid_csv", "missing_id_column") + if (any(unfixable_errors %in% error_types)) { + return(purrr::map(errors, ~ c(.x, fixed = FALSE))) + } + + initial_csv_data <- read.csv( + file_path, + stringsAsFactors = FALSE, + check.names = FALSE + ) + + empty_columns_fixed_data <- .fix_empty_column_errors( + initial_csv_data, + purrr::keep(errors, ~ .x$error_type == "empty_columns") + ) + + column_order_fixed_data <- .fix_column_order_errors( + empty_columns_fixed_data, + purrr::keep(errors, ~ .x$error_type == "column_order") + ) + + row_order_fixed_data <- .fix_unsorted_rows_error( + column_order_fixed_data, + purrr::keep(errors, ~ .x$error_type == "unsorted_rows") + ) + + # Write CSV content and also fix excessive quotting and line ending errors + readr::write_csv( + row_order_fixed_data, , + file = file_path, + na = "", + quote = "needed", + escape = "double", + eol = "\n" + ) + + fixed_types <- c( + if ("empty_columns" %in% error_types) "empty_columns", + if ("column_order" %in% error_types) "column_order", + if ("unsorted_rows" %in% error_types) "unsorted_rows", + if ("line_ending_crlf" %in% error_types) "line_ending_crlf", + if ("excessive_quoting" %in% error_types) "excessive_quoting" + ) + return(purrr::map(errors, ~ c(.x, fixed = .x$error_type %in% fixed_types))) +} + +#' Remove empty trailing columns from CSV data +#' +#' @param csv_data A data frame containing the CSV data. +#' @param empty_column_errors A list of empty column error objects from +#' \code{check_worksheet}. +#' +#' @return The data frame with empty columns removed. +#' +#' @keywords internal +.fix_empty_column_errors <- function(csv_data, empty_column_errors) { + if (length(empty_column_errors) == 0) { + return(csv_data) + } + + empty_col_positions <- purrr::map_int(empty_column_errors, ~ .x$col_num) + cols_to_keep <- colnames(csv_data)[-empty_col_positions] + return(csv_data[, cols_to_keep]) +} + +#' Reorder columns to match expected order in a CSV worksheet +#' +#' @param csv_data A data frame containing the CSV data. +#' @param column_order_errors A list of column order error objects from +#' \code{check_worksheet}. +#' +#' @return The reordered data frame +#' +#' @keywords internal +.fix_column_order_errors <- function(csv_data, column_order_errors) { + if (length(column_order_errors) == 0) { + return(csv_data) + } + + expected_positions <- purrr::map_int(column_order_errors, ~ .x$col_num) + expected_names <- purrr::map_chr(column_order_errors, ~ .x$expected_column) + + # Add columns that are missing into the sheet + missing_column_names <- expected_names[ + !expected_names %in% colnames(csv_data)] + csv_data_with_missing_columns <- purrr::reduce( + missing_column_names, + function(new_csv_data, missing_column_name) { + new_csv_data[[missing_column_name]] <- rep("", nrow(csv_data)) + return(new_csv_data) + }, + .init = csv_data + ) + + corrected_column_order <- purrr::reduce2( + expected_positions, + expected_names, + function(cols, pos, name) { + cols[pos] <- name + return(cols) + }, + .init = colnames(csv_data_with_missing_columns) + ) + + return(csv_data_with_missing_columns[, corrected_column_order]) +} + +#' Sort rows alphabetically by the ID column +#' +#' @param csv_data A data frame containing the CSV data. +#' @param unsorted_row_errors A list of unsorted row error objects from +#' \code{check_worksheet}. Each error contains an \code{id_column_name} field. +#' +#' @return The sorted data frame +#' +#' @keywords internal +.fix_unsorted_rows_error <- function(csv_data, unsorted_row_errors) { + if (length(unsorted_row_errors) == 0) { + return(csv_data) + } + id_column_name <- unsorted_row_errors[[1]]$id_column_name + return(csv_data[order(csv_data[[id_column_name]]), ]) +} diff --git a/R/scope-worksheets.R b/R/scope-worksheets.R new file mode 100644 index 00000000..05acde35 --- /dev/null +++ b/R/scope-worksheets.R @@ -0,0 +1,107 @@ +#' Filter worksheets to a subset of variables +#' +#' Creates temporary copies of variables.csv and variable_details.csv +#' containing only the rows matching the specified scope. Scope can be +#' defined by variable names, subject values, or auto-detected from git diff. +#' +#' @param variables_path Path to variables.csv +#' @param variable_details_path Path to variable_details.csv +#' @param variables Character vector of variable names to include, or NULL +#' @param subjects Character vector of subject values to include, or NULL +#' +#' @return A named list with `variables_path` and `variable_details_path` +#' pointing to (possibly temp) files, plus `scope_desc` describing what was +#' filtered, and `scoped` (logical) indicating whether filtering was applied. +#' +#' @export +scope_worksheets <- function( + variables_path, + variable_details_path, + variables = NULL, + subjects = NULL +) { + if (is.null(variables) && is.null(subjects)) { + return(list( + variables_path = variables_path, + variable_details_path = variable_details_path, + scope_desc = "all variables", + scoped = FALSE + )) + } + + vars_df <- read.csv(variables_path, stringsAsFactors = FALSE, + check.names = FALSE) + details_df <- read.csv(variable_details_path, stringsAsFactors = FALSE, + check.names = FALSE) + + # Build the set of in-scope variable names + in_scope <- character() + + if (!is.null(variables)) { + in_scope <- union(in_scope, variables) + } + + if (!is.null(subjects)) { + subject_vars <- vars_df$variable[ + trimws(vars_df$subject) %in% subjects + ] + in_scope <- union(in_scope, subject_vars) + } + + # Filter both data frames + vars_filtered <- vars_df[vars_df$variable %in% in_scope, ] + details_filtered <- details_df[details_df$variable %in% in_scope, ] + + scope_desc <- if (!is.null(subjects) && !is.null(variables)) { + paste0(length(in_scope), " variables (subjects: ", + paste(subjects, collapse = ", "), + " + explicit: ", paste(variables, collapse = ", "), ")") + } else if (!is.null(subjects)) { + paste0(length(in_scope), " variables in subjects: ", + paste(subjects, collapse = ", ")) + } else { + paste0(length(in_scope), " variables: ", + paste(in_scope, collapse = ", ")) + } + + # Write to temp files preserving header structure + tmp_vars <- tempfile(pattern = "variables_scoped_", fileext = ".csv") + tmp_details <- tempfile(pattern = "variable_details_scoped_", fileext = ".csv") + + readr::write_csv(vars_filtered, tmp_vars, na = "", quote = "needed", + escape = "double", eol = "\n") + readr::write_csv(details_filtered, tmp_details, na = "", quote = "needed", + escape = "double", eol = "\n") + + list( + variables_path = tmp_vars, + variable_details_path = tmp_details, + scope_desc = scope_desc, + scoped = TRUE + ) +} + +#' Parse --variables and --subject CLI arguments +#' +#' @param args Character vector from commandArgs(trailingOnly = TRUE) +#' +#' @return Named list with `variables` (character vector or NULL) and +#' `subjects` (character vector or NULL) +#' +#' @export +parse_scope_args <- function(args) { + variables <- NULL + subjects <- NULL + + var_idx <- which(args == "--variables") + if (length(var_idx) > 0 && var_idx[1] < length(args)) { + variables <- trimws(unlist(strsplit(args[var_idx[1] + 1], ","))) + } + + subj_idx <- which(args == "--subject") + if (length(subj_idx) > 0 && subj_idx[1] < length(args)) { + subjects <- trimws(unlist(strsplit(args[subj_idx[1] + 1], ","))) + } + + list(variables = variables, subjects = subjects) +} diff --git a/exec/check-worksheets.R b/exec/check-worksheets.R new file mode 100644 index 00000000..0e347a22 --- /dev/null +++ b/exec/check-worksheets.R @@ -0,0 +1,96 @@ +#!/usr/bin/env Rscript + +# Check CSV worksheets for formatting compliance +# This script validates variables.csv and variable_details.csv against +# formatting standards and reports any violations found. +# +# Usage: +# Rscript exec/check-worksheets.R # all variables +# Rscript exec/check-worksheets.R --subject "Ethnicity,Language,Migration" +# Rscript exec/check-worksheets.R --variables "SDCGCGT,SDCFIMM" +# Rscript exec/check-worksheets.R --subject Smoking --variables "COPD_Emph_der" +# +# Exit codes: +# 0 - No formatting violations found +# 1 - Formatting violations detected + +suppressPackageStartupMessages({ + library(cchsflow) + library(cli) +}) + +variables_path <- "inst/extdata/variables.csv" +variable_details_path <- "inst/extdata/variable_details.csv" + +# Parse scope arguments +scope_args <- parse_scope_args(commandArgs(trailingOnly = TRUE)) +scope <- scope_worksheets( + variables_path, variable_details_path, + variables = scope_args$variables, + subjects = scope_args$subjects +) + +cli_h1("Checking CSV worksheet formatting") +if (scope$scoped) { + cli_alert_info("Scope: {scope$scope_desc}") +} + +# Check variables.csv +cli_alert_info("Checking variables.csv...") +variables_sheet_errors <- check_worksheet(scope$variables_path, "variables") +n_variables_sheet_errors <- length(variables_sheet_errors) +if (n_variables_sheet_errors > 0) { + cli_alert_danger("Found {cli::no(n_variables_sheet_errors)} error{?s}") +} else { + cli_alert_success("Found {cli::no(n_variables_sheet_errors)} error{?s}") +} +cli_text("") + +# Check variable_details.csv +cli_alert_info("Checking variable_details.csv...") +variable_details_errors <- check_worksheet( + scope$variable_details_path, "variable_details") +n_variable_details_errors <- length(variable_details_errors) +if (n_variable_details_errors > 0) { + cli_alert_danger("Found {cli::no(n_variable_details_errors)} error{?s}") +} else { + cli_alert_success("Found {cli::no(n_variable_details_errors)} error{?s}") +} + +# Check recode block overlap in variable_details.csv +cli_alert_info("Checking variable_details.csv recode block consistency...") +recode_block_errors <- check_recode_blocks(scope$variable_details_path) +n_recode_block_errors <- length(recode_block_errors) +if (n_recode_block_errors > 0) { + cli_alert_danger("Found {cli::no(n_recode_block_errors)} error{?s}") +} else { + cli_alert_success("Found {cli::no(n_recode_block_errors)} error{?s}") +} +cli_text("") + +all_errors <- purrr::flatten( + list(variables_sheet_errors, variable_details_errors, recode_block_errors)) + +# Report results +n_all_errors <- length(all_errors) +if (n_all_errors == 0) { + cli_rule() + cli_alert_success("All worksheets are properly formatted!") + cli_rule() + quit(status = 0) +} else { + cli_rule() + cli_alert_danger("Found {cli::no(n_all_errors)} formatting violation{?s}") + cli_rule() + + # Display each error + for (i in seq_along(all_errors)) { + error <- all_errors[[i]] + cli_alert_danger("{i}. {error$message}") + } + + cli_text("") + + cli_alert_info("To fix these issues, run: {.run Rscript exec/fix-worksheets.R}") + quit(status = 1) +} diff --git a/exec/fix-worksheets.R b/exec/fix-worksheets.R new file mode 100644 index 00000000..c717d9f2 --- /dev/null +++ b/exec/fix-worksheets.R @@ -0,0 +1,152 @@ +#!/usr/bin/env Rscript + +# Fix CSV worksheet formatting issues +# This script automatically corrects formatting violations in the repo's +# variables and variable details sheet +# +# Usage: +# Rscript exec/fix-worksheets.R # all variables +# Rscript exec/fix-worksheets.R --subject "Ethnicity,Language,Migration" +# Rscript exec/fix-worksheets.R --variables "SDCGCGT,SDCFIMM" +# Rscript exec/fix-worksheets.R --subject Smoking --variables "COPD_Emph_der" +# +# Exit codes: +# 0 - Fixes applied successfully or no fixes needed +# 1 - Unable to fix (e.g., file not found, invalid CSV) + +suppressPackageStartupMessages({ + library(cchsflow) + library(cli) +}) + +# Constants +variables_path <- "inst/extdata/variables.csv" +variable_details_path <- "inst/extdata/variable_details.csv" + +# Parse scope arguments +scope_args <- parse_scope_args(commandArgs(trailingOnly = TRUE)) +scope <- scope_worksheets( + variables_path, variable_details_path, + variables = scope_args$variables, + subjects = scope_args$subjects +) + +#' Get list of fixed errors +#' +#' @param errors The result of a \code{fix_worksheet} call +#' +#' @return the list of fixed errors +.keep_fixed_errors <- function(errors) { + return(purrr::keep(errors, ~ .x$fixed == TRUE)) +} + +#' Get list of unfixed errors +#' +#' @param errors The result of a \code{fix_worksheet} call +#' +#' @return the list of unfixed errors +.keep_unfixed_errors <- function(errors) { + return(purrr::keep(errors, ~.x$fixed == FALSE)) +} + +#' Give the user an update regarding fixing a worksheet +#' +#' @param fix_results The result of a \code{fix_worksheet} call +.log_fix_results <- function(fix_results) { + num_fixed <- length(.keep_fixed_errors(fix_results)) + + num_unfixed <- length(.keep_unfixed_errors(fix_results)) + + if (length(fix_results) == 0) { + cli_alert_success("No issues found - file already compliant") + } else if (num_unfixed > 0) { + cli_alert_danger("Unable to fix {num_unfixed} error{?s}:") + unfixed_errors <- purrr::keep(fix_results, ~ .x$fixed == FALSE) + for (err in unfixed_errors) { + cli_text(" {cli::symbol$cross} {err$error_type}: {err$message}") + } + } else { + cli_alert_success("Fixed {num_fixed} error{?s}") + } +} + +cli_h1("Fixing CSV worksheet formatting") +if (scope$scoped) { + cli_alert_info("Scope: {scope$scope_desc}") +} + +cli_text("") + +cli_alert_info("Processing variables.csv...") + +result_vars <- fix_worksheet(scope$variables_path, "variables") + +.log_fix_results(result_vars) + +cli_text("") + +cli_alert_info("Processing variable_details.csv...") + +result_details <- fix_worksheet(scope$variable_details_path, "variable_details") + +.log_fix_results(result_details) + +cli_text("") + +# If scoped, copy fixed temp files back to originals +if (scope$scoped) { + # Read originals and scoped-fixed data + orig_vars <- read.csv(variables_path, stringsAsFactors = FALSE, + check.names = FALSE) + orig_details <- read.csv(variable_details_path, stringsAsFactors = FALSE, + check.names = FALSE) + fixed_vars <- read.csv(scope$variables_path, stringsAsFactors = FALSE, + check.names = FALSE) + fixed_details <- read.csv(scope$variable_details_path, + stringsAsFactors = FALSE, check.names = FALSE) + + # Replace in-scope rows in originals with fixed versions + in_scope_var_names <- unique(fixed_vars$variable) + orig_vars <- orig_vars[!orig_vars$variable %in% in_scope_var_names, ] + orig_vars <- rbind(orig_vars, fixed_vars) + orig_vars <- orig_vars[order(orig_vars$variable), ] + + orig_details <- orig_details[ + !orig_details$variable %in% in_scope_var_names, ] + orig_details <- rbind(orig_details, fixed_details) + orig_details <- orig_details[order(orig_details$variable), ] + + readr::write_csv(orig_vars, variables_path, na = "", quote = "needed", + escape = "double", eol = "\n") + readr::write_csv(orig_details, variable_details_path, na = "", + quote = "needed", escape = "double", eol = "\n") + cli_alert_info("Merged scoped fixes back into full worksheets") +} + +# Final report +num_fixed <- length(.keep_fixed_errors(result_vars)) + + length(.keep_fixed_errors(result_details)) +num_unfixed <- length(.keep_unfixed_errors(result_vars)) + + length(.keep_unfixed_errors(result_details)) +success <- num_unfixed == 0 +if (success) { + cli_rule() + if (num_fixed > 0) { + cli_alert_success("Applied {num_fixed} total fix{?es}") + cli_rule() + cli_text("") + cli_alert_info( + "The worksheets have been updated. Please review the changes and commit them." + ) + } else { + cli_alert_success("All worksheets are already properly formatted!") + cli_rule() + } + quit(status = 0) +} else { + cli_rule() + cli_alert_danger("Unable to fix all issues") + cli_rule() + cli_alert_info("Please review the errors above and fix them manually.") + quit(status = 1) +} From 61356d81c515f2311485ad1cd92cd65dddf73532 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Mon, 6 Apr 2026 15:59:03 -0400 Subject: [PATCH 13/15] refactor(skill): Improve cchsflow-review from PR #176 retrospective MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Step 7: Broaden re-confirmation to all findings (not just P0/P1) to catch stale-data false positives on informational items - Step 8: Add branch verification before CEP commit to prevent committing artifacts on the wrong branch - Step 8/9: Add fix-then-report ordering guidance — apply fixes before posting PR comment when issues are being corrected - L6: Add explicit DV feeder resolution callout with multi-era example (rec_with_table does not auto-resolve feeders) - L6: Add QMD skip-if-clean guidance for reviews with no anomalies - Reference: Add Gem template fallback path via git show --- .claude/skills/cchsflow-review/SKILL.md | 20 ++++++++++++++++--- .../docs/l6-implementation-validation.md | 16 +++++++++++++-- 2 files changed, 31 insertions(+), 5 deletions(-) diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md index 9313b2ed..c3cf820d 100644 --- a/.claude/skills/cchsflow-review/SKILL.md +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -229,7 +229,7 @@ Read and follow `docs/l6-implementation-validation.md` for the full procedure. T #### Re-confirm findings before scoring -Before finalising the review summary, **re-confirm each P0/P1 finding** by reading the specific cell directly from the current branch's `inst/extdata/` file using Python csv. Do not rely on earlier script output or cached copies (e.g., `/tmp/vd_pr.csv`). A finding that cannot be reproduced on a fresh read of the branch should be downgraded to 0. This step catches false positives caused by stale data in intermediate files. +Before finalising the review summary, **re-confirm every finding** (P0/P1 and informational) by reading the specific cell directly from the current branch's `inst/extdata/` file using Python csv. Do not rely on earlier script output or cached copies (e.g., `/tmp/vd_pr.csv`). A finding that cannot be reproduced on a fresh read of the branch should be downgraded to 0 or removed. This step catches false positives caused by stale data in intermediate files — for example, `_s` databases that appear in cached data but have already been cleaned up on the branch. #### Scoring scale @@ -268,6 +268,14 @@ ceps/cep-NNN-/ After saving artifacts, **commit and push them to the PR branch** so other reviewers can access them. CEP artifacts referenced in PR comments must exist on the branch — local-only files create dead references. +**Branch verification**: Before committing, verify you are on the PR branch: + +```bash +git branch --show-current # Must match the PR branch, not skills/review-validation +``` + +If you switched branches during the review (e.g., to access skill docs or extract templates), switch back to the PR branch before committing. Use `git stash` / `git stash pop` to carry uncommitted CEP files across branch switches. + ```bash git add ceps/cep-NNN-/ # Exclude rendered output (.html, *_files/, .quarto/) — only commit source files @@ -275,7 +283,13 @@ git commit -m "Add CEP-NNN review artifacts for PR #XXX" git push origin ``` -If working on a different branch than the PR, push to the PR branch or note in the PR comment where the artifacts live. +#### Fix-then-report ordering + +If the review identifies fixable issues (e.g., label typos, `_s` → `_m` conversions) and the user requests applying them, **apply fixes and commit before posting the PR comment**. This avoids posting a comment that says "no issues" and then immediately pushing a fix commit, or having to edit the comment after the fact. + +The workflow becomes: Step 7 (score) → Step 9 (fix) → Step 8 (report). The PR comment should reference the fix commit SHA and describe what was found and fixed. + +If no fixes are needed, post the PR comment immediately after scoring. #### Post PR comment (PR reviews) @@ -371,7 +385,7 @@ Summarise the retrospective to the user. If skill updates are warranted, propose - **L6 implementation validation**: `docs/l6-implementation-validation.md` — rec_with_table() testing, prevalence analysis - **CSV validation and fixes**: `docs/csv-validation-and-fixes.md` — check/fix tools, fix workflow, visual diff - **Variable naming conventions**: `docs/variable-naming-conventions.md` — harmonized variable naming rules -- **Gem verification workflow**: `docs/review/` — NotebookLM Gem system prompt, notebook manifest, coverage summary +- **Gem verification workflow**: `docs/review/` — NotebookLM Gem system prompt, notebook manifest, coverage summary. The Gem prompt template lives in `ceps/cep-002-smoking/gn-all-smoking-variables-prompt.md` on the `skills/review-validation` branch. If not available on the current branch, extract with `git show skills/review-validation:ceps/cep-002-smoking/gn-all-smoking-variables-prompt.md`. ### External references diff --git a/.claude/skills/cchsflow-review/docs/l6-implementation-validation.md b/.claude/skills/cchsflow-review/docs/l6-implementation-validation.md index ecc2fad0..967f376a 100644 --- a/.claude/skills/cchsflow-review/docs/l6-implementation-validation.md +++ b/.claude/skills/cchsflow-review/docs/l6-implementation-validation.md @@ -126,6 +126,8 @@ write.csv(results, "ceps/cep-NNN-domain/vars-pumf-integration-test.csv", After generating the integration test CSV, create a Quarto document (`.qmd`) that visualises the cross-cycle results. This is a standard CEP artifact — visual inspection of prevalence trends is the most effective way to detect era boundary problems. +**When to skip**: If the cross-cycle summary shows no anomalies (no step changes, no unexpected zeros, distributions consistent within eras), the QMD is optional. Note in the CEP review summary that no QMD was generated because no anomalies were detected. The QMD is most valuable when there are ambiguous patterns that benefit from visual inspection. + The QMD should include: 1. **Cross-cycle valid % line plot** for each key variable (or a representative subset), with cycles on the x-axis and valid % on the y-axis. Add vertical reference lines at era boundaries (2007, 2015). 2. **Category distribution plot** for categorical derived variables (e.g., stacked bar chart of diet_score_cat3 across cycles). @@ -195,11 +197,21 @@ Example of a step change indicating a problem: ## Derived variable testing +**DV feeder resolution**: `rec_with_table()` does **not** auto-resolve DerivedVar feeders. You must include all feeder variables in the `variables` argument. For multi-era DVs (e.g., `active_transport` with 3 era-specific functions), each era needs different feeders — build the variables list per era from the `DerivedVar::[]` field in variable_details.csv. If feeders are omitted, the DV will silently return NA for all respondents in that era. + +Example for a 3-era DV: +```r +# Era 1 (2001-2005): needs PAC_4A_cont, PAC_4B_cont +# Era 2 (2007-2014): needs PAC_7, PAC_7A, PAC_7B_cont, PAC_8, PAC_8A, PAC_8B_cont +# Era 3 (2015+): needs PAYDVTTR, PAADVTRV +# Test each era with its own feeder set + the DV name +``` + If the in-scope variables include derived variables (functions in `R/`): 1. Identify the DV function (e.g., `diet_score_fun()` in `R/diet.R`) -2. Check that all input variables are available in the test cycles -3. Run `rec_with_table()` with the derived variable to verify the full pipeline +2. Check that all input variables are available in the test cycles — and include them in the `variables` argument +3. Run `rec_with_table()` with the derived variable and its feeders to verify the full pipeline 4. Compare the derived variable's valid % against its input variables — the DV should not have materially higher valid % than its least-available input 5. For categorical derived variables and key continuous inputs, examine the **exposure distribution** across cycles — not just valid counts. The central harmonization question is whether typical exposures (e.g., proportion with 0 fruit/veg, or >5 servings/day) remain stable across cycles. A sudden shift in the distribution at an era boundary signals a recoding or mapping error even when valid % is unchanged. Include these distributions in both the integration test output and the QMD visualisation From 9e99e1b6f5dbf7d8c530b855c550ee1e25f7a1d0 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Mon, 6 Apr 2026 20:26:27 -0400 Subject: [PATCH 14/15] fix(validation): Add missing schema infrastructure and fix bugs - Add load_schema() function and YAML schema files (variables.yaml, variable_details.yaml) so check_worksheet() can actually run - Add glue, purrr, readr, yaml to DESCRIPTION Imports - Fix assignment operator bug in .create_line_ending_crlf_error() (message <- should be message =) - Fix stray comma in fix_worksheet() write_csv() call - Use devtools::load_all() in exec scripts when in source tree - Regenerate NAMESPACE exports and man pages --- DESCRIPTION | 5 +++ NAMESPACE | 6 ++++ R/check-worksheet.R | 2 +- R/fix-worksheet.R | 2 +- R/load-schema.R | 34 +++++++++++++++++++ exec/check-worksheets.R | 6 +++- exec/fix-worksheets.R | 6 +++- .../schemas/core/variable_details.yaml | 17 ++++++++++ inst/metadata/schemas/core/variables.yaml | 13 +++++++ man/check_recode_blocks.Rd | 31 +++++++++++++++++ man/check_worksheet.Rd | 26 ++++++++++++++ man/dot-check_column_order.Rd | 25 ++++++++++++++ man/dot-check_excessive_quoting.Rd | 26 ++++++++++++++ man/dot-check_line_endings.Rd | 25 ++++++++++++++ man/dot-check_row_sorting.Rd | 25 ++++++++++++++ man/dot-check_trailing_empty_columns.Rd | 23 +++++++++++++ man/dot-create_column_order_error.Rd | 32 +++++++++++++++++ man/dot-create_excessive_quoting_error.Rd | 32 +++++++++++++++++ man/dot-create_file_not_found_error.Rd | 20 +++++++++++ man/dot-create_invalid_csv_error.Rd | 22 ++++++++++++ man/dot-create_line_ending_crlf_error.Rd | 22 ++++++++++++ man/dot-create_missing_id_column_error.Rd | 22 ++++++++++++ ...dot-create_recode_block_collision_error.Rd | 28 +++++++++++++++ ...dot-create_trailing_empty_columns_error.Rd | 22 ++++++++++++ man/dot-create_unsorted_rows_error.Rd | 22 ++++++++++++ man/dot-fix_column_order_errors.Rd | 21 ++++++++++++ man/dot-fix_empty_column_errors.Rd | 21 ++++++++++++ man/dot-fix_unsorted_rows_error.Rd | 21 ++++++++++++ man/dot-split_csv_line.Rd | 19 +++++++++++ man/fix_worksheet.Rd | 32 +++++++++++++++++ man/load_schema.Rd | 30 ++++++++++++++++ man/parse_scope_args.Rd | 18 ++++++++++ man/scope_worksheets.Rd | 32 +++++++++++++++++ 33 files changed, 684 insertions(+), 4 deletions(-) create mode 100644 R/load-schema.R create mode 100644 inst/metadata/schemas/core/variable_details.yaml create mode 100644 inst/metadata/schemas/core/variables.yaml create mode 100644 man/check_recode_blocks.Rd create mode 100644 man/check_worksheet.Rd create mode 100644 man/dot-check_column_order.Rd create mode 100644 man/dot-check_excessive_quoting.Rd create mode 100644 man/dot-check_line_endings.Rd create mode 100644 man/dot-check_row_sorting.Rd create mode 100644 man/dot-check_trailing_empty_columns.Rd create mode 100644 man/dot-create_column_order_error.Rd create mode 100644 man/dot-create_excessive_quoting_error.Rd create mode 100644 man/dot-create_file_not_found_error.Rd create mode 100644 man/dot-create_invalid_csv_error.Rd create mode 100644 man/dot-create_line_ending_crlf_error.Rd create mode 100644 man/dot-create_missing_id_column_error.Rd create mode 100644 man/dot-create_recode_block_collision_error.Rd create mode 100644 man/dot-create_trailing_empty_columns_error.Rd create mode 100644 man/dot-create_unsorted_rows_error.Rd create mode 100644 man/dot-fix_column_order_errors.Rd create mode 100644 man/dot-fix_empty_column_errors.Rd create mode 100644 man/dot-fix_unsorted_rows_error.Rd create mode 100644 man/dot-split_csv_line.Rd create mode 100644 man/fix_worksheet.Rd create mode 100644 man/load_schema.Rd create mode 100644 man/parse_scope_args.Rd create mode 100644 man/scope_worksheets.Rd diff --git a/DESCRIPTION b/DESCRIPTION index 4aa44369..17c0ee1c 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -41,6 +41,11 @@ Depends: sjlabelled (>= 1.0.17), stringr (>= 1.2.0), magrittr +Imports: + glue, + purrr, + readr, + yaml Description: Supporting the use of the Canadian Community Health Survey (CCHS) by transforming variables from each cycle into harmonized, consistent versions that span survey cycles (currently, 2001 to diff --git a/NAMESPACE b/NAMESPACE index 5bedc946..07ee2136 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -36,13 +36,17 @@ export(age_cat_fun) export(binge_drinker_fun) export(bmi_fun) export(bmi_fun_cat) +export(check_recode_blocks) +export(check_worksheet) export(diet_score_fun) export(diet_score_fun_cat) export(energy_exp_fun) +export(fix_worksheet) export(food_insecurity_der) export(if_else2) export(immigration_fun) export(is_equal) +export(load_schema) export(low_drink_long_fun) export(low_drink_score_fun) export(low_drink_score_fun1) @@ -52,12 +56,14 @@ export(multiple_conditions_fun1) export(multiple_conditions_fun2) export(pack_years_fun) export(pack_years_fun_cat) +export(parse_scope_args) export(pct_time_fun) export(pct_time_fun_cat) export(rec_with_table) export(resp_condition_fun1) export(resp_condition_fun2) export(resp_condition_fun3) +export(scope_worksheets) export(set_data_labels) export(smoke_simple_fun) export(time_quit_smoking_fun) diff --git a/R/check-worksheet.R b/R/check-worksheet.R index c0763ba5..5009a840 100644 --- a/R/check-worksheet.R +++ b/R/check-worksheet.R @@ -317,7 +317,7 @@ check_worksheet <- function( row_num = row_num, expected_line_ending = expected_line_ending, actual_line_ending = actual_line_ending, - message <- glue::glue("Error in {.pretty_print_file_type(file_type)} at {file_path}. Row {row_num} has an invalid line ending. Expected {expected_line_ending} but found {actual_line_ending}.") + message = glue::glue("Error in {.pretty_print_file_type(file_type)} at {file_path}. Row {row_num} has an invalid line ending. Expected {expected_line_ending} but found {actual_line_ending}.") )) } diff --git a/R/fix-worksheet.R b/R/fix-worksheet.R index b40ab5b3..fecc51c0 100644 --- a/R/fix-worksheet.R +++ b/R/fix-worksheet.R @@ -63,7 +63,7 @@ fix_worksheet <- function( # Write CSV content and also fix excessive quotting and line ending errors readr::write_csv( - row_order_fixed_data, , + row_order_fixed_data, file = file_path, na = "", quote = "needed", diff --git a/R/load-schema.R b/R/load-schema.R new file mode 100644 index 00000000..a0c41d7a --- /dev/null +++ b/R/load-schema.R @@ -0,0 +1,34 @@ +#' Load schema configuration from YAML +#' +#' @description Loads the YAML schema configuration for a given file type. +#' The schema contains the expected column order and other metadata used +#' for validating CSV worksheets. +#' +#' @param file_type Either "variables" or "variable_details" +#' +#' @return List containing schema configuration: +#' \itemize{ +#' \item expected_column_order: Character vector of column names in expected order +#' \item id_column_name: Name of the ID column used for row sorting (variables only) +#' } +#' +#' @export +#' +#' @examples +#' \dontrun{ +#' schema <- load_schema("variables") +#' schema$expected_column_order +#' schema$id_column_name +#' } +load_schema <- function(file_type) { + file_type <- match.arg(file_type, c("variables", "variable_details")) + + schema_file <- paste0(file_type, ".yaml") + schema_path <- system.file( + "metadata", "schemas", "core", schema_file, + package = "cchsflow", + mustWork = TRUE + ) + + return(yaml::read_yaml(schema_path)) +} diff --git a/exec/check-worksheets.R b/exec/check-worksheets.R index 0e347a22..66fd17b8 100644 --- a/exec/check-worksheets.R +++ b/exec/check-worksheets.R @@ -15,7 +15,11 @@ # 1 - Formatting violations detected suppressPackageStartupMessages({ - library(cchsflow) + if (file.exists("DESCRIPTION")) { + devtools::load_all(quiet = TRUE) + } else { + library(cchsflow) + } library(cli) }) diff --git a/exec/fix-worksheets.R b/exec/fix-worksheets.R index c717d9f2..011c625b 100644 --- a/exec/fix-worksheets.R +++ b/exec/fix-worksheets.R @@ -15,7 +15,11 @@ # 1 - Unable to fix (e.g., file not found, invalid CSV) suppressPackageStartupMessages({ - library(cchsflow) + if (file.exists("DESCRIPTION")) { + devtools::load_all(quiet = TRUE) + } else { + library(cchsflow) + } library(cli) }) diff --git a/inst/metadata/schemas/core/variable_details.yaml b/inst/metadata/schemas/core/variable_details.yaml new file mode 100644 index 00000000..90bc65ec --- /dev/null +++ b/inst/metadata/schemas/core/variable_details.yaml @@ -0,0 +1,17 @@ +expected_column_order: + - "variable" + - "dummyVariable" + - "typeEnd" + - "databaseStart" + - "variableStart" + - "typeStart" + - "recEnd" + - "numValidCat" + - "catLabel" + - "catLabelLong" + - "units" + - "recStart" + - "catStartLabel" + - "variableStartShortLabel" + - "variableStartLabel" + - "notes" diff --git a/inst/metadata/schemas/core/variables.yaml b/inst/metadata/schemas/core/variables.yaml new file mode 100644 index 00000000..7feaf009 --- /dev/null +++ b/inst/metadata/schemas/core/variables.yaml @@ -0,0 +1,13 @@ +expected_column_order: + - "variable" + - "label" + - "labelLong" + - "section" + - "subject" + - "variableType" + - "units" + - "databaseStart" + - "variableStart" + - "description" + +id_column_name: "variable" diff --git a/man/check_recode_blocks.Rd b/man/check_recode_blocks.Rd new file mode 100644 index 00000000..6a6207b7 --- /dev/null +++ b/man/check_recode_blocks.Rd @@ -0,0 +1,31 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{check_recode_blocks} +\alias{check_recode_blocks} +\title{Check variable_details.csv for recode block recStart collisions} +\usage{ +check_recode_blocks(file_path) +} +\arguments{ +\item{file_path}{Path to variable_details.csv} +} +\value{ +A list of errors found. Each error is a named list containing +information about the error. +} +\description{ +For variables with multiple recode blocks (distinct variableStart values), +checks whether the same recStart value appears in rows from more than one +block for the same database. This directly detects the condition that causes +rec_with_table() to match duplicate rows and produce incorrect output. +} +\details{ +Note: databaseStart overlap alone is not sufficient to flag an error because +cchsflow legitimately uses parallel PUMF and Master blocks with shared +databases but non-overlapping recStart ranges. +} +\examples{ +\dontrun{ +check_recode_blocks("inst/extdata/variable_details.csv") +} +} diff --git a/man/check_worksheet.Rd b/man/check_worksheet.Rd new file mode 100644 index 00000000..47245b6b --- /dev/null +++ b/man/check_worksheet.Rd @@ -0,0 +1,26 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{check_worksheet} +\alias{check_worksheet} +\title{Check a CSV worksheet for formatting errors} +\usage{ +check_worksheet(file_path, file_type = c("variables", "variable_details")) +} +\arguments{ +\item{file_path}{Path to the CSV file to check} + +\item{file_type}{Type of file being checked. Either "variables" or +"variable_details".} +} +\value{ +A list of errors found. Each error is a named list containing +information about the error. +} +\description{ +Check a CSV worksheet for formatting errors +} +\examples{ +\dontrun{ +check_worksheet("inst/extdata/variables.csv", "variables") +} +} diff --git a/man/dot-check_column_order.Rd b/man/dot-check_column_order.Rd new file mode 100644 index 00000000..aab6f466 --- /dev/null +++ b/man/dot-check_column_order.Rd @@ -0,0 +1,25 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.check_column_order} +\alias{.check_column_order} +\title{Check the columns order in a worksheet} +\usage{ +.check_column_order(csv_data, expected_columns, error_ctx) +} +\arguments{ +\item{csv_data}{A data.frame containing the worksheet rows} + +\item{expected_columns}{The worksheet column in their expected order} + +\item{error_ctx}{Information used when creating the error object. A named +list with the following fields: +* file_type: The type of worksheet the CSV contains. Can be "variables" or + "variable_details". +* file_path: The file path to the worksheet} +} +\value{ +The list of column order errors found in the worksheet +} +\description{ +Check the columns order in a worksheet +} diff --git a/man/dot-check_excessive_quoting.Rd b/man/dot-check_excessive_quoting.Rd new file mode 100644 index 00000000..bf0b3903 --- /dev/null +++ b/man/dot-check_excessive_quoting.Rd @@ -0,0 +1,26 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.check_excessive_quoting} +\alias{.check_excessive_quoting} +\title{Check a worksheet for excessive quoting} +\usage{ +.check_excessive_quoting(raw_lines, error_ctx) +} +\arguments{ +\item{raw_lines}{Character vector of lines from readr::read_lines()} + +\item{error_ctx}{Information used when creating the error object. A named +list with the following fields: +* file_type: The type of worksheet the CSV contains. Can be "variables" or + "variable_details". +* file_path: The file path to the worksheet} +} +\value{ +list of errors +} +\description{ +Scans raw CSV lines for quoted fields that don't require quoting. A field +needs quoting only if it contains a comma, double-quote, newline, or +carriage return. Uses vectorised pre-filter: only lines containing a +double-quote are inspected field-by-field. +} diff --git a/man/dot-check_line_endings.Rd b/man/dot-check_line_endings.Rd new file mode 100644 index 00000000..29547489 --- /dev/null +++ b/man/dot-check_line_endings.Rd @@ -0,0 +1,25 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.check_line_endings} +\alias{.check_line_endings} +\title{Check whether a worksheet has the correct line endings} +\usage{ +.check_line_endings(raw_lines, error_ctx) +} +\arguments{ +\item{raw_lines}{Character vector of lines from readr::read_lines()} + +\item{error_ctx}{Information used when creating the error object. A named +list with the following fields: +* file_type: The type of worksheet the CSV contains. Can be "variables" or + "variable_details". +* file_path: The file path to the worksheet} +} +\value{ +The list of line ending errors found in the worksheet +} +\description{ +Uses vectorised grep on raw lines for performance. The raw file is read +with readr::read_lines() which strips LF but preserves trailing CR if +present, so CRLF lines end with \\r. +} diff --git a/man/dot-check_row_sorting.Rd b/man/dot-check_row_sorting.Rd new file mode 100644 index 00000000..5d11fa7f --- /dev/null +++ b/man/dot-check_row_sorting.Rd @@ -0,0 +1,25 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.check_row_sorting} +\alias{.check_row_sorting} +\title{Check the rows order in a worksheet} +\usage{ +.check_row_sorting(csv_data, id_column_name, error_ctx) +} +\arguments{ +\item{csv_data}{A data.frame containing the worksheet rows} + +\item{id_column_name}{Name of the column to check for sorting} + +\item{error_ctx}{Information used when creating the error object. A named +list with the following fields: +* file_type: The type of worksheet the CSV contains. Can be "variables" or + "variable_details". +* file_path: The file path to the worksheet} +} +\value{ +The list of row order errors found in the worksheet +} +\description{ +Check the rows order in a worksheet +} diff --git a/man/dot-check_trailing_empty_columns.Rd b/man/dot-check_trailing_empty_columns.Rd new file mode 100644 index 00000000..c789b99b --- /dev/null +++ b/man/dot-check_trailing_empty_columns.Rd @@ -0,0 +1,23 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.check_trailing_empty_columns} +\alias{.check_trailing_empty_columns} +\title{Check a worksheet for trailing empty columns} +\usage{ +.check_trailing_empty_columns(csv_data, error_ctx) +} +\arguments{ +\item{csv_data}{Data.frame containing the worksheet rows} + +\item{error_ctx}{Information used when creating the error object. A named +list with the following fields: +* file_type: The type of worksheet the CSV contains. Can be "variables" or + "variable_details". +* file_path: The file path to the worksheet} +} +\value{ +List of errors +} +\description{ +Check a worksheet for trailing empty columns +} diff --git a/man/dot-create_column_order_error.Rd b/man/dot-create_column_order_error.Rd new file mode 100644 index 00000000..eec87cc6 --- /dev/null +++ b/man/dot-create_column_order_error.Rd @@ -0,0 +1,32 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_column_order_error} +\alias{.create_column_order_error} +\title{Create an error for when the worksheet columns are in the wrong order} +\usage{ +.create_column_order_error( + file_type, + file_path, + expected_column, + col_num, + actual_column +) +} +\arguments{ +\item{file_path}{Path to the worksheet} + +\item{expected_column}{The column expected at the offending position} + +\item{col_num}{The position of the column with the wrong header} + +\item{actual_column}{The actual column value} + +\item{file_type:}{The type of worksheet. Can be "variables" or +"variable_details".} +} +\value{ +A named list +} +\description{ +Create an error for when the worksheet columns are in the wrong order +} diff --git a/man/dot-create_excessive_quoting_error.Rd b/man/dot-create_excessive_quoting_error.Rd new file mode 100644 index 00000000..6fc4d0a4 --- /dev/null +++ b/man/dot-create_excessive_quoting_error.Rd @@ -0,0 +1,32 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_excessive_quoting_error} +\alias{.create_excessive_quoting_error} +\title{Create an error for when the worksheet has excessive quoting} +\usage{ +.create_excessive_quoting_error( + file_type, + file_path, + row_num, + col_num, + cell_value +) +} +\arguments{ +\item{file_path}{Path to the worksheet} + +\item{row_num}{Row number with excessive quotes} + +\item{col_num}{Column number with excessive quotes} + +\item{cell_value}{Value of the cell with excessive quotes} + +\item{file_type:}{The type of worksheet. Can be "variables" or +"variable_details".} +} +\value{ +A named list +} +\description{ +Create an error for when the worksheet has excessive quoting +} diff --git a/man/dot-create_file_not_found_error.Rd b/man/dot-create_file_not_found_error.Rd new file mode 100644 index 00000000..cf166510 --- /dev/null +++ b/man/dot-create_file_not_found_error.Rd @@ -0,0 +1,20 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_file_not_found_error} +\alias{.create_file_not_found_error} +\title{Create the error object for when the worksheet could not be found} +\usage{ +.create_file_not_found_error(file_type, file_path) +} +\arguments{ +\item{file_path}{The invalid path} + +\item{file_type:}{The type of worksheet. Can be "variables" or +"variable_details".} +} +\value{ +A named list +} +\description{ +Create the error object for when the worksheet could not be found +} diff --git a/man/dot-create_invalid_csv_error.Rd b/man/dot-create_invalid_csv_error.Rd new file mode 100644 index 00000000..9cf71f9a --- /dev/null +++ b/man/dot-create_invalid_csv_error.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_invalid_csv_error} +\alias{.create_invalid_csv_error} +\title{Create the error for when the worksheet is not valid CSV} +\usage{ +.create_invalid_csv_error(file_type, file_path, error_message) +} +\arguments{ +\item{file_path}{Path to the worksheet} + +\item{error_message}{Reason(s) for why the worksheet is invalid CSV} + +\item{file_type:}{The type of worksheet. Can be "variables" or +"variable_details".} +} +\value{ +A named list +} +\description{ +Create the error for when the worksheet is not valid CSV +} diff --git a/man/dot-create_line_ending_crlf_error.Rd b/man/dot-create_line_ending_crlf_error.Rd new file mode 100644 index 00000000..4a4236ea --- /dev/null +++ b/man/dot-create_line_ending_crlf_error.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_line_ending_crlf_error} +\alias{.create_line_ending_crlf_error} +\title{Create an error for when the worksheet has invalid line endings} +\usage{ +.create_line_ending_crlf_error(file_type, file_path, row_num) +} +\arguments{ +\item{file_path}{Path to the worksheet} + +\item{row_num}{Index of the row with the invalid line ending} + +\item{file_type:}{The type of worksheet. Can be "variables" or +"variable_details".} +} +\value{ +A named list +} +\description{ +Create an error for when the worksheet has invalid line endings +} diff --git a/man/dot-create_missing_id_column_error.Rd b/man/dot-create_missing_id_column_error.Rd new file mode 100644 index 00000000..7ff70354 --- /dev/null +++ b/man/dot-create_missing_id_column_error.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_missing_id_column_error} +\alias{.create_missing_id_column_error} +\title{Create an error for when the worksheet is missing the ID column} +\usage{ +.create_missing_id_column_error(file_type, file_path, id_column_name) +} +\arguments{ +\item{file_path}{Path to the worksheet} + +\item{id_column_name}{Name of the expected ID column} + +\item{file_type:}{The type of worksheet. Can be "variables" or +"variable_details".} +} +\value{ +A named list +} +\description{ +Create an error for when the worksheet is missing the ID column +} diff --git a/man/dot-create_recode_block_collision_error.Rd b/man/dot-create_recode_block_collision_error.Rd new file mode 100644 index 00000000..638aeeae --- /dev/null +++ b/man/dot-create_recode_block_collision_error.Rd @@ -0,0 +1,28 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_recode_block_collision_error} +\alias{.create_recode_block_collision_error} +\title{Create an error for recStart collisions across recode blocks} +\usage{ +.create_recode_block_collision_error( + file_path, + variable_name, + collision_keys, + db_recstart_blocks +) +} +\arguments{ +\item{file_path}{Path to the worksheet} + +\item{variable_name}{Name of the variable with the collision} + +\item{collision_keys}{Character vector of "database|||recStart" keys with collisions} + +\item{db_recstart_blocks}{Named list mapping keys to block vectors} +} +\value{ +A named list +} +\description{ +Create an error for recStart collisions across recode blocks +} diff --git a/man/dot-create_trailing_empty_columns_error.Rd b/man/dot-create_trailing_empty_columns_error.Rd new file mode 100644 index 00000000..6f4c40e6 --- /dev/null +++ b/man/dot-create_trailing_empty_columns_error.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_trailing_empty_columns_error} +\alias{.create_trailing_empty_columns_error} +\title{Create an error for when the worksheet has trailing empty columns} +\usage{ +.create_trailing_empty_columns_error(file_type, file_path, col_num) +} +\arguments{ +\item{file_path}{Path to the worksheet} + +\item{col_num}{Position of the trailing empty column} + +\item{file_type:}{The type of worksheet. Can be "variables" or +"variable_details".} +} +\value{ +A named list +} +\description{ +Create an error for when the worksheet has trailing empty columns +} diff --git a/man/dot-create_unsorted_rows_error.Rd b/man/dot-create_unsorted_rows_error.Rd new file mode 100644 index 00000000..94575d7d --- /dev/null +++ b/man/dot-create_unsorted_rows_error.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.create_unsorted_rows_error} +\alias{.create_unsorted_rows_error} +\title{Create an error for when the worksheet rows are unsorted} +\usage{ +.create_unsorted_rows_error(file_type, file_path, id_column_name) +} +\arguments{ +\item{file_path}{Path to the worksheet} + +\item{id_column_name}{Name of the column that should be sorted} + +\item{file_type:}{The type of worksheet. Can be "variables" or +"variable_details".} +} +\value{ +A named list +} +\description{ +Create an error for when the worksheet rows are unsorted +} diff --git a/man/dot-fix_column_order_errors.Rd b/man/dot-fix_column_order_errors.Rd new file mode 100644 index 00000000..657886d0 --- /dev/null +++ b/man/dot-fix_column_order_errors.Rd @@ -0,0 +1,21 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/fix-worksheet.R +\name{.fix_column_order_errors} +\alias{.fix_column_order_errors} +\title{Reorder columns to match expected order in a CSV worksheet} +\usage{ +.fix_column_order_errors(csv_data, column_order_errors) +} +\arguments{ +\item{csv_data}{A data frame containing the CSV data.} + +\item{column_order_errors}{A list of column order error objects from +\code{check_worksheet}.} +} +\value{ +The reordered data frame +} +\description{ +Reorder columns to match expected order in a CSV worksheet +} +\keyword{internal} diff --git a/man/dot-fix_empty_column_errors.Rd b/man/dot-fix_empty_column_errors.Rd new file mode 100644 index 00000000..fdc837ba --- /dev/null +++ b/man/dot-fix_empty_column_errors.Rd @@ -0,0 +1,21 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/fix-worksheet.R +\name{.fix_empty_column_errors} +\alias{.fix_empty_column_errors} +\title{Remove empty trailing columns from CSV data} +\usage{ +.fix_empty_column_errors(csv_data, empty_column_errors) +} +\arguments{ +\item{csv_data}{A data frame containing the CSV data.} + +\item{empty_column_errors}{A list of empty column error objects from +\code{check_worksheet}.} +} +\value{ +The data frame with empty columns removed. +} +\description{ +Remove empty trailing columns from CSV data +} +\keyword{internal} diff --git a/man/dot-fix_unsorted_rows_error.Rd b/man/dot-fix_unsorted_rows_error.Rd new file mode 100644 index 00000000..c1fdecd4 --- /dev/null +++ b/man/dot-fix_unsorted_rows_error.Rd @@ -0,0 +1,21 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/fix-worksheet.R +\name{.fix_unsorted_rows_error} +\alias{.fix_unsorted_rows_error} +\title{Sort rows alphabetically by the ID column} +\usage{ +.fix_unsorted_rows_error(csv_data, unsorted_row_errors) +} +\arguments{ +\item{csv_data}{A data frame containing the CSV data.} + +\item{unsorted_row_errors}{A list of unsorted row error objects from +\code{check_worksheet}. Each error contains an \code{id_column_name} field.} +} +\value{ +The sorted data frame +} +\description{ +Sort rows alphabetically by the ID column +} +\keyword{internal} diff --git a/man/dot-split_csv_line.Rd b/man/dot-split_csv_line.Rd new file mode 100644 index 00000000..4e2c58c1 --- /dev/null +++ b/man/dot-split_csv_line.Rd @@ -0,0 +1,19 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/check-worksheet.R +\name{.split_csv_line} +\alias{.split_csv_line} +\title{Split a single CSV line into raw field strings preserving quoting} +\usage{ +.split_csv_line(line) +} +\arguments{ +\item{line}{A single CSV line as a character string} +} +\value{ +Character vector of raw field strings +} +\description{ +Unlike scan() which strips quotes, this returns each field exactly as it +appears in the raw line — quoted fields retain their surrounding quotes. +This is needed for detecting excessive quoting. +} diff --git a/man/fix_worksheet.Rd b/man/fix_worksheet.Rd new file mode 100644 index 00000000..aef9c2d6 --- /dev/null +++ b/man/fix_worksheet.Rd @@ -0,0 +1,32 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/fix-worksheet.R +\name{fix_worksheet} +\alias{fix_worksheet} +\title{Fix formatting issues in a CSV worksheet} +\usage{ +fix_worksheet(file_path, file_type = c("variables", "variable_details")) +} +\arguments{ +\item{file_path}{Path to the CSV file to fix} + +\item{file_type}{Type of file being fixed. Either "variables" or +"variable_details".} +} +\value{ +A list of error objects, each with an added \code{fixed} field: + \itemize{ + \item fixed: Boolean indicating if the error was fixed + } + Returns an empty list if no errors were found. +} +\description{ +Automatically corrects formatting violations in a CSV worksheet + based on the errors detected by \code{check_worksheet}. +} +\examples{ +\dontrun{ +errors <- fix_worksheet("inst/extdata/variables.csv", "variables") +fixed_count <- sum(sapply(errors, function(e) e$fixed)) +print(sprintf("Fixed \%d errors", fixed_count)) +} +} diff --git a/man/load_schema.Rd b/man/load_schema.Rd new file mode 100644 index 00000000..e6b14213 --- /dev/null +++ b/man/load_schema.Rd @@ -0,0 +1,30 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/load-schema.R +\name{load_schema} +\alias{load_schema} +\title{Load schema configuration from YAML} +\usage{ +load_schema(file_type) +} +\arguments{ +\item{file_type}{Either "variables" or "variable_details"} +} +\value{ +List containing schema configuration: + \itemize{ + \item expected_column_order: Character vector of column names in expected order + \item id_column_name: Name of the ID column used for row sorting (variables only) + } +} +\description{ +Loads the YAML schema configuration for a given file type. + The schema contains the expected column order and other metadata used + for validating CSV worksheets. +} +\examples{ +\dontrun{ +schema <- load_schema("variables") +schema$expected_column_order +schema$id_column_name +} +} diff --git a/man/parse_scope_args.Rd b/man/parse_scope_args.Rd new file mode 100644 index 00000000..ed94b6e8 --- /dev/null +++ b/man/parse_scope_args.Rd @@ -0,0 +1,18 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/scope-worksheets.R +\name{parse_scope_args} +\alias{parse_scope_args} +\title{Parse --variables and --subject CLI arguments} +\usage{ +parse_scope_args(args) +} +\arguments{ +\item{args}{Character vector from commandArgs(trailingOnly = TRUE)} +} +\value{ +Named list with `variables` (character vector or NULL) and +`subjects` (character vector or NULL) +} +\description{ +Parse --variables and --subject CLI arguments +} diff --git a/man/scope_worksheets.Rd b/man/scope_worksheets.Rd new file mode 100644 index 00000000..82449b45 --- /dev/null +++ b/man/scope_worksheets.Rd @@ -0,0 +1,32 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/scope-worksheets.R +\name{scope_worksheets} +\alias{scope_worksheets} +\title{Filter worksheets to a subset of variables} +\usage{ +scope_worksheets( + variables_path, + variable_details_path, + variables = NULL, + subjects = NULL +) +} +\arguments{ +\item{variables_path}{Path to variables.csv} + +\item{variable_details_path}{Path to variable_details.csv} + +\item{variables}{Character vector of variable names to include, or NULL} + +\item{subjects}{Character vector of subject values to include, or NULL} +} +\value{ +A named list with `variables_path` and `variable_details_path` +pointing to (possibly temp) files, plus `scope_desc` describing what was +filtered, and `scoped` (logical) indicating whether filtering was applied. +} +\description{ +Creates temporary copies of variables.csv and variable_details.csv +containing only the rows matching the specified scope. Scope can be +defined by variable names, subject values, or auto-detected from git diff. +} From 103d6445bbbe1b7f1442fae61f46355271238d92 Mon Sep 17 00:00:00 2001 From: Doug Manuel Date: Mon, 6 Apr 2026 20:42:52 -0400 Subject: [PATCH 15/15] refactor(skills): Renumber checks, add L-stage mapping, fix stale references - Renumber validation checks sequentially (1-8) in cchsflow-validation - Add skill chooser table and L-stage mapping to cchsflow-validation - Add "Save Gem verification findings" step to cchsflow-review Step 8 - Fix stale metadata_registry.yaml references across 3 docs - Replace broken MCP troubleshooting cross-reference in cchsflow-worksheets with inline 4-step troubleshooting guide - Fix _s suffix description (deprecated, replace with _m) - Update expected column counts to match current schemas (10/16) --- .claude/skills/cchsflow-review/SKILL.md | 12 ++++- .../docs/csv-validation-and-fixes.md | 2 +- .../docs/l3-l5-worksheet-checks.md | 2 +- .claude/skills/cchsflow-validation/SKILL.md | 48 ++++++++++++++----- .claude/skills/cchsflow-worksheets/SKILL.md | 10 +++- 5 files changed, 57 insertions(+), 17 deletions(-) diff --git a/.claude/skills/cchsflow-review/SKILL.md b/.claude/skills/cchsflow-review/SKILL.md index c3cf820d..fdc60e33 100644 --- a/.claude/skills/cchsflow-review/SKILL.md +++ b/.claude/skills/cchsflow-review/SKILL.md @@ -264,6 +264,16 @@ ceps/cep-NNN-/ -pumf-integration-test.csv ``` +#### Save Gem verification findings + +If a Gem verification was performed (via the NotebookLM Gem in `docs/review/`), save the findings before they are lost to context compaction: + +1. Save the Gem response as `gn-{domain}-gem-findings.md` in the CEP directory +2. Include: summary table, detailed findings with classification (by-design / pre-existing / not-actionable / blocking), and action taken +3. Reference the prompt file (`gn-{domain}-variables-prompt.md`) that generated the findings + +This must be done **before** posting the PR comment, so the comment can reference the committed findings file. + #### Commit and push CEP artifacts After saving artifacts, **commit and push them to the PR branch** so other reviewers can access them. CEP artifacts referenced in PR comments must exist on the branch — local-only files create dead references. @@ -392,7 +402,7 @@ Summarise the retrospective to the user. If skill updates are warranted, propose - L0-L6 workflow: `.claude/skills/cchsflow-worksheets/docs/harmonization-workflow.md` - Era mapping tables: `.claude/skills/cchsflow-worksheets/docs/variableStart-databaseStart-authoring.md` - Schema definitions: `inst/metadata/schemas/core/variables.yaml`, `inst/metadata/schemas/core/variable_details.yaml` -- Regex patterns and naming conventions: `inst/metadata/documentation/metadata_registry.yaml` +- Naming conventions: `docs/variable-naming-conventions.md` (in this skill's `docs/` folder) - CSV formatting check/fix: `exec/check-worksheets.R`, `exec/fix-worksheets.R` (uses `R/check-worksheet.R`, `R/fix-worksheet.R`). Supports `--subject` and `--variables` for scoped validation. - Scope filtering: `R/scope-worksheets.R` (`scope_worksheets()`, `parse_scope_args()`) - CSV standardisation with schema validation: `R/csv-utils.R` (`standardise_csv()`), `R/schema-validation.R` (`validate_csv_against_schema()`) diff --git a/.claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md b/.claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md index 9489062d..73a5f0cf 100644 --- a/.claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md +++ b/.claude/skills/cchsflow-review/docs/csv-validation-and-fixes.md @@ -28,7 +28,7 @@ standardise_csv("inst/extdata/variables.csv") standardise_csv("inst/extdata/variable_details.csv", collaboration = TRUE, validate_only = TRUE) ``` -Collaboration mode validates fields against `metadata_registry.yaml` regex patterns including `dummyVariable`, `variableStart`, `recStart`, and `recEnd`. It also checks for missing categorical dummy variables and cross-field rules. +Collaboration mode validates fields against naming convention regex patterns (see `docs/variable-naming-conventions.md`) including `dummyVariable`, `variableStart`, `recStart`, and `recEnd`. It also checks for missing categorical dummy variables and cross-field rules. ### When to run diff --git a/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md b/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md index bb81da91..c64e48a5 100644 --- a/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md +++ b/.claude/skills/cchsflow-review/docs/l3-l5-worksheet-checks.md @@ -107,7 +107,7 @@ Scan for: ## Check 5b: dummyVariable naming conventions -Verify that `dummyVariable` values follow the naming convention below. (Note: `inst/metadata/documentation/metadata_registry.yaml` is referenced as the authoritative source for these patterns but does not yet exist — this skill section is the current reference.) +Verify that `dummyVariable` values follow the naming convention below. (See also `docs/variable-naming-conventions.md` in this skill's folder for the full naming rules.) **Categorical variables** — regex: `^[a-zA-Z0-9_]+_cat[0-9]+(_[0-9]+|_NA[a-z])$` diff --git a/.claude/skills/cchsflow-validation/SKILL.md b/.claude/skills/cchsflow-validation/SKILL.md index cfe1068d..6a66f76e 100644 --- a/.claude/skills/cchsflow-validation/SKILL.md +++ b/.claude/skills/cchsflow-validation/SKILL.md @@ -41,6 +41,30 @@ Scoped mode extracts matching rows to temp files, runs checks/fixes on those, th The R functions `scope_worksheets()` and `parse_scope_args()` in `R/scope-worksheets.R` can also be called programmatically. +## Which skill to use + +| Task | Skill | +|------|-------| +| **Authoring** new variables or editing worksheets | `cchsflow-worksheets` | +| **Validating** worksheets for formatting and consistency | `cchsflow-validation` (this skill) | +| **Reviewing** a PR or self-reviewing harmonisation work | `cchsflow-review` | +| **Writing** derived variable R functions | `cchsflow-derive` | + +Typical flow: worksheets → validation → review (for PRs) or worksheets → validation (for self-review). + +## L-stage mapping + +| Check | L-stage | When to run | +|-------|---------|-------------| +| 1: CSV formatting | L5 | After authoring, before committing | +| 2: Source references | L3 | After variableStart authoring | +| 3: Cross-file consistency | L5 | After adding variables to either file | +| 4: databaseStart coverage | L5 | After modifying databaseStart fields | +| 5: R CMD check | L6 | Before merge, after R/ file changes | +| 6: Pre-2007 explicit mappings | L3 | After adding pre-2007 databases | +| 7: DerivedVar mixed _p/_m | L5 | After writing DerivedVar rows | +| 8: Trailing empty columns | L5 | After any Excel-based editing | + ## Validation checks ### Check 1: CSV formatting @@ -185,7 +209,7 @@ Rscript -e "devtools::load_all('.'); cat('Package loads OK\n')" If `devtools::load_all()` fails, the GHA will also fail when it tries to install the package. -### Check 7: Pre-2007 explicit mapping coverage +### Check 6: Pre-2007 explicit mapping coverage For any variable where `databaseStart` includes pre-2007 databases (`cchs2001_m`, `cchs2001_p`, `cchs2003_m`, `cchs2003_p`, `cchs2005_m`, `cchs2005_p`), verify that `variableStart` contains explicit `db::VAR` entries for those cycles rather than relying on `[VAR]` defaults. @@ -224,7 +248,7 @@ if (length(issues) > 0) { Pre-2007 mapping gaps are **P1** errors — the variable exists in those cycles but the wrong source variable is read at runtime. -### Check 8: DerivedVar mixed _p/_m row detection +### Check 7: DerivedVar mixed _p/_m row detection DerivedVar rows must not mix `_p` (PUMF) and `_m` (Master) databases in a single row when those database types use different feeder variables. If a single DerivedVar row's `databaseStart` contains both `_p` and `_m` entries, `rec_with_table()` will apply the same feeder variable set to all databases in that row — silently producing wrong results when PUMF and Master use different age, sex, or other input variables. @@ -266,7 +290,7 @@ if (nrow(mixed) > 0) { A mixed row is **always suspect**. It is a **P1** error if the `_p` and `_m` feeder sets differ (use `resolve_dependencies()` with a `databases` filter to confirm). It may be acceptable if feeders are identical across both database types, but this should be verified explicitly. -### Check 6: Trailing empty columns +### Check 8: Trailing empty columns Check for trailing empty columns added by Excel editing (a recurring issue across v3 PRs): @@ -286,20 +310,20 @@ else cat('OK: No trailing empty columns\n') " ``` -Expected column counts: variables.csv = 20, variable_details.csv = 22. +Expected column counts: variables.csv = 10, variable_details.csv = 16. (Defined in YAML schemas at `inst/metadata/schemas/core/`.) ## Interpreting results | Check | Pass | Severity | Fail action | |-------|------|----------|------------| -| CSV formatting | No output / clean exit | P2 | Run `Rscript exec/fix-worksheets.R` to auto-fix, then commit | -| Source references | No invalid refs | P0 | Fix variableStart mappings per era rules | -| Cross-file consistency | All variables in both files | P1 | Add missing entries to the appropriate file | -| databaseStart coverage | No mismatches | P1 | Align databaseStart between files | -| R CMD check | 0 errors, 0 warnings | P0 | Fix R/ files: remove `library()` calls, declare deps in DESCRIPTION | -| Trailing empty columns | Expected column counts | P2 | Trim to real columns using R `write.csv()` | -| Pre-2007 explicit mappings | No gaps | P1 | Add explicit `db::VAR` entries for pre-2007 cycles | -| DerivedVar mixed _p/_m | No mixed rows | P1 | Split rows by database type; verify feeders with `resolve_dependencies()` | +| 1: CSV formatting | No output / clean exit | P2 | Run `Rscript exec/fix-worksheets.R` to auto-fix, then commit | +| 2: Source references | No invalid refs | P0 | Fix variableStart mappings per era rules | +| 3: Cross-file consistency | All variables in both files | P1 | Add missing entries to the appropriate file | +| 4: databaseStart coverage | No mismatches | P1 | Align databaseStart between files | +| 5: R CMD check | 0 errors, 0 warnings | P0 | Fix R/ files: remove `library()` calls, declare deps in DESCRIPTION | +| 6: Pre-2007 explicit mappings | No gaps | P1 | Add explicit `db::VAR` entries for pre-2007 cycles | +| 7: DerivedVar mixed _p/_m | No mixed rows | P1 | Split rows by database type; verify feeders with `resolve_dependencies()` | +| 8: Trailing empty columns | Expected column counts | P2 | Trim to real columns using R `write.csv()` | ## When to run diff --git a/.claude/skills/cchsflow-worksheets/SKILL.md b/.claude/skills/cchsflow-worksheets/SKILL.md index 02491784..9456c5d4 100644 --- a/.claude/skills/cchsflow-worksheets/SKILL.md +++ b/.claude/skills/cchsflow-worksheets/SKILL.md @@ -23,7 +23,13 @@ Key tools for authoring: - `mcp__cchs-metadata__suggest_cchsflow_row(variable_name)` — draft a harmonisation row - `mcp__cchs-metadata__get_source_conflicts(variable_name, dataset_id)` — find cross-source label disagreements (useful for catching metadata inconsistencies before authoring) -If the MCP is not available, see the troubleshooting section in `.claude/skills/cchsflow-review/SKILL.md` under "If the MCP is not available" for setup instructions (including the standalone CLI fallback). The MCP server (v0.3.0+) lives in `../cchsflow-docs/mcp-server/` and is also available as a [GitHub release](https://github.com/Big-Life-Lab/cchsflow-docs/releases). +If the MCP is not available: +1. Check that `../cchsflow-docs/mcp-server/server.py` exists (may need to restore from that repo's main branch) +2. Verify MCP configuration in `~/.claude.json` includes the `cchs-metadata` server +3. **Standalone CLI fallback**: `python3 ../cchsflow-docs/mcp-server/server.py --cli search "SMK_005"` (runs without Claude Code) +4. **File-based fallback**: Read DDI YAML files directly from `../cchsflow-docs/ddi/` or CSVs from `../cchsflow-docs/data/` + +The MCP server (v0.3.0+) lives in `../cchsflow-docs/mcp-server/` and is also available as a [GitHub release](https://github.com/Big-Life-Lab/cchsflow-docs/releases). ## Key references @@ -50,7 +56,7 @@ Detailed documentation is in the `docs/` subdirectory: |--------|---------|-------| | `_p` | PUMF (Public Use Microdata File) | Grouped/derived variables | | `_m` | Master survey file | Ungrouped source variables | -| `_s` | Share file | Synthetic datasets | +| `_s` | Share file (deprecated) | Replace with `_m` | | `_i` | ICES-linked (deprecated) | Replace with `_m` | ### PUMF vs Master row splitting