Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
4845592
Add CEP-015: Variable discovery and project tools
DougManuel Feb 24, 2026
0a430b5
skill(cchsflow-review): Add scope-confirmation prompt when domain not…
DougManuel Mar 11, 2026
a476810
feat(skills): Add validation checks, era boundary docs, and PUMF avai…
DougManuel Mar 12, 2026
8645104
feat(skills): Add cchsflow-worksheets reference documentation
DougManuel Mar 12, 2026
3584588
docs(skill): Update cchsflow-review with recode block terminology and…
DougManuel Mar 13, 2026
2b14aa2
refactor(skill): Split cchsflow-review SKILL.md into orchestrator + docs
DougManuel Mar 29, 2026
335b856
fix(skill): Renumber Check 6 DV specification items sequentially
DougManuel Mar 29, 2026
540b1c1
feat(skill): Add PUMF-Master variable family pattern, completeness au…
DougManuel Mar 30, 2026
d30e9eb
feat(skill): Add cchsflow-derive skill with development workflow and …
DougManuel Mar 30, 2026
c267a38
feat(skill): Add R code/test triage to cchsflow-review
DougManuel Mar 31, 2026
a12784f
Added information on variable naming conventions to cchsflow workshee…
rafdoodle Apr 1, 2026
0ae2130
Merge branch 'skills/review-validation' of https://github.com/Big-Lif…
rafdoodle Apr 1, 2026
5b74bf8
feat(validation): Add scoped worksheet validation with performance fix
DougManuel Apr 6, 2026
61356d8
refactor(skill): Improve cchsflow-review from PR #176 retrospective
DougManuel Apr 6, 2026
9e99e1b
fix(validation): Add missing schema infrastructure and fix bugs
DougManuel Apr 7, 2026
103d644
refactor(skills): Renumber checks, add L-stage mapping, fix stale ref…
DougManuel Apr 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions .claude/skills/cchsflow-derive/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
---
name: cchsflow-derive
description: Write and review derived variable functions for cchsflow. Use when implementing new DV functions (calculate_*, assess_*, categorize_*), upgrading existing functions to v3 architecture, reviewing DV code for correctness, or preparing DV changes for commit. Covers the 3-step architecture, source-agnostic design, quality tiers, patterns, testing, and package-level validation.
allowed-tools: Bash(Rscript:*), Bash(R:*), Bash(git:*), Read, Glob, Grep
---

# cchsflow derived variable development

Write, review, and validate derived variable functions using the v3 3-step architecture.

## Usage

```
/cchsflow-derive # general guidance (reads foundations)
/cchsflow-derive calculate_bmi # review/write a specific function
/cchsflow-derive --check # run done criteria checks
```

## Before you start

### Required reading

Before writing or reviewing a DV function, read these docs (in this skill's `docs/` folder):

1. **[foundations.md](docs/foundations.md)** — 3-step architecture, missing data handling, quality tiers, coding standards, anti-patterns. Read this first.
2. **The pattern doc** that matches your function (see "Choose a pattern" below)

### Choose a pattern

Identify which pattern your function follows, then read the corresponding doc:

| Pattern | When to use | Doc |
|---------|-------------|-----|
| **Formula calculation** | Compute a value from inputs (BMI, pack-years) | [formula-calculation.md](docs/patterns/formula-calculation.md) |
| **Category grouping** | Map values to categories (BMI categories, smoking status) | [category-grouping.md](docs/patterns/category-grouping.md) |
| **Pass-through** | Clean and forward a single variable | [pass-through.md](docs/patterns/pass-through.md) |
| **Cat-to-continuous** | Midpoint imputation from categorical ranges | [cat-to-continuous.md](docs/patterns/cat-to-continuous.md) |
| **Multi-source routing** | Choose best source with priority chain | [multi-source-routing.md](docs/patterns/multi-source-routing.md) |
| **Pathway branching** | Complex decision tree with gate variables | [pathway-branching.md](docs/patterns/pathway-branching.md) |

### Reference material

- **[7-levels.md](docs/7-levels.md)** — function complexity taxonomy (L1-L7)
- **[function-inventory.md](docs/function-inventory.md)** — all existing DV functions with pattern, level, and tier
- **[testing.md](docs/testing.md)** — unit test and golden fixture patterns, common failure diagnostics

## Development workflow

### 1. Write tests

Follow the test tier matching your function's quality tier (see [testing.md](docs/testing.md)):

- **Bronze**: Happy path + one missing input
- **Silver**: + out-of-range, vectors, dataframe via `mutate()`
- **Gold**: + every `case_when()` branch, tagged NA type verification, `output_format` parameter

### 2. Write the function

Follow the pattern template from the appropriate pattern doc. Key principles:

- **Source-agnostic**: Semantic parameter names (`height_m`, `weight_kg`), not CCHS variable names. ONE function for both PUMF and Master; the worksheet routes different source variables to the same parameters.
- **3-step**: `clean_variables(output_format = "tagged_na")` → `case_when()` logic → `clean_variables(output_format = output_format)`
- **Step 1 always uses `"tagged_na"`**: Never pass the user's `output_format` to Step 1 — `any_missing()` in Step 2 won't detect numeric missing codes.
- **Namespace-qualify**: `dplyr::case_when()`, `haven::tagged_na()` — functions must work standalone.

### 3. Write roxygen documentation

Silver and gold tier require the full template (see foundations.md § Documentation):

```r
#' @title [verb phrase]
#' @description [1-2 sentences]
#' @details [implementation notes, PUMF vs Master table if source-agnostic]
#' @param var1 [description]
#' @param output_format Output missing data format: "tagged_na" (default) or "original".
#' @param ... Arguments passed from deprecated aliases.
#' @return [type and range]
#' @examples
#' # Scalar
#' # Vector
#' # Dataframe
#' # Standalone with rec_with_table (in \dontrun{})
#' @references
#' @seealso
#' @export
```

**`@param ...` rule**: If deprecated aliases use `@rdname` pointing to your function and their signature is `function(...)`, you MUST add `@param ... Arguments passed from deprecated aliases.` to your roxygen. Otherwise R CMD check will report "Undocumented arguments in Rd file: '...'".

### 4. Write deprecated aliases (if renaming)

If the function replaces an older function name, add aliases in `R/deprecated-aliases.R`:

```r
#' @rdname new_function_name
#' @export
old_function_name <- function(...) {
.Deprecated("new_function_name",
msg = "old_function_name() is deprecated. Use new_function_name() instead.")
new_function_name(...)
}
```

### 5. Update worksheets (if needed)

If the function is referenced from `variable_details.csv` via `Func::`:

- Update `recEnd` to point to the new function name
- Update `dummyVariable` if function name changed
- Run `Rscript exec/fix-worksheets.R` after any CSV modification
- Rebuild RData if worksheet structure changed (see cchsflow-worksheets skill)

## Done criteria

**Before committing DV function changes, ALL of these must pass.** Run them in order — earlier checks are faster and catch different issues.

### Check 1: Unit tests pass

```r
# From the project root (or worktree root)
Rscript -e 'devtools::load_all(); testthat::test_file("tests/testthat/test-<domain>.R")'
```

Verify: 0 failures for in-scope tests. Pre-existing failures in other test files are acceptable (note them but don't block on them).

### Check 2: R CMD check passes

```r
# Quick check — catches NAMESPACE, roxygen, imports (skips tests/examples)
Rscript -e 'devtools::check(document = FALSE, args = "--no-tests --no-examples --no-vignettes --no-manual")'

# Full check — recommended before PR
Rscript -e 'devtools::check()'
```

Verify: 0 **new** errors/warnings/notes compared to the branch baseline. Common issues caught only here:

- Undocumented `...` from `@rdname` aliases
- Missing NAMESPACE exports
- Broken `@examples`
- Undeclared imports in DESCRIPTION

### Check 3: Worksheet validation (if worksheets changed)

Invoke the `cchsflow-validation` skill, or run manually:

```r
Rscript exec/fix-worksheets.R
```

### Check 4: Roxygen checklist

Verify manually against the template in Step 3 above:

- [ ] `@title`, `@description`, `@details` present
- [ ] All `@param` documented (including `...` if aliases exist)
- [ ] `@examples` includes scalar, vector, dataframe, and `rec_with_table()`
- [ ] `@return` describes type and range
- [ ] `@export` present
- [ ] `@seealso` links related functions

### Check 5: Test coverage checklist

- [ ] Every `case_when()` branch has a test
- [ ] Scalar, vector, and dataframe inputs tested
- [ ] Missing inputs tested (NA, tagged_na("a"), tagged_na("b"))
- [ ] Boundary values tested (for categorization functions)
- [ ] Deprecated aliases tested (expect deprecation warning + correct delegation)

## Cross-references

### Related cchsflow skills

- **cchsflow-review** — PR review of worksheet changes (L0-L6 process). Lives on `skills/review-validation` branch.
- **cchsflow-validation** — programmatic worksheet validation. Lives on `skills/review-validation` branch.
- **cchsflow-worksheets** — worksheet authoring guidance. Lives on `skills/review-validation` branch.

### External references

- R CMD check guidance: `~/github/ai-infrastructure/context/domains/r_packages.md` § "Local verification before committing"
- V3 coding standards: project memory `project_derive_function_standards.md`
- Reference implementations: `calculate_bmi()` in `R/bmi.R` (formula), `calculate_pack_years()` in `R/smoke-pack-years.R` (complex)
115 changes: 115 additions & 0 deletions .claude/skills/cchsflow-derive/docs/7-levels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Function levels (L1-L7)

A taxonomy of reusable function complexity. Higher levels compose lower
levels. Understanding the level helps you write the right amount of code
and reuse existing infrastructure.

## Level definitions

| Level | Name | Purpose | Example |
|-------|------|---------|---------|
| L1 | Foundational utility | Low-level missing data, cleaning, pattern detection | `any_missing()`, `clean_variables()`, `assign_missing()` |
| L2 | Midpoint mapping | Convert categorical ranges to continuous values via lookup table | `smkg_age_midpoint()` |
| L3 | Single-source pass-through | Wrap and clean a single input, worksheet handles routing | `calculate_age_start_smoking()` |
| L4 | Categorical-to-continuous conversion | Apply midpoint imputation with domain logic | `calculate_SMK_06A_cont()` |
| L5 | Filter/route by status | Extract subset of input based on status filtering | `calculate_SMKG203_cont()`, `assess_quit_pathway()` |
| L6 | Multi-source combining | Route multiple sources with priority hierarchy | `calculate_time_quit_smoking_complete()` |
| L7 | Complex multi-source unification | Full decision tree combining multiple inputs | `calculate_SMKDSTY_cat6()`, `calculate_pack_years()` |

## Decision tree

Use this to classify your function:

```
Does your function just pass through a single source?
→ YES → L3 (pass-through)
→ NO ↓

Does it convert categories to continuous values?
→ YES, using a lookup table only → L2 (midpoint mapping)
→ YES, with domain logic → L4 (cat-to-continuous)
→ NO ↓

Does it filter/extract based on a status variable?
→ YES, single source filtered by status → L5 (filter/route)
→ NO ↓

Does it combine multiple sources with priority?
→ YES, with pathway-aware routing → L6 (combining)
→ NO ↓

Does it have a complex decision tree with multiple inputs?
→ YES → L7 (complex unification)
```

## How levels compose

Pack-years demonstrates the full stack:

```
calculate_pack_years (L7)
├── clean_variables() (L1)
├── any_missing() + get_priority_missing() (L1)
├── SMKDSTY_A (L7: calculate_SMKDSTY_cat6)
├── age_start_smoking (L3: calculate_age_start_smoking)
│ └── derive_passthrough() (L1)
├── time_quit_smoking (L6: calculate_time_quit_smoking_complete)
│ ├── calculate_SMK_06A_cont() (L4)
│ │ └── smkg_age_midpoint() (L2)
│ └── pathway logic with SMK_10_gate (L5: assess_quit_pathway)
├── cigs_per_day (L7: calculate_cigs_per_day)
│ └── status-based routing (L5 pattern)
└── age (L3: via worksheet routing)
```

## Level-by-level guidance

### L1: Foundational utilities

These are shared infrastructure. You rarely write new L1 functions — you
use them. Key functions to know:

- `clean_variables(vars, variable_details, output_format)` — step 1 and 3
- `any_missing(var1, var2, ...)` — vectorised missing detection
- `get_priority_missing(var1, var2, ...)` — NA::b wins over NA::a
- `assign_missing(type, var_name, variable_details)` — create typed missing
- `derive_passthrough(value, variable_name, variable_details, output_format)` — L3 helper

### L2: Midpoint mapping

A lookup table that converts categorical codes to continuous values.
Typically a simple named vector or small helper function.

```r
smkg_age_midpoint <- function(category) {
midpoints <- c(8, 13, 16, 18.5, 22, 27, 32, 37, 42, 47, 55)
midpoints[category]
}
```

### L3: Single-source pass-through

Minimal wrapper around `derive_passthrough()`. The worksheet handles
which source variable to feed in.

```r
calculate_age_start_smoking <- function(
age_start_smoking, variable_details = NULL, output_format = "tagged_na") {
derive_passthrough(age_start_smoking, "age_start_smoking",
variable_details, output_format)
}
```

### L4-L7: See pattern docs

These levels correspond to specific patterns:

- L4 → `patterns/cat-to-continuous.md`
- L5 → `patterns/multi-source-routing.md` (filter variant)
- L6 → `patterns/multi-source-routing.md` or `patterns/pathway-branching.md`
- L7 → `patterns/formula-calculation.md` or `patterns/category-grouping.md`

## Existing function inventory

See `function-inventory.md` for a complete mapping of all current DV
functions to their levels and patterns.
Loading
Loading