Refine cpm_spec() with helper-based screening, weighting, and modeling by psychelzh · Pull Request #54 · psychelzh/cpmr

psychelzh · 2026-03-17T10:56:11Z

Summary

This draft PR refines the native cpm_spec() API around a clearer staged mental model while keeping the current CPM prediction structure intact.

The key clarification in this version is that the Stage 1 predictors currently exposed by the package are CPM-derived network-strength predictors:

screened positive edges define a positive_strength
screened negative edges define a negative_strength
feature_space = "separate" exposes those strengths directly
feature_space = "net" exposes one net_strength = positive_strength - negative_strength

Stage 2 is now explicit via model = cpm_model_lm(), but remains intentionally minimal in this PR.

The public surface is now:

cpm_spec(
  screen = ...,
  feature_space = ...,
  weighting = ...,
  model = ...,
  bias_correct = ...
)

Helper constructors now cover the parts that naturally belong together:

screen = cpm_screen(...) with thresholds built by cpm_threshold()
weighting = cpm_weighting(...)
model = cpm_model_lm()

This PR also keeps weighted CPM summaries via sigmoid edge weights centered on the fold-local threshold.

Implementation work previously tracked in #50 and #51 is now covered here. The older prediction_head extension thread from #52 has been superseded by #55.

Refs #44
Refs #50
Refs #51
Refs #55

What Changed

replaced the public prediction_head argument with an explicit model = cpm_model_lm() helper
renamed network_summary to feature_space so the API matches what this stage really controls
clarified the feature-space documentation around CPM-derived network strengths:
- positive / negative refer to screened positive- and negative-association edge sets
- separate exposes positive_strength and negative_strength
- joint is the classic model stream that uses both strengths together
- net is the single-stream view over net_strength
renamed the classic joint stream from combined to joint
renamed the single signed stream from difference to net, while keeping the internal single-feature column explicit as net_strength
kept the richer CPM feature-selection surface introduced on this branch:
- pearson / spearman screening
- alpha / sparsity / effect_size thresholding
- binary / sigmoid edge weighting
- separate / net feature construction
fixed the aliased joint-stream regression path when one strength column is degenerate
updated print output, tidy output, Rd files, README, vignettes, pkgdown reference config, and tests to match the naming cleanup

Notes For Review

this is intentionally a breaking API cleanup; backward compatibility is not preserved because the package is still pre-stable
return_edges still reports the hard-thresholded positive / negative masks
weighted summaries use continuous weights stored in fit_obj$model$edge_weights
model = cpm_model_lm() is deliberately minimal in this PR: the goal here is to establish the Stage 2 API seam without yet expanding model families
fitted Stage 2 models now live under fit_obj$model$outcome_models, with the active stream names recorded in fit_obj$model$prediction_streams
the current implementation is intentionally network-strength-first; some CPM papers and future extensions may want other subject-level predictors, and that broader design space now lives in Follow-up: complete the two-stage CPM feature/model API #55
threshold tuning and fold-level reuse remain in Implement CPM-aware threshold tuning with fold-level statistic reuse #44

Reprexes

The examples below were generated on this branch.

1. Default CPM with the staged helper surface

pkgload::load_all(".", quiet = TRUE, export_all = FALSE, helpers = FALSE)
set.seed(1)
conmat <- matrix(rnorm(20 * 12), nrow = 20)
behav <- rowMeans(conmat[, 1:4, drop = FALSE]) + rnorm(20, sd = 0.3)

spec <- cpm_spec()
fit_obj <- fit(spec, conmat = conmat, behav = behav)

fit_obj$predictions |>
  head(3)
#>   row      real       joint    positive    negative
#> 1   1 0.8445976  0.66360265  0.65017850  0.46429658
#> 2   2 0.4785268 -0.09850425 -0.06938406 -0.26333511
#> 3   3 0.2234539  0.45702649  0.48282936 -0.01239408

2. Rich screening with an explicit model helper

pkgload::load_all(".", quiet = TRUE, export_all = FALSE, helpers = FALSE)
set.seed(2)
conmat <- matrix(rnorm(24 * 10), nrow = 24)
behav <- rank(rowMeans(conmat[, 1:5, drop = FALSE])) + rnorm(24, sd = 0.2)

screened_spec <- cpm_spec(
  screen = cpm_screen(
    association = "spearman",
    threshold = cpm_threshold("effect_size", level = 0.15)
  ),
  feature_space = "net",
  model = cpm_model_lm()
)

screened_fit <- fit(screened_spec, conmat = conmat, behav = behav)

names(screened_fit$predictions)
#> [1] "row"  "real" "net"
head(screened_fit$predictions, 3)
#>   row      real        net
#> 1   1 12.983644 -3.0590394
#> 2   2  3.577684 -5.6445905
#> 3   3 11.234565 -2.8263688

3. Weighted CPM with `cpm_weighting()`

pkgload::load_all(".", quiet = TRUE, export_all = FALSE, helpers = FALSE)
set.seed(1)
conmat <- matrix(rnorm(20 * 12), nrow = 20)
behav <- rowMeans(conmat[, 1:4, drop = FALSE]) + rnorm(20, sd = 0.3)

weighted_spec <- cpm_spec(
  screen = cpm_screen(
    threshold = cpm_threshold("alpha", level = 0.05)
  ),
  weighting = cpm_weighting("sigmoid", scale = 0.03),
  feature_space = "separate",
  model = cpm_model_lm()
)

weighted_fit <- fit(weighted_spec, conmat = conmat, behav = behav)

head(weighted_fit$predictions, 3)
#>   row      real       joint   positive     negative
#> 1   1 0.8445976  0.71610009  0.7064192  0.477684518
#> 2   2 0.4785268 -0.07018704 -0.0484309 -0.271327133
#> 3   3 0.2234539  0.46870275  0.4859598 -0.006188583
round(weighted_fit$model$edge_weights[1:6, ], 3)
#>      positive negative
#> [1,]    0.048        0
#> [2,]    0.099        0
#> [3,]    0.963        0
#> [4,]    0.989        0
#> [5,]    0.000        0
#> [6,]    0.000        0

4. Resampling with the same staged spec object

pkgload::load_all(".", quiet = TRUE, export_all = FALSE, helpers = FALSE)
set.seed(3)
conmat <- matrix(rnorm(30 * 14), nrow = 30)
behav <- rowMeans(conmat[, 1:6, drop = FALSE]) + rnorm(30, sd = 0.4)

res <- fit_resamples(
  cpm_spec(
    screen = cpm_screen(
      threshold = cpm_threshold("sparsity", level = 0.1)
    ),
    weighting = cpm_weighting("sigmoid", scale = 0.05),
    feature_space = "separate",
    model = cpm_model_lm()
  ),
  conmat = conmat,
  behav = behav,
  kfolds = 5,
  return_edges = "sum"
)

head(res$predictions, 3)
#>   row fold       real       joint   positive    negative
#> 1   1    3 -0.2672746 -0.36871199 -0.3918115  0.04313175
#> 2   2    4  0.5916610  0.04039963  0.1493113 -0.09720055
#> 3   3    5  0.3570704  0.35222763  0.2361003  0.23735100
round(res$edges[1:6, ], 2)
#>      positive negative
#> [1,]        0        0
#> [2,]        0        0
#> [3,]        0        0
#> [4,]        0        0
#> [5,]        4        0
#> [6,]        1        0

Validation

ran air format --check R tests
ran devtools::document()
ran devtools::test() locally
measured coverage with covr::package_coverage(type = "tests")
current local result: 347 passing tests and 100% test coverage
devtools::check() is currently blocked in this environment by missing Pandoc for vignette rebuilding; fallback local R CMD check on a tarball built with --no-build-vignettes returned 0 errors, 2 warnings, and 0 notes, where both warnings reflect skipped vignette outputs

codecov · 2026-03-17T10:57:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (a0b6dd1) to head (cd6c352).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##              main       #54    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           12        12            
  Lines          970      1327   +357     
==========================================
+ Hits           970      1327   +357

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

psychelzh · 2026-03-17T14:14:09Z

I think the next implementation step is less about adding more options to
prediction_head, and more about making the pipeline boundary explicit.

What seems most promising to me is a two-stage CPM design:

Stage 1 builds CPM-derived subject-level features from edges
Stage 2 fits a statistical model on those features

Concretely, that suggests a future shape more like:

cpm_spec(
  screen = cpm_screen(...),
  feature_space = "separate",
  weighting = cpm_weighting(...),
  model = cpm_model_lm()
)

The key implementation idea is that Stage 1 remains the CPM-specific core:
fold-local screening, thresholding, weighting, and feature construction. Stage
2 should be as small as possible and ideally reuse existing R model engines
instead of growing a custom mini-framework.

That means:

lm() should be the default first model
polynomial terms should likely come from the model formula rather than a
separate head option
if we later support binary outcomes, glm(..., family = binomial()) is a
much more meaningful extension than linear_no_intercept

One important naming point from the current API discussion: combined is not
really a feature column. It is a joint model over positive and negative.
So if we move toward a model-based second stage, it would be cleaner if the
feature layer exposed actual predictors:

positive, negative for the classic separate-feature case
possibly something like net_strength for the one-feature case, instead of
the more ambiguous difference

I also think this would help preserve the meaning of return_edges:

return_edges should keep meaning “which edges were selected in Stage 1”
it should not be overloaded to mean model importance

That is another reason I would keep the first model expansion limited to
summary-feature models rather than edge-level learners. Once Stage 2 starts
using edge-level predictors directly, return_edges becomes much harder to
interpret.

So if we want an incremental path, my suggestion would be:

replace prediction_head with model
keep Stage 2 restricted to CPM summary features
start with cpm_model_lm()
then consider glm(binomial) and formula-based polynomial extensions

That would keep the current PR focused, while giving the next round of API work
a cleaner conceptual foundation.

Add weighted CPM network summaries

26a2d0c

psychelzh added 4 commits March 17, 2026 20:18

Simplify cpm_spec helper API

bc6d0ba

Tighten validation and coverage for spec helpers

4f9232a

Apply air formatting across R and tests

4881796

Improve vignette readability

0c0b9f0

psychelzh force-pushed the codex/enhance-cpm-spec-api branch from d3b69c1 to 0c0b9f0 Compare March 17, 2026 14:07

Prototype model helper stage in cpm_spec

90e3aba

psychelzh changed the title ~~Expand cpm_spec() with weighted CPM summaries~~ Refine cpm_spec() with helper-based screening, weighting, and modeling Mar 17, 2026

psychelzh mentioned this pull request Mar 17, 2026

Implement CPM-aware threshold tuning with fold-level statistic reuse #44

Open

psychelzh added 4 commits March 17, 2026 22:59

Clarify CPM feature-space naming

674455e

Clarify prediction stream semantics

afa0031

Fix aliased joint stream regression

fc3bc2d

Clarify CPM feature-space terminology

cd6c352

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine cpm_spec() with helper-based screening, weighting, and modeling#54

Refine cpm_spec() with helper-based screening, weighting, and modeling#54
psychelzh wants to merge 10 commits intomainfrom
codex/enhance-cpm-spec-api

psychelzh commented Mar 17, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

psychelzh commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

psychelzh commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Notes For Review

Reprexes

1. Default CPM with the staged helper surface

2. Rich screening with an explicit model helper

3. Weighted CPM with cpm_weighting()

4. Resampling with the same staged spec object

Validation

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

psychelzh commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

psychelzh commented Mar 17, 2026 •

edited

Loading

3. Weighted CPM with `cpm_weighting()`

codecov bot commented Mar 17, 2026 •

edited

Loading