Skip to content

Refine cpm_spec() with helper-based screening, weighting, and modeling#54

Draft
psychelzh wants to merge 10 commits intomainfrom
codex/enhance-cpm-spec-api
Draft

Refine cpm_spec() with helper-based screening, weighting, and modeling#54
psychelzh wants to merge 10 commits intomainfrom
codex/enhance-cpm-spec-api

Conversation

@psychelzh
Copy link
Owner

@psychelzh psychelzh commented Mar 17, 2026

Summary

This draft PR refines the native cpm_spec() API around a clearer staged mental model while keeping the current CPM prediction structure intact.

The key clarification in this version is that the Stage 1 predictors currently exposed by the package are CPM-derived network-strength predictors:

  • screened positive edges define a positive_strength
  • screened negative edges define a negative_strength
  • feature_space = "separate" exposes those strengths directly
  • feature_space = "net" exposes one net_strength = positive_strength - negative_strength

Stage 2 is now explicit via model = cpm_model_lm(), but remains intentionally minimal in this PR.

The public surface is now:

cpm_spec(
  screen = ...,
  feature_space = ...,
  weighting = ...,
  model = ...,
  bias_correct = ...
)

Helper constructors now cover the parts that naturally belong together:

  • screen = cpm_screen(...) with thresholds built by cpm_threshold()
  • weighting = cpm_weighting(...)
  • model = cpm_model_lm()

This PR also keeps weighted CPM summaries via sigmoid edge weights centered on the fold-local threshold.

Implementation work previously tracked in #50 and #51 is now covered here. The older prediction_head extension thread from #52 has been superseded by #55.

Refs #44
Refs #50
Refs #51
Refs #55

What Changed

  • replaced the public prediction_head argument with an explicit model = cpm_model_lm() helper
  • renamed network_summary to feature_space so the API matches what this stage really controls
  • clarified the feature-space documentation around CPM-derived network strengths:
    • positive / negative refer to screened positive- and negative-association edge sets
    • separate exposes positive_strength and negative_strength
    • joint is the classic model stream that uses both strengths together
    • net is the single-stream view over net_strength
  • renamed the classic joint stream from combined to joint
  • renamed the single signed stream from difference to net, while keeping the internal single-feature column explicit as net_strength
  • kept the richer CPM feature-selection surface introduced on this branch:
    • pearson / spearman screening
    • alpha / sparsity / effect_size thresholding
    • binary / sigmoid edge weighting
    • separate / net feature construction
  • fixed the aliased joint-stream regression path when one strength column is degenerate
  • updated print output, tidy output, Rd files, README, vignettes, pkgdown reference config, and tests to match the naming cleanup

Notes For Review

  • this is intentionally a breaking API cleanup; backward compatibility is not preserved because the package is still pre-stable
  • return_edges still reports the hard-thresholded positive / negative masks
  • weighted summaries use continuous weights stored in fit_obj$model$edge_weights
  • model = cpm_model_lm() is deliberately minimal in this PR: the goal here is to establish the Stage 2 API seam without yet expanding model families
  • fitted Stage 2 models now live under fit_obj$model$outcome_models, with the active stream names recorded in fit_obj$model$prediction_streams
  • the current implementation is intentionally network-strength-first; some CPM papers and future extensions may want other subject-level predictors, and that broader design space now lives in Follow-up: complete the two-stage CPM feature/model API #55
  • threshold tuning and fold-level reuse remain in Implement CPM-aware threshold tuning with fold-level statistic reuse #44

Reprexes

The examples below were generated on this branch.

1. Default CPM with the staged helper surface

pkgload::load_all(".", quiet = TRUE, export_all = FALSE, helpers = FALSE)
set.seed(1)
conmat <- matrix(rnorm(20 * 12), nrow = 20)
behav <- rowMeans(conmat[, 1:4, drop = FALSE]) + rnorm(20, sd = 0.3)

spec <- cpm_spec()
fit_obj <- fit(spec, conmat = conmat, behav = behav)

fit_obj$predictions |>
  head(3)
#>   row      real       joint    positive    negative
#> 1   1 0.8445976  0.66360265  0.65017850  0.46429658
#> 2   2 0.4785268 -0.09850425 -0.06938406 -0.26333511
#> 3   3 0.2234539  0.45702649  0.48282936 -0.01239408

2. Rich screening with an explicit model helper

pkgload::load_all(".", quiet = TRUE, export_all = FALSE, helpers = FALSE)
set.seed(2)
conmat <- matrix(rnorm(24 * 10), nrow = 24)
behav <- rank(rowMeans(conmat[, 1:5, drop = FALSE])) + rnorm(24, sd = 0.2)

screened_spec <- cpm_spec(
  screen = cpm_screen(
    association = "spearman",
    threshold = cpm_threshold("effect_size", level = 0.15)
  ),
  feature_space = "net",
  model = cpm_model_lm()
)

screened_fit <- fit(screened_spec, conmat = conmat, behav = behav)

names(screened_fit$predictions)
#> [1] "row"  "real" "net"
head(screened_fit$predictions, 3)
#>   row      real        net
#> 1   1 12.983644 -3.0590394
#> 2   2  3.577684 -5.6445905
#> 3   3 11.234565 -2.8263688

3. Weighted CPM with cpm_weighting()

pkgload::load_all(".", quiet = TRUE, export_all = FALSE, helpers = FALSE)
set.seed(1)
conmat <- matrix(rnorm(20 * 12), nrow = 20)
behav <- rowMeans(conmat[, 1:4, drop = FALSE]) + rnorm(20, sd = 0.3)

weighted_spec <- cpm_spec(
  screen = cpm_screen(
    threshold = cpm_threshold("alpha", level = 0.05)
  ),
  weighting = cpm_weighting("sigmoid", scale = 0.03),
  feature_space = "separate",
  model = cpm_model_lm()
)

weighted_fit <- fit(weighted_spec, conmat = conmat, behav = behav)

head(weighted_fit$predictions, 3)
#>   row      real       joint   positive     negative
#> 1   1 0.8445976  0.71610009  0.7064192  0.477684518
#> 2   2 0.4785268 -0.07018704 -0.0484309 -0.271327133
#> 3   3 0.2234539  0.46870275  0.4859598 -0.006188583
round(weighted_fit$model$edge_weights[1:6, ], 3)
#>      positive negative
#> [1,]    0.048        0
#> [2,]    0.099        0
#> [3,]    0.963        0
#> [4,]    0.989        0
#> [5,]    0.000        0
#> [6,]    0.000        0

4. Resampling with the same staged spec object

pkgload::load_all(".", quiet = TRUE, export_all = FALSE, helpers = FALSE)
set.seed(3)
conmat <- matrix(rnorm(30 * 14), nrow = 30)
behav <- rowMeans(conmat[, 1:6, drop = FALSE]) + rnorm(30, sd = 0.4)

res <- fit_resamples(
  cpm_spec(
    screen = cpm_screen(
      threshold = cpm_threshold("sparsity", level = 0.1)
    ),
    weighting = cpm_weighting("sigmoid", scale = 0.05),
    feature_space = "separate",
    model = cpm_model_lm()
  ),
  conmat = conmat,
  behav = behav,
  kfolds = 5,
  return_edges = "sum"
)

head(res$predictions, 3)
#>   row fold       real       joint   positive    negative
#> 1   1    3 -0.2672746 -0.36871199 -0.3918115  0.04313175
#> 2   2    4  0.5916610  0.04039963  0.1493113 -0.09720055
#> 3   3    5  0.3570704  0.35222763  0.2361003  0.23735100
round(res$edges[1:6, ], 2)
#>      positive negative
#> [1,]        0        0
#> [2,]        0        0
#> [3,]        0        0
#> [4,]        0        0
#> [5,]        4        0
#> [6,]        1        0

Validation

  • ran air format --check R tests
  • ran devtools::document()
  • ran devtools::test() locally
  • measured coverage with covr::package_coverage(type = "tests")
  • current local result: 347 passing tests and 100% test coverage
  • devtools::check() is currently blocked in this environment by missing Pandoc for vignette rebuilding; fallback local R CMD check on a tarball built with --no-build-vignettes returned 0 errors, 2 warnings, and 0 notes, where both warnings reflect skipped vignette outputs

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (a0b6dd1) to head (cd6c352).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##              main       #54    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           12        12            
  Lines          970      1327   +357     
==========================================
+ Hits           970      1327   +357     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@psychelzh psychelzh force-pushed the codex/enhance-cpm-spec-api branch from d3b69c1 to 0c0b9f0 Compare March 17, 2026 14:07
@psychelzh
Copy link
Owner Author

I think the next implementation step is less about adding more options to
prediction_head, and more about making the pipeline boundary explicit.

What seems most promising to me is a two-stage CPM design:

  1. Stage 1 builds CPM-derived subject-level features from edges
  2. Stage 2 fits a statistical model on those features

Concretely, that suggests a future shape more like:

cpm_spec(
  screen = cpm_screen(...),
  feature_space = "separate",
  weighting = cpm_weighting(...),
  model = cpm_model_lm()
)

The key implementation idea is that Stage 1 remains the CPM-specific core:
fold-local screening, thresholding, weighting, and feature construction. Stage
2 should be as small as possible and ideally reuse existing R model engines
instead of growing a custom mini-framework.

That means:

  • lm() should be the default first model
  • polynomial terms should likely come from the model formula rather than a
    separate head option
  • if we later support binary outcomes, glm(..., family = binomial()) is a
    much more meaningful extension than linear_no_intercept

One important naming point from the current API discussion: combined is not
really a feature column. It is a joint model over positive and negative.
So if we move toward a model-based second stage, it would be cleaner if the
feature layer exposed actual predictors:

  • positive, negative for the classic separate-feature case
  • possibly something like net_strength for the one-feature case, instead of
    the more ambiguous difference

I also think this would help preserve the meaning of return_edges:

  • return_edges should keep meaning “which edges were selected in Stage 1”
  • it should not be overloaded to mean model importance

That is another reason I would keep the first model expansion limited to
summary-feature models rather than edge-level learners. Once Stage 2 starts
using edge-level predictors directly, return_edges becomes much harder to
interpret.

So if we want an incremental path, my suggestion would be:

  1. replace prediction_head with model
  2. keep Stage 2 restricted to CPM summary features
  3. start with cpm_model_lm()
  4. then consider glm(binomial) and formula-based polynomial extensions

That would keep the current PR focused, while giving the next round of API work
a cleaner conceptual foundation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant