Skip to content

davdittrich/autumn

 
 

Repository files navigation

autumn: Fast, Modern, and Tidy Raking

License: MIT

autumn calibrates survey data via iterative proportional fitting (raking), producing weights that align a sample’s marginal distributions with known population targets. This fork extends the original aaronrudkin/autumn with dual Newton calibration (Deville & Särndal 1992), SQUAREM acceleration, automatic sparse-cell collapse, bounded weight calibration, and an outcome-aware design-effect estimator.

Installation

# Install from GitHub:
remotes::install_github("davdittrich/autumn")

Quick Start

harvest() takes a data frame and a target and returns the data frame with a weights column. Default parameters enforce a weight cap of 5 and mean 1.

weighted_data <- harvest(respondent_data, ns_target)

Two target formats are supported.

Named-vector list:

list(
  gender = c(Male = 0.4829, Female = 0.5171),
  region = c(Midwest = 0.2086, Northeast = 0.1764,
             South   = 0.3775, West     = 0.2374)
)

Three-column data frame (columns variable, level, proportion):

  variable    level proportion
1   gender     Male     0.4829
2   gender   Female     0.5171
3   region  Midwest     0.2086
4   region Northeast     0.1764
5   region    South     0.3775
6   region     West     0.2374

Key Capabilities

Auto-collapse of sparse or missing cells

Calibration fails when a target level has zero respondents, and produces extreme weights when a cell is too small to satisfy its target under the weight cap. auto_collapse = TRUE detects both conditions and merges each sparse level into its most similar surviving level before calibration begins.

The merge threshold for level $k$ is $\max!\left(3,\ \lfloor n \cdot 0.002 \rfloor,\ \lceil n p_k / w_{\max} \rceil\right)$. The third term is the minimum count at which the cell’s expected weight stays at or below max_weight; the second prevents collapsing well-calibrated sparse cells when the cap is loose.

# One level of 'region' absent from the sample
sparse_data <- respondent_data[respondent_data$region != "Midwest", ]

# Without auto_collapse: stop() with "missing levels" error
# With auto_collapse: Midwest absorbed into nearest neighbour, weights produced
suppressWarnings(
  weighted_sparse <- harvest(sparse_data, ns_target, auto_collapse = TRUE)
)
attr(weighted_sparse, "collapsed_levels")
#>   variable    from  into n_from p_from distance
#>      <chr>   <chr> <chr>  <int>  <dbl>    <dbl>
#> 1   region Midwest  West      0 0.2086        0

For ordered factors, merging is restricted to immediately adjacent surviving levels. Custom auxiliary columns for similarity scoring are supplied via collapse_vars.

Dual Newton calibration

method = "calibrate" solves the calibration system via K-dimensional dual Newton (Deville & Särndal 1992) rather than Gauss-Seidel IPF. The distance function depends on max_weight:

  • Unbounded (max_weight = Inf): raking (exponential) distance. A $K \times K$ linear solve per Newton step converges in 10–20 iterations versus 100–500 for IPF. Use this path when data are severely imbalanced and no weight cap is required.
  • Bounded (max_weight < Inf, including the default of 5): logit distance, which enforces the cap by construction at every Newton iteration. Weights satisfy the cap and calibration targets simultaneously.
# Unbounded — faster on severely imbalanced problems
harvest(respondent_data, ns_target, method = "calibrate", max_weight = Inf)

# Bounded — logit distance enforces cap by construction (default max_weight = 5)
harvest(respondent_data, ns_target, method = "calibrate")

SQUAREM-accelerated IPF

accelerate = TRUE applies the Varadhan-Roland (2008) SqS3 acceleration scheme to standard iterative proportional fitting. On problems where IPF converges slowly, SQUAREM reduces iteration count by an order of magnitude.

harvest(respondent_data, ns_target, accelerate = TRUE)

Outcome-aware design effects

design_effect() now supports the Henry and Valliant (2015) estimator when outcome, data, and target are all supplied:

$$\hat{d}_{HV} = (1 - \hat{R}^2_w),\hat{d}_K$$

where $\hat{R}^2_w$ is the weighted $R^2$ from regressing the outcome on dummy-coded calibration auxiliaries and $\hat{d}_K$ is the Kish (1992) design effect. When the calibration auxiliaries strongly predict the outcome, the design effect approaches zero — correctly reflecting the variance reduction from calibration.

weighted_data <- harvest(respondent_data, ns_target)
# Kish estimator (weights only)
design_effect(weighted_data$weights)

# Henry-Valliant estimator (outcome-aware).
# respondent_data has no continuous outcome; use a binary indicator.
male_indicator <- as.integer(respondent_data$gender == "Male")
design_effect(weighted_data$weights,
              outcome = male_indicator,
              data    = respondent_data,
              target  = ns_target)

design_effect() with only weights now emits a warning and falls back to Kish when the Spencer (2000) estimator would otherwise be invoked; Spencer assumes design weights (inverse inclusion probabilities), which calibration weights are not.

Additional controls

  • add_na_proportion: inject an ___NA category at the observed missingness rate, preventing listwise deletion from distorting calibration
  • adaptive_order: reorder calibration variables each iteration by current error magnitude
  • convergence: named vector controlling "pct", "absolute", "single_weight", and "time" stopping rules
  • diagnose_weights(): per-variable calibration summary with marginal deviations

Performance

Comparison with anesrake

On the included 6,691-respondent dataset raked on 10 variables, autumn is 61× faster than anesrake (unbounded) and allocates 99% less memory. With a cap of 5 — where both algorithms partially converge since the unconstrained max weight is ~26 — autumn is 1.4× faster:

#> # A data frame: 4 × 3
#>   expression             median mem_alloc
#>   <chr>                <bch:tm> <bch:byt>
#> 1 autumn (unbounded)    190.7ms  157.81MB
#> 2 autumn (cap = 5)        8.15s    7.48GB
#> 3 anesrake (unbounded)   11.62s   13.01GB
#> 4 anesrake (cap = 5)     11.48s   13.01GB

On a harder problem — the same dataset raked on 17 variables — autumn is 60.1× faster (unbounded) and 1.5× faster (cap = 5):

#> # A data frame: 4 × 3
#>   expression             median mem_alloc
#>   <chr>                <bch:tm> <bch:byt>
#> 1 autumn (unbounded)   188.39ms  160.71MB
#> 2 autumn (cap = 5)        7.29s    7.48GB
#> 3 anesrake (unbounded)   11.32s   13.03GB
#> 4 anesrake (cap = 5)     10.94s   13.03GB

survey::rake does not complete the 17-variable rake under default parameters and is excluded from that benchmark.

Calibration method comparison

The six parameter combinations cover two axes: bounding (max_weight finite vs. Inf) and algorithm (method = "rake" vs. "calibrate", and accelerate):

#> # A data frame: 6 × 4
#>   expression               median mem_alloc n_itr
#>   <chr>                  <bch:tm> <bch:byt> <int>
#> 1 rake_bounded              7.14s    7.48GB    20
#> 2 rake_bounded_squarem   661.06ms  737.89MB    20
#> 3 rake_unbounded          169.8ms  157.09MB    20
#> 4 rake_unbounded_squarem 838.34ms   773.7MB    20
#> 5 calibrate_bounded      593.76ms    1.15GB    20
#> 6 calibrate_unbounded    107.48ms  210.81MB    20

Key guidance from the results above:

  • Unbounded rake (max_weight = Inf) is 42.1× faster than bounded rake on this dataset. The gap is large because bounded rake runs to max iterations without converging (the unconstrained max weight is ~26, exceeding the cap of 5); unbounded rake converges fully. Use it when no cap is needed.
  • SQUAREM (accelerate = TRUE) is 10× faster than plain bounded rake on this dataset. On this infeasible bounded problem, SQUAREM reaches a fixed point near the cap boundary far more quickly than plain IPF. SQUAREM’s advantage scales with problem difficulty; on well-conditioned problems where plain IPF converges in 3–5 passes, per-super-step overhead can offset the iteration savings.
  • Unbounded calibrate (method = "calibrate", max_weight = Inf) is 1.7× faster than unbounded rake on this 10-variable problem. calibrate’s advantage emerges on severely imbalanced data where IPF requires hundreds of iterations; a single K×K Newton step costs more but converges in 10–20 steps versus 500+.
  • Bounded calibrate (method = "calibrate", max_weight = 5, auto_collapse = TRUE) is 10× faster than bounded rake on this dataset. It uses logit distance (Deville & Särndal 1992), which enforces the cap by construction at every Newton step. The unconstrained raking max weight is ~26; auto_collapse merges three sparse cells before calibration, after which mean = 1 and max = 4.51. Note that bounded rake (rake_bounded row) partially converges on this dataset at max_weight = 5 without collapse — the cap binds before full convergence — so its timing reflects an early exit, not a fully calibrated solution.

Tightening convergence criteria (convergence["pct"] and convergence["absolute"]) yields further gains at the cost of additional iterations.

Why “autumn”?

Authorship

autumn is maintained by Dennis A. V. Dittrich. This package is a fork of the original autumn by Aaron Rudkin; the core IPF implementation and the ns_target dataset originate from that work. Target proportions in ns_target were developed by Alex Rossell-Hayes.

Bug reports and contributions are welcome at the GitHub issue tracker.

Package hex logo adapted from art by Freepik from flaticon.com.

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • R 100.0%