golds-gym 🏋️

Golden testing for performance benchmarks. Save timing baselines on first run, compare against them on subsequent runs.

The library was originally inspired by (and dependent on) benchpress, but now has independent measurement routines.

Key Features:

Architecture-specific baselines (different hardware = different golden files)
Hybrid tolerance (handles both fast <1ms and slow operations)
Robust statistics mode (outlier detection, trimmed mean)
Lens-based custom expectations (assert "must be faster", compare by median, etc.)

Quick Start

import Test.Hspec
import Test.Hspec.BenchGolden
import Data.List (sort)

main :: IO ()
main = hspec $ do
  describe "Performance" $ do
    -- Pure function with normal form evaluation (full evaluation of the function arguments)
    benchGolden "list append" $
      nf (\xs -> xs ++ xs) [1..1000]

    -- Custom configuration
    benchGoldenWith defaultBenchConfig
      { iterations = 500
      , tolerancePercent = 10.0
      }
      "sorting" $
      nf ...

Evaluation strategies (required - specify how values are forced):

nf f x - Force result of f x to normal form (deep, full evaluation)
nfIO action - Execute IO action and force result to normal form
nfAppIO f x - Apply function, execute resulting IO, force result to normal form
io action - Plain IO action without additional forcing

Why evaluation strategies matter: Without forcing, GHC may optimize away computations or share results across iterations, making benchmarks meaningless.

First run creates .golden/<arch>/<benchmark_name>.golden with baseline stats.
Subsequent runs compare against baseline. Test fails if mean time changes beyond tolerance (default: ±15% OR ±0.01ms).

Output format :

Metric  Baseline    Actual      Diff
------  --------    ------      ----
Mean    0.150 ms  0.170 ms   +13.3%

Update baselines after intentional changes:

GOLDS_GYM_ACCEPT=1 stack test

How It Works

Golden files store timing statistics per architecture (e.g., .golden/aarch64-darwin-Apple_M1/):

{
  "mean": 1.234,
  "stddev": 0.056,
  "median": 1.201,
  "architecture": "aarch64-darwin-Apple_M1",
  "timestamp": "2026-01-30T12:00:00Z"
}

Hybrid tolerance (default) prevents false failures: benchmarks pass if within ±15% OR ±0.01ms. This handles measurement noise for fast operations (<1ms) while catching real regressions for slower code.

Configuration

Key BenchConfig options:

Field	Default	Description
`iterations`	100	Number of benchmark iterations
`tolerancePercent`	15.0	Allowed mean time deviation (%)
`absoluteToleranceMs`	Just 0.01	Absolute tolerance (ms) - enables hybrid mode
`useRobustStatistics`	False	Use trimmed mean/MAD instead of mean/stddev
`warmupIterations`	5	Warm-up runs before measurement

See BenchConfig type for all options.

Environment variables:

GOLDS_GYM_ACCEPT=1 - Regenerate all golden files
GOLDS_GYM_SKIP=1 - Skip benchmarks entirely (useful in CI)

Advanced: Robust Statistics

Standard mean/stddev are sensitive to outliers (GC pauses, OS scheduling). Robust statistics provide outlier-resistant comparisons:

benchGoldenWith defaultBenchConfig
  { useRobustStatistics = True  -- Use trimmed mean + MAD
  , trimPercent = 10.0          -- Remove top/bottom 10%
  , outlierThreshold = 3.0      -- Flag outliers >3 MADs from median
  }
  "noisy benchmark" $
  nf computation input

When to use:

Benchmarking in noisy environments (shared CI, development machines)
Operations with occasional GC pauses or system interruptions
Fast operations (<1ms) with high variance
You see outliers in test output warnings

Advanced: Lens-Based Expectations

For fine-grained control, use lens-based expectations to assert custom performance requirements:

import Test.Hspec.BenchGolden.Lenses

-- Compare by median instead of mean (more robust)
benchGoldenWithExpectation "median comparison" defaultBenchConfig
  [expect _statsMedian (Percent 10.0)]
  (nf myAlgorithm input)

-- Compose multiple requirements (both must pass)
benchGoldenWithExpectation "strict requirements" defaultBenchConfig
  [ expect _statsMean (Percent 15.0) &&~
    expect _statsIQR (Absolute 0.1)     -- Low variance required
  ]
  (nf criticalFunction data)

Available lenses: _statsMean, _statsMedian, _statsTrimmedMean, _statsStddev, _statsMAD, _statsIQR, _statsMin, _statsMax

Tolerance types:

Percent 15.0 - Within ±15%
Absolute 0.01 - Within ±0.01ms
Hybrid 15.0 0.01 - Within ±15% OR ±0.01ms
MustImprove 10.0 - Must be ≥10% faster (for testing optimizations)
MustRegress 5.0 - Must be ≥5% slower (for accepting controlled regressions)

Composition: (&&~) for AND, (||~) for OR

Documentation

API documentation - Full Haddock docs
Example benchmarks - Comprehensive usage examples
CHANGELOG - Version history and migration guides

Related Work

tasty-bench

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
example		example
src/Test/Hspec		src/Test/Hspec
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cabal.project		cabal.project
golds-gym.cabal		golds-gym.cabal
stack.yaml		stack.yaml
stack.yaml.lock		stack.yaml.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

golds-gym 🏋️

Quick Start

How It Works

Configuration

Advanced: Robust Statistics

Advanced: Lens-Based Expectations

Documentation

Related Work

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

golds-gym 🏋️

Quick Start

How It Works

Configuration

Advanced: Robust Statistics

Advanced: Lens-Based Expectations

Documentation

Related Work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages