splita

A/B test analysis that is correct by default, informative by design, and composable by construction.

Quick Start

from splita import Experiment

result = Experiment([0, 1, 0, 1, 0, 1, 0, 0], [1, 0, 1, 1, 0, 1, 1, 1]).run()
print(result)

ExperimentResult
────────────────────────────────────
  metric          conversion
  method          ztest
  control_n       8
  treatment_n     8
────────────────────────────────────
  control_mean    0.3750
  treatment_mean  0.6250
  lift            0.2500
  relative_lift   66.67%
────────────────────────────────────
  statistic       1.4639
  pvalue          0.1432
  ci              [-0.0849, 0.5849]
  significant     False
  alpha           0.0500
────────────────────────────────────
  effect_size     0.5236 (Cohen's h)
  power           0.2825 (post-hoc*)

  * Post-hoc power is a function of the p-value
    and does not provide additional information.
    Use SampleSize for prospective power analysis.

Installation

pip install splita

For ML-powered variance reduction (CUPAC), install with the ml extra:

pip install splita[ml]  # adds scikit-learn

What Can You Do?

Run an A/B test

Pass your data and splita auto-detects the metric type, selects the right test, and returns everything you need.

from splita import Experiment
import numpy as np

rng = np.random.default_rng(42)
ctrl = rng.normal(25.0, 8.0, size=1000)
trt = rng.normal(26.5, 8.0, size=1000)

result = Experiment(ctrl, trt).run()
print(result.significant)   # True
print(result.relative_lift)  # ~6%
print(result.to_dict())      # JSON-serialisable dict

Override if you want: metric='continuous', method='bootstrap', alternative='greater', or alpha=0.01. All configuration is keyword-only, so the defaults are always explicit.

Plan your experiment

Figure out how many users you need before you start.

from splita import SampleSize

# Conversion rate test: detect a 2pp lift from a 10% baseline
plan = SampleSize.for_proportion(baseline=0.10, mde=0.02)
print(plan.n_per_variant)  # 3843

# Revenue test: detect a $2 lift with $40 std dev
plan = SampleSize.for_mean(baseline_mean=25.0, baseline_std=40.0, mde=2.0)
print(plan.n_per_variant)  # 6280

# How long will it take?
plan = SampleSize.for_proportion(0.10, 0.02).duration(daily_users=1000)
print(plan.days_needed)  # 8

# Already have a fixed audience? Find the smallest effect you can detect.
mde = SampleSize.mde_for_proportion(baseline=0.10, n=5000)
print(f"{mde:.4f}")  # 0.0173

Check data quality

Sample Ratio Mismatch invalidates everything. Check it first.

from splita import SRMCheck

result = SRMCheck([4900, 5100]).run()
print(result.passed)   # True
print(result.message)  # "No sample ratio mismatch detected (p=0.0736)."

# Unequal splits and multi-variant
result = SRMCheck([8000, 2000], expected_fractions=[0.80, 0.20]).run()

Test multiple metrics

When you test conversion, revenue, and retention in the same experiment, raw p-values lie. Correct them.

from splita import MultipleCorrection

pvalues = [0.01, 0.04, 0.20]
labels = ["conversion", "revenue", "retention"]

result = MultipleCorrection(pvalues, labels=labels).run()
print(result.n_rejected)       # 2
print(result.adjusted_pvalues) # [0.03, 0.06, 0.20]
print(result.rejected)         # [True, False, False]

Methods: 'bh' (Benjamini-Hochberg, default), 'bonferroni', 'holm', 'by' (Benjamini-Yekutieli).

Reduce variance

Lower variance means smaller required sample sizes. CUPED uses pre-experiment data to subtract noise.

from splita.variance import CUPED, OutlierHandler
from splita import Experiment
import numpy as np

rng = np.random.default_rng(42)
pre = rng.normal(10, 2, size=200)
ctrl = pre[:100] + rng.normal(0, 1, 100)
trt = pre[100:] + 0.5 + rng.normal(0, 1, 100)

# Step 1: Cap outliers (thresholds from pooled data to avoid bias)
handler = OutlierHandler(method='winsorize')
ctrl, trt = handler.fit_transform(ctrl, trt)

# Step 2: CUPED adjustment
cuped = CUPED()
ctrl_adj, trt_adj = cuped.fit_transform(ctrl, trt, pre[:100], pre[100:])
print(f"Variance reduction: {cuped.variance_reduction_:.0%}")  # ~75%

# Step 3: Run the test on adjusted data
result = Experiment(ctrl_adj, trt_adj).run()

Monitor in real-time

The peeking problem: checking results daily inflates your false positive rate. mSPRT gives you always-valid p-values that stay correct no matter when you look.

from splita import mSPRT
import numpy as np

test = mSPRT(metric='continuous', alpha=0.05)

# Day 1
ctrl_day1 = np.random.default_rng(1).normal(10, 2, size=100)
trt_day1 = np.random.default_rng(2).normal(10.5, 2, size=100)
state = test.update(ctrl_day1, trt_day1)
print(state.should_stop)          # False
print(state.always_valid_pvalue)  # still high

# Day 2 (incremental update)
ctrl_day2 = np.random.default_rng(3).normal(10, 2, size=200)
trt_day2 = np.random.default_rng(4).normal(10.5, 2, size=200)
state = test.update(ctrl_day2, trt_day2)
print(state.should_stop)          # may now be True

Optimize with bandits

When you want to minimize regret instead of just measuring a difference, use Thompson Sampling to shift traffic toward the winner as data arrives.

from splita import ThompsonSampler
import numpy as np

rng = np.random.default_rng(42)
true_rates = [0.05, 0.07, 0.06]  # arm 1 is best

ts = ThompsonSampler(n_arms=3, random_state=42)
for _ in range(1000):
    arm = ts.recommend()
    reward = rng.binomial(1, true_rates[arm])
    ts.update(arm, reward)

result = ts.result()
print(result.current_best_arm)  # 1
print(result.prob_best)         # [~0.01, ~0.95, ~0.04]
print(result.should_stop)       # True (expected loss below threshold)

Full Example: E-commerce A/B Test

A complete workflow from planning through analysis.

import numpy as np
from splita import Experiment, SampleSize, SRMCheck, MultipleCorrection

# 1. Plan: 10% baseline conversion, detect a 1.5pp lift
plan = SampleSize.for_proportion(baseline=0.10, mde=0.015, power=0.80)
print(f"Need {plan.n_per_variant} users per variant")
print(f"At 5,000 users/day: {plan.duration(5000).days_needed} days")

# 2. Simulate experiment data
rng = np.random.default_rng(42)
n = plan.n_per_variant
ctrl = rng.binomial(1, 0.10, size=n)
trt = rng.binomial(1, 0.115, size=n)  # true 1.5pp lift

# 3. SRM check first — always
srm = SRMCheck([len(ctrl), len(trt)]).run()
assert srm.passed, srm.message

# 4. Analyze primary metric
primary = Experiment(ctrl, trt).run()

# 5. Analyze secondary metrics (revenue, engagement)
rev_ctrl = rng.exponential(25, size=n)
rev_trt = rng.exponential(26, size=n)
revenue = Experiment(rev_ctrl, rev_trt).run()

# 6. Correct for multiple testing
corrected = MultipleCorrection(
    [primary.pvalue, revenue.pvalue],
    labels=["conversion", "revenue"],
).run()
print(corrected)

Design Philosophy

Correct by default. Every formula is validated against scipy.stats. Auto-detection picks the right test for your data. Welch's t-test (not Student's), unpooled standard errors for confidence intervals, Cohen's h for proportions. You get the right answer without configuring anything.

Informative errors. Every ValueError follows a 3-part structure: what went wrong, the actual bad value, and a hint for fixing it.

`alpha` must be in (0, 1), got 1.5.
  Detail: alpha=1.5 means a 150% false positive rate.
  Hint: typical values are 0.05, 0.01, or 0.10.

Composable. All results are frozen dataclasses with .to_dict() for serialisation. Pipe OutlierHandler into CUPED into Experiment — each step is a clean function of its inputs. No global state, no side effects.

Zero opinions on your data stack. splita takes arrays and returns dataclasses. It does not care whether your data comes from pandas, Spark, BigQuery, or a CSV. No DataFrame dependencies, no plotting libraries, no ORM integrations.

API Reference

88 classes across 8 modules. Every class returns frozen dataclasses with .to_dict().

Core Analysis (24 classes)

Class	Type	Description	Reference
`Experiment`	Hybrid	Frequentist A/B test (z-test, t-test, Mann-Whitney, chi-square, delta method, bootstrap)	Welch 1947, Deng et al. 2018
`BayesianExperiment`	Original	Bayesian A/B test with P(B>A), expected loss, ROPE	Berry 2006
`ObjectiveBayesianExperiment`	Original	Empirical Bayes A/B test with prior learned from historical experiments	Microsoft/Bing 2019
`QuantileExperiment`	Original	Bootstrap inference at arbitrary quantiles (median, p90, p99)	Efron 1979
`SampleSize`	Hybrid	Power analysis for proportions, means, and ratios	Farrington & Manning 1990, Cohen 1988
`SRMCheck`	Wrapper	Sample Ratio Mismatch detector (chi-square goodness-of-fit)	Fabijan et al. 2019
`MultipleCorrection`	Original	p-value correction (BH, Bonferroni, Holm, BY)	Benjamini & Hochberg 1995
`PermutationTest`	Original	Exact distribution-free hypothesis testing via label permutation	—
`PowerSimulation`	Wrapper	Monte Carlo power for complex designs	—
`HTEEstimator`	Wrapper	Heterogeneous treatment effects (T-learner, S-learner)	Kunzel et al. 2019
`CausalForest`	Hybrid	Honest T-learner with jackknife CIs	Athey & Wager 2018
`TriggeredExperiment`	Wrapper	Intent-to-treat vs per-protocol analysis	Hernan & Robins 2020
`DilutionAnalysis`	Hybrid	Dilute triggered effect back to full ITT population	Deng et al. 2015
`InteractionTest`	Hybrid	Segment-level treatment effect heterogeneity (Cochran's Q)	Cochran 1954
`MultiObjectiveExperiment`	Wrapper	Pareto analysis across multiple metrics	—
`StratifiedExperiment`	Original	Neyman-style stratified inference	Neyman 1923, Miratrix et al. 2013
`FunnelExperiment`	Hybrid	Per-step conversion funnel analysis	Kohavi et al. 2020
`InterleavingExperiment`	Hybrid	Team Draft / Balanced interleaving for ranking comparison	Chapelle et al. 2012
`MetricDecomposition`	Wrapper	Decompose metric into additive components and test each	Deng et al. KDD 2024
`MixedEffectsExperiment`	Original	Random-intercept model for repeated-measures experiments	—
`OECBuilder`	Hybrid	Combine multiple metrics into an Overall Evaluation Criterion	Deng et al. WSDM 2013
`OptimalProxyMetrics`	Original	Learn optimal weighted proxy for a north star metric	Jeunen 2024
`RiskAwareDecision`	Hybrid	Multi-metric constrained decision framework	Spotify 2024
`SurvivalExperiment`	Hybrid	Kaplan-Meier survival estimation and log-rank test	Kaplan & Meier 1958, Cox 1972

Variance Reduction (14 classes)

Class	Type	Description	Reference
`CUPED`	Original	Pre-experiment covariate adjustment	Deng et al. WSDM 2013
`CUPAC`	Hybrid	ML-predicted covariate adjustment (cross-validated)	Tang et al. 2020 (DoorDash)
`OutlierHandler`	Hybrid	Winsorize, trim, IQR, DBSCAN clustering	Tukey 1977, Ester et al. 1996
`MultivariateCUPED`	Original	Multi-covariate CUPED extension	Deng & Shi 2016
`RegressionAdjustment`	Original	Lin's OLS with HC2 robust SEs	Lin 2013
`AdaptiveWinsorizer`	Original	Grid-search optimal capping thresholds	Gupta et al. 2019 (Microsoft ExP)
`DoubleML`	Hybrid	Double/debiased ML for treatment effects	Chernozhukov et al. 2018
`ClusterBootstrap`	Original	Cluster-level bootstrap for ratio metrics with within-cluster correlation	Cameron et al. 2008
`InExperimentVR`	Original	Variance reduction using in-experiment control-group covariates	Deng et al. KDD 2023
`NonstationaryAdjustment`	Hybrid	Bias-corrected ATE via time-series decomposition	Chen et al. 2024
`PostStratification`	Original	Post-experiment stratification with inverse-variance weighting	Miratrix et al. 2013
`PredictionPoweredInference`	Original	Augment small labeled data with ML predictions for valid inference	Angelopoulos et al. 2023
`RobustMeanEstimator`	Original	Huber M-estimator, Median of Means, Catoni estimator	Huber 1964
`TrimmedMeanEstimator`	Original	Robust ATE via symmetric tail trimming	—

Sequential Testing (7 classes)

Class	Type	Description	Reference
`mSPRT`	Original	Always-valid p-values via mixture likelihood ratio	Johari et al. 2015/2022
`GroupSequential`	Original	Alpha-spending boundaries (OBF, Pocock, Kim-DeMets)	O'Brien & Fleming 1979, Lan & DeMets 1983
`EValue`	Original	E-value sequential testing	Grunwald et al. 2020
`ConfidenceSequence`	Original	Time-uniform confidence sequences	Howard et al. 2021
`EProcess`	Original	Safe testing with e-processes (GRAPA, universal)	Grunwald et al. 2020, Ramdas et al. 2023
`SampleSizeReestimation`	Original	Mid-experiment sample size adjustment via conditional power	Mehta & Pocock 2011
`YEASTSequentialTest`	Original	Tuning-free sequential test via Levy's inequality	Kurennoy (Meta) 2024

Bandits (5 classes)

Class	Type	Description	Reference
`ThompsonSampler`	Original	Multi-armed bandit (Bernoulli, Gaussian, Poisson)	Russo et al. 2018
`LinTS`	Original	Linear Thompson Sampling contextual bandit	Agrawal & Goyal 2013
`LinUCB`	Original	Upper confidence bound contextual bandit	Li et al. 2010
`BayesianStopping`	Original	Stopping rule evaluator for bandits	—
`OfflineEvaluator`	Hybrid	Offline policy evaluation via IPS and doubly robust estimator	Li et al. 2011

Causal Inference (19 classes)

Class	Type	Description	Reference
`DifferenceInDifferences`	Hybrid	Classic two-period DiD with parallel trends check	Card & Krueger 1994
`SyntheticControl`	Hybrid	Weighted donor combination via constrained optimization	Abadie et al. 2010
`ClusterExperiment`	Hybrid	Cluster-robust inference with ICC	Cameron & Miller 2015
`SwitchbackExperiment`	Hybrid	Time-based switchback design analysis	Bojinov & Shephard 2019
`SurrogateEstimator`	Wrapper	Short-term to long-term effect prediction	Athey et al. 2019
`SurrogateIndex`	Hybrid	Multi-surrogate index with cross-fitting	Athey et al. 2019
`InterferenceExperiment`	Original	Network interference with Horvitz-Thompson estimator	Basse & Feller 2018
`BipartiteExperiment`	Original	Cross-side exposure mapping for two-sided experiments	Harshaw et al. 2023
`ContinuousTreatmentEffect`	Original	Dose-response curve via kernel-weighted local linear regression	Hirano & Imbens 2004
`DoublyRobustEstimator`	Hybrid	Augmented IPW (AIPW) with cross-fitting	Robins 1994, Bang & Robins 2005
`DynamicCausalEffect`	Hybrid	Time-varying treatment effects with doubly robust estimation	Shi, Deng et al. JASA 2022
`EffectTransport`	Original	Transport treatment effects across populations via inverse odds weighting	Rosenman et al. 2025
`GeoExperiment`	Hybrid	Bayesian synthetic control for geo-level marketing experiments	—
`InstrumentalVariables`	Original	Two-Stage Least Squares (2SLS) for LATE estimation	—
`MarketplaceExperiment`	Hybrid	Buyer/seller randomization bias analysis for two-sided marketplaces	Bajari et al. 2023
`MediationAnalysis`	Original	Baron-Kenny mediation with Sobel test for indirect effects (ACME)	Imai et al. 2010
`PropensityScoreMatching`	Original	Logistic propensity scores with nearest-neighbor matching	Rosenbaum & Rubin 1983
`RegressionDiscontinuity`	Original	Sharp RDD with local linear regression and IK bandwidth	Imbens & Kalyanaraman 2012
`TMLE`	Hybrid	Targeted Maximum Likelihood Estimation for ATE	van der Laan & Rubin 2006

Diagnostics (10 classes)

Class	Type	Description	Reference
`NoveltyCurve`	Original	Rolling-window novelty/primacy effect detection	—
`AATest`	Wrapper	Validate randomization via simulation	Kohavi et al. 2020
`EffectTimeSeries`	Hybrid	Cumulative treatment effect over time	Kohavi et al. 2020
`MetricSensitivity`	Hybrid	Pre-experiment power estimation from historical data	—
`VarianceEstimator`	Original	Distributional analysis with A/B-specific recommendations	—
`NonStationaryDetector`	Original	CUSUM change-point detection on effect series	Page 1954
`CarryoverDetector`	Wrapper	Detect treatment effect leakage into control via pre/post comparison	—
`FlickerDetector`	Original	Detect users who switched variants mid-experiment	—
`PHackingDetector`	Hybrid	p-curve analysis for selective reporting detection	Simonsohn et al. 2014
`RandomizationValidator`	Hybrid	Covariate balance check via SMD and chi-squared omnibus test	Austin 2009

Design (6 classes)

Class	Type	Description	Reference
`PairwiseDesign`	Original	Mahalanobis distance matched-pair assignment	Greevy et al. 2004
`AdaptiveEnrichment`	Original	Mid-experiment subgroup selection based on treatment response	Simon & Simon 2013
`BayesianExperimentOptimizer`	Hybrid	Surrogate-model-based experiment parameter optimization	Meta 2025
`BudgetSplitDesign`	Hybrid	Budget-split design to eliminate cannibalization bias	Liu et al. KDD 2021
`FractionalFactorialDesign`	Hybrid	Resolution III+ fractional factorial designs for factor screening	Box & Hunter 1961
`ResponseAdaptiveRandomization`	Original	Dynamic allocation probabilities favouring better-performing arms	—

Governance (3 classes)

Class	Type	Description	Reference
`ExperimentRegistry`	Original	In-memory experiment tracking with date filtering	—
`ConflictDetector`	Original	Overlapping experiment detection (traffic, metric, segment)	Kohavi et al. 2020
`GuardrailMonitor`	Wrapper	Safety metric monitoring with Bonferroni-corrected stopping rules	—

Type legend: Original = algorithm from paper equations. Wrapper = delegates to scipy/sklearn. Hybrid = original algorithm + scipy/sklearn numerical primitives.

For full citations, see References.

All result types are frozen dataclasses with .to_dict() and pretty __repr__.

Why splita?

	splita	GrowthBook	Eppo	statsmodels
Type	Python library	SaaS + SDK	SaaS + SDK	Python library
Pricing	Free (MIT)	Free tier / paid	Paid	Free (BSD)
Self-hosted	Yes (it's a library)	Yes (complex)	No	Yes (it's a library)
Bayesian A/B	Built-in	Built-in	Built-in	No
Sequential testing	mSPRT, YEAST, e-values, group sequential	CUPED + sequential	Sequential	No
Variance reduction	CUPED, CUPAC, Double ML, 14 methods	CUPED	CUPED	Manual
Causal inference	DiD, Synthetic Control, TMLE, 19 classes	No	No	DiD (basic)
Bandits	Thompson, LinUCB, LinTS	Multi-armed bandits	Bandits	No
Vendor lock-in	None	Moderate	High	None
Data stays local	Always	Self-hosted option	No	Always
Jupyter-native	Yes (HTML repr, widgets)	No	No	Partial
Result format	Frozen dataclasses, `.to_dict()`	JSON API	JSON API	Mixed
Dependencies	numpy + scipy only	Node.js / Docker	SaaS	numpy + scipy + pandas
Lines for a z-test	3	~20 (SDK + config)	~20 (SDK + config)	8
`explain()` in 4 languages	Yes	No	No	No
LaTeX export	Yes	No	No	Yes

When to use splita: You want a Python-native experimentation toolkit with correct statistical defaults, no vendor lock-in, and your data stays in your infrastructure. You need more than basic t-tests: causal inference, sequential testing, bandits, variance reduction.

When to use GrowthBook/Eppo: You need a full-stack feature flagging + experimentation platform with a web UI, team collaboration, and you're okay with SaaS or self-hosted infrastructure.

When to use statsmodels: You need general-purpose statistics (regression, time series, GLMs) beyond A/B testing. For pure experimentation, splita provides more features with fewer lines of code.

See Benchmarks for detailed performance comparisons.

Requirements

Python 3.10+
numpy >= 1.24
scipy >= 1.10
scikit-learn >= 1.2 (optional, only for CUPAC -- install with pip install splita[ml])

Contributing

Contributions are welcome. Please open an issue first to discuss what you would like to change.

pip install -e ".[dev]"
pytest --cov=splita
ruff check src/ tests/
mypy src/splita/

Roadmap

Experimentation Accelerator (planned): AI-powered experiment prioritization using content embeddings and historical A/B results to suggest which variants to test next and explain why winners win. Based on recent research in content-aware ranking for experimentation (arxiv:2602.13852, 2026). Requires LLM/embedding infrastructure — not yet in scope for a pure statistics library.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github		.github
docs		docs
examples		examples
src/splita		src/splita
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

splita

Quick Start

Installation

What Can You Do?

Run an A/B test

Plan your experiment

Check data quality

Test multiple metrics

Reduce variance

Monitor in real-time

Optimize with bandits

Full Example: E-commerce A/B Test

Design Philosophy

API Reference

Core Analysis (24 classes)

Variance Reduction (14 classes)

Sequential Testing (7 classes)

Bandits (5 classes)

Causal Inference (19 classes)

Diagnostics (10 classes)

Design (6 classes)

Governance (3 classes)

Why splita?

Requirements

Contributing

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

splita

Quick Start

Installation

What Can You Do?

Run an A/B test

Plan your experiment

Check data quality

Test multiple metrics

Reduce variance

Monitor in real-time

Optimize with bandits

Full Example: E-commerce A/B Test

Design Philosophy

API Reference

Core Analysis (24 classes)

Variance Reduction (14 classes)

Sequential Testing (7 classes)

Bandits (5 classes)

Causal Inference (19 classes)

Diagnostics (10 classes)

Design (6 classes)

Governance (3 classes)

Why splita?

Requirements

Contributing

Roadmap

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages