JohnKossa · JohnKossa · Apr 27, 2026 · Apr 23, 2026 · Apr 27, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -22,4 +22,5 @@ jobs:
         ruff check src/ tests/
         python -m build
         pip install twine
-        twine check dist/*
+        twine check dist/*
+        sphinx-build -b html docs docs/_build/html
diff --git a/.gitignore b/.gitignore
@@ -132,3 +132,4 @@ dmypy.json
 
 # uv
 uv.lock
+/.pypirc
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,10 +1,17 @@
 # Changelog
 
+## [0.2.0] - 2026-04-27
+### Added
+- `LayeredCompBaggingModel`: A bagging ensemble version of the primary algorithm that reduces variance and automatically optimizes the `weight_falloff` for each tree in the ensemble.
+- `src/layeredcompbaggingmodel`: New module for the bagging model.
+- Optimization of `weight_falloff`: Using bounded golden method to find the optimal `weight_falloff` (0-15) for each tree based on an internal validation set.
+- Reproducibility support: Added `random_state` to `LayeredCompBaggingModel` for consistent ensemble results.
+
 ## [0.1.0] - 2026-04-22
 ### Added
 - Initial release: Hierarchical tree-based regressor using path-weighted Wilson means (95% trimmed) for robust predictions (e.g., parcel sale prices).
 - NaN handling: Categorical NaNs as distinct "NaN" category (`fillna("NaN").unique()`); numeric NaNs excluded from splits via `notna()` masks (per SPEC.md); target `y` must be finite (raises `ValueError`).
 - Scikit-learn compliance: `BaseEstimator`/`RegressorMixin`; works with `Pipeline`, `GridSearchCV`, `cross_val_score`, pickling; partial `check_estimator` pass (intentional NaN trade-off).
-- Development: Full type hints (`py.typed`, mypy-ready), 16+ unittest/pytest tests (splits/NaN/explain/pickle/sklearn), `examples/quickstart.py` (MAE ~127k), `src/` layout, Hatchling build, dev deps (ruff/black/mypy).
+- Development: Full type hints (`py.typed`, mypy-ready), 20+ unittest/pytest tests (splits/NaN/explain/pickle/sklearn/bagging), `examples/quickstart.py` (MAE ~127k), `src/` layout, Hatchling build, dev deps (ruff/black/mypy).
 
 Future releases will include Sphinx docs, benchmarks (vs XGBoost/LinearR), CI/CD.
diff --git a/MODEL_SPEC.md b/MODEL_SPEC.md
@@ -1,16 +1,16 @@
 # Layered Comp Model
 
-# Overview:
+## Overview:
 
 The idea is to build "hierarchical" predictions that start with a general predicted price, then refine the prediction to be more specific by adding information and narrowing the comparison group.
 
 You take the "Wilson mean" of all your parcels, then you find a filter that splits them into the best submarkets you can find and produce a "child node" for each variant of that filter (using a one-vs-rest approach for categorical data). Then you repeat until you've filtered down to a single data point. The value is a weighted average that prioritizes comparing well against closer matches and comparing slightly less well against further matches.
 
 To get the predictions back out, you find the most specific bucket your subject matches, then you trace its path up the tree, taking and weighting the Wilson means as you go.
 
-# Method:
+## Method:
 
-## Training
+### Training
 
 1. Build a tree.
 2. Plot sale prices.
@@ -19,7 +19,7 @@ To get the predictions back out, you find the most specific bucket your subject
 5. Make child nodes (one-vs-rest for categorical, or binary split for numeric using binary search for the breakpoint). We choose the split that results in the lowest ratio of weighted child MAE to parent MAE.
 6. Repeat from step 3 until we've filtered down to a single parcel (leaf node) or cannot split further (minimum node size = 2).
 
-## Predicting
+### Predicting
 
 1. Find the node furthest down in the hierarchy that matches your parcel.
 2. Note its Wilson mean and the Wilson means of all nodes above it in the hierarchy.
@@ -31,11 +31,12 @@ To get the predictions back out, you find the most specific bucket your subject
 4. Take the weighted average of the Wilson means.
 5. There's your prediction.
 
-# Hyperparameters
+### Hyperparameters
 
 weight_falloff: 0 to 1. Will be used in w(x)=(1−x)^weight_falloff where x is normalized from 0 to 1
 
-# Nuances:
+
+## Nuances:
 
 The Wilson Mean keeps the prediction from going too crazy on the large sets, and it also penalizes the small sets so when the test set gets specific, the value won't swing wildly.
 
@@ -46,3 +47,36 @@ Every parcel should compare well because this model is fundamentally doing a hie
 If a predicted parcel has a feature that wasn't in the training set, that particular level of nuance will be missed, but the parcel will still slot into a node slightly higher up the tree, so the model should still perform reasonably well even for things we don't have representative sales for.
 
 The function that the weighted medians follows will determine a lot of how this model handles accuracy vs equity. A fast falloff will give good accuracy but may miss broader market trends. A slow falloff will promote "normativity" in predictions, but may miss market nuance and not assign correct values to particularly rare but valuable features.
+
+
+# Layered Comp Bagging Model
+
+## Overview:
+
+A bagging ensemble version of the primary algorithm that reduces variance and automatically optimizes the `weight_falloff` for each tree in the ensemble.
+
+## Method:
+
+### Training
+
+For each tree from 1 to `tree_count`:
+
+1. **Subsampling**: Randomly sample a subset of the training data equal to `sample_pct` of the total records.
+2. **Internal Split**: Divide the sample into a **training portion** and a **test portion**. The `sample_pct` also serves as the split ratio (e.g., if `sample_pct` is 0.8, 80% of the sample is for training, 20% for testing). If the test portion calculation results in 0, a minimum of 1 row is used.
+3. **Tree Construction**: Build a standard `LayeredCompModel` tree using only the training portion.
+4. **Weight Falloff Optimization**: Find the optimal `weight_falloff` (between 0 and 20) that minimizes the error (MAE or MSE) on the test portion. Since the error function typically has a single local minimum, use **Brent's method** or a binary search for optimization.
+5. **Storage**: Save the tree structure and its specific optimized `weight_falloff`.
+
+### Predicting
+
+1. Generate predictions for the input from all `tree_count` individual trees.
+2. Each tree uses its own optimized `weight_falloff` discovered during training.
+3. The final prediction is the **arithmetic mean** of all individual tree predictions.
+
+## Hyperparameters
+
+*   **tree_count**: Integer (min 1, default 10). Number of trees to build.
+*   **sample_pct**: Float (0 < x < 1, default 0.8). Fraction of data sampled for each tree and used as the internal split ratio.
+*   **random_state**: Integer or RandomState instance for reproducibility.
+*   **split_metric**: {'mae', 'mse'}. Metric used for both tree splitting and `weight_falloff` optimization.
+
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,48 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Project information -----------------------------------------------------
+import os
+import sys
+
+sys.path.insert(0, os.path.abspath('../src'))
+
+project = 'LayeredCompModel'
+copyright = '2026, John Kossa'
+author = 'John Kossa'
+release = '0.1.0'
+
+# -- General configuration ---------------------------------------------------
+extensions = [
+    'autoapi.extension',
+    'myst_parser',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.autodoc',
+]
+
+templates_path = ['_templates']
+exclude_patterns = []
+
+autoapi_type = 'python'
+autoapi_dirs = ['../src']
+autoapi_options = ['members', 'show-inheritance', 'special-members', 'undoc-members']
+
+myst_enable_extensions = [
+    "dollarmath",
+    "amsmath",
+    "deflist",
+    "html_admonition",
+    "html_image",
+    "colon_fence",
+    "smartquotes",
+    "replacements",
+    "strikethrough",
+    "substitution",
+]
+
+# -- Options for HTML output -------------------------------------------------
+html_theme = 'sphinx_rtd_theme'
+html_static_path = ['_static']
diff --git a/docs/index.rst b/docs/index.rst
@@ -0,0 +1,28 @@
+LayeredCompModel
+================
+
+Hierarchical tree-based regressor for robust predictions (e.g., parcel sale prices) using weighted Wilson score intervals.
+
+**Scikit-learn compatible.**
+
+Installation
+------------
+
+.. code-block:: bash
+
+   pip install layeredcompmodel
+
+Quickstart
+----------
+
+See `examples/quickstart.py` in the repository.
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+   api/modules
+
+:ref:`genindex`
+:ref:`modindex`
+:ref:`search`
diff --git a/pyproject.toml b/pyproject.toml
@@ -23,6 +23,7 @@ classifiers = [
 ]
 requires-python = ">=3.10"
 dependencies = [
+    "ipykernel>=7.2.0",
     "numpy>=1.24.0",
     "pandas>=2.0.0",
     "scikit-learn>=1.3.0",
@@ -43,7 +44,11 @@ dev = [
     "black",
     "mypy",
     "ruff",
-    "build"
+    "build",
+    "sphinx>=7.0",
+    "sphinx-rtd-theme",
+    "sphinx-autoapi",
+    "myst-parser",
 ]
 
 [tool.pytest.ini_options]
@@ -63,10 +68,11 @@ cov = "pytest --cov=layeredcompmodel --cov-report=term-missing --cov-fail-under=
 mypy-check = "mypy src/layeredcompmodel"
 ruff-check = "ruff check src/ tests/"
 build = "hatch build"
+docs = "sphinx-build -b html docs docs/_build/html"
 
 [tool.mypy]
 ignore_missing_imports = true
 disallow_untyped_defs = false
 warn_return_any = false
 warn_unreachable = false
-check_untyped_defs = false
+check_untyped_defs = false
diff --git a/src/layeredcompmodel/__init__.py b/src/layeredcompmodel/__init__.py
@@ -1,4 +1,5 @@
 from .model import LayeredCompModel, calculate_wilson_mean
+from .bagging_model import LayeredCompBaggingModel
 
-__all__ = ["LayeredCompModel", "calculate_wilson_mean"]
-__version__ = "0.1.0"
+__all__ = ["LayeredCompModel", "LayeredCompBaggingModel", "calculate_wilson_mean"]
+__version__ = "0.2.0"
diff --git a/src/layeredcompmodel/bagging_model.py b/src/layeredcompmodel/bagging_model.py
@@ -0,0 +1,142 @@
+import numpy as np
+import pandas as pd
+from sklearn.base import BaseEstimator, RegressorMixin
+from sklearn.utils.validation import check_X_y, check_array, check_is_fitted, check_random_state
+from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
+from sklearn.model_selection import train_test_split
+from scipy.optimize import minimize_scalar
+from typing import Any, Dict, List, Optional, Union
+
+from layeredcompmodel.model import LayeredCompModel
+
+
+class LayeredCompBaggingModel(BaseEstimator, RegressorMixin):
+    """
+    Layered Comp Bagging Model.
+
+    A bagging ensemble version of the primary algorithm that reduces variance
+    and automatically optimizes the weight_falloff for each tree in the ensemble.
+
+    Parameters
+    ----------
+    tree_count : int, default=10
+        Number of trees to build. Must be >= 1.
+    sample_pct : float, default=0.8
+        Fraction of data sampled for each tree and used as the internal split ratio.
+        Must be between 0 and 1 (exclusive).
+    random_state : int, RandomState instance or None, default=None
+        Determines random number generation for subsampling.
+    split_metric : {'mae', 'mse'}, default='mae'
+        Metric used for both tree splitting and weight_falloff optimization.
+    """
+
+    def __init__(
+            self,
+            tree_count: int = 10,
+            sample_pct: float = 0.8,
+            random_state: Optional[Union[int, np.random.RandomState]] = None,
+            split_metric: str = 'mae',
+            n_jobs: int = 1
+    ) -> None:
+        self.tree_count = tree_count
+        self.sample_pct = sample_pct
+        self.random_state = random_state
+        self.split_metric = split_metric
+        self.n_jobs = n_jobs
+
+    def fit(self, X: Any, y: Any) -> "LayeredCompBaggingModel":
+        """
+        Build a bagging ensemble of LayeredCompModel trees.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_samples, n_features)
+            The training input samples.
+        y : array-like of shape (n_samples,)
+            The target values.
+
+        Returns
+        -------
+        self : object
+            Fitted estimator.
+        """
+        # Validate hyperparameters
+        if self.tree_count < 1:
+            raise ValueError(f"tree_count must be >= 1, got {self.tree_count}")
+        if not (0 < self.sample_pct < 1):
+            raise ValueError(f"sample_pct must be between 0 and 1 (exclusive), got {self.sample_pct}")
+        if self.split_metric not in ('mae', 'mse'):
+            raise ValueError(f"split_metric must be 'mae' or 'mse', got {self.split_metric}")
+
+        if X.shape[1] == 0:
+            raise ValueError(f"0 feature(s) (shape={X.shape}) while a minimum of 1 is required.")
+        if len(X) == 0:
+            raise ValueError(f"Found array with 0 sample(s) (shape={X.shape}) while a minimum of 1 is required.")
+        if len(y) == 0:
+            raise ValueError(f"Found array with 0 sample(s) (shape={y.shape}) while a minimum of 1 is required.")
+
+        # Convert y to a common format or handle both types
+        y_array = y.values if hasattr(y, 'values') else y
+
+        if pd.isna(y_array).any():
+            raise ValueError("Input y contains NaN.")
+        if pd.api.types.is_numeric_dtype(y_array) and np.isinf(y_array).any():
+            raise ValueError("Input y contains infinity.")
+
+        self.n_features_in_ = X.shape[1]
+        self.feature_names_in_ = getattr(X, "columns", np.array([str(i) for i in range(X.shape[1])])).tolist()
+
+        self.estimators_: List[LayeredCompModel] = []
+
+        metric_fn = mean_absolute_error if self.split_metric == 'mae' else mean_squared_error
+
+        for i in range(self.tree_count):
+            X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=(1 - self.sample_pct),
+                                                      random_state=self.random_state + i)
+
+            tree = LayeredCompModel(split_metric=self.split_metric, n_jobs=self.n_jobs)
+            tree.fit(X_tr, y_tr)
+
+            def objective(w: float) -> float:
+                tree.weight_falloff = w
+                preds = tree.predict(X_ts)
+                return float(metric_fn(y_ts, preds))
+
+            if len(y_ts) > 0:
+                res = minimize_scalar(objective, bounds=(0.0, 15.0), method='bounded')
+                opt_w = res.x
+                best = res.fun
+            else:
+                # Fallback if no test data
+                opt_w = 3
+                best = -1
+
+            tree.weight_falloff = opt_w
+            self.estimators_.append(tree)
+            print(f"Trained tree {i + 1} of {self.tree_count} with weight {tree.weight_falloff} @ {best}")
+
+        return self
+
+    def predict(self, X: Any) -> np.ndarray:
+        """
+        Predict regression target for X.
+
+        The final prediction is the arithmetic mean of all individual tree predictions.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_samples, n_features)
+            The input samples.
+
+        Returns
+        -------
+        y : ndarray of shape (n_samples,)
+            The predicted values.
+        """
+        check_is_fitted(self)
+
+        all_preds = []
+        for tree in self.estimators_:
+            all_preds.append(tree.predict(X))
+
+        return np.mean(all_preds, axis=0)