Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,5 @@ jobs:
ruff check src/ tests/
python -m build
pip install twine
twine check dist/*
twine check dist/*
sphinx-build -b html docs docs/_build/html
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -132,3 +132,4 @@ dmypy.json

# uv
uv.lock
/.pypirc
9 changes: 8 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# Changelog

## [0.2.0] - 2026-04-27
### Added
- `LayeredCompBaggingModel`: A bagging ensemble version of the primary algorithm that reduces variance and automatically optimizes the `weight_falloff` for each tree in the ensemble.
- `src/layeredcompbaggingmodel`: New module for the bagging model.
- Optimization of `weight_falloff`: Using bounded golden method to find the optimal `weight_falloff` (0-15) for each tree based on an internal validation set.
- Reproducibility support: Added `random_state` to `LayeredCompBaggingModel` for consistent ensemble results.

## [0.1.0] - 2026-04-22
### Added
- Initial release: Hierarchical tree-based regressor using path-weighted Wilson means (95% trimmed) for robust predictions (e.g., parcel sale prices).
- NaN handling: Categorical NaNs as distinct "NaN" category (`fillna("NaN").unique()`); numeric NaNs excluded from splits via `notna()` masks (per SPEC.md); target `y` must be finite (raises `ValueError`).
- Scikit-learn compliance: `BaseEstimator`/`RegressorMixin`; works with `Pipeline`, `GridSearchCV`, `cross_val_score`, pickling; partial `check_estimator` pass (intentional NaN trade-off).
- Development: Full type hints (`py.typed`, mypy-ready), 16+ unittest/pytest tests (splits/NaN/explain/pickle/sklearn), `examples/quickstart.py` (MAE ~127k), `src/` layout, Hatchling build, dev deps (ruff/black/mypy).
- Development: Full type hints (`py.typed`, mypy-ready), 20+ unittest/pytest tests (splits/NaN/explain/pickle/sklearn/bagging), `examples/quickstart.py` (MAE ~127k), `src/` layout, Hatchling build, dev deps (ruff/black/mypy).

Future releases will include Sphinx docs, benchmarks (vs XGBoost/LinearR), CI/CD.
46 changes: 40 additions & 6 deletions MODEL_SPEC.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Layered Comp Model

# Overview:
## Overview:

The idea is to build "hierarchical" predictions that start with a general predicted price, then refine the prediction to be more specific by adding information and narrowing the comparison group.

You take the "Wilson mean" of all your parcels, then you find a filter that splits them into the best submarkets you can find and produce a "child node" for each variant of that filter (using a one-vs-rest approach for categorical data). Then you repeat until you've filtered down to a single data point. The value is a weighted average that prioritizes comparing well against closer matches and comparing slightly less well against further matches.

To get the predictions back out, you find the most specific bucket your subject matches, then you trace its path up the tree, taking and weighting the Wilson means as you go.

# Method:
## Method:

## Training
### Training

1. Build a tree.
2. Plot sale prices.
Expand All @@ -19,7 +19,7 @@ To get the predictions back out, you find the most specific bucket your subject
5. Make child nodes (one-vs-rest for categorical, or binary split for numeric using binary search for the breakpoint). We choose the split that results in the lowest ratio of weighted child MAE to parent MAE.
6. Repeat from step 3 until we've filtered down to a single parcel (leaf node) or cannot split further (minimum node size = 2).

## Predicting
### Predicting

1. Find the node furthest down in the hierarchy that matches your parcel.
2. Note its Wilson mean and the Wilson means of all nodes above it in the hierarchy.
Expand All @@ -31,11 +31,12 @@ To get the predictions back out, you find the most specific bucket your subject
4. Take the weighted average of the Wilson means.
5. There's your prediction.

# Hyperparameters
### Hyperparameters

weight_falloff: 0 to 1. Will be used in w(x)=(1−x)^weight_falloff where x is normalized from 0 to 1

# Nuances:

## Nuances:

The Wilson Mean keeps the prediction from going too crazy on the large sets, and it also penalizes the small sets so when the test set gets specific, the value won't swing wildly.

Expand All @@ -46,3 +47,36 @@ Every parcel should compare well because this model is fundamentally doing a hie
If a predicted parcel has a feature that wasn't in the training set, that particular level of nuance will be missed, but the parcel will still slot into a node slightly higher up the tree, so the model should still perform reasonably well even for things we don't have representative sales for.

The function that the weighted medians follows will determine a lot of how this model handles accuracy vs equity. A fast falloff will give good accuracy but may miss broader market trends. A slow falloff will promote "normativity" in predictions, but may miss market nuance and not assign correct values to particularly rare but valuable features.


# Layered Comp Bagging Model

## Overview:

A bagging ensemble version of the primary algorithm that reduces variance and automatically optimizes the `weight_falloff` for each tree in the ensemble.

## Method:

### Training

For each tree from 1 to `tree_count`:

1. **Subsampling**: Randomly sample a subset of the training data equal to `sample_pct` of the total records.
2. **Internal Split**: Divide the sample into a **training portion** and a **test portion**. The `sample_pct` also serves as the split ratio (e.g., if `sample_pct` is 0.8, 80% of the sample is for training, 20% for testing). If the test portion calculation results in 0, a minimum of 1 row is used.
3. **Tree Construction**: Build a standard `LayeredCompModel` tree using only the training portion.
4. **Weight Falloff Optimization**: Find the optimal `weight_falloff` (between 0 and 20) that minimizes the error (MAE or MSE) on the test portion. Since the error function typically has a single local minimum, use **Brent's method** or a binary search for optimization.
5. **Storage**: Save the tree structure and its specific optimized `weight_falloff`.

### Predicting

1. Generate predictions for the input from all `tree_count` individual trees.
2. Each tree uses its own optimized `weight_falloff` discovered during training.
3. The final prediction is the **arithmetic mean** of all individual tree predictions.

## Hyperparameters

* **tree_count**: Integer (min 1, default 10). Number of trees to build.
* **sample_pct**: Float (0 < x < 1, default 0.8). Fraction of data sampled for each tree and used as the internal split ratio.
* **random_state**: Integer or RandomState instance for reproducibility.
* **split_metric**: {'mae', 'mse'}. Metric used for both tree splitting and `weight_falloff` optimization.

48 changes: 48 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
import os
import sys

sys.path.insert(0, os.path.abspath('../src'))

project = 'LayeredCompModel'
copyright = '2026, John Kossa'
author = 'John Kossa'
release = '0.1.0'

# -- General configuration ---------------------------------------------------
extensions = [
'autoapi.extension',
'myst_parser',
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
'sphinx.ext.autodoc',
]

templates_path = ['_templates']
exclude_patterns = []

autoapi_type = 'python'
autoapi_dirs = ['../src']
autoapi_options = ['members', 'show-inheritance', 'special-members', 'undoc-members']

myst_enable_extensions = [
"dollarmath",
"amsmath",
"deflist",
"html_admonition",
"html_image",
"colon_fence",
"smartquotes",
"replacements",
"strikethrough",
"substitution",
]

# -- Options for HTML output -------------------------------------------------
html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']
28 changes: 28 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
LayeredCompModel
================

Hierarchical tree-based regressor for robust predictions (e.g., parcel sale prices) using weighted Wilson score intervals.

**Scikit-learn compatible.**

Installation
------------

.. code-block:: bash

pip install layeredcompmodel

Quickstart
----------

See `examples/quickstart.py` in the repository.

.. toctree::
:maxdepth: 2
:caption: Contents:

api/modules

:ref:`genindex`
:ref:`modindex`
:ref:`search`
10 changes: 8 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ classifiers = [
]
requires-python = ">=3.10"
dependencies = [
"ipykernel>=7.2.0",
"numpy>=1.24.0",
"pandas>=2.0.0",
"scikit-learn>=1.3.0",
Expand All @@ -43,7 +44,11 @@ dev = [
"black",
"mypy",
"ruff",
"build"
"build",
"sphinx>=7.0",
"sphinx-rtd-theme",
"sphinx-autoapi",
"myst-parser",
]

[tool.pytest.ini_options]
Expand All @@ -63,10 +68,11 @@ cov = "pytest --cov=layeredcompmodel --cov-report=term-missing --cov-fail-under=
mypy-check = "mypy src/layeredcompmodel"
ruff-check = "ruff check src/ tests/"
build = "hatch build"
docs = "sphinx-build -b html docs docs/_build/html"

[tool.mypy]
ignore_missing_imports = true
disallow_untyped_defs = false
warn_return_any = false
warn_unreachable = false
check_untyped_defs = false
check_untyped_defs = false
5 changes: 3 additions & 2 deletions src/layeredcompmodel/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .model import LayeredCompModel, calculate_wilson_mean
from .bagging_model import LayeredCompBaggingModel

__all__ = ["LayeredCompModel", "calculate_wilson_mean"]
__version__ = "0.1.0"
__all__ = ["LayeredCompModel", "LayeredCompBaggingModel", "calculate_wilson_mean"]
__version__ = "0.2.0"
142 changes: 142 additions & 0 deletions src/layeredcompmodel/bagging_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted, check_random_state
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split
from scipy.optimize import minimize_scalar
from typing import Any, Dict, List, Optional, Union

from layeredcompmodel.model import LayeredCompModel


class LayeredCompBaggingModel(BaseEstimator, RegressorMixin):
"""
Layered Comp Bagging Model.

A bagging ensemble version of the primary algorithm that reduces variance
and automatically optimizes the weight_falloff for each tree in the ensemble.

Parameters
----------
tree_count : int, default=10
Number of trees to build. Must be >= 1.
sample_pct : float, default=0.8
Fraction of data sampled for each tree and used as the internal split ratio.
Must be between 0 and 1 (exclusive).
random_state : int, RandomState instance or None, default=None
Determines random number generation for subsampling.
split_metric : {'mae', 'mse'}, default='mae'
Metric used for both tree splitting and weight_falloff optimization.
"""

def __init__(
self,
tree_count: int = 10,
sample_pct: float = 0.8,
random_state: Optional[Union[int, np.random.RandomState]] = None,
split_metric: str = 'mae',
n_jobs: int = 1
) -> None:
self.tree_count = tree_count
self.sample_pct = sample_pct
self.random_state = random_state
self.split_metric = split_metric
self.n_jobs = n_jobs

def fit(self, X: Any, y: Any) -> "LayeredCompBaggingModel":
"""
Build a bagging ensemble of LayeredCompModel trees.

Parameters
----------
X : array-like of shape (n_samples, n_features)
The training input samples.
y : array-like of shape (n_samples,)
The target values.

Returns
-------
self : object
Fitted estimator.
"""
# Validate hyperparameters
if self.tree_count < 1:
raise ValueError(f"tree_count must be >= 1, got {self.tree_count}")
if not (0 < self.sample_pct < 1):
raise ValueError(f"sample_pct must be between 0 and 1 (exclusive), got {self.sample_pct}")
if self.split_metric not in ('mae', 'mse'):
raise ValueError(f"split_metric must be 'mae' or 'mse', got {self.split_metric}")

if X.shape[1] == 0:
raise ValueError(f"0 feature(s) (shape={X.shape}) while a minimum of 1 is required.")
if len(X) == 0:
raise ValueError(f"Found array with 0 sample(s) (shape={X.shape}) while a minimum of 1 is required.")
if len(y) == 0:
raise ValueError(f"Found array with 0 sample(s) (shape={y.shape}) while a minimum of 1 is required.")

# Convert y to a common format or handle both types
y_array = y.values if hasattr(y, 'values') else y

if pd.isna(y_array).any():
raise ValueError("Input y contains NaN.")
if pd.api.types.is_numeric_dtype(y_array) and np.isinf(y_array).any():
raise ValueError("Input y contains infinity.")

self.n_features_in_ = X.shape[1]
self.feature_names_in_ = getattr(X, "columns", np.array([str(i) for i in range(X.shape[1])])).tolist()

self.estimators_: List[LayeredCompModel] = []

metric_fn = mean_absolute_error if self.split_metric == 'mae' else mean_squared_error

for i in range(self.tree_count):
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=(1 - self.sample_pct),
random_state=self.random_state + i)

tree = LayeredCompModel(split_metric=self.split_metric, n_jobs=self.n_jobs)
tree.fit(X_tr, y_tr)

def objective(w: float) -> float:
tree.weight_falloff = w
preds = tree.predict(X_ts)
return float(metric_fn(y_ts, preds))

if len(y_ts) > 0:
res = minimize_scalar(objective, bounds=(0.0, 15.0), method='bounded')
opt_w = res.x
best = res.fun
else:
# Fallback if no test data
opt_w = 3
best = -1

tree.weight_falloff = opt_w
self.estimators_.append(tree)
print(f"Trained tree {i + 1} of {self.tree_count} with weight {tree.weight_falloff} @ {best}")

return self

def predict(self, X: Any) -> np.ndarray:
"""
Predict regression target for X.

The final prediction is the arithmetic mean of all individual tree predictions.

Parameters
----------
X : array-like of shape (n_samples, n_features)
The input samples.

Returns
-------
y : ndarray of shape (n_samples,)
The predicted values.
"""
check_is_fitted(self)

all_preds = []
for tree in self.estimators_:
all_preds.append(tree.predict(X))

return np.mean(all_preds, axis=0)
Loading
Loading