diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 219ff66..24a15b0 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -22,4 +22,5 @@ jobs: ruff check src/ tests/ python -m build pip install twine - twine check dist/* \ No newline at end of file + twine check dist/* + sphinx-build -b html docs docs/_build/html \ No newline at end of file diff --git a/.gitignore b/.gitignore index 85ccec0..d0450a1 100644 --- a/.gitignore +++ b/.gitignore @@ -132,3 +132,4 @@ dmypy.json # uv uv.lock +/.pypirc diff --git a/CHANGELOG.md b/CHANGELOG.md index 31f2898..10877ad 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,10 +1,17 @@ # Changelog +## [0.2.0] - 2026-04-27 +### Added +- `LayeredCompBaggingModel`: A bagging ensemble version of the primary algorithm that reduces variance and automatically optimizes the `weight_falloff` for each tree in the ensemble. +- `src/layeredcompbaggingmodel`: New module for the bagging model. +- Optimization of `weight_falloff`: Using bounded golden method to find the optimal `weight_falloff` (0-15) for each tree based on an internal validation set. +- Reproducibility support: Added `random_state` to `LayeredCompBaggingModel` for consistent ensemble results. + ## [0.1.0] - 2026-04-22 ### Added - Initial release: Hierarchical tree-based regressor using path-weighted Wilson means (95% trimmed) for robust predictions (e.g., parcel sale prices). - NaN handling: Categorical NaNs as distinct "NaN" category (`fillna("NaN").unique()`); numeric NaNs excluded from splits via `notna()` masks (per SPEC.md); target `y` must be finite (raises `ValueError`). - Scikit-learn compliance: `BaseEstimator`/`RegressorMixin`; works with `Pipeline`, `GridSearchCV`, `cross_val_score`, pickling; partial `check_estimator` pass (intentional NaN trade-off). -- Development: Full type hints (`py.typed`, mypy-ready), 16+ unittest/pytest tests (splits/NaN/explain/pickle/sklearn), `examples/quickstart.py` (MAE ~127k), `src/` layout, Hatchling build, dev deps (ruff/black/mypy). +- Development: Full type hints (`py.typed`, mypy-ready), 20+ unittest/pytest tests (splits/NaN/explain/pickle/sklearn/bagging), `examples/quickstart.py` (MAE ~127k), `src/` layout, Hatchling build, dev deps (ruff/black/mypy). Future releases will include Sphinx docs, benchmarks (vs XGBoost/LinearR), CI/CD. diff --git a/MODEL_SPEC.md b/MODEL_SPEC.md index f67429f..6515688 100644 --- a/MODEL_SPEC.md +++ b/MODEL_SPEC.md @@ -1,6 +1,6 @@ # Layered Comp Model -# Overview: +## Overview: The idea is to build "hierarchical" predictions that start with a general predicted price, then refine the prediction to be more specific by adding information and narrowing the comparison group. @@ -8,9 +8,9 @@ You take the "Wilson mean" of all your parcels, then you find a filter that spli To get the predictions back out, you find the most specific bucket your subject matches, then you trace its path up the tree, taking and weighting the Wilson means as you go. -# Method: +## Method: -## Training +### Training 1. Build a tree. 2. Plot sale prices. @@ -19,7 +19,7 @@ To get the predictions back out, you find the most specific bucket your subject 5. Make child nodes (one-vs-rest for categorical, or binary split for numeric using binary search for the breakpoint). We choose the split that results in the lowest ratio of weighted child MAE to parent MAE. 6. Repeat from step 3 until we've filtered down to a single parcel (leaf node) or cannot split further (minimum node size = 2). -## Predicting +### Predicting 1. Find the node furthest down in the hierarchy that matches your parcel. 2. Note its Wilson mean and the Wilson means of all nodes above it in the hierarchy. @@ -31,11 +31,12 @@ To get the predictions back out, you find the most specific bucket your subject 4. Take the weighted average of the Wilson means. 5. There's your prediction. -# Hyperparameters +### Hyperparameters weight_falloff: 0 to 1. Will be used in w(x)=(1−x)^weight_falloff where x is normalized from 0 to 1 -# Nuances: + +## Nuances: The Wilson Mean keeps the prediction from going too crazy on the large sets, and it also penalizes the small sets so when the test set gets specific, the value won't swing wildly. @@ -46,3 +47,36 @@ Every parcel should compare well because this model is fundamentally doing a hie If a predicted parcel has a feature that wasn't in the training set, that particular level of nuance will be missed, but the parcel will still slot into a node slightly higher up the tree, so the model should still perform reasonably well even for things we don't have representative sales for. The function that the weighted medians follows will determine a lot of how this model handles accuracy vs equity. A fast falloff will give good accuracy but may miss broader market trends. A slow falloff will promote "normativity" in predictions, but may miss market nuance and not assign correct values to particularly rare but valuable features. + + +# Layered Comp Bagging Model + +## Overview: + +A bagging ensemble version of the primary algorithm that reduces variance and automatically optimizes the `weight_falloff` for each tree in the ensemble. + +## Method: + +### Training + +For each tree from 1 to `tree_count`: + +1. **Subsampling**: Randomly sample a subset of the training data equal to `sample_pct` of the total records. +2. **Internal Split**: Divide the sample into a **training portion** and a **test portion**. The `sample_pct` also serves as the split ratio (e.g., if `sample_pct` is 0.8, 80% of the sample is for training, 20% for testing). If the test portion calculation results in 0, a minimum of 1 row is used. +3. **Tree Construction**: Build a standard `LayeredCompModel` tree using only the training portion. +4. **Weight Falloff Optimization**: Find the optimal `weight_falloff` (between 0 and 20) that minimizes the error (MAE or MSE) on the test portion. Since the error function typically has a single local minimum, use **Brent's method** or a binary search for optimization. +5. **Storage**: Save the tree structure and its specific optimized `weight_falloff`. + +### Predicting + +1. Generate predictions for the input from all `tree_count` individual trees. +2. Each tree uses its own optimized `weight_falloff` discovered during training. +3. The final prediction is the **arithmetic mean** of all individual tree predictions. + +## Hyperparameters + +* **tree_count**: Integer (min 1, default 10). Number of trees to build. +* **sample_pct**: Float (0 < x < 1, default 0.8). Fraction of data sampled for each tree and used as the internal split ratio. +* **random_state**: Integer or RandomState instance for reproducibility. +* **split_metric**: {'mae', 'mse'}. Metric used for both tree splitting and `weight_falloff` optimization. + diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 0000000..29922c3 --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,48 @@ +# Configuration file for the Sphinx documentation builder. +# +# For the full list of built-in configuration values, see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Project information ----------------------------------------------------- +import os +import sys + +sys.path.insert(0, os.path.abspath('../src')) + +project = 'LayeredCompModel' +copyright = '2026, John Kossa' +author = 'John Kossa' +release = '0.1.0' + +# -- General configuration --------------------------------------------------- +extensions = [ + 'autoapi.extension', + 'myst_parser', + 'sphinx.ext.napoleon', + 'sphinx.ext.viewcode', + 'sphinx.ext.autodoc', +] + +templates_path = ['_templates'] +exclude_patterns = [] + +autoapi_type = 'python' +autoapi_dirs = ['../src'] +autoapi_options = ['members', 'show-inheritance', 'special-members', 'undoc-members'] + +myst_enable_extensions = [ + "dollarmath", + "amsmath", + "deflist", + "html_admonition", + "html_image", + "colon_fence", + "smartquotes", + "replacements", + "strikethrough", + "substitution", +] + +# -- Options for HTML output ------------------------------------------------- +html_theme = 'sphinx_rtd_theme' +html_static_path = ['_static'] \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 0000000..c4dceb6 --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,28 @@ +LayeredCompModel +================ + +Hierarchical tree-based regressor for robust predictions (e.g., parcel sale prices) using weighted Wilson score intervals. + +**Scikit-learn compatible.** + +Installation +------------ + +.. code-block:: bash + + pip install layeredcompmodel + +Quickstart +---------- + +See `examples/quickstart.py` in the repository. + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + api/modules + +:ref:`genindex` +:ref:`modindex` +:ref:`search` \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index 4334a0e..b5bfe9b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -23,6 +23,7 @@ classifiers = [ ] requires-python = ">=3.10" dependencies = [ + "ipykernel>=7.2.0", "numpy>=1.24.0", "pandas>=2.0.0", "scikit-learn>=1.3.0", @@ -43,7 +44,11 @@ dev = [ "black", "mypy", "ruff", - "build" + "build", + "sphinx>=7.0", + "sphinx-rtd-theme", + "sphinx-autoapi", + "myst-parser", ] [tool.pytest.ini_options] @@ -63,10 +68,11 @@ cov = "pytest --cov=layeredcompmodel --cov-report=term-missing --cov-fail-under= mypy-check = "mypy src/layeredcompmodel" ruff-check = "ruff check src/ tests/" build = "hatch build" +docs = "sphinx-build -b html docs docs/_build/html" [tool.mypy] ignore_missing_imports = true disallow_untyped_defs = false warn_return_any = false warn_unreachable = false -check_untyped_defs = false \ No newline at end of file +check_untyped_defs = false diff --git a/src/layeredcompmodel/__init__.py b/src/layeredcompmodel/__init__.py index 541b43c..ac55807 100644 --- a/src/layeredcompmodel/__init__.py +++ b/src/layeredcompmodel/__init__.py @@ -1,4 +1,5 @@ from .model import LayeredCompModel, calculate_wilson_mean +from .bagging_model import LayeredCompBaggingModel -__all__ = ["LayeredCompModel", "calculate_wilson_mean"] -__version__ = "0.1.0" \ No newline at end of file +__all__ = ["LayeredCompModel", "LayeredCompBaggingModel", "calculate_wilson_mean"] +__version__ = "0.2.0" \ No newline at end of file diff --git a/src/layeredcompmodel/bagging_model.py b/src/layeredcompmodel/bagging_model.py new file mode 100644 index 0000000..433d2ee --- /dev/null +++ b/src/layeredcompmodel/bagging_model.py @@ -0,0 +1,142 @@ +import numpy as np +import pandas as pd +from sklearn.base import BaseEstimator, RegressorMixin +from sklearn.utils.validation import check_X_y, check_array, check_is_fitted, check_random_state +from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error +from sklearn.model_selection import train_test_split +from scipy.optimize import minimize_scalar +from typing import Any, Dict, List, Optional, Union + +from layeredcompmodel.model import LayeredCompModel + + +class LayeredCompBaggingModel(BaseEstimator, RegressorMixin): + """ + Layered Comp Bagging Model. + + A bagging ensemble version of the primary algorithm that reduces variance + and automatically optimizes the weight_falloff for each tree in the ensemble. + + Parameters + ---------- + tree_count : int, default=10 + Number of trees to build. Must be >= 1. + sample_pct : float, default=0.8 + Fraction of data sampled for each tree and used as the internal split ratio. + Must be between 0 and 1 (exclusive). + random_state : int, RandomState instance or None, default=None + Determines random number generation for subsampling. + split_metric : {'mae', 'mse'}, default='mae' + Metric used for both tree splitting and weight_falloff optimization. + """ + + def __init__( + self, + tree_count: int = 10, + sample_pct: float = 0.8, + random_state: Optional[Union[int, np.random.RandomState]] = None, + split_metric: str = 'mae', + n_jobs: int = 1 + ) -> None: + self.tree_count = tree_count + self.sample_pct = sample_pct + self.random_state = random_state + self.split_metric = split_metric + self.n_jobs = n_jobs + + def fit(self, X: Any, y: Any) -> "LayeredCompBaggingModel": + """ + Build a bagging ensemble of LayeredCompModel trees. + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + The training input samples. + y : array-like of shape (n_samples,) + The target values. + + Returns + ------- + self : object + Fitted estimator. + """ + # Validate hyperparameters + if self.tree_count < 1: + raise ValueError(f"tree_count must be >= 1, got {self.tree_count}") + if not (0 < self.sample_pct < 1): + raise ValueError(f"sample_pct must be between 0 and 1 (exclusive), got {self.sample_pct}") + if self.split_metric not in ('mae', 'mse'): + raise ValueError(f"split_metric must be 'mae' or 'mse', got {self.split_metric}") + + if X.shape[1] == 0: + raise ValueError(f"0 feature(s) (shape={X.shape}) while a minimum of 1 is required.") + if len(X) == 0: + raise ValueError(f"Found array with 0 sample(s) (shape={X.shape}) while a minimum of 1 is required.") + if len(y) == 0: + raise ValueError(f"Found array with 0 sample(s) (shape={y.shape}) while a minimum of 1 is required.") + + # Convert y to a common format or handle both types + y_array = y.values if hasattr(y, 'values') else y + + if pd.isna(y_array).any(): + raise ValueError("Input y contains NaN.") + if pd.api.types.is_numeric_dtype(y_array) and np.isinf(y_array).any(): + raise ValueError("Input y contains infinity.") + + self.n_features_in_ = X.shape[1] + self.feature_names_in_ = getattr(X, "columns", np.array([str(i) for i in range(X.shape[1])])).tolist() + + self.estimators_: List[LayeredCompModel] = [] + + metric_fn = mean_absolute_error if self.split_metric == 'mae' else mean_squared_error + + for i in range(self.tree_count): + X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=(1 - self.sample_pct), + random_state=self.random_state + i) + + tree = LayeredCompModel(split_metric=self.split_metric, n_jobs=self.n_jobs) + tree.fit(X_tr, y_tr) + + def objective(w: float) -> float: + tree.weight_falloff = w + preds = tree.predict(X_ts) + return float(metric_fn(y_ts, preds)) + + if len(y_ts) > 0: + res = minimize_scalar(objective, bounds=(0.0, 15.0), method='bounded') + opt_w = res.x + best = res.fun + else: + # Fallback if no test data + opt_w = 3 + best = -1 + + tree.weight_falloff = opt_w + self.estimators_.append(tree) + print(f"Trained tree {i + 1} of {self.tree_count} with weight {tree.weight_falloff} @ {best}") + + return self + + def predict(self, X: Any) -> np.ndarray: + """ + Predict regression target for X. + + The final prediction is the arithmetic mean of all individual tree predictions. + + Parameters + ---------- + X : array-like of shape (n_samples, n_features) + The input samples. + + Returns + ------- + y : ndarray of shape (n_samples,) + The predicted values. + """ + check_is_fitted(self) + + all_preds = [] + for tree in self.estimators_: + all_preds.append(tree.predict(X)) + + return np.mean(all_preds, axis=0) diff --git a/src/layeredcompmodel/model.py b/src/layeredcompmodel/model.py index be85f0a..887537c 100644 --- a/src/layeredcompmodel/model.py +++ b/src/layeredcompmodel/model.py @@ -14,8 +14,26 @@ def calculate_wilson_mean(y: Sequence[float]) -> float: """ - Calculates the Wilson mean: trim the top 2.5% and the bottom 2.5% (the middle 95%) - and return the mean of the remaining data. + Calculate robust "Wilson" mean: trim top/bottom 2.5% outliers (keep middle 95%), return mean of remainder. + + Parameters + ---------- + y : Sequence[float] + Input data values. + + Returns + ------- + float + Trimmed mean. Falls back to full mean if trim yields empty. + + Notes + ----- + Handles empty input as NaN. + + Examples + -------- + >>> calculate_wilson_mean([1.0, 2.0, 3.0, 100.0]) + 2.0 """ y_array = np.asarray(y, dtype=float) if y_array.size == 0: @@ -40,7 +58,59 @@ def __init__(self, depth: int, wilson_mean: float, count: int, filter_col: Optio class LayeredCompModel(RegressorMixin, BaseEstimator): + """ + Hierarchical tree-based regressor. Uses Wilson-trimmed means at nodes; predicts via weighted + path average from leaf to root (exponential decay controlled by weight_falloff). + + Splits recursively to minimize (weighted child_metric / parent_metric) ratio using MAE or MSE. + + Parameters + ---------- + weight_falloff : float, default=0.5 + Weight decay exponent: higher favors leaf nodes. + split_metric : {'mae', 'mse'}, default='mae' + Split quality metric. + n_jobs : int, default=1 + Parallel split search jobs. + + Attributes + ---------- + tree_ : CompNode + Fitted tree root. + columns_ : list[str] + Input feature names. + n_features_in_ : int + Number of features seen during fit. + pre_sorted_indices_ : dict + Cached sorted indices for numeric features. + + See Also + -------- + calculate_wilson_mean : Robust node mean computation. + + Examples + -------- + >>> import numpy as np + >>> from sklearn.datasets import make_regression + >>> X, y = make_regression(n_samples=100, n_features=4, random_state=0) + >>> model = LayeredCompModel(split_metric='mae') + >>> model.fit(pd.DataFrame(X), pd.Series(y)) + LayeredCompModel(split_metric='mae', weight_falloff=0.5) + >>> model.predict(pd.DataFrame(X[:2])) + array([...]) + """ def __init__(self, weight_falloff: float = 0.5, split_metric: str = 'mae', n_jobs: int = 1) -> None: + """ + Parameters + ---------- + weight_falloff : float, default=0.5 + Controls weighting in prediction: w = (1 - x)**weight_falloff along path (x=0 at leaf). + Higher values prioritize leaf node. + split_metric : {'mae', 'mse'}, default='mae' + Metric minimized for splits: mean absolute or squared error. + n_jobs : int, default=1 + Number of jobs for parallel best-split search. + """ self.weight_falloff: float = weight_falloff self._split_metric_name: str = split_metric self.n_jobs: int = n_jobs @@ -270,6 +340,28 @@ def _find_best_split(self, X_full: DataFrame, y_full: Series, indices: np.ndarra return best_split def fit(self, X: DataFrame, y: Series, verbose: bool = False) -> "LayeredCompModel": + """ + Build tree from training data. + + Parameters + ---------- + X : DataFrame + Features; numeric/categorical cols supported (NaNs stop traversal). + y : Series + Target values (numeric, no NaN/inf). + verbose : bool, default=False + Print split info. + + Returns + ------- + self : LayeredCompModel + Fitted estimator. + + Raises + ------ + ValueError + Invalid input shapes, NaN/inf in y, etc. + """ # Convert to pandas for easier manipulation if isinstance(X, np.ndarray): X = pd.DataFrame(X) @@ -420,6 +512,26 @@ def _build_tree(self, X_full: DataFrame, y_full: Series, indices: np.ndarray, de return root_node def predict(self, X: DataFrame) -> np.ndarray: + """ + Predict using weighted path averages. + + Parameters + ---------- + X : DataFrame, shape (n_samples, n_features) + Test features (columns must match training). + + Returns + ------- + y_pred : ndarray, shape (n_samples,) + Predicted targets. + + Raises + ------ + NotFittedError + If model not fitted. + ValueError + Feature count mismatch. + """ check_is_fitted(self) if hasattr(self, 'n_features_in_') and X.shape[1] != self.n_features_in_: raise ValueError(f"X has {X.shape[1]} features, but {type(self).__name__} is expecting {self.n_features_in_} features as input.") @@ -500,7 +612,18 @@ def _predict_row(self, row: pd.Series) -> float: def to_dict(self) -> Dict[str, Any]: """ - Exports the trained tree structure as a dictionary. + Serialize fitted tree to nested dict (JSON-compatible). + + Returns + ------- + dict[str, Any] + Tree structure: each node {"wilson_mean": float, "count": int, "depth": int, + "filter_col": str/None, "filter_val": str/float/None, "is_numeric": bool, + "variant": str/None, "children": list} + + Notes + ----- + Converts numpy scalars to Python; str() non-primitive filter_val. """ check_is_fitted(self) assert self.tree_ is not None @@ -534,13 +657,40 @@ def _node_to_dict(node: Optional[CompNode]) -> Optional[Dict[str, Any]]: def to_json(self, indent: int = 4) -> str: """ - Exports the trained tree structure as a JSON string. + Serialize fitted tree to JSON string. + + Parameters + ---------- + indent : int, default=4 + JSON indentation level. + + Returns + ------- + str + JSON tree dump. """ return json.dumps(self.to_dict(), indent=indent) def explain_value(self, row: Union[DataFrame, pd.Series, Dict[str, Any]]) -> Dict[str, Any]: """ - Audits and traces the path that a row takes through the tree. + Trace prediction path for a single row, return nodes/weights/calculation. + + Parameters + ---------- + row : DataFrame or Series or dict + Single row data (DataFrame iloc[0] if multi-row). + + Returns + ------- + dict + - final_prediction : float + - weight_falloff : float + - path : list[dict] each {"depth", "wilson_mean", "count", "filter_col", "filter_val", "is_numeric", "actual_value", "weight"} + - calculation : str e.g. "(75*0.125 + 80*0.875) / 1.0000 = 78.75" + + Examples + -------- + >>> model.explain_value(model.columns_.to_dict()) # Sample """ check_is_fitted(self) if isinstance(row, pd.Series): diff --git a/tests/test_bagging_model.py b/tests/test_bagging_model.py new file mode 100644 index 0000000..3f1c549 --- /dev/null +++ b/tests/test_bagging_model.py @@ -0,0 +1,74 @@ +import pytest +import numpy as np +import pandas as pd +from sklearn.datasets import make_regression +from layeredcompmodel import LayeredCompBaggingModel + +def test_bagging_model_basic(): + X, y = make_regression(n_samples=50, n_features=4, random_state=42) + X = pd.DataFrame(X, columns=['a', 'b', 'c', 'd']) + y = pd.Series(y) + + model = LayeredCompBaggingModel(tree_count=5, sample_pct=0.8, random_state=42) + # print(X) + # print(y) + model.fit(X, y) + + assert len(model.estimators_) == 5 + + preds = model.predict(X) + assert len(preds) == len(y) + assert np.all(np.isfinite(preds)) + +def test_bagging_model_hyperparameters(): + X, y = make_regression(n_samples=20, n_features=2, random_state=42) + + # Test invalid tree_count + with pytest.raises(ValueError, match="tree_count must be >= 1"): + LayeredCompBaggingModel(tree_count=0).fit(X, y) + + # Test invalid sample_pct + with pytest.raises(ValueError, match="sample_pct must be between 0 and 1"): + LayeredCompBaggingModel(sample_pct=1.0).fit(X, y) + + with pytest.raises(ValueError, match="sample_pct must be between 0 and 1"): + LayeredCompBaggingModel(sample_pct=0.0).fit(X, y) + +def test_bagging_model_split_metric(): + X, y = make_regression(n_samples=20, n_features=2, random_state=42) + + model_mse = LayeredCompBaggingModel(tree_count=2, split_metric='mse', random_state=42) + model_mse.fit(X, y) + assert model_mse.split_metric == 'mse' + + model_mae = LayeredCompBaggingModel(tree_count=2, split_metric='mae', random_state=42) + model_mae.fit(X, y) + assert model_mae.split_metric == 'mae' + +def test_bagging_model_small_data(): + # Test with very small dataset + X = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) + y = pd.Series([10, 20, 30]) + + model = LayeredCompBaggingModel(tree_count=2, sample_pct=0.5, random_state=42) + model.fit(X, y) + + preds = model.predict(X) + assert len(preds) == 3 + +def test_bagging_model_random_state(): + X, y = make_regression(n_samples=50, n_features=4, random_state=42) + + model1 = LayeredCompBaggingModel(tree_count=3, random_state=42) + model1.fit(X, y) + + model2 = LayeredCompBaggingModel(tree_count=3, random_state=42) + model2.fit(X, y) + + np.testing.assert_array_almost_equal(model1.predict(X), model2.predict(X)) + + model3 = LayeredCompBaggingModel(tree_count=3, random_state=43) + model3.fit(X, y) + + with pytest.raises(AssertionError): + np.testing.assert_array_almost_equal(model1.predict(X), model3.predict(X))