The goal of the Layered Comp Model is to create a hierarchical prediction system that starts with a general predicted price and refines it by adding information and narrowing the comparison group. This is achieved by building a tree of nodes, each representing a filtered subset of the data, and calculating a weighted average of the Wilson means to produce a final prediction. The model aims to balance accuracy and equity by penalizing large sets and promoting normativity in predictions. The implementation should be a scikit-learn compatible estimator.
More details can be found in MODEL_SPEC.md
- pandas
- python
- numpy
- scipy
- scikit-learn
A pandas dataframe
A target prediction field
A list of columns to use
A list of columns to use
A trained scikit-learn compatible model object (e.g., LayeredCompModel) with fit and predict methods.
A pandas dataframe
A weight_falloff hyperparameter.
A "prediction" field decorated on the dataframe and returns it.
We will need to be able to determine whether a column is numeric or categorical so the correct segmentation test can be applied.
Uses a simple linear regression under the hood to fit sale price vs mean and evaluates the split quality by calculating the reduction in Mean Absolute Error (MAE).
- Calculate the base MAE for the current set of data by fitting a linear regression (sale price vs mean) and calculating the MAE (
$MAE_{total}$ ).
- For each variant in the categorical (one-vs-rest), treating missing values as a distinct category:
- filter to only that variant, calculate its MAE vs its mean (
$MAE_v$ ), and get its count ($N_v$ ). - filter to the inverse of that variant, calculate its MAE vs its mean (
$MAE_{inv}$ ), and get its count ($N_{inv}$ ). - Calculate the weighted MAE for the split:
$MAE_{weighted} = (MAE_v \times N_v + MAE_{inv} \times N_{inv}) / N_{total}$ . - The segmentation score is the ratio:
$Score = MAE_{weighted} / MAE_{total}$ .
- filter to only that variant, calculate its MAE vs its mean (
- Find the lowest segmentation score among all variants. In case of ties, choose the variant that splits the count most evenly. If still tied, choose the first one.
- Exclude NaNs from numeric features during this process.
- Set num_iterations to minimum of 10 and log2(current population size).
- Perform a binary search for an optimal breakpoint:
- Set the initial midpoint to the median of the feature values.
- filter to the entries below the midpoint, calculate its MAE (
$MAE_{low}$ ) and count ($N_{low}$ ). - filter to the entries above the midpoint, calculate its MAE (
$MAE_{high}$ ) and count ($N_{high}$ ). - Calculate the weighted MAE for the split:
$MAE_{weighted} = (MAE_{low} \times N_{low} + MAE_{high} \times N_{high}) / N_{total}$ . - The segmentation score is the ratio:
$Score = MAE_{weighted} / MAE_{total}$ . - Move the midpoint to the side that resulted in a better scoring split (lower ratio) and reform the "above" and "below" subsets for each step of the search.
- Return the lowest segmentation score and corresponding midpoint found. In case of ties, choose the split that splits the count most evenly. If still tied, choose the first one.
Minimum node size is 2. Do not attempt to split a node if its size is below this threshold.