Skip to content

DataSlingers/TopK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAMPART: Top-k Feature Importance Ranking

TMLR 2025

RAMPART (Ranked Attributions with MiniPatches And Recursive Trimming) is an efficient method for identifying the most important features in high-dimensional datasets.

This repository contains the official implementation from:

Top-k Feature Importance Ranking Eric Chen, Tiffany Tang, Genevera I. Allen Transactions on Machine Learning Research (TMLR), 2025 OpenReview

Overview

Traditional feature importance methods rank all features, which is computationally expensive and often unnecessary when you only need the top few. RAMPART uses:

  1. Minipatch Ensembling (RAMP): Aggregates feature rankings across random subsamples of observations and features
  2. Recursive Trimming (RAMPART): Iteratively eliminates bottom-ranked features using sequential halving

This approach is:

  • Efficient: Focuses computation on promising features
  • Scalable: Works with thousands of features
  • Flexible: Compatible with any feature importance model

Quick Start

from rampart import ramp, rampart
from rampart.models import RandomForestModel
import numpy as np

# Generate example data
X = np.random.randn(200, 100)
beta = np.zeros(100)
beta[:5] = [5, 4, 3, 2, 1]  # 5 signal features
y = X @ beta + np.random.randn(200)

# Find top-5 features using RAMPART
rankings = rampart(
    X, y,
    k=5,
    model_cls=RandomForestModel,
    n_minipatches=1000,
    n_obs=50,
    n_features=10
)

# Get indices of top-5 features
top_5 = np.argsort(rankings)[:5]
print(f"Top-5 features: {top_5}")

Installation

# Clone the repository
git clone https://github.com/DataSlingers/TopK.git
cd TopK

# Install dependencies
pip install -r requirements.txt

Repository Structure

TopK/
├── rampart/                  # Main package
│   ├── algorithms.py         # RAMP and RAMPART implementations
│   ├── models.py             # Regression models
│   └── classifiers.py        # Classification models
├── examples/                 # Example notebooks
│   ├── quickstart.ipynb      # Basic usage
│   └── custom_data_example.ipynb
├── simulations/              # Paper simulations
│   ├── config.py             # Simulation parameters
│   ├── data_generation.py    # Synthetic data generation
│   ├── metrics.py            # Evaluation metrics (RBO, top-k accuracy)
│   ├── run_simulations.py    # Single experiment runner
│   └── run_batch.py          # Batch runner for all experiments
├── case_studies/             # Real data applications
│   ├── drug_response/        # CCLE drug response (Section 4.2.1)
│   └── breast_cancer/        # TCGA cancer subtyping (Section 4.2.2)
└── figures/                  # Paper figures
    ├── Figure3.ipynb         # Theory validation
    └── plot_results.py       # Generate result plots

Methods

RAMP (Algorithm 1)

Ranks all features by averaging importance rankings across random minipatches:

from rampart import ramp

rankings = ramp(
    X, y,
    model_cls=RandomForestModel,
    n_minipatches=10000,   # Number of minipatches (T)
    n_obs=100,             # Observations per minipatch (n)
    n_features=10          # Features per minipatch (m)
)

RAMPART (Algorithm 2)

Efficiently identifies top-k features using sequential halving:

from rampart import rampart

rankings = rampart(
    X, y,
    k=10,                  # Number of top features to find
    model_cls=RandomForestModel,
    n_minipatches=2000,    # Minipatches per iteration (B)
    n_obs=100,
    n_features=10
)

Available Models

Regression

  • LinearModel: Linear regression (coefficient-based importance)
  • DecisionTreeModel: Decision tree (impurity-based importance)
  • RandomForestModel: Random forest (mean impurity decrease)
  • KernelRidgePermutation: Kernel ridge regression (permutation importance)

Classification

  • LogisticModel: Logistic regression
  • DecisionTreeClassifier: Decision tree classifier
  • RandomForestClassifier: Random forest classifier
  • KernelSVMPermutation: Kernel SVM (permutation importance)

Reproducing Paper Results

Simulations (Section 4.1)

cd simulations

# Run single experiment
python run_simulations.py --task regression --covariance IID --algorithm rampart

# Run all experiments (100 seeds)
python run_batch.py --seeds 100 --parallel 4

# Generate plots
cd ../figures
python plot_results.py --results-dir ../simulations/results

Case Studies (Section 4.2)

See notebooks in case_studies/:

  • Drug response prediction (CCLE dataset)
  • Breast cancer subtype classification (TCGA dataset)

Parameters Guide

Parameter Description Recommended
k Top features to find Based on domain knowledge
n_minipatches Minipatches per iteration 1000-4000 (higher = more accurate)
n_obs Observations per minipatch N/2 to N/4
n_features Features per minipatch 10-20 (should be > k)
model_cls Base model class RandomForestModel (default)

Citation

@article{chen2025topk,
  title={Top-k Feature Importance Ranking},
  author={Chen, Eric and Tang, Tiffany and Allen, Genevera I.},
  journal={Transactions on Machine Learning Research},
  year={2025},
  url={https://openreview.net/forum?id=2OSHpccsaV}
}

About

Top-K Feature Importance Ranking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published