Skip to content

bmsuisse/rusket

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

374 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

rusket logo

Ultra-fast Recommender Engines & Market Basket Analysis for Python, written in Rust.
Made with ❀️ by the Data & AI Team.

PyPI Python Rust License Docs


🎯 Goals

Goal Details
⚑ Blazing fast All algorithms run in compiled Rust (via PyO3) with multi-threaded Rayon parallelism and SIMD-accelerated kernels. ALS is 11Γ—, and FP-Growth is 140Γ— faster than PySpark.
πŸ“¦ Zero dependencies No TensorFlow, no PyTorch, no JVM. A single ~3 MB wheel is all you need β€” pip install rusket and go.
πŸ§‘β€πŸ’» Easy to use Common cases are one-liners: model.recommend_items(user_id), model.recommend_users(item_id), model.export_item_factors() for vector/embedding export. No boilerplate.
πŸ—οΈ Modern data stack Native Pandas, Polars, and Apache Spark support with zero-copy Arrow transfers. Works seamlessly with Delta Lake, Databricks, Snowflake, and any dbt/Parquet pipeline.

⚠️ Note: rusket is currently under heavy construction. The API will probably change in upcoming versions.

rusket is a modern, Rust-powered library for Market Basket Analysis and Recommender Engines. It delivers significant speed-ups and lower memory usage compared to traditional Python implementations, while natively supporting Pandas, Polars, and Spark out of the box.

Zero runtime dependencies. No TensorFlow, no PyTorch, no JVM β€” just pip install rusket and go. The entire engine is compiled Rust, distributed as a single ~3 MB wheel.

It features Collaborative Filtering (ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE), Sequential Recommendation (FPMC, SASRec), Context-aware Prediction (FM), Pattern Mining (FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan), and built-in Hyperparameter Tuning (Optuna + MLflow tracking) with high performance and low memory footprints. Both functional and OOP APIs are available for seamless integration.


✨ Highlights

rusket LibRecommender implicit pyspark.ml
Core language Rust (PyO3) TF + PyTorch + Cython Cython / C++ Scala / Java (JVM)
Runtime deps 0 TF + PyTorch + gensim (~2 GB) OpenBLAS / MKL JVM + Spark
Install size ~3 MB ~2 GB ~50 MB ~300 MB
Algorithms ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE, FM, FPMC, SASRec, FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan ALS, BPR, SVD, LightGCN, ItemCF, FM, DeepFM, ... ALS, BPR ALS, FP-Growth, PrefixSpan
Recommender API βœ… Hybrid Engine + i2i Similarity βœ… βœ… βœ… (ALS only)
Graph & Embeddings βœ… NetworkX Export, Vector DB Export ❌ ❌ ❌
OOP class API βœ… ALS.from_transactions(df).fit() βœ… βœ… βœ…
Pandas / Polars / Spark βœ… / βœ… / βœ… βœ… / ❌ / ❌ ❌ / ❌ / ❌ ❌ / ❌ / βœ…
Parallel execution βœ… Rayon work-stealing βœ… TF/PyTorch threads βœ… OpenMP βœ… Spark Cluster
Memory Low (native Rust buffers) High (TF/PyTorch graphs) Low (C++ arrays) High (JVM overhead)

πŸ“¦ Installation

pip install rusket
# or with uv:
uv add rusket

Optional extras:

# Polars support
pip install "rusket[polars]"

# Pandas/NumPy support (usually already installed)
pip install "rusket[pandas]"

πŸš€ Quick Start

"Frequently Bought Together" β€” Grocery Checkout Data

Identify which products co-occur most in customer baskets β€” the foundation of cross-sell widgets, promotional bundles, and shelf placement decisions.

import pandas as pd
from rusket import FPGrowth

# One week of supermarket checkout data (1 row = 1 receipt, 1 col = 1 SKU)
receipts = pd.DataFrame({
    "milk":         [1, 1, 0, 1, 1, 0, 1],
    "bread":        [1, 0, 1, 1, 0, 1, 1],
    "butter":       [1, 0, 1, 0, 0, 1, 0],
    "eggs":         [0, 1, 1, 0, 1, 0, 1],
    "coffee":       [0, 1, 0, 0, 1, 1, 0],
    "orange_juice": [1, 0, 0, 1, 0, 0, 1],
}, dtype=bool)

# Step 1 β€” which SKU combinations appear in β‰₯40% of receipts?

model = FPGrowth(receipts, min_support=0.4)
freq = model.mine(use_colnames=True)

# Step 2 β€” keep rules with β‰₯60% confidence
rules = model.association_rules(metric="confidence", min_threshold=0.6)

# Lift > 1 means customers buy these together more than chance alone
print(rules[["antecedents", "consequents", "support", "confidence", "lift"]]
      .sort_values("lift", ascending=False))

πŸ›’ E-Commerce Order Lines (Long Format)

Real-world data arrives as (order_id, sku) rows from a database β€” not one-hot matrices.

All mining algorithms expose a class-based API that goes straight from order lines to recommendations:

import pandas as pd
from rusket import FPGrowth

# Order line export from your e-commerce backend
orders = pd.DataFrame({
    "order_id": [1001, 1001, 1001, 1002, 1002, 1003, 1003],
    "sku":      ["HDPHONES", "USB_DAC", "AUX_CABLE",
                 "HDPHONES", "CARRY_CASE",
                 "USB_DAC",  "AUX_CABLE"],
})

model = FPGrowth.from_transactions(
    orders,
    transaction_col="order_id",
    item_col="sku",
    min_support=0.3,
)

freq  = model.mine(use_colnames=True)              # Miner classes: mine() never auto-fits
rules = model.association_rules(metric="confidence", min_threshold=0.6)

# Which accessories should be suggested when headphones are in the cart?
suggestions = model.recommend_items(["HDPHONES"], n=3)
# β†’ e.g. ["USB_DAC", "AUX_CABLE", "CARRY_CASE"]

Or use the explicit type variants:

from rusket import FPGrowth

ohe = FPGrowth.from_pandas(orders, transaction_col="order_id", item_col="sku")
ohe = FPGrowth.from_polars(pl_orders, transaction_col="order_id", item_col="sku")
ohe = FPGrowth.from_transactions([["HDPHONES", "USB_DAC"], ["HDPHONES", "CARRY_CASE"]])  # list of lists

Spark is also supported: FPGrowth.from_spark(spark_df) calls .toPandas() internally.


πŸ»β€β„οΈ Polars Input β€” Reading from Data Lake Parquet

For teams running a modern data stack with Parquet files on S3/GCS/Azure Blob, rusket natively accepts Polars DataFrames. Data is transferred via Arrow zero-copy buffers β€” no conversion overhead.

The fastest path from a data lake to "Frequently Bought Together" rules:

import polars as pl
from rusket import FPGrowth

# ── 1. Read a one-hot basket matrix directly from S3/GCS/local Parquet ──
# Columns = SKUs (bool), rows = receipts β€” produced by your dbt or Spark pipeline
baskets = pl.read_parquet("s3://data-lake/gold/basket_ohe.parquet")
print(f"Loaded {baskets.shape[0]:,} receipts Γ— {baskets.shape[1]} SKUs")

# ── 2. Instantiate FPGrowth (zero-copy from Polars) ─────────────────
model = FPGrowth(baskets, min_support=0.02, max_len=3)

# ── 3. Mine frequent combinations ────────────────────────────────────
freq = model.mine(use_colnames=True)
print(f"Found {len(freq):,} frequent itemsets")
print(freq.sort_values("support", ascending=False).head(10))

# ── 4. Generate cross-sell rules ────────────────────────────────────
rules = model.association_rules(metric="lift", min_threshold=1.2)
print(f"Rules with lift > 1.2: {len(rules):,}")
print(
    rules[["antecedents", "consequents", "confidence", "lift"]]
    .sort_values("lift", ascending=False)
    .head(8)
)

How it works under the hood:
Polars β†’ Arrow buffer β†’ np.uint8 (zero-copy) β†’ Rust fpgrowth_from_dense


πŸ’Ž High-Utility Pattern Mining (HUPM) β€” Profit-Driven Bundle Discovery

Frequent items aren't always the most profitable. HUPM finds product combinations that generate the highest total gross margin β€” even if they appear rarely. rusket implements the state-of-the-art EFIM algorithm in Rust.

import pandas as pd
from rusket import HUPM

# Specialty foods retailer: receipt line items with gross margin per unit sold
orders = pd.DataFrame({
    "receipt_id": [1, 1, 1, 2, 2, 3, 3],
    "product": ["aged_cheese", "wine_flight", "charcuterie",
                "aged_cheese", "charcuterie",
                "wine_flight", "charcuterie"],
    "margin": [8.50, 12.00, 6.50,   # receipt 1 β€” margin per item
               8.50, 6.50,           # receipt 2
               12.00, 6.50],         # receipt 3
})

# Find all product bundles generating β‰₯ €20 total margin across all receipts
high_margin = HUPM.from_transactions(
    orders,
    transaction_col="receipt_id",
    item_col="product",
    utility_col="margin",
    min_utility=20.0,
).mine()
print(high_margin.head())
# e.g. aged_cheese + wine_flight + charcuterie β†’ total margin 81.0

πŸ“Š Sparse Pandas Input

For very sparse datasets (e.g. e-commerce with thousands of SKUs), use Pandas SparseDtype to minimize memory. rusket passes the raw CSR arrays straight to Rust β€” no densification ever happens.

import pandas as pd
import numpy as np
from rusket import FPGrowth

rng = np.random.default_rng(7)
n_rows, n_cols = 30_000, 500

# Very sparse: average basket size β‰ˆ 3 items out of 500
p_buy = 3 / n_cols
matrix = rng.random((n_rows, n_cols)) < p_buy
products = [f"sku_{i:04d}" for i in range(n_cols)]

df_dense = pd.DataFrame(matrix.astype(bool), columns=products)
df_sparse = df_dense.astype(pd.SparseDtype("bool", fill_value=False))

dense_mb = df_dense.memory_usage(deep=True).sum() / 1e6
sparse_mb = df_sparse.memory_usage(deep=True).sum() / 1e6
print(f"Dense  memory: {dense_mb:.1f} MB")
print(f"Sparse memory: {sparse_mb:.1f} MB  ({dense_mb / sparse_mb:.1f}Γ— smaller)")

# Same API, same results β€” just faster and lighter
freq = FPGrowth(df_sparse, min_support=0.01).mine(use_colnames=True)
print(f"Frequent itemsets: {len(freq):,}")

How it works under the hood:
Sparse DataFrame β†’ COO β†’ CSR β†’ (indptr, indices) β†’ Rust fpgrowth_from_csr


🌊 Out-of-Core Processing (FPMiner Streaming)

For datasets scaling to Billion-row sizes that don't fit in memory, use the FPMiner accumulator. It accepts chunks of (txn_id, item_id) pairs, sorting them in-place immediately, and uses a memory-safe k-way merge across all chunks to build the CSR matrix on the fly avoiding massive memory spikes.

import numpy as np
from rusket import FPMiner

n_items = 5_000
miner = FPMiner(n_items=n_items)

# Feed chunks incrementally (e.g. from Parquet/CSV/SQL)
for chunk in dataset:
    txn_ids = chunk["txn_id"].to_numpy(dtype=np.int64)
    item_ids = chunk["item_id"].to_numpy(dtype=np.int32)
    
    # Fast O(k log k) per-chunk sort
    miner.add_chunk(txn_ids, item_ids)

# Stream k-way merge and mine in one pass!
# Returns a DataFrame with 'support' and 'itemsets' just like fpgrowth()
freq = miner.mine(min_support=0.001, max_len=3)

Memory efficiency: The peak memory overhead at mine() time is just $O(k)$ for the cursors (where $k$ is the number of chunks), plus the final compressed CSR allocation.


🌩️ Distributed Computing with Apache Spark

rusket ships a full Spark integration layer in rusket.spark. All algorithms run as Native Arrow UDFs via applyInArrow β€” Rust is called directly on each executor, with zero Python overhead per row.

How it works

PySpark DataFrame
  └─► groupby(group_col).applyInArrow(...)
        └─► Arrow Table (per partition / per group)
              └─► Polars zero-copy conversion
                    └─► rusket Rust extension (on the executor)
                          └─► results β†’ PyArrow β†’ PySpark DataFrame

Full Example β€” Retail Basket Analysis per Store

from pyspark.sql import SparkSession
from rusket.spark import mine_grouped, rules_grouped

spark = SparkSession.builder.appName("rusket-demo").getOrCreate()

# ── 1. Load your OHE transaction table (one row = one basket) ──────────────
#    Schema: store_id (string), bread (bool), butter (bool), milk (bool), ...
spark_df = spark.read.parquet("s3://data/baskets/")

# ── 2. Mine frequent itemsets per store in parallel ──────────────────────────
#    Each Spark task calls the Rust FP-Growth/Eclat engine on its Arrow batch.
freq_df = mine_grouped(
    spark_df,
    group_col="store_id",
    min_support=0.05,    # 5% support per store

)
# freq_df schema: store_id | support (double) | itemsets (array<string>)

# ── 3. Count transactions per store (needed for rule support) ────────────────
from pyspark.sql import functions as F
counts = (
    spark_df.groupby("store_id")
    .agg(F.count("*").alias("n"))
    .rdd.collectAsMap()          # {"store_1": 12000, "store_2": 8500, ...}
)

# ── 4. Generate association rules per store ──────────────────────────────────
rules_df = rules_grouped(
    freq_df,
    group_col="store_id",
    num_itemsets=counts,         # pass per-group counts as a dict
    metric="confidence",
    min_threshold=0.6,
)
# rules_df schema: store_id | antecedents | consequents | confidence | lift | ...

rules_df.orderBy("lift", ascending=False).show(10, truncate=False)

Sequential Patterns per Category

from rusket.spark import prefixspan_grouped

# event_log schema: category_id, user_id, item_id, event_ts
event_log = spark.read.parquet("s3://data/events/")

seq_df = prefixspan_grouped(
    event_log,
    group_col="category_id",   # mine independently per product category
    user_col="user_id",        # sequence identifier within the group
    time_col="event_ts",       # ordering column
    item_col="item_id",
    min_support=50,            # absolute count: pattern must appear in β‰₯50 sessions
    max_len=4,
)
# seq_df schema: category_id | support (long) | sequence (array<string>)
seq_df.show(5, truncate=False)

High-Utility Patterns per Region

from rusket.spark import hupm_grouped

# profit_log schema: region_id, txn_id, item_id, profit
profit_log = spark.read.parquet("s3://data/profit/")

utility_df = hupm_grouped(
    profit_log,
    group_col="region_id",
    transaction_col="txn_id",
    item_col="item_id",
    utility_col="profit",
    min_utility=500.0,         # only itemsets with combined profit β‰₯ €500
)
# utility_df schema: region_id | utility (double) | itemset (array<long>)
utility_df.show(5, truncate=False)

Batch Recommendations across the Cluster

from rusket.spark import recommend_batches
from rusket import ALS

# 1. Train an ALS model locally (or load a pre-trained one)
als = ALS.from_transactions(
    events_pd,
    user_col="user_id",
    item_col="item_id",
).fit()  # ← always call .fit() after from_transactions()

# 2. Scale-out scoring: one recommendation row per user
user_df = spark.read.parquet("s3://data/users/").select("user_id")

recs_df = recommend_batches(user_df, model=als, user_col="user_id", k=10)
# recs_df schema: user_id (string) | recommended_items (array<int>)
recs_df.show(5, truncate=False)

Tip β€” Databricks / Delta Lake: All functions return a standard PySpark DataFrame, so you can write results back with .write.format("delta").save(...) or .saveAsTable(...) directly.


πŸ“– API Reference

OOP Class API

Every algorithm in rusket exposes a class-based API in addition to the functional helpers. All classes share a unified interface inherited from BaseModel:

Class Inherits from Description
FPGrowth Miner, RuleMinerMixin FP-Tree parallel mining
Eclat Miner, RuleMinerMixin Vertical bitset mining
FPGrowth Miner, RuleMinerMixin Frequent Pattern Growth algorithm
FIN Miner, RuleMinerMixin FP-tree Node-list intersection mining
LCM Miner, RuleMinerMixin Linear-time Closed itemset Mining
HUPM Miner High-Utility Pattern Mining (EFIM)
PrefixSpan Miner Sequential pattern mining
ALS ImplicitRecommender Alternating Least Squares CF
BPR ImplicitRecommender Bayesian Personalized Ranking CF
SVD ImplicitRecommender Funk SVD (biased SGD)
LightGCN ImplicitRecommender Graph Convolutional CF
ItemKNN ImplicitRecommender Item-based k-NN CF
UserKNN ImplicitRecommender User-based k-NN CF
EASE ImplicitRecommender Embarrassingly Shallow Autoencoders
FM BaseModel Factorization Machines (CTR prediction)
FPMC SequentialRecommender Factorizing Personalized Markov Chains
SASRec SequentialRecommender Self-Attentive Sequential Recommendation
HybridEmbeddingIndex β€” CF + semantic embedding fusion

All classes share the following data-ingestion class methods inherited from BaseModel:

# Load from long-format (transaction_id, item_id) DataFrame or list of lists
model = FPGrowth.from_transactions(df, transaction_col="order_id", item_col="item", min_support=0.3)

# Typed convenience aliases β€” same result
model = FPGrowth.from_pandas(df,  ...)
model = FPGrowth.from_polars(pl_df, ...)
model = FPGrowth.from_spark(spark_df, ...)

Miner subclasses (FPGrowth, Eclat) additionally expose RuleMinerMixin, giving a fluent pipeline:

model  = FPGrowth.from_transactions(df, min_support=0.3)
freq   = model.mine(use_colnames=True)             # pd.DataFrame [support, itemsets]
rules  = model.association_rules(metric="lift")    # pd.DataFrame [antecedents, consequents, ...]
recs   = model.recommend_items(["bread", "milk"])  # list of suggested items

ImplicitRecommender subclasses (ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE) follow the scikit-learn fit()/predict() pattern. SequentialRecommender subclasses (FPMC, SASRec) use from_transactions(..., time_col=...).fit() for sequential next-item prediction:

# Option A β€” construct then fit with a sparse matrix
model = ALS(factors=64, iterations=15)
model.fit(user_item_csr)

# Option B β€” from event log, then explicit .fit()
model = ALS(factors=64).from_transactions(
    df, user_col="user_id", item_col="item_id"
).fit()  # ← .fit() is always required

# Predict / recommend
items, scores = model.recommend_items(user_id=42, n=10, exclude_seen=True)
users, scores = model.recommend_users(item_id=99, n=5)

Breaking change vs older versions: from_transactions() no longer auto-fits. Always chain .fit() after it.

🧠 Advanced Pattern & Recommendation Algorithms

rusket provides more than just basic market basket analysis. It includes an entire suite of modern algorithms and a high-level Business Recommender API.

🎯 ItemKNN & UserKNN β€” Nearest-Neighbor Collaborative Filtering

Two complementary memory-based methods that consistently rank among the top performers in academic benchmarks (see Anelli et al. 2022).

  • ItemKNN β€” Finds items similar to what the user already liked. Fast, stable, and scales well with pre-computed item-item similarity.
  • UserKNN β€” Finds users similar to the target user and recommends what they liked. Often more serendipitous and performs particularly well on dense datasets.

Both support BM25, TF-IDF, Cosine, and raw Count weighting, with the top-K neighbor pruning running in parallel Rust.

from rusket import ItemKNN, UserKNN

# ── Item-based: "Customers who bought X also bought Y" ────────────
item_knn = ItemKNN.from_transactions(
    purchases, user_col="user_id", item_col="item_id",
    method="bm25", k=100,
).fit()
items, scores = item_knn.recommend_items(user_id=42, n=10)

# ── User-based: "Users similar to you enjoyed these items" ────────
user_knn = UserKNN.from_transactions(
    purchases, user_col="user_id", item_col="item_id",
    method="cosine", k=50,
).fit()
items, scores = user_knn.recommend_items(user_id=42, n=10)

Which one to choose? Start with ItemKNN(method="bm25") β€” it's the fastest and most stable. Switch to UserKNN if you have a dense dataset or want more diverse recommendations. In production, try both and evaluate with rusket.evaluate().

🎯 ALS & BPR Collaborative Filtering

Both models learn user and item embeddings from implicit feedback (purchases, clicks, plays) and power personalised recommendations at scale. Use ALS for broad serendipitous discovery; use BPR when you care only about top-N ranking.

from rusket import ALS, BPR

# ── "For You" homepage β€” music streaming platform ────────────────────
# event log: user_id | track_id | plays (optional weight)
plays = pd.DataFrame({
    "user_id":  [101, 101, 102, 102, 103, 103, 103],
    "track_id": ["T01", "T03", "T01", "T05", "T02", "T03", "T05"],
    "plays":    [12, 5, 8, 3, 20, 1, 7],  # play count as confidence weight
})

als = ALS(factors=64, iterations=15, alpha=40.0).from_transactions(
    plays, user_col="user_id", item_col="track_id", rating_col="plays"
).fit()  # ← always call .fit() after from_transactions()

# Top-10 tracks for user 101, excluding already-played tracks
tracks, scores = als.recommend_items(user_id=101, n=10, exclude_seen=True)

# Which users are most likely to enjoy track T05? β€” useful for email campaigns
users, scores = als.recommend_users(item_id="T05", n=50)

# BPR β€” optimise ranking directly rather than reconstruction
bpr = BPR(factors=64, learning_rate=0.05, iterations=150).fit(user_item_csr)

🎯 Hybrid Recommender API

Combine Collaborative Filtering (ALS/BPR) with Frequent Pattern Mining to cover every placement surface β€” personalised homepage ("For You") and active cart ("Frequently Bought Together") β€” in a single engine.

from rusket import ALS, Recommender, FPGrowth

# 1. Train on purchase history (implicit feedback)
als = ALS(factors=64, iterations=15).fit(user_item_csr)

# 2. Mine co-purchase rules from basket data
miner = FPGrowth(basket_ohe, min_support=0.01)
freq  = miner.mine()
rules = miner.association_rules()

# 3. Create the Hybrid Engine
rec = Recommender(model=als, rules_df=rules)

# "For You" homepage β€” personalised for customer 1001
items, scores = rec.recommend_for_user(user_id=1001, n=5)

# Blend CF + product embeddings (e.g. from a PIM or sentence-transformer)
items, scores = rec.recommend_for_user(user_id=1001, n=5, alpha=0.7,
                                       target_item_for_semantic="HDPHONES")

# Active cart cross-sell β€” "Frequently Bought Together"
add_ons = rec.recommend_for_cart(["USB_DAC", "AUX_CABLE"], n=3)

# Overnight batch β€” score all customers, write to CRM
batch_df = rec.predict_next_chunk(user_history_df, user_col="customer_id", k=5)

🧬 Hybrid Embedding Fusion β€” CF + Semantic in One Vector Space

Collaborative filtering embeddings capture behavioral signals (who bought what); semantic text embeddings capture content meaning (product descriptions). Fusing them into a single vector space lets you do ANN retrieval, vector DB export, and clustering in one shot.

import rusket

# 1. Train ALS on implicit feedback
als = rusket.ALS(factors=64, iterations=15).fit(interactions)

# 2. Get semantic embeddings (e.g. from sentence-transformers)
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
text_vectors = encoder.encode(product_descriptions)  # (n_items, 384)

# 3. Fuse into a single hybrid vector space
hybrid = rusket.HybridEmbeddingIndex(
    cf_embeddings=als.item_factors,       # (n_items, 64)
    semantic_embeddings=text_vectors,      # (n_items, 384)
    strategy="weighted_concat",            # "concat" | "weighted_concat" | "projection"
    alpha=0.6,                             # 60% CF, 40% semantic
)

# 4. Similar items via cosine on the fused space
ids, scores = hybrid.query(item_id=42, n=10)

# 5. Build an ANN index for sub-millisecond retrieval
ann = hybrid.build_ann_index(backend="native")  # or "faiss"

# 6. Export to a vector DB for production serving
hybrid.export_vectors(qdrant_client, collection_name="hybrid_items")

# 7. Or export as separate named vectors for DB-side fusion
hybrid.export_vectors(qdrant_client, mode="multi", collection_name="hybrid_items")
# β†’ Qdrant/Meilisearch/Weaviate store "cf" and "semantic" as separate named vectors

Three fusion strategies:

Strategy Description Use Case
"concat" L2-normalise each space, concatenate Equal importance, no tuning
"weighted_concat" Scale by Ξ± / 1βˆ’Ξ±, then concat Default β€” tune alpha to balance CF vs semantic
"projection" Concat + PCA to projection_dim Compact vectors for large-scale deployment

Standalone function: If you just need the fused matrix without an index, use rusket.fuse_embeddings(cf, sem, strategy="weighted_concat", alpha=0.6).

🎯 Multi-Stage Recommendation Pipeline

For production systems requiring advanced retrieval and ranking, use the Pipeline class. This mirrors the "retrieve β†’ rerank β†’ filter" paradigm used by Twitter/X and modern ML stacks.

It chains multiple models together:

  1. Retrieve: Candidate generation
  2. Rerank: Re-score candidates using a heavier scoring function
  3. Filter: Apply business rules (e.g. exclude out-of-stock items, diversify)
from rusket import ALS, BPR, Pipeline, RuleBasedRecommender
import pandas as pd

# 1. Train multiple base models
als = ALS(factors=64).fit(interactions)
bpr = BPR(factors=128).fit(interactions)

# 2. Define explicit business rules (e.g. promoting warranties with laptops)
rules_df = pd.DataFrame({
    "antecedent": ["102"],   # Laptop SKU
    "consequent": ["999"],   # Warranty SKU
    "score": [2.0]
})
rules = RuleBasedRecommender.from_transactions(
    interactions, rules=rules_df, user_col="user", item_col="item"
).fit()

# 3. Compose the Pipeline (Retrieve from ALS, rerank with deeper BPR vectors)
# Items from the `rules` model receive an artificial +1,000,000 score 
# ensuring they rank at the top *after* the algorithmic reranking.
pipeline = Pipeline(
    retrieve=[als, bpr],
    merge_strategy="max",  # how to combine candidate scores
    rerank=bpr,
    rules=rules, 
)

# Recommend for a user
items, scores = pipeline.recommend(user_id=42, n=10, exclude_seen=True)

# Blazing-fast Batch Scoring utilizing Rust inner loops
batch_recs = pipeline.recommend_batch(
    user_ids=[1, 2, 3],
    n=10,
    format="polars"  # Returns a native Polars DataFrame instantly
)

πŸ’Ύ Saving, Loading and Serving (LanceDB / Vector DBs)

rusket models use a unified BaseModel that provides .save() and .load() functionality. You can also export trained models to a Vector Database for fast, real-time serving in production. We even provide load_model which automatically infers the model architecture from the pickle file.

import rusket

# 1. Train the model
model = rusket.ALS(factors=32).fit(interactions)

# 2. Save your trained model to disk
model.save("my_als_model.pkl")

# 3. Load it back using the generic loader
loaded_model = rusket.load_model("my_als_model.pkl")

# 4. Export the embeddings for a Vector Database
items_df = rusket.export_item_factors(
    loaded_model, 
    normalize=True,     # Best for Cosine Similarity search
    format="pandas"
)

# 5. Serve it in real-time (Example using LanceDB)
import lancedb

# Create a local vector database
db = lancedb.connect("./lancedb_store")
table = db.create_table("items", data=items_df)

# Query the table with a specific user's latent factors
user_emb = loaded_model.user_factors[0]

# Retrieve top 5 item recommendations for this user using L2-normalized vector search!
results = table.search(user_emb).limit(5).to_pandas()

πŸ” Analytics Helpers

from rusket import find_substitutes, customer_saturation

# Identify cannibalizing SKUs (lift < 1.0) for assortment rationalisation
subs = find_substitutes(rules_df, max_lift=0.8)
#  antecedents  consequents  lift
#  (Cola A,)    (Cola B,)    0.61   ← these products hurt each other's sales

# Segment customers by category penetration (decile 10 = buy everything; 1 = barely engaged)
saturation = customer_saturation(
    purchases_df, user_col="customer_id", category_col="category_id"
)

πŸ“ˆ BPR & Sequential Patterns

  • BPR (Bayesian Personalized Ranking): Directly optimises ranking of positive interactions over negative ones β€” ideal for newsfeeds, playlists, and app recommendation surfaces that prioritise top-N precision.
  • Sequential Pattern Mining (PrefixSpan): Discovers ordered patterns across time (e.g., "Subscriber signed up for broadband β†’ mobile plan β†’ premium bundle" or "Customer viewed Camera β†’ 2 weeks later bought Lens").

rusket natively extracts PrefixSpan sequences from Pandas, Polars, and PySpark event logs with zero-copy Arrow mapping:

from rusket import PrefixSpan

# Telco product adoption journeys β€” what sequence of subscriptions do customers follow?
# df: customer_id | subscription_date | product_id
model = PrefixSpan.from_transactions(
    subscription_events,
    transaction_col="customer_id",
    item_col="product_id",
    time_col="subscription_date",
    min_support=50,    # at least 50 customers follow this path
    max_len=4,
)
freq_seqs = model.mine()
# e.g. [broadband] β†’ [mobile] β†’ [tv_bundle] appears in 312 journeys

πŸ•ΈοΈ Graph Analytics & Embeddings

Integrate natively with the modern GenAI/LLM stack:

  • Vector Export: Export user/item factors to a Pandas DataFrame ready for FAISS/Qdrant using model.export_item_factors().
  • Item-to-Item Similarity: Fast Cosine Similarity on embeddings using model.similar_items(item_id).
  • Graph Generation: Automatically convert association rules into a networkx directed Graph for community detection using rusket.viz.to_networkx(rules).

πŸ”¬ MLOps: MLflow Tracking & Hyperparameter Tuning

rusket has built-in support for MLflow experiment tracking, mlflow.pyfunc packaging, and Bayesian hyperparameter optimisation using Optuna's TPE sampler. For ALS/eALS models, each Optuna trial runs the Rust-native cross-validation backend β€” making the entire search blazingly fast.

import rusket
import rusket.mlflow
from rusket import OptunaSearchSpace

# ── 1. Enable MLflow Autologging ─────────────────────────────────────
rusket.mlflow.autolog()

# ── 2. Train a single model with automatic tracking ──────────────────
# Hyperparameters (factors, iterations) and training_duration_seconds are logged!
import mlflow
with mlflow.start_run():
    model = rusket.ALS(factors=64, iterations=15).fit(df)

# Save/Load models as native MLflow pyfunc artifacts for easy deployment
rusket.mlflow.save_model(model, "my_als_model")
loaded_model = mlflow.pyfunc.load_model("my_als_model")  # Has a .predict(df) method

# ── 3. Quick hyperparameter search with sensible defaults ───────────
result = rusket.optuna_optimize(
    rusket.ALS,
    df,
    user_col="user_id",
    item_col="item_id",
    n_trials=50,
    metric="ndcg",
    k=10,
)
print(f"Best ndcg@10: {result.best_score:.4f}")
print(f"Best params:  {result.best_params}")

# ── Custom search space + refit best model ───────────────────────────
result = rusket.optuna_optimize(
    rusket.eALS,
    df,
    user_col="user_id",
    item_col="item_id",
    search_space=[
        OptunaSearchSpace.int("factors", 16, 256, log=True),
        OptunaSearchSpace.float("alpha", 1.0, 100.0, log=True),
        OptunaSearchSpace.float("regularization", 1e-4, 1.0, log=True),
        OptunaSearchSpace.int("iterations", 5, 30),
    ],
    n_trials=100,
    n_folds=3,
    metric="precision",
    refit_best=True,  # best model is already fitted
)
items, scores = result.best_model.recommend_items(user_id=42, n=10)

# ── MLflow experiment tracking ───────────────────────────────────────
# pip install mlflow optuna-integration
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("als-tuning")

result = rusket.optuna_optimize(
    rusket.ALS, df,
    user_col="user_id", item_col="item_id",
    n_trials=50, metric="ndcg",
    mlflow_tracking=True,   # ← every trial logged to MLflow
)

# ── Custom callbacks ─────────────────────────────────────────────────
result = rusket.optuna_optimize(
    rusket.ALS, df,
    user_col="user_id", item_col="item_id",
    n_trials=50,
    callbacks=[my_custom_callback],  # any Optuna-compatible callback
)

πŸš€ GPU Acceleration (CUDA)

rusket supports optional GPU acceleration via CuPy or PyTorch CUDA for models that benefit from large matrix operations. Enable it globally with a single call β€” no need to pass use_gpu=True to every model.

import rusket

# Enable GPU globally β€” every model created after this uses CUDA
rusket.enable_gpu()

# All models now default to GPU
als = rusket.ALS(factors=128, iterations=20).fit(interactions)
ease = rusket.EASE(regularization=500).fit(interactions)
bpr = rusket.BPR(factors=64).fit(interactions)

# Per-model override: force a specific model to CPU
small_model = rusket.SVD(factors=16, use_gpu=False)

# Turn it off globally
rusket.disable_gpu()

# Check the current state
rusket.is_gpu_enabled()  # β†’ False

Supported Models

All 12 recommender models respect the global GPU flag:

Model GPU-accelerated operations
ALS / eALS Gramian, Cholesky solve, batch scoring
BPR SGD updates, batch recommend
SVD Factor updates, batch scoring
EASE Gram matrix inversion
ItemKNN / UserKNN Similarity scoring
LightGCN Graph convolution, scoring
FM Prediction
FPMC Factor updates
SASRec / BERT4Rec Attention forward pass
NMF Multiplicative updates

Installation

# CuPy (recommended β€” fastest)
pip install cupy-cuda12x

# Or PyTorch
pip install torch

No GPU? No problem. rusket auto-detects whether a GPU backend is available. If neither CuPy nor PyTorch CUDA is installed, enable_gpu() will still succeed but models will raise an ImportError at fit-time. Use rusket.check_gpu_available() to test beforehand.


⚑ Benchmarks

Benchmark environment: Apple Silicon MacBook Air (M-series, arm64, 8 GB RAM). All timings are single-run wall-clock measurements.

Scale Benchmarks (1M β†’ 200M rows)

What's measured: from_transactions() converts long-format (txn_id, item_id) rows into a sparse OHE matrix. fpgrowth() then mines that matrix. Both steps have the same Rust mining cost β€” the only difference at large scale is whether you pay the conversion cost upfront.

Scale from_transactions (conversion) fpgrowth (mining) Total
1M rows 4.9s 0.1s 5.0s
10M rows 23.2s 1.2s 24.4s
50M rows 59.1s 4.0s 63.1s
100M rows (20M txns Γ— 200k items) 124.1s 10.1s 134.2s
200M rows (40M txns Γ— 200k items) 229.2s 17.6s 246.8s

The mining step is fast β€” the bottleneck at scale is the long-format β†’ sparse-matrix conversion. If your pipeline already produces a CSR/sparse matrix (e.g., from a Parquet/warehouse export), you skip the conversion entirely and only pay the mining cost.

Power-user path: Direct CSR β†’ Rust

import numpy as np
from scipy import sparse as sp
from rusket import FPGrowth

# Build CSR directly from integer IDs (no pandas!)
csr = sp.csr_matrix(
    (np.ones(len(txn_ids), dtype=np.int8), (txn_ids, item_ids)),
    shape=(n_transactions, n_items),
)
freq = FPGrowth(csr, item_names=item_names).mine(
    min_support=0.001, max_len=3, use_colnames=True
)

At 100M rows, the mining step itself takes 10.1 seconds. Building the CSR directly skips the from_transactions conversion cost (~124s) but does not change the mining time.

Real-World Datasets

Dataset Transactions Items rusket
andi_data.txt 8,416 119 9.7 s (22.8M itemsets)
andi_data2.txt 540,455 2,603 7.9 s

Run benchmarks yourself:

uv run pytest benchmarks/bench_scale.py -v -s   # Scale benchmark
uv run python benchmarks/bench_realworld.py     # Real-world datasets
uv run pytest tests/test_benchmark.py -v -s      # pytest-benchmark

Recommender Benchmarks vs LibRecommender

Measured with pytest-benchmark (5 rounds, warmed up, GC disabled). MovieLens 100k dataset (943 users, 1,682 items, 100k ratings). Only model.fit() is timed β€” no startup or data loading overhead.

Benchmark rusket LibRecommender Speedup
ALS (Cholesky) (64 factors, 15 epochs) 427 ms 1,324 ms 3.1Γ—
ALS (eALS) (64 factors, 15 epochs) 360 ms N/A β€”
BPR (64 factors, 10 epochs) 33 ms 681 ms 20.4Γ—
ItemKNN (k=100) 55 ms 287 ms 5.2Γ—
SVD (64 factors, 20 epochs) 55 ms ❌ TF-only (broken) β€”
EASE 71 ms N/A β€”

Note: LibRecommender requires TensorFlow + PyTorch + gensim + Cython (~2 GB of dependencies). rusket has zero runtime dependencies.

uv run pytest benchmarks/bench_pytest_librecommender.py -v --benchmark-columns=mean,stddev,rounds

πŸ— Architecture

Data Flow

pandas dense         ──► np.uint8 array (C-contiguous)  ──► Rust fpgrowth_from_dense
pandas Arrow backend ──► Arrow β†’ np.uint8 (zero-copy)   ──► Rust fpgrowth_from_dense
pandas sparse        ──► CSR int32 arrays               ──► Rust fpgrowth_from_csr
polars               ──► Arrow β†’ np.uint8 (zero-copy)   ──► Rust fpgrowth_from_dense
numpy ndarray        ──► np.uint8 (C-contiguous)        ──► Rust fpgrowth_from_dense

All mining and rule generation happens inside Rust. No Python loops, no round-trips.

The 1 Billion Row Architecture

To pass the "1 Billion Row" threshold without OOM crashes, rusket employs a zero-allocation mining loop:

  • Eclat Scratch Buffers: intersect_count_into writes intersections directly into thread-local pre-allocated memory bytes and computes popcnt in a single pass. It implements early-exit loop termination the moment it proves a combination cannot reach min_support.
  • FPGrowth Parallel Tree Build: Conditional FP-trees are collected concurrently inside the rayon parallel mining step, replacing the standard sequential loop and eliminating memory contention bottlenecks.
  • AHashMap Deduplication: Extremely fast O(N) duplicate basket counting replaces standard O(N log N) unstable sorts in the core pipeline.

πŸ§‘β€πŸ’» Development

Prerequisites

  • Rust 1.83+ (rustup update)
  • Python 3.10+
  • uv (recommended package manager)

Getting Started

# Clone
git clone https://github.com/bmsuisse/rusket.git
cd rusket

# Build Rust extension in dev mode
uv run maturin develop --release

# Run the full test suite
uv run pytest tests/ -x -q

# Type-check the Python layer
uv run pyright rusket/

# Cargo check (Rust)
cargo check

Run Examples

# Getting started
uv run python examples/01_getting_started.py

# Market basket analysis with Faker
uv run python examples/02_market_basket_faker.py

# Polars input
uv run python examples/03_polars_input.py

# Sparse input
uv run python examples/04_sparse_input.py

# Large-scale mining (100k+ rows)
uv run python examples/05_large_scale.py

πŸ€– AI Disclosure

A large part of this library β€” including the Rust core algorithms, the Python wrappers, the OOP class hierarchy, and the Spark integration layer β€” was written with substantial assistance from AI pair-programming tools (specifically Google Gemini / Antigravity). Human review, benchmarking, and architectural decisions were applied throughout.

We believe in transparency about AI-assisted development. The algorithms are correct, the tests pass, and the performance numbers are real β€” but if you find a bug or a piece of "AI slop", please open an issue!


πŸ“œ License

MIT License