GitHub - bmsuisse/rusket: rusket 🦀🧺

Ultra-fast Recommender Engines & Market Basket Analysis for Python, written in Rust.
Made with ❤️ by the Data & AI Team.

🎯 Goals

Goal	Details
⚡ Blazing fast	All algorithms run in compiled Rust (via PyO3) with multi-threaded Rayon parallelism and SIMD-accelerated kernels. ALS is 11×, and FP-Growth is 140× faster than PySpark.
📦 Zero dependencies	No TensorFlow, no PyTorch, no JVM. A single ~3 MB wheel is all you need — `pip install rusket` and go.
🧑‍💻 Easy to use	Common cases are one-liners: `model.recommend_items(user_id)`, `model.recommend_users(item_id)`, `model.export_item_factors()` for vector/embedding export. No boilerplate.
🏗️ Modern data stack	Native Pandas, Polars, and Apache Spark support with zero-copy Arrow transfers. Works seamlessly with Delta Lake, Databricks, Snowflake, and any dbt/Parquet pipeline.

⚠️ Note: rusket is currently under heavy construction. The API will probably change in upcoming versions.

rusket is a modern, Rust-powered library for Market Basket Analysis and Recommender Engines. It delivers significant speed-ups and lower memory usage compared to traditional Python implementations, while natively supporting Pandas, Polars, and Spark out of the box.

Zero runtime dependencies. No TensorFlow, no PyTorch, no JVM — just pip install rusket and go. The entire engine is compiled Rust, distributed as a single ~3 MB wheel.

It features Collaborative Filtering (ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE), Sequential Recommendation (FPMC, SASRec), Context-aware Prediction (FM), Pattern Mining (FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan), and built-in Hyperparameter Tuning (Optuna + MLflow tracking) with high performance and low memory footprints. Both functional and OOP APIs are available for seamless integration.

✨ Highlights

	`rusket`	`LibRecommender`	`implicit`	`pyspark.ml`
Core language	Rust (PyO3)	TF + PyTorch + Cython	Cython / C++	Scala / Java (JVM)
Runtime deps	0	TF + PyTorch + gensim (~2 GB)	OpenBLAS / MKL	JVM + Spark
Install size	~3 MB	~2 GB	~50 MB	~300 MB
Algorithms	ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE, FM, FPMC, SASRec, FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan	ALS, BPR, SVD, LightGCN, ItemCF, FM, DeepFM, ...	ALS, BPR	ALS, FP-Growth, PrefixSpan
Recommender API	✅ Hybrid Engine + i2i Similarity	✅	✅	✅ (ALS only)
Graph & Embeddings	✅ NetworkX Export, Vector DB Export	❌	❌	❌
OOP class API	✅ `ALS.from_transactions(df).fit()`	✅	✅	✅
Pandas / Polars / Spark	✅ / ✅ / ✅	✅ / ❌ / ❌	❌ / ❌ / ❌	❌ / ❌ / ✅
Parallel execution	✅ Rayon work-stealing	✅ TF/PyTorch threads	✅ OpenMP	✅ Spark Cluster
Memory	Low (native Rust buffers)	High (TF/PyTorch graphs)	Low (C++ arrays)	High (JVM overhead)

📦 Installation

pip install rusket
# or with uv:
uv add rusket

Optional extras:

# Polars support
pip install "rusket[polars]"

# Pandas/NumPy support (usually already installed)
pip install "rusket[pandas]"

🚀 Quick Start

"Frequently Bought Together" — Grocery Checkout Data

Identify which products co-occur most in customer baskets — the foundation of cross-sell widgets, promotional bundles, and shelf placement decisions.

import pandas as pd
from rusket import FPGrowth

# One week of supermarket checkout data (1 row = 1 receipt, 1 col = 1 SKU)
receipts = pd.DataFrame({
    "milk":         [1, 1, 0, 1, 1, 0, 1],
    "bread":        [1, 0, 1, 1, 0, 1, 1],
    "butter":       [1, 0, 1, 0, 0, 1, 0],
    "eggs":         [0, 1, 1, 0, 1, 0, 1],
    "coffee":       [0, 1, 0, 0, 1, 1, 0],
    "orange_juice": [1, 0, 0, 1, 0, 0, 1],
}, dtype=bool)

# Step 1 — which SKU combinations appear in ≥40% of receipts?

model = FPGrowth(receipts, min_support=0.4)
freq = model.mine(use_colnames=True)

# Step 2 — keep rules with ≥60% confidence
rules = model.association_rules(metric="confidence", min_threshold=0.6)

# Lift > 1 means customers buy these together more than chance alone
print(rules[["antecedents", "consequents", "support", "confidence", "lift"]]
      .sort_values("lift", ascending=False))

🛒 E-Commerce Order Lines (Long Format)

Real-world data arrives as (order_id, sku) rows from a database — not one-hot matrices.

All mining algorithms expose a class-based API that goes straight from order lines to recommendations:

import pandas as pd
from rusket import FPGrowth

# Order line export from your e-commerce backend
orders = pd.DataFrame({
    "order_id": [1001, 1001, 1001, 1002, 1002, 1003, 1003],
    "sku":      ["HDPHONES", "USB_DAC", "AUX_CABLE",
                 "HDPHONES", "CARRY_CASE",
                 "USB_DAC",  "AUX_CABLE"],
})

model = FPGrowth.from_transactions(
    orders,
    transaction_col="order_id",
    item_col="sku",
    min_support=0.3,
)

freq  = model.mine(use_colnames=True)              # Miner classes: mine() never auto-fits
rules = model.association_rules(metric="confidence", min_threshold=0.6)

# Which accessories should be suggested when headphones are in the cart?
suggestions = model.recommend_items(["HDPHONES"], n=3)
# → e.g. ["USB_DAC", "AUX_CABLE", "CARRY_CASE"]

Or use the explicit type variants:

from rusket import FPGrowth

ohe = FPGrowth.from_pandas(orders, transaction_col="order_id", item_col="sku")
ohe = FPGrowth.from_polars(pl_orders, transaction_col="order_id", item_col="sku")
ohe = FPGrowth.from_transactions([["HDPHONES", "USB_DAC"], ["HDPHONES", "CARRY_CASE"]])  # list of lists

Spark is also supported: FPGrowth.from_spark(spark_df) calls .toPandas() internally.

🐻‍❄️ Polars Input — Reading from Data Lake Parquet

For teams running a modern data stack with Parquet files on S3/GCS/Azure Blob, rusket natively accepts Polars DataFrames. Data is transferred via Arrow zero-copy buffers — no conversion overhead.

The fastest path from a data lake to "Frequently Bought Together" rules:

import polars as pl
from rusket import FPGrowth

# ── 1. Read a one-hot basket matrix directly from S3/GCS/local Parquet ──
# Columns = SKUs (bool), rows = receipts — produced by your dbt or Spark pipeline
baskets = pl.read_parquet("s3://data-lake/gold/basket_ohe.parquet")
print(f"Loaded {baskets.shape[0]:,} receipts × {baskets.shape[1]} SKUs")

# ── 2. Instantiate FPGrowth (zero-copy from Polars) ─────────────────
model = FPGrowth(baskets, min_support=0.02, max_len=3)

# ── 3. Mine frequent combinations ────────────────────────────────────
freq = model.mine(use_colnames=True)
print(f"Found {len(freq):,} frequent itemsets")
print(freq.sort_values("support", ascending=False).head(10))

# ── 4. Generate cross-sell rules ────────────────────────────────────
rules = model.association_rules(metric="lift", min_threshold=1.2)
print(f"Rules with lift > 1.2: {len(rules):,}")
print(
    rules[["antecedents", "consequents", "confidence", "lift"]]
    .sort_values("lift", ascending=False)
    .head(8)
)

How it works under the hood:
Polars → Arrow buffer → np.uint8 (zero-copy) → Rust fpgrowth_from_dense

💎 High-Utility Pattern Mining (HUPM) — Profit-Driven Bundle Discovery

Frequent items aren't always the most profitable. HUPM finds product combinations that generate the highest total gross margin — even if they appear rarely. rusket implements the state-of-the-art EFIM algorithm in Rust.

import pandas as pd
from rusket import HUPM

# Specialty foods retailer: receipt line items with gross margin per unit sold
orders = pd.DataFrame({
    "receipt_id": [1, 1, 1, 2, 2, 3, 3],
    "product": ["aged_cheese", "wine_flight", "charcuterie",
                "aged_cheese", "charcuterie",
                "wine_flight", "charcuterie"],
    "margin": [8.50, 12.00, 6.50,   # receipt 1 — margin per item
               8.50, 6.50,           # receipt 2
               12.00, 6.50],         # receipt 3
})

# Find all product bundles generating ≥ €20 total margin across all receipts
high_margin = HUPM.from_transactions(
    orders,
    transaction_col="receipt_id",
    item_col="product",
    utility_col="margin",
    min_utility=20.0,
).mine()
print(high_margin.head())
# e.g. aged_cheese + wine_flight + charcuterie → total margin 81.0

📊 Sparse Pandas Input

For very sparse datasets (e.g. e-commerce with thousands of SKUs), use Pandas SparseDtype to minimize memory. rusket passes the raw CSR arrays straight to Rust — no densification ever happens.

import pandas as pd
import numpy as np
from rusket import FPGrowth

rng = np.random.default_rng(7)
n_rows, n_cols = 30_000, 500

# Very sparse: average basket size ≈ 3 items out of 500
p_buy = 3 / n_cols
matrix = rng.random((n_rows, n_cols)) < p_buy
products = [f"sku_{i:04d}" for i in range(n_cols)]

df_dense = pd.DataFrame(matrix.astype(bool), columns=products)
df_sparse = df_dense.astype(pd.SparseDtype("bool", fill_value=False))

dense_mb = df_dense.memory_usage(deep=True).sum() / 1e6
sparse_mb = df_sparse.memory_usage(deep=True).sum() / 1e6
print(f"Dense  memory: {dense_mb:.1f} MB")
print(f"Sparse memory: {sparse_mb:.1f} MB  ({dense_mb / sparse_mb:.1f}× smaller)")

# Same API, same results — just faster and lighter
freq = FPGrowth(df_sparse, min_support=0.01).mine(use_colnames=True)
print(f"Frequent itemsets: {len(freq):,}")

How it works under the hood:
Sparse DataFrame → COO → CSR → (indptr, indices) → Rust fpgrowth_from_csr

🌊 Out-of-Core Processing (FPMiner Streaming)

For datasets scaling to Billion-row sizes that don't fit in memory, use the FPMiner accumulator. It accepts chunks of (txn_id, item_id) pairs, sorting them in-place immediately, and uses a memory-safe k-way merge across all chunks to build the CSR matrix on the fly avoiding massive memory spikes.

import numpy as np
from rusket import FPMiner

n_items = 5_000
miner = FPMiner(n_items=n_items)

# Feed chunks incrementally (e.g. from Parquet/CSV/SQL)
for chunk in dataset:
    txn_ids = chunk["txn_id"].to_numpy(dtype=np.int64)
    item_ids = chunk["item_id"].to_numpy(dtype=np.int32)
    
    # Fast O(k log k) per-chunk sort
    miner.add_chunk(txn_ids, item_ids)

# Stream k-way merge and mine in one pass!
# Returns a DataFrame with 'support' and 'itemsets' just like fpgrowth()
freq = miner.mine(min_support=0.001, max_len=3)

Memory efficiency: The peak memory overhead at mine() time is just $O(k)$ for the cursors (where $k$ is the number of chunks), plus the final compressed CSR allocation.

🌩️ Distributed Computing with Apache Spark

rusket ships a full Spark integration layer in rusket.spark. All algorithms run as Native Arrow UDFs via applyInArrow — Rust is called directly on each executor, with zero Python overhead per row.

How it works

PySpark DataFrame
  └─► groupby(group_col).applyInArrow(...)
        └─► Arrow Table (per partition / per group)
              └─► Polars zero-copy conversion
                    └─► rusket Rust extension (on the executor)
                          └─► results → PyArrow → PySpark DataFrame

Full Example — Retail Basket Analysis per Store

from pyspark.sql import SparkSession
from rusket.spark import mine_grouped, rules_grouped

spark = SparkSession.builder.appName("rusket-demo").getOrCreate()

# ── 1. Load your OHE transaction table (one row = one basket) ──────────────
#    Schema: store_id (string), bread (bool), butter (bool), milk (bool), ...
spark_df = spark.read.parquet("s3://data/baskets/")

# ── 2. Mine frequent itemsets per store in parallel ──────────────────────────
#    Each Spark task calls the Rust FP-Growth/Eclat engine on its Arrow batch.
freq_df = mine_grouped(
    spark_df,
    group_col="store_id",
    min_support=0.05,    # 5% support per store

)
# freq_df schema: store_id | support (double) | itemsets (array<string>)

# ── 3. Count transactions per store (needed for rule support) ────────────────
from pyspark.sql import functions as F
counts = (
    spark_df.groupby("store_id")
    .agg(F.count("*").alias("n"))
    .rdd.collectAsMap()          # {"store_1": 12000, "store_2": 8500, ...}
)

# ── 4. Generate association rules per store ──────────────────────────────────
rules_df = rules_grouped(
    freq_df,
    group_col="store_id",
    num_itemsets=counts,         # pass per-group counts as a dict
    metric="confidence",
    min_threshold=0.6,
)
# rules_df schema: store_id | antecedents | consequents | confidence | lift | ...

rules_df.orderBy("lift", ascending=False).show(10, truncate=False)

Sequential Patterns per Category

from rusket.spark import prefixspan_grouped

# event_log schema: category_id, user_id, item_id, event_ts
event_log = spark.read.parquet("s3://data/events/")

seq_df = prefixspan_grouped(
    event_log,
    group_col="category_id",   # mine independently per product category
    user_col="user_id",        # sequence identifier within the group
    time_col="event_ts",       # ordering column
    item_col="item_id",
    min_support=50,            # absolute count: pattern must appear in ≥50 sessions
    max_len=4,
)
# seq_df schema: category_id | support (long) | sequence (array<string>)
seq_df.show(5, truncate=False)

High-Utility Patterns per Region

from rusket.spark import hupm_grouped

# profit_log schema: region_id, txn_id, item_id, profit
profit_log = spark.read.parquet("s3://data/profit/")

utility_df = hupm_grouped(
    profit_log,
    group_col="region_id",
    transaction_col="txn_id",
    item_col="item_id",
    utility_col="profit",
    min_utility=500.0,         # only itemsets with combined profit ≥ €500
)
# utility_df schema: region_id | utility (double) | itemset (array<long>)
utility_df.show(5, truncate=False)

Batch Recommendations across the Cluster

from rusket.spark import recommend_batches
from rusket import ALS

# 1. Train an ALS model locally (or load a pre-trained one)
als = ALS.from_transactions(
    events_pd,
    user_col="user_id",
    item_col="item_id",
).fit()  # ← always call .fit() after from_transactions()

# 2. Scale-out scoring: one recommendation row per user
user_df = spark.read.parquet("s3://data/users/").select("user_id")

recs_df = recommend_batches(user_df, model=als, user_col="user_id", k=10)
# recs_df schema: user_id (string) | recommended_items (array<int>)
recs_df.show(5, truncate=False)

Tip — Databricks / Delta Lake: All functions return a standard PySpark DataFrame, so you can write results back with .write.format("delta").save(...) or .saveAsTable(...) directly.

📖 API Reference

OOP Class API

Every algorithm in rusket exposes a class-based API in addition to the functional helpers. All classes share a unified interface inherited from BaseModel:

Class	Inherits from	Description
`FPGrowth`	`Miner`, `RuleMinerMixin`	FP-Tree parallel mining
`Eclat`	`Miner`, `RuleMinerMixin`	Vertical bitset mining
`FPGrowth`	`Miner`, `RuleMinerMixin`	Frequent Pattern Growth algorithm
`FIN`	`Miner`, `RuleMinerMixin`	FP-tree Node-list intersection mining
`LCM`	`Miner`, `RuleMinerMixin`	Linear-time Closed itemset Mining
`HUPM`	`Miner`	High-Utility Pattern Mining (EFIM)
`PrefixSpan`	`Miner`	Sequential pattern mining
`ALS`	`ImplicitRecommender`	Alternating Least Squares CF
`BPR`	`ImplicitRecommender`	Bayesian Personalized Ranking CF
`SVD`	`ImplicitRecommender`	Funk SVD (biased SGD)
`LightGCN`	`ImplicitRecommender`	Graph Convolutional CF
`ItemKNN`	`ImplicitRecommender`	Item-based k-NN CF
`UserKNN`	`ImplicitRecommender`	User-based k-NN CF
`EASE`	`ImplicitRecommender`	Embarrassingly Shallow Autoencoders
`FM`	`BaseModel`	Factorization Machines (CTR prediction)
`FPMC`	`SequentialRecommender`	Factorizing Personalized Markov Chains
`SASRec`	`SequentialRecommender`	Self-Attentive Sequential Recommendation
`HybridEmbeddingIndex`	—	CF + semantic embedding fusion

All classes share the following data-ingestion class methods inherited from BaseModel:

# Load from long-format (transaction_id, item_id) DataFrame or list of lists
model = FPGrowth.from_transactions(df, transaction_col="order_id", item_col="item", min_support=0.3)

# Typed convenience aliases — same result
model = FPGrowth.from_pandas(df,  ...)
model = FPGrowth.from_polars(pl_df, ...)
model = FPGrowth.from_spark(spark_df, ...)

Miner subclasses (FPGrowth, Eclat) additionally expose RuleMinerMixin, giving a fluent pipeline:

model  = FPGrowth.from_transactions(df, min_support=0.3)
freq   = model.mine(use_colnames=True)             # pd.DataFrame [support, itemsets]
rules  = model.association_rules(metric="lift")    # pd.DataFrame [antecedents, consequents, ...]
recs   = model.recommend_items(["bread", "milk"])  # list of suggested items

ImplicitRecommender subclasses (ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE) follow the scikit-learn fit()/predict() pattern. SequentialRecommender subclasses (FPMC, SASRec) use from_transactions(..., time_col=...).fit() for sequential next-item prediction:

# Option A — construct then fit with a sparse matrix
model = ALS(factors=64, iterations=15)
model.fit(user_item_csr)

# Option B — from event log, then explicit .fit()
model = ALS(factors=64).from_transactions(
    df, user_col="user_id", item_col="item_id"
).fit()  # ← .fit() is always required

# Predict / recommend
items, scores = model.recommend_items(user_id=42, n=10, exclude_seen=True)
users, scores = model.recommend_users(item_id=99, n=5)

Breaking change vs older versions: from_transactions() no longer auto-fits. Always chain .fit() after it.

🧠 Advanced Pattern & Recommendation Algorithms

rusket provides more than just basic market basket analysis. It includes an entire suite of modern algorithms and a high-level Business Recommender API.

🎯 ItemKNN & UserKNN — Nearest-Neighbor Collaborative Filtering

Two complementary memory-based methods that consistently rank among the top performers in academic benchmarks (see Anelli et al. 2022).

ItemKNN — Finds items similar to what the user already liked. Fast, stable, and scales well with pre-computed item-item similarity.
UserKNN — Finds users similar to the target user and recommends what they liked. Often more serendipitous and performs particularly well on dense datasets.

Both support BM25, TF-IDF, Cosine, and raw Count weighting, with the top-K neighbor pruning running in parallel Rust.

from rusket import ItemKNN, UserKNN

# ── Item-based: "Customers who bought X also bought Y" ────────────
item_knn = ItemKNN.from_transactions(
    purchases, user_col="user_id", item_col="item_id",
    method="bm25", k=100,
).fit()
items, scores = item_knn.recommend_items(user_id=42, n=10)

# ── User-based: "Users similar to you enjoyed these items" ────────
user_knn = UserKNN.from_transactions(
    purchases, user_col="user_id", item_col="item_id",
    method="cosine", k=50,
).fit()
items, scores = user_knn.recommend_items(user_id=42, n=10)

Which one to choose? Start with ItemKNN(method="bm25") — it's the fastest and most stable. Switch to UserKNN if you have a dense dataset or want more diverse recommendations. In production, try both and evaluate with rusket.evaluate().

🎯 ALS & BPR Collaborative Filtering

Both models learn user and item embeddings from implicit feedback (purchases, clicks, plays) and power personalised recommendations at scale. Use ALS for broad serendipitous discovery; use BPR when you care only about top-N ranking.

from rusket import ALS, BPR

# ── "For You" homepage — music streaming platform ────────────────────
# event log: user_id | track_id | plays (optional weight)
plays = pd.DataFrame({
    "user_id":  [101, 101, 102, 102, 103, 103, 103],
    "track_id": ["T01", "T03", "T01", "T05", "T02", "T03", "T05"],
    "plays":    [12, 5, 8, 3, 20, 1, 7],  # play count as confidence weight
})

als = ALS(factors=64, iterations=15, alpha=40.0).from_transactions(
    plays, user_col="user_id", item_col="track_id", rating_col="plays"
).fit()  # ← always call .fit() after from_transactions()

# Top-10 tracks for user 101, excluding already-played tracks
tracks, scores = als.recommend_items(user_id=101, n=10, exclude_seen=True)

# Which users are most likely to enjoy track T05? — useful for email campaigns
users, scores = als.recommend_users(item_id="T05", n=50)

# BPR — optimise ranking directly rather than reconstruction
bpr = BPR(factors=64, learning_rate=0.05, iterations=150).fit(user_item_csr)

🎯 Hybrid Recommender API

Combine Collaborative Filtering (ALS/BPR) with Frequent Pattern Mining to cover every placement surface — personalised homepage ("For You") and active cart ("Frequently Bought Together") — in a single engine.

from rusket import ALS, Recommender, FPGrowth

# 1. Train on purchase history (implicit feedback)
als = ALS(factors=64, iterations=15).fit(user_item_csr)

# 2. Mine co-purchase rules from basket data
miner = FPGrowth(basket_ohe, min_support=0.01)
freq  = miner.mine()
rules = miner.association_rules()

# 3. Create the Hybrid Engine
rec = Recommender(model=als, rules_df=rules)

# "For You" homepage — personalised for customer 1001
items, scores = rec.recommend_for_user(user_id=1001, n=5)

# Blend CF + product embeddings (e.g. from a PIM or sentence-transformer)
items, scores = rec.recommend_for_user(user_id=1001, n=5, alpha=0.7,
                                       target_item_for_semantic="HDPHONES")

# Active cart cross-sell — "Frequently Bought Together"
add_ons = rec.recommend_for_cart(["USB_DAC", "AUX_CABLE"], n=3)

# Overnight batch — score all customers, write to CRM
batch_df = rec.predict_next_chunk(user_history_df, user_col="customer_id", k=5)

🧬 Hybrid Embedding Fusion — CF + Semantic in One Vector Space

Collaborative filtering embeddings capture behavioral signals (who bought what); semantic text embeddings capture content meaning (product descriptions). Fusing them into a single vector space lets you do ANN retrieval, vector DB export, and clustering in one shot.

import rusket

# 1. Train ALS on implicit feedback
als = rusket.ALS(factors=64, iterations=15).fit(interactions)

# 2. Get semantic embeddings (e.g. from sentence-transformers)
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
text_vectors = encoder.encode(product_descriptions)  # (n_items, 384)

# 3. Fuse into a single hybrid vector space
hybrid = rusket.HybridEmbeddingIndex(
    cf_embeddings=als.item_factors,       # (n_items, 64)
    semantic_embeddings=text_vectors,      # (n_items, 384)
    strategy="weighted_concat",            # "concat" | "weighted_concat" | "projection"
    alpha=0.6,                             # 60% CF, 40% semantic
)

# 4. Similar items via cosine on the fused space
ids, scores = hybrid.query(item_id=42, n=10)

# 5. Build an ANN index for sub-millisecond retrieval
ann = hybrid.build_ann_index(backend="native")  # or "faiss"

# 6. Export to a vector DB for production serving
hybrid.export_vectors(qdrant_client, collection_name="hybrid_items")

# 7. Or export as separate named vectors for DB-side fusion
hybrid.export_vectors(qdrant_client, mode="multi", collection_name="hybrid_items")
# → Qdrant/Meilisearch/Weaviate store "cf" and "semantic" as separate named vectors

Three fusion strategies:

Strategy	Description	Use Case
`"concat"`	L2-normalise each space, concatenate	Equal importance, no tuning
`"weighted_concat"`	Scale by `α` / `1−α`, then concat	Default — tune `alpha` to balance CF vs semantic
`"projection"`	Concat + PCA to `projection_dim`	Compact vectors for large-scale deployment

Standalone function: If you just need the fused matrix without an index, use rusket.fuse_embeddings(cf, sem, strategy="weighted_concat", alpha=0.6).

🎯 Multi-Stage Recommendation Pipeline

For production systems requiring advanced retrieval and ranking, use the Pipeline class. This mirrors the "retrieve → rerank → filter" paradigm used by Twitter/X and modern ML stacks.

It chains multiple models together:

Retrieve: Candidate generation
Rerank: Re-score candidates using a heavier scoring function
Filter: Apply business rules (e.g. exclude out-of-stock items, diversify)

from rusket import ALS, BPR, Pipeline, RuleBasedRecommender
import pandas as pd

# 1. Train multiple base models
als = ALS(factors=64).fit(interactions)
bpr = BPR(factors=128).fit(interactions)

# 2. Define explicit business rules (e.g. promoting warranties with laptops)
rules_df = pd.DataFrame({
    "antecedent": ["102"],   # Laptop SKU
    "consequent": ["999"],   # Warranty SKU
    "score": [2.0]
})
rules = RuleBasedRecommender.from_transactions(
    interactions, rules=rules_df, user_col="user", item_col="item"
).fit()

# 3. Compose the Pipeline (Retrieve from ALS, rerank with deeper BPR vectors)
# Items from the `rules` model receive an artificial +1,000,000 score 
# ensuring they rank at the top *after* the algorithmic reranking.
pipeline = Pipeline(
    retrieve=[als, bpr],
    merge_strategy="max",  # how to combine candidate scores
    rerank=bpr,
    rules=rules, 
)

# Recommend for a user
items, scores = pipeline.recommend(user_id=42, n=10, exclude_seen=True)

# Blazing-fast Batch Scoring utilizing Rust inner loops
batch_recs = pipeline.recommend_batch(
    user_ids=[1, 2, 3],
    n=10,
    format="polars"  # Returns a native Polars DataFrame instantly
)

💾 Saving, Loading and Serving (LanceDB / Vector DBs)

rusket models use a unified BaseModel that provides .save() and .load() functionality. You can also export trained models to a Vector Database for fast, real-time serving in production. We even provide load_model which automatically infers the model architecture from the pickle file.

import rusket

# 1. Train the model
model = rusket.ALS(factors=32).fit(interactions)

# 2. Save your trained model to disk
model.save("my_als_model.pkl")

# 3. Load it back using the generic loader
loaded_model = rusket.load_model("my_als_model.pkl")

# 4. Export the embeddings for a Vector Database
items_df = rusket.export_item_factors(
    loaded_model, 
    normalize=True,     # Best for Cosine Similarity search
    format="pandas"
)

# 5. Serve it in real-time (Example using LanceDB)
import lancedb

# Create a local vector database
db = lancedb.connect("./lancedb_store")
table = db.create_table("items", data=items_df)

# Query the table with a specific user's latent factors
user_emb = loaded_model.user_factors[0]

# Retrieve top 5 item recommendations for this user using L2-normalized vector search!
results = table.search(user_emb).limit(5).to_pandas()

🔍 Analytics Helpers

from rusket import find_substitutes, customer_saturation

# Identify cannibalizing SKUs (lift < 1.0) for assortment rationalisation
subs = find_substitutes(rules_df, max_lift=0.8)
#  antecedents  consequents  lift
#  (Cola A,)    (Cola B,)    0.61   ← these products hurt each other's sales

# Segment customers by category penetration (decile 10 = buy everything; 1 = barely engaged)
saturation = customer_saturation(
    purchases_df, user_col="customer_id", category_col="category_id"
)

📈 BPR & Sequential Patterns

BPR (Bayesian Personalized Ranking): Directly optimises ranking of positive interactions over negative ones — ideal for newsfeeds, playlists, and app recommendation surfaces that prioritise top-N precision.
Sequential Pattern Mining (PrefixSpan): Discovers ordered patterns across time (e.g., "Subscriber signed up for broadband → mobile plan → premium bundle" or "Customer viewed Camera → 2 weeks later bought Lens").

rusket natively extracts PrefixSpan sequences from Pandas, Polars, and PySpark event logs with zero-copy Arrow mapping:

from rusket import PrefixSpan

# Telco product adoption journeys — what sequence of subscriptions do customers follow?
# df: customer_id | subscription_date | product_id
model = PrefixSpan.from_transactions(
    subscription_events,
    transaction_col="customer_id",
    item_col="product_id",
    time_col="subscription_date",
    min_support=50,    # at least 50 customers follow this path
    max_len=4,
)
freq_seqs = model.mine()
# e.g. [broadband] → [mobile] → [tv_bundle] appears in 312 journeys

🕸️ Graph Analytics & Embeddings

Integrate natively with the modern GenAI/LLM stack:

Vector Export: Export user/item factors to a Pandas DataFrame ready for FAISS/Qdrant using model.export_item_factors().
Item-to-Item Similarity: Fast Cosine Similarity on embeddings using model.similar_items(item_id).
Graph Generation: Automatically convert association rules into a networkx directed Graph for community detection using rusket.viz.to_networkx(rules).

🔬 MLOps: MLflow Tracking & Hyperparameter Tuning

rusket has built-in support for MLflow experiment tracking, mlflow.pyfunc packaging, and Bayesian hyperparameter optimisation using Optuna's TPE sampler. For ALS/eALS models, each Optuna trial runs the Rust-native cross-validation backend — making the entire search blazingly fast.

import rusket
import rusket.mlflow
from rusket import OptunaSearchSpace

# ── 1. Enable MLflow Autologging ─────────────────────────────────────
rusket.mlflow.autolog()

# ── 2. Train a single model with automatic tracking ──────────────────
# Hyperparameters (factors, iterations) and training_duration_seconds are logged!
import mlflow
with mlflow.start_run():
    model = rusket.ALS(factors=64, iterations=15).fit(df)

# Save/Load models as native MLflow pyfunc artifacts for easy deployment
rusket.mlflow.save_model(model, "my_als_model")
loaded_model = mlflow.pyfunc.load_model("my_als_model")  # Has a .predict(df) method

# ── 3. Quick hyperparameter search with sensible defaults ───────────
result = rusket.optuna_optimize(
    rusket.ALS,
    df,
    user_col="user_id",
    item_col="item_id",
    n_trials=50,
    metric="ndcg",
    k=10,
)
print(f"Best ndcg@10: {result.best_score:.4f}")
print(f"Best params:  {result.best_params}")

# ── Custom search space + refit best model ───────────────────────────
result = rusket.optuna_optimize(
    rusket.eALS,
    df,
    user_col="user_id",
    item_col="item_id",
    search_space=[
        OptunaSearchSpace.int("factors", 16, 256, log=True),
        OptunaSearchSpace.float("alpha", 1.0, 100.0, log=True),
        OptunaSearchSpace.float("regularization", 1e-4, 1.0, log=True),
        OptunaSearchSpace.int("iterations", 5, 30),
    ],
    n_trials=100,
    n_folds=3,
    metric="precision",
    refit_best=True,  # best model is already fitted
)
items, scores = result.best_model.recommend_items(user_id=42, n=10)

# ── MLflow experiment tracking ───────────────────────────────────────
# pip install mlflow optuna-integration
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("als-tuning")

result = rusket.optuna_optimize(
    rusket.ALS, df,
    user_col="user_id", item_col="item_id",
    n_trials=50, metric="ndcg",
    mlflow_tracking=True,   # ← every trial logged to MLflow
)

# ── Custom callbacks ─────────────────────────────────────────────────
result = rusket.optuna_optimize(
    rusket.ALS, df,
    user_col="user_id", item_col="item_id",
    n_trials=50,
    callbacks=[my_custom_callback],  # any Optuna-compatible callback
)

🚀 GPU Acceleration (CUDA)

rusket supports optional GPU acceleration via CuPy or PyTorch CUDA for models that benefit from large matrix operations. Enable it globally with a single call — no need to pass use_gpu=True to every model.

import rusket

# Enable GPU globally — every model created after this uses CUDA
rusket.enable_gpu()

# All models now default to GPU
als = rusket.ALS(factors=128, iterations=20).fit(interactions)
ease = rusket.EASE(regularization=500).fit(interactions)
bpr = rusket.BPR(factors=64).fit(interactions)

# Per-model override: force a specific model to CPU
small_model = rusket.SVD(factors=16, use_gpu=False)

# Turn it off globally
rusket.disable_gpu()

# Check the current state
rusket.is_gpu_enabled()  # → False

Supported Models

All 12 recommender models respect the global GPU flag:

Model	GPU-accelerated operations
ALS / eALS	Gramian, Cholesky solve, batch scoring
BPR	SGD updates, batch recommend
SVD	Factor updates, batch scoring
EASE	Gram matrix inversion
ItemKNN / UserKNN	Similarity scoring
LightGCN	Graph convolution, scoring
FM	Prediction
FPMC	Factor updates
SASRec / BERT4Rec	Attention forward pass
NMF	Multiplicative updates

Installation

# CuPy (recommended — fastest)
pip install cupy-cuda12x

# Or PyTorch
pip install torch

No GPU? No problem. rusket auto-detects whether a GPU backend is available. If neither CuPy nor PyTorch CUDA is installed, enable_gpu() will still succeed but models will raise an ImportError at fit-time. Use rusket.check_gpu_available() to test beforehand.

⚡ Benchmarks

Benchmark environment: Apple Silicon MacBook Air (M-series, arm64, 8 GB RAM). All timings are single-run wall-clock measurements.

Scale Benchmarks (1M → 200M rows)

What's measured: from_transactions() converts long-format (txn_id, item_id) rows into a sparse OHE matrix. fpgrowth() then mines that matrix. Both steps have the same Rust mining cost — the only difference at large scale is whether you pay the conversion cost upfront.

Scale	`from_transactions` (conversion)	`fpgrowth` (mining)	Total
1M rows	4.9s	0.1s	5.0s
10M rows	23.2s	1.2s	24.4s
50M rows	59.1s	4.0s	63.1s
100M rows (20M txns × 200k items)	124.1s	10.1s	134.2s
200M rows (40M txns × 200k items)	229.2s	17.6s	246.8s

The mining step is fast — the bottleneck at scale is the long-format → sparse-matrix conversion. If your pipeline already produces a CSR/sparse matrix (e.g., from a Parquet/warehouse export), you skip the conversion entirely and only pay the mining cost.

Power-user path: Direct CSR → Rust

import numpy as np
from scipy import sparse as sp
from rusket import FPGrowth

# Build CSR directly from integer IDs (no pandas!)
csr = sp.csr_matrix(
    (np.ones(len(txn_ids), dtype=np.int8), (txn_ids, item_ids)),
    shape=(n_transactions, n_items),
)
freq = FPGrowth(csr, item_names=item_names).mine(
    min_support=0.001, max_len=3, use_colnames=True
)

At 100M rows, the mining step itself takes 10.1 seconds. Building the CSR directly skips the from_transactions conversion cost (~124s) but does not change the mining time.

Real-World Datasets

Dataset	Transactions	Items	`rusket`
andi_data.txt	8,416	119	9.7 s (22.8M itemsets)
andi_data2.txt	540,455	2,603	7.9 s

Run benchmarks yourself:

uv run pytest benchmarks/bench_scale.py -v -s   # Scale benchmark
uv run python benchmarks/bench_realworld.py     # Real-world datasets
uv run pytest tests/test_benchmark.py -v -s      # pytest-benchmark

Recommender Benchmarks vs LibRecommender

Measured with pytest-benchmark (5 rounds, warmed up, GC disabled). MovieLens 100k dataset (943 users, 1,682 items, 100k ratings). Only model.fit() is timed — no startup or data loading overhead.

Benchmark	rusket	LibRecommender	Speedup
ALS (Cholesky) (64 factors, 15 epochs)	427 ms	1,324 ms	3.1×
ALS (eALS) (64 factors, 15 epochs)	360 ms	N/A	—
BPR (64 factors, 10 epochs)	33 ms	681 ms	20.4×
ItemKNN (k=100)	55 ms	287 ms	5.2×
SVD (64 factors, 20 epochs)	55 ms	❌ TF-only (broken)	—
EASE	71 ms	N/A	—

Note: LibRecommender requires TensorFlow + PyTorch + gensim + Cython (~2 GB of dependencies). rusket has zero runtime dependencies.

uv run pytest benchmarks/bench_pytest_librecommender.py -v --benchmark-columns=mean,stddev,rounds

🏗 Architecture

Data Flow

pandas dense         ──► np.uint8 array (C-contiguous)  ──► Rust fpgrowth_from_dense
pandas Arrow backend ──► Arrow → np.uint8 (zero-copy)   ──► Rust fpgrowth_from_dense
pandas sparse        ──► CSR int32 arrays               ──► Rust fpgrowth_from_csr
polars               ──► Arrow → np.uint8 (zero-copy)   ──► Rust fpgrowth_from_dense
numpy ndarray        ──► np.uint8 (C-contiguous)        ──► Rust fpgrowth_from_dense

All mining and rule generation happens inside Rust. No Python loops, no round-trips.

The 1 Billion Row Architecture

To pass the "1 Billion Row" threshold without OOM crashes, rusket employs a zero-allocation mining loop:

Eclat Scratch Buffers: intersect_count_into writes intersections directly into thread-local pre-allocated memory bytes and computes popcnt in a single pass. It implements early-exit loop termination the moment it proves a combination cannot reach min_support.
FPGrowth Parallel Tree Build: Conditional FP-trees are collected concurrently inside the rayon parallel mining step, replacing the standard sequential loop and eliminating memory contention bottlenecks.
AHashMap Deduplication: Extremely fast O(N) duplicate basket counting replaces standard O(N log N) unstable sorts in the core pipeline.

🧑‍💻 Development

Prerequisites

Rust 1.83+ (rustup update)
Python 3.10+
uv (recommended package manager)

Getting Started

# Clone
git clone https://github.com/bmsuisse/rusket.git
cd rusket

# Build Rust extension in dev mode
uv run maturin develop --release

# Run the full test suite
uv run pytest tests/ -x -q

# Type-check the Python layer
uv run pyright rusket/

# Cargo check (Rust)
cargo check

Run Examples

# Getting started
uv run python examples/01_getting_started.py

# Market basket analysis with Faker
uv run python examples/02_market_basket_faker.py

# Polars input
uv run python examples/03_polars_input.py

# Sparse input
uv run python examples/04_sparse_input.py

# Large-scale mining (100k+ rows)
uv run python examples/05_large_scale.py

🤖 AI Disclosure

A large part of this library — including the Rust core algorithms, the Python wrappers, the OOP class hierarchy, and the Spark integration layer — was written with substantial assistance from AI pair-programming tools (specifically Google Gemini / Antigravity). Human review, benchmarking, and architectural decisions were applied throughout.

We believe in transparency about AI-assisted development. The algorithms are correct, the tests pass, and the performance numbers are real — but if you find a bug or a piece of "AI slop", please open an issue!

📜 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 374 Commits
.agents		.agents
.claude/skills/desloppify		.claude/skills/desloppify
.desloppify		.desloppify
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
rusket		rusket
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
basedpyright.json		basedpyright.json
cliff.toml		cliff.toml
llm.txt		llm.txt
patch_mlflow.py		patch_mlflow.py
pyproject.toml		pyproject.toml
pyright_output.txt		pyright_output.txt
scorecard.png		scorecard.png
search_results.txt		search_results.txt
uv.lock		uv.lock
zensical.toml		zensical.toml

Folders and files

Latest commit

History

Repository files navigation

🎯 Goals

✨ Highlights

📦 Installation

🚀 Quick Start

"Frequently Bought Together" — Grocery Checkout Data

🛒 E-Commerce Order Lines (Long Format)

🐻‍❄️ Polars Input — Reading from Data Lake Parquet

💎 High-Utility Pattern Mining (HUPM) — Profit-Driven Bundle Discovery

📊 Sparse Pandas Input

🌊 Out-of-Core Processing (FPMiner Streaming)

🌩️ Distributed Computing with Apache Spark

How it works

Full Example — Retail Basket Analysis per Store

Sequential Patterns per Category

High-Utility Patterns per Region

Batch Recommendations across the Cluster

📖 API Reference

OOP Class API

🧠 Advanced Pattern & Recommendation Algorithms

🎯 ItemKNN & UserKNN — Nearest-Neighbor Collaborative Filtering

🎯 ALS & BPR Collaborative Filtering

🎯 Hybrid Recommender API

🧬 Hybrid Embedding Fusion — CF + Semantic in One Vector Space

🎯 Multi-Stage Recommendation Pipeline

💾 Saving, Loading and Serving (LanceDB / Vector DBs)

🔍 Analytics Helpers

📈 BPR & Sequential Patterns

🕸️ Graph Analytics & Embeddings

🔬 MLOps: MLflow Tracking & Hyperparameter Tuning

🚀 GPU Acceleration (CUDA)

Supported Models

Installation

⚡ Benchmarks

Scale Benchmarks (1M → 200M rows)

Power-user path: Direct CSR → Rust

Real-World Datasets

Recommender Benchmarks vs LibRecommender

🏗 Architecture

Data Flow

The 1 Billion Row Architecture

🧑‍💻 Development

Prerequisites

Getting Started

Run Examples

🤖 AI Disclosure

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 49

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages