Ultra-fast Recommender Engines & Market Basket Analysis for Python, written in Rust.
Made with β€οΈ by the Data & AI Team.
| Goal | Details |
|---|---|
| β‘ Blazing fast | All algorithms run in compiled Rust (via PyO3) with multi-threaded Rayon parallelism and SIMD-accelerated kernels. ALS is 11Γ, and FP-Growth is 140Γ faster than PySpark. |
| π¦ Zero dependencies | No TensorFlow, no PyTorch, no JVM. A single ~3 MB wheel is all you need β pip install rusket and go. |
| π§βπ» Easy to use | Common cases are one-liners: model.recommend_items(user_id), model.recommend_users(item_id), model.export_item_factors() for vector/embedding export. No boilerplate. |
| ποΈ Modern data stack | Native Pandas, Polars, and Apache Spark support with zero-copy Arrow transfers. Works seamlessly with Delta Lake, Databricks, Snowflake, and any dbt/Parquet pipeline. |
β οΈ Note:rusketis currently under heavy construction. The API will probably change in upcoming versions.
rusket is a modern, Rust-powered library for Market Basket Analysis and Recommender Engines. It delivers significant speed-ups and lower memory usage compared to traditional Python implementations, while natively supporting Pandas, Polars, and Spark out of the box.
Zero runtime dependencies. No TensorFlow, no PyTorch, no JVM β just pip install rusket and go. The entire engine is compiled Rust, distributed as a single ~3 MB wheel.
It features Collaborative Filtering (ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE), Sequential Recommendation (FPMC, SASRec), Context-aware Prediction (FM), Pattern Mining (FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan), and built-in Hyperparameter Tuning (Optuna + MLflow tracking) with high performance and low memory footprints. Both functional and OOP APIs are available for seamless integration.
rusket |
LibRecommender |
implicit |
pyspark.ml |
|
|---|---|---|---|---|
| Core language | Rust (PyO3) | TF + PyTorch + Cython | Cython / C++ | Scala / Java (JVM) |
| Runtime deps | 0 | TF + PyTorch + gensim (~2 GB) | OpenBLAS / MKL | JVM + Spark |
| Install size | ~3 MB | ~2 GB | ~50 MB | ~300 MB |
| Algorithms | ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE, FM, FPMC, SASRec, FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan | ALS, BPR, SVD, LightGCN, ItemCF, FM, DeepFM, ... | ALS, BPR | ALS, FP-Growth, PrefixSpan |
| Recommender API | β Hybrid Engine + i2i Similarity | β | β | β (ALS only) |
| Graph & Embeddings | β NetworkX Export, Vector DB Export | β | β | β |
| OOP class API | β
ALS.from_transactions(df).fit() |
β | β | β |
| Pandas / Polars / Spark | β / β / β | β / β / β | β / β / β | β / β / β |
| Parallel execution | β Rayon work-stealing | β TF/PyTorch threads | β OpenMP | β Spark Cluster |
| Memory | Low (native Rust buffers) | High (TF/PyTorch graphs) | Low (C++ arrays) | High (JVM overhead) |
pip install rusket
# or with uv:
uv add rusketOptional extras:
# Polars support
pip install "rusket[polars]"
# Pandas/NumPy support (usually already installed)
pip install "rusket[pandas]"Identify which products co-occur most in customer baskets β the foundation of cross-sell widgets, promotional bundles, and shelf placement decisions.
import pandas as pd
from rusket import FPGrowth
# One week of supermarket checkout data (1 row = 1 receipt, 1 col = 1 SKU)
receipts = pd.DataFrame({
"milk": [1, 1, 0, 1, 1, 0, 1],
"bread": [1, 0, 1, 1, 0, 1, 1],
"butter": [1, 0, 1, 0, 0, 1, 0],
"eggs": [0, 1, 1, 0, 1, 0, 1],
"coffee": [0, 1, 0, 0, 1, 1, 0],
"orange_juice": [1, 0, 0, 1, 0, 0, 1],
}, dtype=bool)
# Step 1 β which SKU combinations appear in β₯40% of receipts?
model = FPGrowth(receipts, min_support=0.4)
freq = model.mine(use_colnames=True)
# Step 2 β keep rules with β₯60% confidence
rules = model.association_rules(metric="confidence", min_threshold=0.6)
# Lift > 1 means customers buy these together more than chance alone
print(rules[["antecedents", "consequents", "support", "confidence", "lift"]]
.sort_values("lift", ascending=False))Real-world data arrives as (order_id, sku) rows from a database β not one-hot matrices.
All mining algorithms expose a class-based API that goes straight from order lines to recommendations:
import pandas as pd
from rusket import FPGrowth
# Order line export from your e-commerce backend
orders = pd.DataFrame({
"order_id": [1001, 1001, 1001, 1002, 1002, 1003, 1003],
"sku": ["HDPHONES", "USB_DAC", "AUX_CABLE",
"HDPHONES", "CARRY_CASE",
"USB_DAC", "AUX_CABLE"],
})
model = FPGrowth.from_transactions(
orders,
transaction_col="order_id",
item_col="sku",
min_support=0.3,
)
freq = model.mine(use_colnames=True) # Miner classes: mine() never auto-fits
rules = model.association_rules(metric="confidence", min_threshold=0.6)
# Which accessories should be suggested when headphones are in the cart?
suggestions = model.recommend_items(["HDPHONES"], n=3)
# β e.g. ["USB_DAC", "AUX_CABLE", "CARRY_CASE"]Or use the explicit type variants:
from rusket import FPGrowth
ohe = FPGrowth.from_pandas(orders, transaction_col="order_id", item_col="sku")
ohe = FPGrowth.from_polars(pl_orders, transaction_col="order_id", item_col="sku")
ohe = FPGrowth.from_transactions([["HDPHONES", "USB_DAC"], ["HDPHONES", "CARRY_CASE"]]) # list of listsSpark is also supported:
FPGrowth.from_spark(spark_df)calls.toPandas()internally.
For teams running a modern data stack with Parquet files on S3/GCS/Azure Blob, rusket natively accepts Polars DataFrames. Data is transferred via Arrow zero-copy buffers β no conversion overhead.
The fastest path from a data lake to "Frequently Bought Together" rules:
import polars as pl
from rusket import FPGrowth
# ββ 1. Read a one-hot basket matrix directly from S3/GCS/local Parquet ββ
# Columns = SKUs (bool), rows = receipts β produced by your dbt or Spark pipeline
baskets = pl.read_parquet("s3://data-lake/gold/basket_ohe.parquet")
print(f"Loaded {baskets.shape[0]:,} receipts Γ {baskets.shape[1]} SKUs")
# ββ 2. Instantiate FPGrowth (zero-copy from Polars) βββββββββββββββββ
model = FPGrowth(baskets, min_support=0.02, max_len=3)
# ββ 3. Mine frequent combinations ββββββββββββββββββββββββββββββββββββ
freq = model.mine(use_colnames=True)
print(f"Found {len(freq):,} frequent itemsets")
print(freq.sort_values("support", ascending=False).head(10))
# ββ 4. Generate cross-sell rules ββββββββββββββββββββββββββββββββββββ
rules = model.association_rules(metric="lift", min_threshold=1.2)
print(f"Rules with lift > 1.2: {len(rules):,}")
print(
rules[["antecedents", "consequents", "confidence", "lift"]]
.sort_values("lift", ascending=False)
.head(8)
)How it works under the hood:
Polars β Arrow buffer βnp.uint8(zero-copy) β Rustfpgrowth_from_dense
Frequent items aren't always the most profitable. HUPM finds product combinations that generate the highest total gross margin β even if they appear rarely. rusket implements the state-of-the-art EFIM algorithm in Rust.
import pandas as pd
from rusket import HUPM
# Specialty foods retailer: receipt line items with gross margin per unit sold
orders = pd.DataFrame({
"receipt_id": [1, 1, 1, 2, 2, 3, 3],
"product": ["aged_cheese", "wine_flight", "charcuterie",
"aged_cheese", "charcuterie",
"wine_flight", "charcuterie"],
"margin": [8.50, 12.00, 6.50, # receipt 1 β margin per item
8.50, 6.50, # receipt 2
12.00, 6.50], # receipt 3
})
# Find all product bundles generating β₯ β¬20 total margin across all receipts
high_margin = HUPM.from_transactions(
orders,
transaction_col="receipt_id",
item_col="product",
utility_col="margin",
min_utility=20.0,
).mine()
print(high_margin.head())
# e.g. aged_cheese + wine_flight + charcuterie β total margin 81.0For very sparse datasets (e.g. e-commerce with thousands of SKUs), use Pandas SparseDtype to minimize memory. rusket passes the raw CSR arrays straight to Rust β no densification ever happens.
import pandas as pd
import numpy as np
from rusket import FPGrowth
rng = np.random.default_rng(7)
n_rows, n_cols = 30_000, 500
# Very sparse: average basket size β 3 items out of 500
p_buy = 3 / n_cols
matrix = rng.random((n_rows, n_cols)) < p_buy
products = [f"sku_{i:04d}" for i in range(n_cols)]
df_dense = pd.DataFrame(matrix.astype(bool), columns=products)
df_sparse = df_dense.astype(pd.SparseDtype("bool", fill_value=False))
dense_mb = df_dense.memory_usage(deep=True).sum() / 1e6
sparse_mb = df_sparse.memory_usage(deep=True).sum() / 1e6
print(f"Dense memory: {dense_mb:.1f} MB")
print(f"Sparse memory: {sparse_mb:.1f} MB ({dense_mb / sparse_mb:.1f}Γ smaller)")
# Same API, same results β just faster and lighter
freq = FPGrowth(df_sparse, min_support=0.01).mine(use_colnames=True)
print(f"Frequent itemsets: {len(freq):,}")How it works under the hood:
Sparse DataFrame β COO β CSR β(indptr, indices)β Rustfpgrowth_from_csr
For datasets scaling to Billion-row sizes that don't fit in memory, use the FPMiner accumulator. It accepts chunks of (txn_id, item_id) pairs, sorting them in-place immediately, and uses a memory-safe k-way merge across all chunks to build the CSR matrix on the fly avoiding massive memory spikes.
import numpy as np
from rusket import FPMiner
n_items = 5_000
miner = FPMiner(n_items=n_items)
# Feed chunks incrementally (e.g. from Parquet/CSV/SQL)
for chunk in dataset:
txn_ids = chunk["txn_id"].to_numpy(dtype=np.int64)
item_ids = chunk["item_id"].to_numpy(dtype=np.int32)
# Fast O(k log k) per-chunk sort
miner.add_chunk(txn_ids, item_ids)
# Stream k-way merge and mine in one pass!
# Returns a DataFrame with 'support' and 'itemsets' just like fpgrowth()
freq = miner.mine(min_support=0.001, max_len=3)Memory efficiency: The peak memory overhead at mine() time is just
rusket ships a full Spark integration layer in rusket.spark. All algorithms run as Native Arrow UDFs via applyInArrow β Rust is called directly on each executor, with zero Python overhead per row.
PySpark DataFrame
βββΊ groupby(group_col).applyInArrow(...)
βββΊ Arrow Table (per partition / per group)
βββΊ Polars zero-copy conversion
βββΊ rusket Rust extension (on the executor)
βββΊ results β PyArrow β PySpark DataFrame
from pyspark.sql import SparkSession
from rusket.spark import mine_grouped, rules_grouped
spark = SparkSession.builder.appName("rusket-demo").getOrCreate()
# ββ 1. Load your OHE transaction table (one row = one basket) ββββββββββββββ
# Schema: store_id (string), bread (bool), butter (bool), milk (bool), ...
spark_df = spark.read.parquet("s3://data/baskets/")
# ββ 2. Mine frequent itemsets per store in parallel ββββββββββββββββββββββββββ
# Each Spark task calls the Rust FP-Growth/Eclat engine on its Arrow batch.
freq_df = mine_grouped(
spark_df,
group_col="store_id",
min_support=0.05, # 5% support per store
)
# freq_df schema: store_id | support (double) | itemsets (array<string>)
# ββ 3. Count transactions per store (needed for rule support) ββββββββββββββββ
from pyspark.sql import functions as F
counts = (
spark_df.groupby("store_id")
.agg(F.count("*").alias("n"))
.rdd.collectAsMap() # {"store_1": 12000, "store_2": 8500, ...}
)
# ββ 4. Generate association rules per store ββββββββββββββββββββββββββββββββββ
rules_df = rules_grouped(
freq_df,
group_col="store_id",
num_itemsets=counts, # pass per-group counts as a dict
metric="confidence",
min_threshold=0.6,
)
# rules_df schema: store_id | antecedents | consequents | confidence | lift | ...
rules_df.orderBy("lift", ascending=False).show(10, truncate=False)from rusket.spark import prefixspan_grouped
# event_log schema: category_id, user_id, item_id, event_ts
event_log = spark.read.parquet("s3://data/events/")
seq_df = prefixspan_grouped(
event_log,
group_col="category_id", # mine independently per product category
user_col="user_id", # sequence identifier within the group
time_col="event_ts", # ordering column
item_col="item_id",
min_support=50, # absolute count: pattern must appear in β₯50 sessions
max_len=4,
)
# seq_df schema: category_id | support (long) | sequence (array<string>)
seq_df.show(5, truncate=False)from rusket.spark import hupm_grouped
# profit_log schema: region_id, txn_id, item_id, profit
profit_log = spark.read.parquet("s3://data/profit/")
utility_df = hupm_grouped(
profit_log,
group_col="region_id",
transaction_col="txn_id",
item_col="item_id",
utility_col="profit",
min_utility=500.0, # only itemsets with combined profit β₯ β¬500
)
# utility_df schema: region_id | utility (double) | itemset (array<long>)
utility_df.show(5, truncate=False)from rusket.spark import recommend_batches
from rusket import ALS
# 1. Train an ALS model locally (or load a pre-trained one)
als = ALS.from_transactions(
events_pd,
user_col="user_id",
item_col="item_id",
).fit() # β always call .fit() after from_transactions()
# 2. Scale-out scoring: one recommendation row per user
user_df = spark.read.parquet("s3://data/users/").select("user_id")
recs_df = recommend_batches(user_df, model=als, user_col="user_id", k=10)
# recs_df schema: user_id (string) | recommended_items (array<int>)
recs_df.show(5, truncate=False)Tip β Databricks / Delta Lake: All functions return a standard PySpark DataFrame, so you can write results back with
.write.format("delta").save(...)or.saveAsTable(...)directly.
Every algorithm in rusket exposes a class-based API in addition to the functional helpers. All classes share a unified interface inherited from BaseModel:
| Class | Inherits from | Description |
|---|---|---|
FPGrowth |
Miner, RuleMinerMixin |
FP-Tree parallel mining |
Eclat |
Miner, RuleMinerMixin |
Vertical bitset mining |
FPGrowth |
Miner, RuleMinerMixin |
Frequent Pattern Growth algorithm |
FIN |
Miner, RuleMinerMixin |
FP-tree Node-list intersection mining |
LCM |
Miner, RuleMinerMixin |
Linear-time Closed itemset Mining |
HUPM |
Miner |
High-Utility Pattern Mining (EFIM) |
PrefixSpan |
Miner |
Sequential pattern mining |
ALS |
ImplicitRecommender |
Alternating Least Squares CF |
BPR |
ImplicitRecommender |
Bayesian Personalized Ranking CF |
SVD |
ImplicitRecommender |
Funk SVD (biased SGD) |
LightGCN |
ImplicitRecommender |
Graph Convolutional CF |
ItemKNN |
ImplicitRecommender |
Item-based k-NN CF |
UserKNN |
ImplicitRecommender |
User-based k-NN CF |
EASE |
ImplicitRecommender |
Embarrassingly Shallow Autoencoders |
FM |
BaseModel |
Factorization Machines (CTR prediction) |
FPMC |
SequentialRecommender |
Factorizing Personalized Markov Chains |
SASRec |
SequentialRecommender |
Self-Attentive Sequential Recommendation |
HybridEmbeddingIndex |
β | CF + semantic embedding fusion |
All classes share the following data-ingestion class methods inherited from BaseModel:
# Load from long-format (transaction_id, item_id) DataFrame or list of lists
model = FPGrowth.from_transactions(df, transaction_col="order_id", item_col="item", min_support=0.3)
# Typed convenience aliases β same result
model = FPGrowth.from_pandas(df, ...)
model = FPGrowth.from_polars(pl_df, ...)
model = FPGrowth.from_spark(spark_df, ...)Miner subclasses (FPGrowth, Eclat) additionally expose RuleMinerMixin, giving a fluent pipeline:
model = FPGrowth.from_transactions(df, min_support=0.3)
freq = model.mine(use_colnames=True) # pd.DataFrame [support, itemsets]
rules = model.association_rules(metric="lift") # pd.DataFrame [antecedents, consequents, ...]
recs = model.recommend_items(["bread", "milk"]) # list of suggested itemsImplicitRecommender subclasses (ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE) follow the scikit-learn fit()/predict() pattern.
SequentialRecommender subclasses (FPMC, SASRec) use from_transactions(..., time_col=...).fit() for sequential next-item prediction:
# Option A β construct then fit with a sparse matrix
model = ALS(factors=64, iterations=15)
model.fit(user_item_csr)
# Option B β from event log, then explicit .fit()
model = ALS(factors=64).from_transactions(
df, user_col="user_id", item_col="item_id"
).fit() # β .fit() is always required
# Predict / recommend
items, scores = model.recommend_items(user_id=42, n=10, exclude_seen=True)
users, scores = model.recommend_users(item_id=99, n=5)Breaking change vs older versions:
from_transactions()no longer auto-fits. Always chain.fit()after it.
rusket provides more than just basic market basket analysis. It includes an entire suite of modern algorithms and a high-level Business Recommender API.
Two complementary memory-based methods that consistently rank among the top performers in academic benchmarks (see Anelli et al. 2022).
- ItemKNN β Finds items similar to what the user already liked. Fast, stable, and scales well with pre-computed item-item similarity.
- UserKNN β Finds users similar to the target user and recommends what they liked. Often more serendipitous and performs particularly well on dense datasets.
Both support BM25, TF-IDF, Cosine, and raw Count weighting, with the top-K neighbor pruning running in parallel Rust.
from rusket import ItemKNN, UserKNN
# ββ Item-based: "Customers who bought X also bought Y" ββββββββββββ
item_knn = ItemKNN.from_transactions(
purchases, user_col="user_id", item_col="item_id",
method="bm25", k=100,
).fit()
items, scores = item_knn.recommend_items(user_id=42, n=10)
# ββ User-based: "Users similar to you enjoyed these items" ββββββββ
user_knn = UserKNN.from_transactions(
purchases, user_col="user_id", item_col="item_id",
method="cosine", k=50,
).fit()
items, scores = user_knn.recommend_items(user_id=42, n=10)Which one to choose? Start with
ItemKNN(method="bm25")β it's the fastest and most stable. Switch toUserKNNif you have a dense dataset or want more diverse recommendations. In production, try both and evaluate withrusket.evaluate().
Both models learn user and item embeddings from implicit feedback (purchases, clicks, plays) and power personalised recommendations at scale. Use ALS for broad serendipitous discovery; use BPR when you care only about top-N ranking.
from rusket import ALS, BPR
# ββ "For You" homepage β music streaming platform ββββββββββββββββββββ
# event log: user_id | track_id | plays (optional weight)
plays = pd.DataFrame({
"user_id": [101, 101, 102, 102, 103, 103, 103],
"track_id": ["T01", "T03", "T01", "T05", "T02", "T03", "T05"],
"plays": [12, 5, 8, 3, 20, 1, 7], # play count as confidence weight
})
als = ALS(factors=64, iterations=15, alpha=40.0).from_transactions(
plays, user_col="user_id", item_col="track_id", rating_col="plays"
).fit() # β always call .fit() after from_transactions()
# Top-10 tracks for user 101, excluding already-played tracks
tracks, scores = als.recommend_items(user_id=101, n=10, exclude_seen=True)
# Which users are most likely to enjoy track T05? β useful for email campaigns
users, scores = als.recommend_users(item_id="T05", n=50)
# BPR β optimise ranking directly rather than reconstruction
bpr = BPR(factors=64, learning_rate=0.05, iterations=150).fit(user_item_csr)Combine Collaborative Filtering (ALS/BPR) with Frequent Pattern Mining to cover every placement surface β personalised homepage ("For You") and active cart ("Frequently Bought Together") β in a single engine.
from rusket import ALS, Recommender, FPGrowth
# 1. Train on purchase history (implicit feedback)
als = ALS(factors=64, iterations=15).fit(user_item_csr)
# 2. Mine co-purchase rules from basket data
miner = FPGrowth(basket_ohe, min_support=0.01)
freq = miner.mine()
rules = miner.association_rules()
# 3. Create the Hybrid Engine
rec = Recommender(model=als, rules_df=rules)
# "For You" homepage β personalised for customer 1001
items, scores = rec.recommend_for_user(user_id=1001, n=5)
# Blend CF + product embeddings (e.g. from a PIM or sentence-transformer)
items, scores = rec.recommend_for_user(user_id=1001, n=5, alpha=0.7,
target_item_for_semantic="HDPHONES")
# Active cart cross-sell β "Frequently Bought Together"
add_ons = rec.recommend_for_cart(["USB_DAC", "AUX_CABLE"], n=3)
# Overnight batch β score all customers, write to CRM
batch_df = rec.predict_next_chunk(user_history_df, user_col="customer_id", k=5)Collaborative filtering embeddings capture behavioral signals (who bought what); semantic text embeddings capture content meaning (product descriptions). Fusing them into a single vector space lets you do ANN retrieval, vector DB export, and clustering in one shot.
import rusket
# 1. Train ALS on implicit feedback
als = rusket.ALS(factors=64, iterations=15).fit(interactions)
# 2. Get semantic embeddings (e.g. from sentence-transformers)
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
text_vectors = encoder.encode(product_descriptions) # (n_items, 384)
# 3. Fuse into a single hybrid vector space
hybrid = rusket.HybridEmbeddingIndex(
cf_embeddings=als.item_factors, # (n_items, 64)
semantic_embeddings=text_vectors, # (n_items, 384)
strategy="weighted_concat", # "concat" | "weighted_concat" | "projection"
alpha=0.6, # 60% CF, 40% semantic
)
# 4. Similar items via cosine on the fused space
ids, scores = hybrid.query(item_id=42, n=10)
# 5. Build an ANN index for sub-millisecond retrieval
ann = hybrid.build_ann_index(backend="native") # or "faiss"
# 6. Export to a vector DB for production serving
hybrid.export_vectors(qdrant_client, collection_name="hybrid_items")
# 7. Or export as separate named vectors for DB-side fusion
hybrid.export_vectors(qdrant_client, mode="multi", collection_name="hybrid_items")
# β Qdrant/Meilisearch/Weaviate store "cf" and "semantic" as separate named vectorsThree fusion strategies:
| Strategy | Description | Use Case |
|---|---|---|
"concat" |
L2-normalise each space, concatenate | Equal importance, no tuning |
"weighted_concat" |
Scale by Ξ± / 1βΞ±, then concat |
Default β tune alpha to balance CF vs semantic |
"projection" |
Concat + PCA to projection_dim |
Compact vectors for large-scale deployment |
Standalone function: If you just need the fused matrix without an index, use
rusket.fuse_embeddings(cf, sem, strategy="weighted_concat", alpha=0.6).
For production systems requiring advanced retrieval and ranking, use the Pipeline class. This mirrors the "retrieve β rerank β filter" paradigm used by Twitter/X and modern ML stacks.
It chains multiple models together:
- Retrieve: Candidate generation
- Rerank: Re-score candidates using a heavier scoring function
- Filter: Apply business rules (e.g. exclude out-of-stock items, diversify)
from rusket import ALS, BPR, Pipeline, RuleBasedRecommender
import pandas as pd
# 1. Train multiple base models
als = ALS(factors=64).fit(interactions)
bpr = BPR(factors=128).fit(interactions)
# 2. Define explicit business rules (e.g. promoting warranties with laptops)
rules_df = pd.DataFrame({
"antecedent": ["102"], # Laptop SKU
"consequent": ["999"], # Warranty SKU
"score": [2.0]
})
rules = RuleBasedRecommender.from_transactions(
interactions, rules=rules_df, user_col="user", item_col="item"
).fit()
# 3. Compose the Pipeline (Retrieve from ALS, rerank with deeper BPR vectors)
# Items from the `rules` model receive an artificial +1,000,000 score
# ensuring they rank at the top *after* the algorithmic reranking.
pipeline = Pipeline(
retrieve=[als, bpr],
merge_strategy="max", # how to combine candidate scores
rerank=bpr,
rules=rules,
)
# Recommend for a user
items, scores = pipeline.recommend(user_id=42, n=10, exclude_seen=True)
# Blazing-fast Batch Scoring utilizing Rust inner loops
batch_recs = pipeline.recommend_batch(
user_ids=[1, 2, 3],
n=10,
format="polars" # Returns a native Polars DataFrame instantly
)rusket models use a unified BaseModel that provides .save() and .load() functionality. You can also export trained models to a Vector Database for fast, real-time serving in production. We even provide load_model which automatically infers the model architecture from the pickle file.
import rusket
# 1. Train the model
model = rusket.ALS(factors=32).fit(interactions)
# 2. Save your trained model to disk
model.save("my_als_model.pkl")
# 3. Load it back using the generic loader
loaded_model = rusket.load_model("my_als_model.pkl")
# 4. Export the embeddings for a Vector Database
items_df = rusket.export_item_factors(
loaded_model,
normalize=True, # Best for Cosine Similarity search
format="pandas"
)
# 5. Serve it in real-time (Example using LanceDB)
import lancedb
# Create a local vector database
db = lancedb.connect("./lancedb_store")
table = db.create_table("items", data=items_df)
# Query the table with a specific user's latent factors
user_emb = loaded_model.user_factors[0]
# Retrieve top 5 item recommendations for this user using L2-normalized vector search!
results = table.search(user_emb).limit(5).to_pandas()from rusket import find_substitutes, customer_saturation
# Identify cannibalizing SKUs (lift < 1.0) for assortment rationalisation
subs = find_substitutes(rules_df, max_lift=0.8)
# antecedents consequents lift
# (Cola A,) (Cola B,) 0.61 β these products hurt each other's sales
# Segment customers by category penetration (decile 10 = buy everything; 1 = barely engaged)
saturation = customer_saturation(
purchases_df, user_col="customer_id", category_col="category_id"
)- BPR (Bayesian Personalized Ranking): Directly optimises ranking of positive interactions over negative ones β ideal for newsfeeds, playlists, and app recommendation surfaces that prioritise top-N precision.
- Sequential Pattern Mining (PrefixSpan): Discovers ordered patterns across time (e.g., "Subscriber signed up for broadband β mobile plan β premium bundle" or "Customer viewed Camera β 2 weeks later bought Lens").
rusket natively extracts PrefixSpan sequences from Pandas, Polars, and PySpark event logs with zero-copy Arrow mapping:
from rusket import PrefixSpan
# Telco product adoption journeys β what sequence of subscriptions do customers follow?
# df: customer_id | subscription_date | product_id
model = PrefixSpan.from_transactions(
subscription_events,
transaction_col="customer_id",
item_col="product_id",
time_col="subscription_date",
min_support=50, # at least 50 customers follow this path
max_len=4,
)
freq_seqs = model.mine()
# e.g. [broadband] β [mobile] β [tv_bundle] appears in 312 journeysIntegrate natively with the modern GenAI/LLM stack:
- Vector Export: Export user/item factors to a Pandas
DataFrameready for FAISS/Qdrant usingmodel.export_item_factors(). - Item-to-Item Similarity: Fast Cosine Similarity on embeddings using
model.similar_items(item_id). - Graph Generation: Automatically convert association rules into a
networkxdirected Graph for community detection usingrusket.viz.to_networkx(rules).
rusket has built-in support for MLflow experiment tracking, mlflow.pyfunc packaging, and Bayesian hyperparameter optimisation using Optuna's TPE sampler. For ALS/eALS models, each Optuna trial runs the Rust-native cross-validation backend β making the entire search blazingly fast.
import rusket
import rusket.mlflow
from rusket import OptunaSearchSpace
# ββ 1. Enable MLflow Autologging βββββββββββββββββββββββββββββββββββββ
rusket.mlflow.autolog()
# ββ 2. Train a single model with automatic tracking ββββββββββββββββββ
# Hyperparameters (factors, iterations) and training_duration_seconds are logged!
import mlflow
with mlflow.start_run():
model = rusket.ALS(factors=64, iterations=15).fit(df)
# Save/Load models as native MLflow pyfunc artifacts for easy deployment
rusket.mlflow.save_model(model, "my_als_model")
loaded_model = mlflow.pyfunc.load_model("my_als_model") # Has a .predict(df) method
# ββ 3. Quick hyperparameter search with sensible defaults βββββββββββ
result = rusket.optuna_optimize(
rusket.ALS,
df,
user_col="user_id",
item_col="item_id",
n_trials=50,
metric="ndcg",
k=10,
)
print(f"Best ndcg@10: {result.best_score:.4f}")
print(f"Best params: {result.best_params}")
# ββ Custom search space + refit best model βββββββββββββββββββββββββββ
result = rusket.optuna_optimize(
rusket.eALS,
df,
user_col="user_id",
item_col="item_id",
search_space=[
OptunaSearchSpace.int("factors", 16, 256, log=True),
OptunaSearchSpace.float("alpha", 1.0, 100.0, log=True),
OptunaSearchSpace.float("regularization", 1e-4, 1.0, log=True),
OptunaSearchSpace.int("iterations", 5, 30),
],
n_trials=100,
n_folds=3,
metric="precision",
refit_best=True, # best model is already fitted
)
items, scores = result.best_model.recommend_items(user_id=42, n=10)
# ββ MLflow experiment tracking βββββββββββββββββββββββββββββββββββββββ
# pip install mlflow optuna-integration
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("als-tuning")
result = rusket.optuna_optimize(
rusket.ALS, df,
user_col="user_id", item_col="item_id",
n_trials=50, metric="ndcg",
mlflow_tracking=True, # β every trial logged to MLflow
)
# ββ Custom callbacks βββββββββββββββββββββββββββββββββββββββββββββββββ
result = rusket.optuna_optimize(
rusket.ALS, df,
user_col="user_id", item_col="item_id",
n_trials=50,
callbacks=[my_custom_callback], # any Optuna-compatible callback
)rusket supports optional GPU acceleration via CuPy or PyTorch CUDA for models that benefit from large matrix operations. Enable it globally with a single call β no need to pass use_gpu=True to every model.
import rusket
# Enable GPU globally β every model created after this uses CUDA
rusket.enable_gpu()
# All models now default to GPU
als = rusket.ALS(factors=128, iterations=20).fit(interactions)
ease = rusket.EASE(regularization=500).fit(interactions)
bpr = rusket.BPR(factors=64).fit(interactions)
# Per-model override: force a specific model to CPU
small_model = rusket.SVD(factors=16, use_gpu=False)
# Turn it off globally
rusket.disable_gpu()
# Check the current state
rusket.is_gpu_enabled() # β FalseAll 12 recommender models respect the global GPU flag:
| Model | GPU-accelerated operations |
|---|---|
| ALS / eALS | Gramian, Cholesky solve, batch scoring |
| BPR | SGD updates, batch recommend |
| SVD | Factor updates, batch scoring |
| EASE | Gram matrix inversion |
| ItemKNN / UserKNN | Similarity scoring |
| LightGCN | Graph convolution, scoring |
| FM | Prediction |
| FPMC | Factor updates |
| SASRec / BERT4Rec | Attention forward pass |
| NMF | Multiplicative updates |
# CuPy (recommended β fastest)
pip install cupy-cuda12x
# Or PyTorch
pip install torchNo GPU? No problem.
rusketauto-detects whether a GPU backend is available. If neither CuPy nor PyTorch CUDA is installed,enable_gpu()will still succeed but models will raise anImportErrorat fit-time. Userusket.check_gpu_available()to test beforehand.
Benchmark environment: Apple Silicon MacBook Air (M-series, arm64, 8 GB RAM). All timings are single-run wall-clock measurements.
What's measured:
from_transactions()converts long-format(txn_id, item_id)rows into a sparse OHE matrix.fpgrowth()then mines that matrix. Both steps have the same Rust mining cost β the only difference at large scale is whether you pay the conversion cost upfront.
| Scale | from_transactions (conversion) |
fpgrowth (mining) |
Total |
|---|---|---|---|
| 1M rows | 4.9s | 0.1s | 5.0s |
| 10M rows | 23.2s | 1.2s | 24.4s |
| 50M rows | 59.1s | 4.0s | 63.1s |
| 100M rows (20M txns Γ 200k items) | 124.1s | 10.1s | 134.2s |
| 200M rows (40M txns Γ 200k items) | 229.2s | 17.6s | 246.8s |
The mining step is fast β the bottleneck at scale is the long-format β sparse-matrix conversion. If your pipeline already produces a CSR/sparse matrix (e.g., from a Parquet/warehouse export), you skip the conversion entirely and only pay the mining cost.
import numpy as np
from scipy import sparse as sp
from rusket import FPGrowth
# Build CSR directly from integer IDs (no pandas!)
csr = sp.csr_matrix(
(np.ones(len(txn_ids), dtype=np.int8), (txn_ids, item_ids)),
shape=(n_transactions, n_items),
)
freq = FPGrowth(csr, item_names=item_names).mine(
min_support=0.001, max_len=3, use_colnames=True
)At 100M rows, the mining step itself takes 10.1 seconds. Building the CSR directly skips the
from_transactionsconversion cost (~124s) but does not change the mining time.
| Dataset | Transactions | Items | rusket |
|---|---|---|---|
| andi_data.txt | 8,416 | 119 | 9.7 s (22.8M itemsets) |
| andi_data2.txt | 540,455 | 2,603 | 7.9 s |
Run benchmarks yourself:
uv run pytest benchmarks/bench_scale.py -v -s # Scale benchmark
uv run python benchmarks/bench_realworld.py # Real-world datasets
uv run pytest tests/test_benchmark.py -v -s # pytest-benchmarkMeasured with
pytest-benchmark(5 rounds, warmed up, GC disabled). MovieLens 100k dataset (943 users, 1,682 items, 100k ratings). Onlymodel.fit()is timed β no startup or data loading overhead.
| Benchmark | rusket | LibRecommender | Speedup |
|---|---|---|---|
| ALS (Cholesky) (64 factors, 15 epochs) | 427 ms | 1,324 ms | 3.1Γ |
| ALS (eALS) (64 factors, 15 epochs) | 360 ms | N/A | β |
| BPR (64 factors, 10 epochs) | 33 ms | 681 ms | 20.4Γ |
| ItemKNN (k=100) | 55 ms | 287 ms | 5.2Γ |
| SVD (64 factors, 20 epochs) | 55 ms | β TF-only (broken) | β |
| EASE | 71 ms | N/A | β |
Note: LibRecommender requires TensorFlow + PyTorch + gensim + Cython (~2 GB of dependencies). rusket has zero runtime dependencies.
uv run pytest benchmarks/bench_pytest_librecommender.py -v --benchmark-columns=mean,stddev,roundspandas dense βββΊ np.uint8 array (C-contiguous) βββΊ Rust fpgrowth_from_dense
pandas Arrow backend βββΊ Arrow β np.uint8 (zero-copy) βββΊ Rust fpgrowth_from_dense
pandas sparse βββΊ CSR int32 arrays βββΊ Rust fpgrowth_from_csr
polars βββΊ Arrow β np.uint8 (zero-copy) βββΊ Rust fpgrowth_from_dense
numpy ndarray βββΊ np.uint8 (C-contiguous) βββΊ Rust fpgrowth_from_dense
All mining and rule generation happens inside Rust. No Python loops, no round-trips.
To pass the "1 Billion Row" threshold without OOM crashes, rusket employs a zero-allocation mining loop:
- Eclat Scratch Buffers:
intersect_count_intowrites intersections directly into thread-local pre-allocated memory bytes and computespopcntin a single pass. It implements early-exit loop termination the moment it proves a combination cannot reachmin_support. - FPGrowth Parallel Tree Build: Conditional FP-trees are collected concurrently inside the rayon parallel mining step, replacing the standard sequential loop and eliminating memory contention bottlenecks.
AHashMapDeduplication: Extremely fast O(N) duplicate basket counting replaces standard O(N log N) unstable sorts in the core pipeline.
- Rust 1.83+ (
rustup update) - Python 3.10+
- uv (recommended package manager)
# Clone
git clone https://github.com/bmsuisse/rusket.git
cd rusket
# Build Rust extension in dev mode
uv run maturin develop --release
# Run the full test suite
uv run pytest tests/ -x -q
# Type-check the Python layer
uv run pyright rusket/
# Cargo check (Rust)
cargo check# Getting started
uv run python examples/01_getting_started.py
# Market basket analysis with Faker
uv run python examples/02_market_basket_faker.py
# Polars input
uv run python examples/03_polars_input.py
# Sparse input
uv run python examples/04_sparse_input.py
# Large-scale mining (100k+ rows)
uv run python examples/05_large_scale.py
A large part of this library β including the Rust core algorithms, the Python wrappers, the OOP class hierarchy, and the Spark integration layer β was written with substantial assistance from AI pair-programming tools (specifically Google Gemini / Antigravity). Human review, benchmarking, and architectural decisions were applied throughout.
We believe in transparency about AI-assisted development. The algorithms are correct, the tests pass, and the performance numbers are real β but if you find a bug or a piece of "AI slop", please open an issue!