Cluster Rank Summarize is an intelligent data clustering framework that combines advanced itemset mining, hierarchical clustering, and LLM-powered analysis to discover meaningful patterns in complex datasets. Built for researchers, data scientists, and analysts who need to uncover hidden insights from tabular data.
- π Intelligent Itemset Mining: Discover frequent patterns with configurable support thresholds
- π Hierarchical Clustering: Automatically group similar itemsets with customizable similarity thresholds
- π€ LLM-Enhanced Analysis: Generate human-readable summaries of itemsets as well as their categorization, advanced cluster analysis and insights using AI
- β‘ Interactive Training: Iteratively improve clustering results with user feedback
- π Rich Visualizations: Create interactive sunburst charts, network graphs, and heatmaps
- π― Similarity Search: Find similar itemsets and clusters for targeted analysis
- βοΈ Flexible Configuration: Customize weights, parameters, and ranking strategies
# Clone the repository
git clone https://github.com/radiantlogicinc/cluster_rank_summarize.git
cd clustering-project
# Install the package
pip install -e .
# Set up environment variables
mkdir -p env
cp .env.example env/.env
# Edit env/.env with your configuration (database path, API keys, etc.)The framework works with SQLite databases. After installation:
- Prepare your SQLite database with your tabular data
- Configure environment variables in
env/.env: DB_PATH=/path/to/your/database.db LITELLM_API_KEY_SUMMARIZATION=your_api_key_here # Optional, for LLM features - Run the main script:
python cluster_rank_summarize/main.py train # First time training python cluster_rank_summarize/main.py cluster # Generate clusters
The script will automatically:
- Connect to your SQLite database
- Show available tables for selection
- Display column information for row ID selection
- Guide you through the clustering process
Important: Row ID Column Selection
When prompted, select a column that serves as a unique identifier for each row (e.g., id, patient_id, record_id). This column is essential for:
- Creating itemsets that group related data points
- Tracking which rows belong to which clusters
- Enabling similarity search and analysis
- Generating meaningful visualizations
The row ID column should contain unique values for each row in your dataset.
If you want to use the framework as a package in your own applications, you can install it directly from GitHub:
# Install using uv (recommended)
uv pip install git+https://github.com/radiantlogicinc/cluster_rank_summarize.git
# Or install using pip
pip install git+https://github.com/radiantlogicinc/cluster_rank_summarize.gitAfter installation, you can import and use the framework in your Python applications:
from cluster_rank_summarize.main import train, get_clusters
from cluster_rank_summarize.data_preprocessing import prepare_data
import polars as pl
import pandas as pd
# Load your tabular data
df = pl.read_csv("your_data.csv")
# Step 1: Process the polars dataframe and prepare the pandas dataframe
TD = prepare_data(df, row_id_colname="id")
# Step 2: Train the model with your preferences
train(df, row_id_colname="id", table_name="your_table")
# Step 3: Get clustering results
get_clusters(
df,
row_id_colname="id",
table_name="your_table",
generate_visualizations=True,
generate_advanced_report=True
)# Train the model interactively
python cluster_rank_summarize/main.py train
# Generate basic clustering results
python cluster_rank_summarize/main.py cluster
# Generate clusters with all advanced features
python cluster_rank_summarize/main.py cluster --visualize --advanced --itemset_summarization_categorization
# Optimize learning rate parameters (advanced users)
python cluster_rank_summarize/main.py tune_lrThe framework supports several optional flags to enhance your analysis:
| Flag | Description | Requirements |
|---|---|---|
--advanced |
Generate comprehensive cluster analysis reports with executive summaries, detailed analysis of all clusters, comparitive analysis of clusters and actionable insights | Requires API key |
--visualize |
Create interactive visualizations including sunburst charts, network graphs, heatmaps, and dendrograms | Requires API key for itemset summaries |
--itemset_summarization_categorization |
Generate human-readable summaries of itemsets and categorize them by interest level | Requires API key |
Example Usage:
# Basic clustering without LLM features
python cluster_rank_summarize/main.py cluster
# Full analysis with all features (requires API key)
python cluster_rank_summarize/main.py cluster --advanced --visualize --itemset_summarization_categorization
# Just visualizations
python cluster_rank_summarize/main.py cluster --visualize
# Just itemset summarization and categorization by LLM
python cluster_rank_summarize/main.py cluster --itemset_summarization_categorization
# Just advanced cluster analysis
python cluster_rank_summarize/main.py cluster --advancedNote: Features requiring API keys will prompt you to enter your API key if not configured in the environment file.
The framework supports three main operation modes:
| Mode | Purpose | Description |
|---|---|---|
train |
Initial Training | Interactive training mode where you review itemsets and provide feedback to improve clustering results. Creates/updates configuration files. |
cluster |
Generate Results | Uses trained configuration to generate clustering results, visualizations, and analysis reports. |
tune_lr |
Parameter Optimization | Advanced mode for testing different learning rate combinations to find optimal parameters. |
Typical Workflow:
- First-time setup: Run
trainmode to set preferences and create configuration - Generate results: Run
clustermode with desired flags for analysis - Optimize parameters (optional): Run
tune_lrmode to fine-tune learning rates - Iterate: Re-run
trainmode if results need improvement
- Market Research: Segment customers and discover behavioral patterns
- Operational Analysis: Find process inefficiencies and optimization opportunities
- Academic Research: Explore complex datasets for publication-ready insights
cluster_rank_summarize/
βββ main.py # Entry point and CLI interface
βββ clustering.py # Hierarchical clustering algorithms
βββ data_preprocessing.py # Data cleaning and preparation
βββ itemset_mining.py # Frequent itemset mining and algorithms
βββ llm_analysis.py # AI-powered analysis and summarization
βββ parameter_optimization.py # Learning rate and weight optimization
βββ similarity_search.py # Similarity computation and search
βββ visualization.py # Interactive charts and graphs
βββ display_utils.py # Output formatting and display
βββ utils.py # Helper functions and utilities
The framework uses JSON configuration files stored in ranking_config/ to persist your preferences:
{
"weights": {
"column1": 0.3,
"column2": 0.2,
"column3": 0.5
},
"min_support": 0.1,
"max_collection": -1,
"gamma": 0.7,
"lr_weights": 0.12,
"lr_gamma": 0.8
}The demo/ folder contains a complete example using the MIMIC-III dataset:
Demo Contents:
ADMISSIONS.csv: Sample medical admissions dataADMISSIONS_config.json: Pre-trained configurationcluster_reports/: Detailed analysis reports for each clustercluster_sunburst_with_summary.html: Interactive visualization with a complete hierarchyitemset_summaries.md: Human-readable itemset descriptions
- Itemset Summarization: Convert complex patterns into readable descriptions
- Interest Categorization: Automatically classify patterns by relevance
- Advanced Analysis: Generate executive summaries and actionable insights of clusters
- DeepSeek API (default)
- OpenAI GPT models
- Any LiteLLM-compatible endpoint
- Sunburst Charts: Interactive hierarchical cluster visualization
- Network Graphs: Cluster relationships and connections
- Heatmaps: Similarity matrices and overlap analysis
- Dendrograms: Hierarchical clustering trees
# Find similar itemsets
similarities = get_similar_itemsets(itemsets, itemset_id=5, row_id_colname="id", top_n=10)
# Find similar clusters
cluster_similarities = get_similar_clusters(clusters, cluster_id=2, itemsets=itemsets, row_id_colname="id", top_n=5)# Test different learning rate combinations
results = test_learning_rate_combinations(dataframe, row_id_colname)The framework supports iterative improvement through user feedback:
- Review initial clustering results
- Promote/demote itemsets based on relevance
- Automatically adjust weights and parameters
- Re-cluster with improved settings
- Python 3.11+
- pandas >= 2.2.3
- polars >= 1.30.0
- scikit-learn (via mlxtend)
- plotly >= 6.1.2
- networkx >= 3.4.2
- litellm >= 1.74.0 (for LLM features)
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Ready to discover hidden patterns in your data? Start with our Quick Start Guide and explore the demo!