A multi-model system for analyzing natural language queries for SQL generation. Provides query classification, complexity analysis, and schema relevance ranking for text-to-SQL systems.
| Feature | Model | Accuracy | Use Case |
|---|---|---|---|
| Query Analyzer | DistilBERT | 83.2% | Classify query complexity, type, keywords |
| Schema Ranker | Sentence-BERT | 80% recall | Pre-filter tables for RAG systems |
# 1. Setup
./setup.sh
# 2. Download pre-trained models (recommended)
huggingface-cli download amannirala13/squery-lens-models --local-dir ./models
# Or train from scratch (~1 hour total)
# ./train.sh # Query Analyzer (~16 min)
# ./train_ranker.sh # Schema Ranker (~46 min)
# 3. Test
./analyze.sh "Find customers who spent more than 1000"
./rank.sh "Show all orders" "customers,orders,products,logs"
# 4. Demo
python demo_ranker.py| Resource | Link | Size |
|---|---|---|
| Pre-trained Models | amannirala13/squery-lens-models | 1.6 GB |
| Training Datasets | amannirala13/squery-lens-data | 262 MB |
# Download models only
huggingface-cli download amannirala13/squery-lens-models --local-dir ./models
# Download datasets (for training from scratch)
huggingface-cli download amannirala13/squery-lens-data --local-dir ./data --repo-type datasetsql_classifier/
├── src/
│ ├── model/ # Query Analyzer (DistilBERT)
│ ├── schema_ranker/ # Schema Ranker (Sentence-BERT)
│ │ ├── ranker.py # Python API for RAG integration
│ │ └── train.py
│ ├── training/
│ └── api/
│
├── models/
│ ├── best_model_enhanced.pt # Query Analyzer
│ └── schema_ranker/ # Schema Ranker
│
├── # Scripts
├── setup.sh # Install dependencies
├── train.sh # Train Query Analyzer
├── train_ranker.sh # Train Schema Ranker
├── serve.sh # Start API server
├── analyze.sh # Analyze a query
├── rank.sh # Rank tables
└── demo_ranker.py # Interactive demo
Classifies natural language queries to help route and optimize SQL generation.
- Complexity: simple, medium, complex
- Category: Data Manipulation, Schema Definition, etc.
- Subcategories: Read, Write, Aggregation, Filtering, Joining
- Keywords: SELECT, JOIN, WHERE, GROUP BY, etc.
- Estimated Tables: 1, 2, 3+
$ ./analyze.sh "Calculate total revenue by product category"
Complexity: medium (92.8%)
Category: Data Manipulation (99.7%)
Subcategories: ['Read', 'Aggregation', 'Grouping']
Keywords: ['SELECT', 'FROM', 'GROUP BY', 'AS']
Est. Tables: 1Ranks tables/columns by relevance to a natural language query.
| Scenario | Schema Ranker Needed? |
|---|---|
| RAG uses general embeddings (OpenAI, Cohere) | ✅ Yes - domain-specific ranking |
| RAG already has SQL-tuned embeddings | ❌ Redundant |
| 100+ tables, need fast pre-filtering | ✅ Yes - 7ms for 500 tables |
| Schema already in RAG vector DB | ❌ Query RAG directly |
# Simple table names
./rank.sh "Find customer orders" "customers,orders,products,logs"
# Full schemas (use | separator)
./rank.sh "Find customer orders" \
"customers (id, name, email)|orders (id, customer_id, total)|products (id, name, price)"from src.schema_ranker import SchemaRanker
# Initialize (load once)
ranker = SchemaRanker()
# Load your schema
ranker.load_schema_with_columns({
"customers": ["id", "name", "email"],
"orders": ["id", "customer_id", "total"],
"products": ["id", "name", "price"],
...
})
# Get hints for RAG
hints = ranker.get_hints("Find customers who ordered products", top_k=20)
# Use with RAG
rag.search(query, filter_tables=hints.table_names) # Filter
rag.search(query, boost=hints.to_boost_weights()) # Boost| Metric | Value |
|---|---|
| Table encoding | 0.6s for 500 tables (one-time) |
| Query ranking | 7ms per query |
| Recall @ top-15 | 80% |
| Recall @ top-30 | 80% |
The Schema Ranker does text similarity matching only. It does NOT understand:
- Database relationships (foreign keys)
- Business logic (payment = usage × rate)
- JOIN patterns
For complex queries requiring relationship reasoning, use the ranker as a first-pass filter, then let LLM reason over the candidates.
./serve.sh
# Server runs at http://localhost:8000| Method | Endpoint | Description |
|---|---|---|
| POST | /analyze |
Analyze query + rank tables |
| POST | /analyze/batch |
Batch analysis |
| GET | /info |
Model info |
curl -X POST http://localhost:8000/analyze \
-H "Content-Type: application/json" \
-d '{"query": "Find customers", "tables": ["customers", "orders"]}'import { SQLQueryAnalyzer } from './client/classifier';
const analyzer = new SQLQueryAnalyzer('http://localhost:8000');
const result = await analyzer.analyze("Find customers", ["customers", "orders"]);
console.log(result.complexity); // "simple"
console.log(result.tables); // [{table: "customers", confidence: 0.92}, ...]Query → DistilBERT → Shared Layer (768→512)
↓
┌────────┬────────┼────────┬────────┐
↓ ↓ ↓ ↓ ↓
Complexity Keywords Category Subcat TableCount
Query → Encoder → Query Embedding
↓
Cosine Similarity → Scores
↑
Tables → Encoder → Table Embeddings (pre-computed)
User Query
↓
┌─────────────────────────────────────┐
│ Query Analyzer │
│ → complexity, keywords, category │
│ → Route: simple→template, complex→LLM
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Schema Ranker (optional) │
│ → Pre-filter 500 tables → 20 │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ RAG Vector Search │
│ → Get detailed schema for top tables│
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ LLM │
│ → Generate SQL with focused context │
└─────────────────────────────────────┘
| Output | Accuracy |
|---|---|
| Complexity | 74.9% |
| Keywords | 73.8% F1 |
| Category | 98.7% |
| Subcategory | 88.0% F1 |
| Table Count | 80.8% |
| Combined | 83.2% |
| Metric | Value |
|---|---|
| Pearson Correlation | 0.872 |
| Spearman Correlation | 0.805 |
| Training Samples | 350K pairs |
| Training Time | 46 min |
| Script | Description |
|---|---|
./setup.sh |
Install dependencies |
./train.sh |
Train Query Analyzer |
./train_ranker.sh |
Train Schema Ranker |
./serve.sh |
Start API server |
./analyze.sh "query" |
Analyze a query |
./rank.sh "query" "tables" |
Rank tables |
python demo_ranker.py |
Interactive demo |
| Dataset | Samples | Use |
|---|---|---|
| text-to-sql-100K | 100K | Query Analyzer training |
| data_table_2 | 2.8M | Schema Ranker training |
Automatically uses:
- MPS (Metal) on Apple Silicon
- CUDA on NVIDIA GPUs
- CPU as fallback