AI-Powered Synthetic Data Engine
Generate realistic, multi-table synthetic datasets from natural language descriptions.
Perfect for testing, development, ML training, and demos.
| Feature | Description |
|---|---|
| π£οΈ Natural Language β Schema | Describe your data needs in plain English |
| π€ LLM-Powered Intelligence | Uses Groq/OpenAI for smart schema generation |
| π Multi-Table Relationships | Automatic foreign key handling with referential integrity |
| π Statistical Distributions | Normal, uniform, Poisson, exponential, and more |
| π― Business Constraints | Sum limits, ratios, temporal ordering |
| β‘ Blazing Fast | 250,000 rows/second with vectorized NumPy |
| π§ Smart Value Generation | Domain-aware realistic values (medical, HR, retail) |
| π Reverse Engineering | Describe a chart, get matching data |
pip install misata# Option 1: Environment variable
export GROQ_API_KEY=your_key_here
# Option 2: .env file
echo "GROQ_API_KEY=your_key_here" > .envGet a free API key at console.groq.com
# From natural language
misata generate --story "SaaS company with 10K users, 20% churn in Q3" --output-dir ./data
# With LLM intelligence
misata generate --story "Hospital with patients and doctors" --use-llm --output-dir ./hospital
# From industry template
misata template saas --output-dir ./saas_datafrom misata import DataSimulator, SchemaConfig
from misata.llm_parser import LLMSchemaGenerator
# With LLM (recommended)
llm = LLMSchemaGenerator()
config = llm.generate_from_story("Fitness app with 50K users, workout tracking")
# Generate data
simulator = DataSimulator(config, seed=42)
simulator.export_to_csv("./fitness_data")
# Or iterate over batches
for table_name, batch_df in simulator.generate_all():
print(f"Generated {len(batch_df)} rows for {table_name}")# Generated from: "Hospital with patients and doctors"
patients.csv (10,000 rows)
βββ id, name, date_of_birth, blood_type, doctor_id
βββ Referential integrity with doctors table β
βββ Realistic column distributions β
doctors.csv (100 rows)
βββ id, name, specialty, department, hire_date
βββ LLM-generated realistic specialties β
appointments.csv (25,000 rows)
βββ id, patient_id, doctor_id, date, diagnosis
βββ Foreign keys to both tables β
βββ Temporal constraints (appointment after hire) β
| Use Case | How Misata Helps |
|---|---|
| Unit Testing | Generate consistent test fixtures |
| Load Testing | Create millions of rows quickly |
| ML Training | Synthetic training data with realistic patterns |
| Demo Data | Beautiful, realistic data for demos |
| Development | No more waiting for production data |
| Privacy | No PII in synthetic data |
| Command | Description |
|---|---|
misata generate |
Generate data from story or config |
misata template |
Use an industry template |
misata graph |
Reverse-engineer from chart description |
misata parse |
Preview generated schema config |
misata serve |
Start API server for web UI |
misata templates |
List available templates |
# Generate with specific row count
misata generate --story "E-commerce with orders" --rows 100000
# Use different LLM provider
misata generate --story "..." --use-llm --provider openai
# Export as Parquet
misata generate --story "..." --format parquet
# With seed for reproducibility
misata generate --story "..." --seed 42βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Natural Language β
β "SaaS company with 50K users..." β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLMSchemaGenerator β
β β’ Groq (Llama 3.3) / OpenAI (GPT-4) / Ollama β
β β’ Generates SchemaConfig with relationships β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DataSimulator β
β β’ Topological sort for dependencies β
β β’ Vectorized NumPy generation β
β β’ Batch processing for large datasets β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CSV / Parquet / JSON β
β users.csv, orders.csv, events.csv β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
from misata import SchemaConfig, Table, Column, Relationship
config = SchemaConfig(
name="My Schema",
tables=[
Table(name="users", row_count=10000),
Table(name="orders", row_count=50000),
],
columns={
"users": [
Column(name="id", type="int", distribution_params={"distribution": "sequence"}),
Column(name="name", type="text", distribution_params={"text_type": "name"}),
Column(name="email", type="text", distribution_params={"text_type": "email"}),
Column(name="age", type="int", distribution_params={"distribution": "normal", "mean": 35, "std": 10}),
],
"orders": [
Column(name="id", type="int", distribution_params={"distribution": "sequence"}),
Column(name="user_id", type="foreign_key"),
Column(name="amount", type="float", distribution_params={"min": 10, "max": 500}),
Column(name="status", type="categorical", distribution_params={"choices": ["pending", "shipped", "delivered"]}),
],
},
relationships=[
Relationship(parent_table="users", child_table="orders", parent_key="id", child_key="user_id"),
],
)Pre-built schemas for common use cases:
| Template | Tables | Description |
|---|---|---|
saas |
users, subscriptions, events | SaaS company with churn |
ecommerce |
customers, products, orders | Online retail |
fitness |
users, exercises, workouts | Fitness app |
healthcare |
patients, doctors, appointments | Hospital system |
misata template ecommerce --scale 2.0 --output-dir ./data| Dataset Size | Time | Speed |
|---|---|---|
| 10,000 rows | 0.04s | 250K rows/sec |
| 100,000 rows | 0.4s | 250K rows/sec |
| 1,000,000 rows | 4s | 250K rows/sec |
Vectorized NumPy operations ensure consistent performance regardless of scale.
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Clone the repo
git clone https://github.com/rasinmuhammed/misata.git
cd misata
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest tests/MIT License - see LICENSE for details.
- Groq for fast LLM inference
- NumPy for vectorized operations
- Pydantic for data validation
- Faker for realistic fake data
Made with β€οΈ by Muhammed Rasin