Skip to content

High-performance open-source synthetic data engine. Uses LLMs for schema design and vectorized NumPy for deterministic, scalable generation.

Notifications You must be signed in to change notification settings

rasinmuhammed/misata

Repository files navigation

Misata

AI-Powered Synthetic Data Engine

PyPI version Python versions License GitHub stars PyPI Downloads

Generate realistic, multi-table synthetic datasets from natural language descriptions.
Perfect for testing, development, ML training, and demos.


✨ Features

Feature Description
πŸ—£οΈ Natural Language β†’ Schema Describe your data needs in plain English
πŸ€– LLM-Powered Intelligence Uses Groq/OpenAI for smart schema generation
πŸ”— Multi-Table Relationships Automatic foreign key handling with referential integrity
πŸ“Š Statistical Distributions Normal, uniform, Poisson, exponential, and more
🎯 Business Constraints Sum limits, ratios, temporal ordering
⚑ Blazing Fast 250,000 rows/second with vectorized NumPy
🧠 Smart Value Generation Domain-aware realistic values (medical, HR, retail)
πŸ“ˆ Reverse Engineering Describe a chart, get matching data

πŸš€ Quick Start

Installation

pip install misata

Set up your LLM API key

# Option 1: Environment variable
export GROQ_API_KEY=your_key_here

# Option 2: .env file
echo "GROQ_API_KEY=your_key_here" > .env

Get a free API key at console.groq.com

Generate Your First Dataset

# From natural language
misata generate --story "SaaS company with 10K users, 20% churn in Q3" --output-dir ./data

# With LLM intelligence
misata generate --story "Hospital with patients and doctors" --use-llm --output-dir ./hospital

# From industry template
misata template saas --output-dir ./saas_data

Python API

from misata import DataSimulator, SchemaConfig
from misata.llm_parser import LLMSchemaGenerator

# With LLM (recommended)
llm = LLMSchemaGenerator()
config = llm.generate_from_story("Fitness app with 50K users, workout tracking")

# Generate data
simulator = DataSimulator(config, seed=42)
simulator.export_to_csv("./fitness_data")

# Or iterate over batches
for table_name, batch_df in simulator.generate_all():
    print(f"Generated {len(batch_df)} rows for {table_name}")

πŸ“Š Example Output

# Generated from: "Hospital with patients and doctors"

patients.csv (10,000 rows)
β”œβ”€β”€ id, name, date_of_birth, blood_type, doctor_id
β”œβ”€β”€ Referential integrity with doctors table βœ“
└── Realistic column distributions βœ“

doctors.csv (100 rows)
β”œβ”€β”€ id, name, specialty, department, hire_date
└── LLM-generated realistic specialties βœ“

appointments.csv (25,000 rows)
β”œβ”€β”€ id, patient_id, doctor_id, date, diagnosis
β”œβ”€β”€ Foreign keys to both tables βœ“
└── Temporal constraints (appointment after hire) βœ“

🎯 Use Cases

Use Case How Misata Helps
Unit Testing Generate consistent test fixtures
Load Testing Create millions of rows quickly
ML Training Synthetic training data with realistic patterns
Demo Data Beautiful, realistic data for demos
Development No more waiting for production data
Privacy No PII in synthetic data

πŸ“– Documentation

πŸ’» CLI Commands

Command Description
misata generate Generate data from story or config
misata template Use an industry template
misata graph Reverse-engineer from chart description
misata parse Preview generated schema config
misata serve Start API server for web UI
misata templates List available templates

Examples

# Generate with specific row count
misata generate --story "E-commerce with orders" --rows 100000

# Use different LLM provider
misata generate --story "..." --use-llm --provider openai

# Export as Parquet
misata generate --story "..." --format parquet

# With seed for reproducibility
misata generate --story "..." --seed 42

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Natural Language                  β”‚
β”‚         "SaaS company with 50K users..."            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              LLMSchemaGenerator                      β”‚
β”‚  β€’ Groq (Llama 3.3) / OpenAI (GPT-4) / Ollama       β”‚
β”‚  β€’ Generates SchemaConfig with relationships        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  DataSimulator                       β”‚
β”‚  β€’ Topological sort for dependencies                β”‚
β”‚  β€’ Vectorized NumPy generation                      β”‚
β”‚  β€’ Batch processing for large datasets              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                CSV / Parquet / JSON                  β”‚
β”‚         users.csv, orders.csv, events.csv           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Schema Configuration

from misata import SchemaConfig, Table, Column, Relationship

config = SchemaConfig(
    name="My Schema",
    tables=[
        Table(name="users", row_count=10000),
        Table(name="orders", row_count=50000),
    ],
    columns={
        "users": [
            Column(name="id", type="int", distribution_params={"distribution": "sequence"}),
            Column(name="name", type="text", distribution_params={"text_type": "name"}),
            Column(name="email", type="text", distribution_params={"text_type": "email"}),
            Column(name="age", type="int", distribution_params={"distribution": "normal", "mean": 35, "std": 10}),
        ],
        "orders": [
            Column(name="id", type="int", distribution_params={"distribution": "sequence"}),
            Column(name="user_id", type="foreign_key"),
            Column(name="amount", type="float", distribution_params={"min": 10, "max": 500}),
            Column(name="status", type="categorical", distribution_params={"choices": ["pending", "shipped", "delivered"]}),
        ],
    },
    relationships=[
        Relationship(parent_table="users", child_table="orders", parent_key="id", child_key="user_id"),
    ],
)

🏭 Templates

Pre-built schemas for common use cases:

Template Tables Description
saas users, subscriptions, events SaaS company with churn
ecommerce customers, products, orders Online retail
fitness users, exercises, workouts Fitness app
healthcare patients, doctors, appointments Hospital system
misata template ecommerce --scale 2.0 --output-dir ./data

⚑ Performance

Dataset Size Time Speed
10,000 rows 0.04s 250K rows/sec
100,000 rows 0.4s 250K rows/sec
1,000,000 rows 4s 250K rows/sec

Vectorized NumPy operations ensure consistent performance regardless of scale.

🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Clone the repo
git clone https://github.com/rasinmuhammed/misata.git
cd misata

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest tests/

πŸ“„ License

MIT License - see LICENSE for details.

πŸ™ Acknowledgments

  • Groq for fast LLM inference
  • NumPy for vectorized operations
  • Pydantic for data validation
  • Faker for realistic fake data

Made with ❀️ by Muhammed Rasin

About

High-performance open-source synthetic data engine. Uses LLMs for schema design and vectorized NumPy for deterministic, scalable generation.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published