Skip to content

AndreaBozzo/dataprof

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
dataprof logo

dataprof

Fast, reliable data quality assessment for CSV, Parquet, and databases

CI Crates.io License PyPI Downloads


20x faster than pandas with unlimited streaming for large files. ISO 8000/25012 compliant quality metrics, automatic pattern detection (emails, IPs, IBANs, etc.), and comprehensive statistics (mean, median, skewness, kurtosis). Available as CLI, Rust library, or Python package.

πŸ”’ Privacy First: 100% local processing, no telemetry, read-only DB access. See what dataprof analyzes β†’

Quick Start

CLI Installation

# Install from crates.io
cargo install dataprof

# Or use Python
pip install dataprof

CLI Usage

# Analyze a file
dataprof-cli analyze data.csv

# Generate HTML report
dataprof-cli report data.csv -o report.html

# Batch process directories
dataprof-cli batch /data/folder --recursive --parallel

# Database profiling
dataprof-cli database postgres://user:pass@host/db --table users

More options: dataprof-cli --help | Full CLI Guide

Python API

import dataprof

# Quality analysis (ISO 8000/25012 compliant)
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)

# Async database profiling
async def profile_db():
    result = await dataprof.profile_database_async(
        "postgresql://user:pass@localhost/db",
        "SELECT * FROM users",
        batch_size=1000,
        calculate_quality=True
    )
    return result

Python Documentation | Integrations (Pandas, scikit-learn, Jupyter, Airflow, dbt)

Rust Library

use dataprof::*;

// Adaptive profiling (recommended)
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;

// Arrow for large files (>100MB, requires --features arrow)
let profiler = DataProfiler::columnar();
let report = profiler.analyze_csv_file("large_dataset.csv")?;

Development

# Setup
git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release

# Test databases (optional)
docker-compose -f .devcontainer/docker-compose.yml up -d

# Common tasks
cargo test          # Run tests
cargo bench         # Benchmarks
cargo clippy        # Linting

Development Guide | Performance Guide

Feature Flags

# Minimal (CSV/JSON only)
cargo build --release

# With Apache Arrow (large files >100MB)
cargo build --release --features arrow

# With Parquet support
cargo build --release --features parquet

# With databases
cargo build --release --features postgres,mysql,sqlite

# Python async support
maturin develop --features python-async,database,postgres

# All features
cargo build --release --all-features

When to use Arrow: Large files (>100MB), many columns (>20), uniform types When to use Parquet: Analytics, data lakes, Spark/Pandas integration

Documentation

User Guides: CLI Reference | Python API | Python Integrations | Database Connectors | Apache Arrow

Developer: Development Guide | Performance Guide | Benchmarks

Privacy: What DataProf Does - Complete transparency with source verification

🀝 Contributing

We welcome contributions from everyone! Whether you want to:

  • Fix a bug πŸ›
  • Add a feature ✨
  • Improve documentation πŸ“š
  • Report an issue πŸ“

Quick Start for Contributors

  1. Fork & clone:

    git clone https://github.com/YOUR-USERNAME/dataprof.git
    cd dataprof
  2. Build & test:

    cargo build
    cargo test
  3. Create a feature branch:

    git checkout -b feature/your-feature-name
  4. Before submitting PR:

    cargo fmt --all
    cargo clippy --all --all-targets
    cargo test --all
  5. Submit a Pull Request with clear description

πŸ“– Full Contributing Guide β†’

All contributions are welcome. Please read CONTRIBUTING.md for guidelines and our Code of Conduct.

License

MIT License - See LICENSE for details.

About

Fast, reliable data quality assessment for CSV, Parquet, and databases

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors 5