dataprof

Fast, reliable data quality assessment for CSV, Parquet, and databases

20x faster than pandas with unlimited streaming for large files. ISO 8000/25012 compliant quality metrics, automatic pattern detection (emails, IPs, IBANs, etc.), and comprehensive statistics (mean, median, skewness, kurtosis). Available as CLI, Rust library, or Python package.

🔒 Privacy First: 100% local processing, no telemetry, read-only DB access. See what dataprof analyzes →

Quick Start

CLI Installation

# Install from crates.io
cargo install dataprof

# Or use Python
pip install dataprof

CLI Usage

# Analyze a file
dataprof-cli analyze data.csv

# Generate HTML report
dataprof-cli report data.csv -o report.html

# Batch process directories
dataprof-cli batch /data/folder --recursive --parallel

# Database profiling
dataprof-cli database postgres://user:pass@host/db --table users

More options: dataprof-cli --help | Full CLI Guide

Python API

import dataprof

# Quality analysis (ISO 8000/25012 compliant)
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)

# Async database profiling
async def profile_db():
    result = await dataprof.profile_database_async(
        "postgresql://user:pass@localhost/db",
        "SELECT * FROM users",
        batch_size=1000,
        calculate_quality=True
    )
    return result

Python Documentation | Integrations (Pandas, scikit-learn, Jupyter, Airflow, dbt)

Rust Library

use dataprof::*;

// Adaptive profiling (recommended)
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;

// Arrow for large files (>100MB, requires --features arrow)
let profiler = DataProfiler::columnar();
let report = profiler.analyze_csv_file("large_dataset.csv")?;

Development

# Setup
git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release

# Test databases (optional)
docker-compose -f .devcontainer/docker-compose.yml up -d

# Common tasks
cargo test          # Run tests
cargo bench         # Benchmarks
cargo clippy        # Linting

Development Guide | Performance Guide

Feature Flags

# Minimal (CSV/JSON only)
cargo build --release

# With Apache Arrow (large files >100MB)
cargo build --release --features arrow

# With Parquet support
cargo build --release --features parquet

# With databases
cargo build --release --features postgres,mysql,sqlite

# Python async support
maturin develop --features python-async,database,postgres

# All features
cargo build --release --all-features

When to use Arrow: Large files (>100MB), many columns (>20), uniform types When to use Parquet: Analytics, data lakes, Spark/Pandas integration

Documentation

User Guides: CLI Reference | Python API | Python Integrations | Database Connectors | Apache Arrow

Developer: Development Guide | Performance Guide | Benchmarks

Privacy: What DataProf Does - Complete transparency with source verification

🤝 Contributing

We welcome contributions from everyone! Whether you want to:

Fix a bug 🐛
Add a feature ✨
Improve documentation 📚
Report an issue 📝

Quick Start for Contributors

Fork & clone:

git clone https://github.com/YOUR-USERNAME/dataprof.git
cd dataprof

Build & test:
```
cargo build
cargo test
```

Create a feature branch:

git checkout -b feature/your-feature-name

Before submitting PR:

cargo fmt --all
cargo clippy --all --all-targets
cargo test --all

Submit a Pull Request with clear description

📖 Full Contributing Guide →

All contributions are welcome. Please read CONTRIBUTING.md for guidelines and our Code of Conduct.

License

MIT License - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 453 Commits
.cargo		.cargo
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
benches		benches
benchmark-results		benchmark-results
docs		docs
examples		examples
python		python
scripts		scripts
src		src
templates		templates
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.trufflehogignore		.trufflehogignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
tarpaulin.toml		tarpaulin.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dataprof

Quick Start

CLI Installation

CLI Usage

Python API

Rust Library

Development

Feature Flags

Documentation

🤝 Contributing

Quick Start for Contributors

License

About

Uh oh!

Releases 21

Uh oh!

Contributors 5

Uh oh!

Languages

License

AndreaBozzo/dataprof

Folders and files

Latest commit

History

Repository files navigation

dataprof

Quick Start

CLI Installation

CLI Usage

Python API

Rust Library

Development

Feature Flags

Documentation

🤝 Contributing

Quick Start for Contributors

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 21

Uh oh!

Contributors 5

Uh oh!

Languages