20x faster than pandas with unlimited streaming for large files. ISO 8000/25012 compliant quality metrics, automatic pattern detection (emails, IPs, IBANs, etc.), and comprehensive statistics (mean, median, skewness, kurtosis). Available as CLI, Rust library, or Python package.
π Privacy First: 100% local processing, no telemetry, read-only DB access. See what dataprof analyzes β
# Install from crates.io
cargo install dataprof
# Or use Python
pip install dataprof# Analyze a file
dataprof-cli analyze data.csv
# Generate HTML report
dataprof-cli report data.csv -o report.html
# Batch process directories
dataprof-cli batch /data/folder --recursive --parallel
# Database profiling
dataprof-cli database postgres://user:pass@host/db --table usersMore options: dataprof-cli --help | Full CLI Guide
import dataprof
# Quality analysis (ISO 8000/25012 compliant)
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")
# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)
# Async database profiling
async def profile_db():
result = await dataprof.profile_database_async(
"postgresql://user:pass@localhost/db",
"SELECT * FROM users",
batch_size=1000,
calculate_quality=True
)
return resultPython Documentation | Integrations (Pandas, scikit-learn, Jupyter, Airflow, dbt)
use dataprof::*;
// Adaptive profiling (recommended)
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;
// Arrow for large files (>100MB, requires --features arrow)
let profiler = DataProfiler::columnar();
let report = profiler.analyze_csv_file("large_dataset.csv")?;# Setup
git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release
# Test databases (optional)
docker-compose -f .devcontainer/docker-compose.yml up -d
# Common tasks
cargo test # Run tests
cargo bench # Benchmarks
cargo clippy # LintingDevelopment Guide | Performance Guide
# Minimal (CSV/JSON only)
cargo build --release
# With Apache Arrow (large files >100MB)
cargo build --release --features arrow
# With Parquet support
cargo build --release --features parquet
# With databases
cargo build --release --features postgres,mysql,sqlite
# Python async support
maturin develop --features python-async,database,postgres
# All features
cargo build --release --all-featuresWhen to use Arrow: Large files (>100MB), many columns (>20), uniform types When to use Parquet: Analytics, data lakes, Spark/Pandas integration
User Guides: CLI Reference | Python API | Python Integrations | Database Connectors | Apache Arrow
Developer: Development Guide | Performance Guide | Benchmarks
Privacy: What DataProf Does - Complete transparency with source verification
We welcome contributions from everyone! Whether you want to:
- Fix a bug π
- Add a feature β¨
- Improve documentation π
- Report an issue π
-
Fork & clone:
git clone https://github.com/YOUR-USERNAME/dataprof.git cd dataprof -
Build & test:
cargo build cargo test -
Create a feature branch:
git checkout -b feature/your-feature-name
-
Before submitting PR:
cargo fmt --all cargo clippy --all --all-targets cargo test --all -
Submit a Pull Request with clear description
π Full Contributing Guide β
All contributions are welcome. Please read CONTRIBUTING.md for guidelines and our Code of Conduct.
MIT License - See LICENSE for details.
