Skip to content
poinT92 edited this page Sep 25, 2025 · 14 revisions

dataprof Wiki

Welcome to the DataProf wiki! This is your comprehensive guide to using DataProf for fast, efficient data profiling, ML readiness assessment, and automated preprocessing code generation.

🚀 Key Features

🐍 Actionable Code Generation - DataProf generates ready-to-use Python code for every ML recommendation!

  • Immediate Implementation: Get executable code for preprocessing steps
  • Framework Integration: Works with pandas, scikit-learn, and popular ML libraries
  • Complete Workflows: Generate entire preprocessing pipelines
  • Smart Recommendations: Context-aware suggestions based on your data

Transform from "Your data has missing values" to "Here's the exact code to fix it: df['age'].fillna(df['age'].median(), inplace=True)"

Written with ❤️ by your friendly neighborhood Maintainer, Andrea. I really hope you enjoy your stay here, using dataprof!

🚀 Quick Links

📚 Documentation Pages

Getting Started

  • Database Connectors - Direct database profiling for PostgreSQL, MySQL, SQLite, and DuckDB
  • CLI Guide - General usage guide to dataprof CLI commands and functionalities

Python Usage & ML Features

  • Python API Reference - Complete reference for all DataProf Python functions and classes, including code snippet generation APIs.
  • ML Features Guide - Complete guide to ML readiness assessment and automated preprocessing code generation.
  • Ecosystem Integrations - Complete guide to integrating DataProf with popular data science and ML tools.

Advanced Features

  • Apache Arrow Integration - High-performance columnar processing with 20x memory efficiency for large datasets
  • Performance Guide - Comprehensive performance analysis, benchmarks, and optimization tips
  • Benchmarking - Newly dataprof benchmarking system explained and planned features

📖 Development & Contribution

Development Setup

Contribution Guidelines

🎯 Quick Examples

Get Actionable Code Snippets

import dataprof

# Get ML readiness with code snippets
ml_score = dataprof.ml_readiness_score("data.csv")

for rec in ml_score.recommendations:
    if rec.code_snippet:
        print(f"📋 {rec.category}: {rec.description}")
        print(f"💻 Code: {rec.code_snippet}")

Generate Complete Preprocessing Script

# Generate full preprocessing pipeline
dataprof data.csv --ml-score --output-script preprocess.py

📖 Additional Resources

For more information, check out the main repository documentation:

  • README.md - Project overview and basic usage
  • CHANGELOG.md - Version history and latest features
  • Archive - Historical documentation and roadmaps

Clone this wiki locally