A Python library for understanding dirty data in machine learning pipelines.
Saladin provides tools to analyze and understand messy, real-world datasets before preprocessing. It goes beyond basic statistics to offer insights into data quality, categorical stability, missing patterns, and transformation readiness.
- Data Quality Assessment: Multi-dimensional evaluation of completeness, consistency, validity, and uniqueness.
- Categorical Stability: Analyze how stable categorical features are across your dataset.
- Missing Patterns: Detect random, systematic, or mostly missing data patterns.
- Feature Relationships: Discover correlations and semantic groupings.
- Transformation Readiness: Estimate how hard it will be to clean and transform your data.
pip install saladinOr from source:
git clone https://github.com/lycoriolis/saladin.git
cd saladin
pip install -e .import polars as pl
from saladin import DataUnderstandingEngine
# Load your dirty data
data = pl.DataFrame({
'age': [25, 30, None, 40],
'income': [30000, 50000, 60000, 80000],
'city': ['NYC', 'LA', 'NYC', 'LA']
})
engine = DataUnderstandingEngine()
understanding = engine.understand(data)
print(engine.summary(understanding))- Python 3.10+
- Polars
- NumPy
This project is licensed under a custom license that prohibits commercial use. See the LICENSE file for details.