Skip to content

Lycoriolis/Saladin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Saladin

A Python library for understanding dirty data in machine learning pipelines.

Overview

Saladin provides tools to analyze and understand messy, real-world datasets before preprocessing. It goes beyond basic statistics to offer insights into data quality, categorical stability, missing patterns, and transformation readiness.

Features

  • Data Quality Assessment: Multi-dimensional evaluation of completeness, consistency, validity, and uniqueness.
  • Categorical Stability: Analyze how stable categorical features are across your dataset.
  • Missing Patterns: Detect random, systematic, or mostly missing data patterns.
  • Feature Relationships: Discover correlations and semantic groupings.
  • Transformation Readiness: Estimate how hard it will be to clean and transform your data.

Installation

pip install saladin

Or from source:

git clone https://github.com/lycoriolis/saladin.git
cd saladin
pip install -e .

Usage

import polars as pl
from saladin import DataUnderstandingEngine

# Load your dirty data
data = pl.DataFrame({
    'age': [25, 30, None, 40],
    'income': [30000, 50000, 60000, 80000],
    'city': ['NYC', 'LA', 'NYC', 'LA']
})

engine = DataUnderstandingEngine()
understanding = engine.understand(data)

print(engine.summary(understanding))

Requirements

  • Python 3.10+
  • Polars
  • NumPy

License

This project is licensed under a custom license that prohibits commercial use. See the LICENSE file for details.

About

Personal project for enhanced ML engineering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages