Version: 0.1.0 (Draft) Date: 2026-03-13 Status: Design Phase
f2a is a Python library that takes various data sources (local files, Hugging Face datasets, etc.) and automatically performs Descriptive Statistics analysis and Visualization.
- One-click Analysis: Full descriptive statistics + visualization from a single file path or HuggingFace URL
- Diverse Input Support: CSV, JSON, Parquet, Excel, TSV, Hugging Face
datasets - Rich Statistics: Summary statistics, distribution analysis, correlation analysis, missing data analysis
- Automatic Visualization: Histograms, boxplots, correlation heatmaps, missing data maps, etc.
- Report Generation: Automatically produce HTML reports from analysis results
import f2a
# Analyze a local file
report = f2a.analyze("data/sales.csv")
report.show() # Print summary to console
report.to_html("out/") # Save HTML report
# Analyze a Hugging Face dataset
report = f2a.analyze("hf://imdb")
report.show()
# Detailed access
report.stats.summary() # Summary statistics DataFrame
report.stats.correlation() # Correlation matrix
report.viz.plot_distributions() # Distribution plots┌─────────────────────────────────────────────┐
│ Public API │
│ f2a.analyze() / f2a.load() │
├─────────────────────────────────────────────┤
│ Core Orchestrator │
│ Analyzer (pipeline control) │
├──────────┬──────────┬──────────┬────────────┤
│ Loader │ Stats │ Viz │ Reporter │
│ Data │ Stat │ Visual │ Report │
│ Loading │ Analysis │ ization │ Generation │
├──────────┴──────────┴──────────┴────────────┤
│ Utilities │
│ Type Inference · Validation · Logging │
└─────────────────────────────────────────────┘
f2a/
├── pyproject.toml # Build config (PEP 621)
├── README.md # Project introduction
├── PLAN.md # This document
├── LICENSE # MIT License
│
├── src/
│ └── f2a/
│ ├── __init__.py # Public API exports
│ ├── _version.py # Version management
│ │
│ ├── core/
│ │ ├── __init__.py
│ │ ├── loader.py # File/HF data loading
│ │ ├── analyzer.py # Analysis orchestrator
│ │ └── schema.py # Column type inference & schema
│ │
│ ├── stats/
│ │ ├── __init__.py
│ │ ├── descriptive.py # Descriptive stats (mean, median, variance, etc.)
│ │ ├── distribution.py # Distribution analysis (skewness, kurtosis, normality)
│ │ ├── correlation.py # Correlation analysis
│ │ └── missing.py # Missing data analysis
│ │
│ ├── viz/
│ │ ├── __init__.py
│ │ ├── theme.py # Visualization theme/style
│ │ ├── plots.py # Basic plots (histogram, bar, box)
│ │ ├── dist_plots.py # Distribution visualization
│ │ ├── corr_plots.py # Correlation visualization
│ │ └── missing_plots.py # Missing data visualization
│ │
│ ├── report/
│ │ ├── __init__.py
│ │ ├── generator.py # Report generation engine
│ │ └── templates/ # HTML templates
│ │ └── base.html
│ │
│ └── utils/
│ ├── __init__.py
│ ├── type_inference.py # Automatic data type inference
│ ├── validators.py # Input validation
│ └── logging.py # Logging configuration
│
├── tests/
│ ├── __init__.py
│ ├── conftest.py # pytest fixtures
│ ├── test_loader.py
│ ├── test_descriptive.py
│ ├── test_correlation.py
│ ├── test_viz.py
│ └── test_report.py
│
└── examples/
├── quickstart.py # Quick start example
└── huggingface_demo.py # HF dataset example
Automatically detects the data source and converts it uniformly to a pandas.DataFrame.
| Input Type | Detection Method | Conversion Method |
|---|---|---|
| CSV / TSV | Extension .csv, .tsv |
pd.read_csv() |
| JSON / JSONL | Extension .json, .jsonl |
pd.read_json() |
| Parquet | Extension .parquet |
pd.read_parquet() |
| Excel | Extension .xlsx, .xls |
pd.read_excel() |
| HuggingFace | hf:// prefix or org/dataset pattern |
datasets.load_dataset() → .to_pandas() |
Core Interface:
class DataLoader:
def load(self, source: str, **kwargs) -> pd.DataFrame:
"""Analyze source string and route to appropriate loader"""
def _detect_source_type(self, source: str) -> SourceType:
"""Auto-detect source type"""| Statistic | Numeric | Categorical |
|---|---|---|
| count / unique | ✅ | ✅ |
| mean / median | ✅ | — |
| std / variance | ✅ | — |
| min / max / range | ✅ | — |
| Q1, Q3, IQR | ✅ | — |
| top / freq | — | ✅ |
| mode | ✅ | ✅ |
- Skewness & Kurtosis
- Normality Tests: Shapiro-Wilk (n ≤ 5000), D'Agostino-Pearson
- Quantile Table: 5%, 10%, 25%, 50%, 75%, 90%, 95%
- Pearson correlation (numeric-numeric)
- Spearman rank correlation (numeric-numeric, nonlinear)
- Cramér's V (categorical-categorical)
- Multicollinearity warning (|r| > 0.9)
- Column-wise missing ratio
- Missing pattern analysis (MCAR / MAR hints)
- Row-wise missing distribution
| Chart Type | Target | Module |
|---|---|---|
| Histogram + KDE | Numeric columns | dist_plots.py |
| Boxplot | Numeric columns | plots.py |
| Bar chart (frequency) | Categorical columns | plots.py |
| Correlation heatmap | Numeric column pairs | corr_plots.py |
| Pairplot | Top N numeric columns | corr_plots.py |
| Missing data matrix | All columns | missing_plots.py |
| Violin plot | Numeric columns | dist_plots.py |
Visualization Theme: Unified style management in viz/theme.py (color palette, font size, etc.)
Generates comprehensive HTML reports from analysis results.
Report Structure:
- Overview Section: Dataset name, row/column counts, memory usage
- Variable Summary: Column types, missing ratios, key statistics
- Distribution Section: Distribution visualization for each column
- Correlation Section: Correlation heatmap + key correlated pairs
- Missing Data Section: Missing pattern visualization
- Warnings Section: Outliers, high correlation, high missing ratios, etc.
Input (file path / HF URL)
│
▼
┌─────────┐
│ Loader │ ──→ pd.DataFrame
└────┬─────┘
│
▼
┌──────────┐
│ Schema │ ──→ Column type inference (numeric/categorical/text/datetime)
└────┬─────┘
│
├──→ Stats.descriptive() ──→ StatResult
├──→ Stats.distribution() ──→ StatResult
├──→ Stats.correlation() ──→ StatResult
└──→ Stats.missing() ──→ StatResult
│
▼
┌────────────┐
│ Viz Engine │ ──→ matplotlib Figure objects
└─────┬──────┘
│
▼
┌───────────┐
│ Reporter │ ──→ AnalysisReport
└───────────┘
│
├──→ .show() (console output)
├──→ .to_html() (HTML file)
└──→ .to_dict() (programmatic access)
| Package | Version | Purpose |
|---|---|---|
pandas |
≥ 2.0 | DataFrame core |
numpy |
≥ 1.24 | Numeric operations |
matplotlib |
≥ 3.7 | Basic visualization |
seaborn |
≥ 0.13 | Statistical visualization |
scipy |
≥ 1.11 | Statistical tests |
| Package | Purpose | Extras Name |
|---|---|---|
datasets |
HuggingFace dataset loading | [hf] |
openpyxl |
Excel file support | [excel] |
pyarrow |
Parquet file support | [parquet] |
rich |
Console output formatting | [rich] |
jinja2 |
HTML report templates | [report] |
# Basic install
pip install f2a
# With HuggingFace support
pip install f2a[hf]
# All features
pip install f2a[all]- Project structure setup (pyproject.toml, directories)
- Basic Loader (CSV, JSON)
- Descriptive statistics module (descriptive.py)
- Basic visualization (histograms, boxplots)
- Console output (show)
- HuggingFace dataset loader
- Correlation analysis & heatmap
- Missing data analysis & visualization
- HTML report generation
- Distribution analysis (normality tests, etc.)
- Large dataset support (chunk loading)
- Interactive visualization (plotly option)
- CLI interface
- API stabilization & documentation
- Comprehensive test coverage > 80%
- PyPI deployment
- Tutorials & example notebooks
- Python: 3.10+
- Style: PEP 8, Black formatter, isort
- Type Hints: Required for all public APIs
- Docstrings: Google style
- Testing: pytest, unit testing principles
- Linting: ruff
@dataclass
class AnalysisReport:
"""Top-level container for analysis results"""
dataset_name: str
shape: tuple[int, int]
schema: DataSchema
stats: StatsResult
figures: dict[str, Figure]
warnings: list[str]
def show(self) -> None: ...
def to_html(self, output_dir: str) -> Path: ...
def to_dict(self) -> dict: ...@dataclass
class StatsResult:
"""Container for statistical analysis results"""
summary: pd.DataFrame # Summary statistics
correlation_matrix: pd.DataFrame # Correlation matrix
missing_info: pd.DataFrame # Missing data info
distribution_info: pd.DataFrame # Distribution info
def get_numeric_summary(self) -> pd.DataFrame: ...
def get_categorical_summary(self) -> pd.DataFrame: ...| Scenario | Handling |
|---|---|
| File not found | FileNotFoundError with clear message |
| Unsupported format | UnsupportedFormatError (custom) |
| HF dataset load failure | DataLoadError (custom) + cause chaining |
| Empty dataset | EmptyDataError (custom) |
| No numeric columns | Warning log + skip relevant analysis |
This document is continuously updated as the project progresses.