diff --git a/CHANGELOG.md b/CHANGELOG.md index 38ddb7b..820ac1a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Added +- None + +### Changed +- None + +### Fixed +- None + +### Removed +- None + +## [0.4.2] - 2025-08-27 + ### Added - feat(cli): refactor check command interface from positional arguments to `--conn` and `--table` options - feat(cli): add comprehensive test coverage for new CLI interface functionality @@ -20,7 +34,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - feat(tests): add multi-table Excel file validation test scenarios ### Changed -- **BREAKING CHANGE**: CLI interface changed from `vlite-cli check ` to `vlite-cli check --conn --table ` +- **BREAKING CHANGE**: CLI interface changed from `vlite check ` to `vlite check --conn --table ` - refactor(cli): update SourceParser to accept optional table_name parameter - refactor(cli): modify check command to pass table_name to SourceParser.parse_source() - refactor(tests): update all existing CLI tests to use new interface format @@ -47,7 +61,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **BREAKING CHANGE**: remove backward compatibility for old positional argument interface - remove(cli): eliminate support for `` positional argument in check command -## [0.4.0] - 2025-01-27 +## [0.4.0] - 2025-08-14 ### Added - feat(cli): add `schema` command skeleton @@ -61,7 +75,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - tests(cli): comprehensive unit tests for `schema` command covering argument parsing, rules file validation, decomposition/mapping, aggregation priority, output formats (table/json), and exit codes (AC satisfied) - tests(core): unit tests for `SCHEMA` rule covering normal/edge/error cases, strict type checks, and mypy compliance - tests(integration): database schema drift tests for MySQL and PostgreSQL (existence, type consistency, strict mode extras, case-insensitive) -- tests(e2e): end-to-end `vlite-cli schema` scenarios on database URLs covering happy path, drift (FIELD_MISSING/TYPE_MISMATCH), strict extras, empty rules minimal payload; JSON and table outputs +- tests(e2e): end-to-end `vlite schema` scenarios on database URLs covering happy path, drift (FIELD_MISSING/TYPE_MISMATCH), strict extras, empty rules minimal payload; JSON and table outputs ### Changed - docs: update README and USAGE with schema command overview and detailed usage diff --git a/README.md b/README.md index 51062e7..f336ae6 100644 --- a/README.md +++ b/README.md @@ -1,234 +1,116 @@ # ValidateLite -ValidateLite is a lightweight, zero-config Python CLI tool for validating data quality across files and SQL databases - built for modern data pipelines and CI/CD automation. This python data validation tool is a flexible, extensible command-line tool for automated data quality validation, profiling, and rule-based checks across diverse data sources. Designed for data engineers, analysts, and developers to ensure data reliability and compliance in modern data pipelines. - [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Code Coverage](https://img.shields.io/badge/coverage-80%25-green.svg)](https://github.com/litedatum/validatelite) ---- +**ValidateLite: A lightweight data validation tool for engineers who need answers, fast.** -## ๐Ÿ“ Development Blog +Unlike other complex **data validation tools**, ValidateLite provides two powerful, focused commands for different scenarios: -Follow the journey of building ValidateLite through our development blog posts: +* **`vlite check`**: For quick, ad-hoc data checks. Need to verify if a column is unique or not null *right now*? The `check` command gets you an answer in 30 seconds, zero config required. -- **[DevLog #1: Building a Zero-Config Data Validation Tool](https://blog.litedatum.com/posts/Devlog01-data-validation-tool/)** - The initial vision and architecture of ValidateLite -- **[DevLog #2: Why I Scrapped My Half-Built Data Validation Platform](https://blog.litedatum.com/posts/Devlog02-Rethinking-My-Data-Validation-Tool/)** - Lessons learned from scope creep and the pivot to a focused CLI tool -- **[Rule-Driven Schema Validation: A Lightweight Solution](https://blog.litedatum.com/posts/Rule-Driven-Schema-Validation/)** - Deep dive into schema drift challenges and how ValidateLite's schema validation provides a lightweight alternative to complex frameworks +* **`vlite schema`**: For robust, repeatable **database schema validation**. It's your best defense against **schema drift**. Embed it in your CI/CD and ETL pipelines to enforce data contracts, ensuring data integrity before it becomes a problem. --- -## ๐Ÿš€ Quick Start +## Core Use Case: Automated Schema Validation -### For Regular Users +The `vlite schema` command is key to ensuring the stability of your data pipelines. It allows you to quickly verify that a database table or data file conforms to a defined structure. -**Option 1: Install from [PyPI](https://pypi.org/project/validatelite/) (Recommended)** -```bash -pip install validatelite -vlite --help -``` +### Scenario 1: Gate Deployments in CI/CD -**Option 2: Install from pre-built package** -```bash -# Download the latest release from GitHub -pip install validatelite-0.1.0-py3-none-any.whl -vlite --help -``` +Automatically check for breaking schema changes before they get deployed, preventing production issues caused by unexpected modifications. -**Option 3: Run from source** -```bash -git clone https://github.com/litedatum/validatelite.git -cd validatelite -pip install -r requirements.txt -python cli_main.py --help -``` - -**Option 4: Install with pip-tools (for development)** -```bash -git clone https://github.com/litedatum/validatelite.git -cd validatelite -pip install pip-tools -pip-compile requirements.in -pip install -r requirements.txt -python cli_main.py --help -``` +**Example Workflow (`.github/workflows/ci.yml`)** +```yaml +jobs: + validate-db-schema: + name: Validate Database Schema + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v3 -### For Developers & Contributors + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.9' -If you want to contribute to the project or need the latest development version: + - name: Install ValidateLite + run: pip install validatelite -```bash -git clone https://github.com/litedatum/validatelite.git -cd validatelite - -# Install dependencies (choose one approach) -# Option 1: Install from pinned requirements -pip install -r requirements.txt -pip install -r requirements-dev.txt - -# Option 2: Use pip-tools for development -pip install pip-tools -python scripts/update_requirements.py -pip install -r requirements.txt -pip install -r requirements-dev.txt - -# Install pre-commit hooks -pre-commit install + - name: Run Schema Validation + run: | + vlite schema --conn "mysql://${{ secrets.DB_USER }}:${{ secrets.DB_PASS }}@${{ secrets.DB_HOST }}/sales" \ + --rules ./schemas/customers_schema.json ``` -See [DEVELOPMENT_SETUP.md](docs/DEVELOPMENT_SETUP.md) for detailed development setup instructions. - ---- - -## โœจ Features - -- **๐Ÿ”ง Rule-based Data Quality Engine**: Supports completeness, uniqueness, validity, and custom rules -- **๐Ÿ–ฅ๏ธ Extensible CLI**: Easily integrate with CI/CD and automation workflows -- **๐Ÿ—„๏ธ Multi-Source Support**: Validate data from files (CSV, Excel) and databases (MySQL, PostgreSQL, SQLite) -- **โš™๏ธ Configurable & Modular**: Flexible configuration via TOML and environment variables -- **๐Ÿ›ก๏ธ Comprehensive Error Handling**: Robust exception and error classification system -- **๐Ÿงช Tested & Reliable**: High code coverage, modular tests, and pre-commit hooks -- **๐Ÿ“ Schema Drift Prevention**: Lightweight schema validation that prevents data pipeline failures from unexpected schema changes - a simple alternative to complex validation frameworks - ---- - -## ๐Ÿ“– Documentation - -- **[USAGE.md](docs/USAGE.md)** - Complete user guide with examples and best practices -- Schema command JSON output contract: `docs/schemas/schema_results.schema.json` -- **[DEVELOPMENT_SETUP.md](docs/DEVELOPMENT_SETUP.md)** - Development environment setup and contribution guidelines -- **[CONFIG_REFERENCE.md](docs/CONFIG_REFERENCE.md)** - Configuration file reference -- **[ROADMAP.md](docs/ROADMAP.md)** - Development roadmap and future plans -- **[CHANGELOG.md](CHANGELOG.md)** - Release history and changes - ---- - -## ๐ŸŽฏ Basic Usage - -### Validate a CSV file -```bash -vlite check data.csv --rule "not_null(id)" --rule "unique(email)" -``` - -### Validate a database table -```bash -vlite check "mysql://user:pass@host:3306/db.table" --rules validation_rules.json +### Scenario 2: Monitor ETL/ELT Pipelines + +Set up validation checkpoints at various stages of your data pipelines to guarantee data quality and avoid "garbage in, garbage out." + +**Example Rule File (`customers_schema.json`)** +```json +{ + "customers": { + "rules": [ + { "field": "id", "type": "integer", "required": true }, + { "field": "name", "type": "string", "required": true }, + { "field": "email", "type": "string", "required": true }, + { "field": "age", "type": "integer", "min": 18, "max": 100 }, + { "field": "gender", "enum": ["Male", "Female", "Other"] }, + { "field": "invalid_col" } + ] + } +} ``` -### Check with verbose output +**Run Command:** ```bash -vlite check data.csv --rules rules.json --verbose -``` - -### Validate against a schema file (single table) -```bash -# Table is derived from the data-source URL, the schema file is single-table in v1 -vlite schema "mysql://user:pass@host:3306/sales.users" --rules schema.json - -# Get aggregated JSON with column-level details (see docs/schemas/schema_results.schema.json) -vlite schema "mysql://.../sales.users" --rules schema.json --output json -``` - -For detailed usage examples and advanced features, see [USAGE.md](docs/USAGE.md). - ---- - -## ๐Ÿ—๏ธ Project Structure - -``` -validatelite/ -โ”œโ”€โ”€ cli/ # CLI logic and commands -โ”œโ”€โ”€ core/ # Rule engine and core validation logic -โ”œโ”€โ”€ shared/ # Common utilities, enums, exceptions, and schemas -โ”œโ”€โ”€ config/ # Example and template configuration files -โ”œโ”€โ”€ tests/ # Unit, integration, and E2E tests -โ”œโ”€โ”€ scripts/ # Utility scripts -โ”œโ”€โ”€ docs/ # Documentation -โ””โ”€โ”€ examples/ # Usage examples and sample data +vlite schema --conn "mysql://user:pass@host:3306/sales" --rules customers_schema.json ``` --- -## ๐Ÿงช Testing +## Quick Start: Ad-Hoc Checks with `check` -### For Regular Users -The project includes comprehensive tests to ensure reliability. If you encounter issues, please check the [troubleshooting section](docs/USAGE.md#error-handling) in the usage guide. +For temporary, one-off validation needs, the `check` command is your best friend. -### For Developers +**1. Install (if you haven't already):** ```bash -# Set up test databases (requires Docker) -./scripts/setup_test_databases.sh start - -# Run all tests with coverage -pytest -vv --cov - -# Run tests quietly (suppress debug messages) -python scripts/run_tests_quiet.py --cov - -# Run specific test categories -pytest tests/unit/ -v # Unit tests only -pytest tests/integration/ -v # Integration tests -pytest tests/e2e/ -v # End-to-end tests - -# Run specific tests quietly -python scripts/run_tests_quiet.py tests/unit/ -v +pip install validatelite +``` -# Code quality checks -pre-commit run --all-files +**2. Run a check:** +```bash +# Check for nulls in a CSV file's 'id' column +vlite check --conn "customers.csv" --table customers --rule "not_null(id)" -# Stop test databases when done -./scripts/setup_test_databases.sh stop +# Check for uniqueness in a database table's 'email' column +vlite check --conn "mysql://user:pass@host/db" --table customers --rule "unique(email)" ``` --- -## ๐Ÿค Contributing +## Learn More -We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) and [Code of Conduct](CODE_OF_CONDUCT.md). - -### Development Setup -For detailed development setup instructions, see [DEVELOPMENT_SETUP.md](docs/DEVELOPMENT_SETUP.md). +- **[Usage Guide (USAGE.md)](docs/USAGE.md)**: Learn about all commands, arguments, and advanced features. +- **[Configuration Reference (CONFIG_REFERENCE.md)](docs/CONFIG_REFERENCE.md)**: See how to configure the tool via `toml` files. +- **[Contributing Guide (CONTRIBUTING.md)](CONTRIBUTING.md)**: We welcome contributions! --- -## โ“ FAQ: Why ValidateLite? - -### Q: What is ValidateLite, in one sentence? -A: ValidateLite is a lightweight, zero-config Python CLI tool for data quality validation, profiling, and rule-based checks across CSV files and SQL databases. - -### Q: How is it different from other tools like Great Expectations or Pandera? -A: Unlike heavyweight frameworks, ValidateLite is built for simplicity and speed โ€” no code generation, no DSLs, just one command to validate your data in pipelines or ad hoc scripts. - -### Q: What kind of data sources are supported? -A: Currently supports CSV, Excel, and SQL databases (MySQL, PostgreSQL, SQLite) with planned support for more cloud and file-based sources. - -### Q: Who should use this? -A: Data engineers, analysts, and Python developers who want to integrate fast, automated data quality checks into ETL jobs, CI/CD pipelines, or local workflows. - -### Q: Does it require writing Python code? -A: Not at all. You can specify rules inline in the command line or via a simple JSON config file โ€” no coding needed. - -### Q: Is ValidateLite open-source? -A: Yes! Itโ€™s licensed under MIT and available on GitHub โ€” stars and contributions are welcome! - -### Q: How can I use it in CI/CD? -A: Just install via pip and add a vlite check ... step in your data pipeline or GitHub Action. It returns exit codes you can use for gating deployments. - ---- +## ๐Ÿ“ Development Blog -## ๐Ÿ”’ Security +Follow the journey of building ValidateLite through our development blog posts: -For security issues, please review [SECURITY.md](SECURITY.md) and follow the recommended process. +- **[DevLog #1: Building a Zero-Config Data Validation Tool](https://blog.litedatum.com/posts/Devlog01-data-validation-tool/)** +- **[DevLog #2: Why I Scrapped My Half-Built Data Validation Platform](https://blog.litedatum.com/posts/Devlog02-Rethinking-My-Data-Validation-Tool/) +- **[Rule-Driven Schema Validation: A Lightweight Solution](https://blog.litedatum.com/posts/Rule-Driven-Schema-Validation/) --- ## ๐Ÿ“„ License -This project is licensed under the terms of the [MIT License](LICENSE). - ---- - -## ๐Ÿ™ Acknowledgements - -- Inspired by best practices in data engineering and open-source data quality tools -- Thanks to all contributors and users for their feedback and support +This project is licensed under the [MIT License](LICENSE). diff --git a/cli/__init__.py b/cli/__init__.py index 640c839..8bbfd0e 100644 --- a/cli/__init__.py +++ b/cli/__init__.py @@ -2,10 +2,10 @@ ValidateLite CLI Package Command-line interface for the data quality validation tool. -Provides a unified `vlite-cli check` command for data quality checking. +Provides a unified `vlite check` command for data quality checking. """ -__version__ = "0.4.0" +__version__ = "0.4.2" from .app import cli_app diff --git a/cli/app.py b/cli/app.py index eca4c6a..a7c5d90 100644 --- a/cli/app.py +++ b/cli/app.py @@ -2,7 +2,7 @@ CLI Application Entry Point Main CLI application using Click framework. -Provides the unified `vlite-cli check` command for data quality validation. +Provides the unified `vlite check` command for data quality validation. """ import sys @@ -67,8 +67,8 @@ def _setup_logging() -> None: logging.getLogger().setLevel(logging.WARNING) -@click.group(name="vlite-cli", invoke_without_command=True) -@click.version_option(version="0.4.0", prog_name="vlite-cli") +@click.group(name="vlite", invoke_without_command=True) +@click.version_option(version="0.4.2", prog_name="vlite") @click.pass_context def cli_app(ctx: click.Context) -> None: """ @@ -142,16 +142,16 @@ def rules_help() -> None: Usage Examples: # Single rule - vlite-cli check users.csv --rule "not_null(id)" + vlite check --conn users.csv --rule "not_null(id)" # Multiple rules - vlite-cli check users.csv --rule "not_null(id)" --rule "unique(email)" + vlite check --conn users.csv --rule "not_null(id)" --rule "unique(email)" # Rules file - vlite-cli check users.csv --rules validation.json + vlite check --conn users.csv --rules validation.json # Database check - vlite-cli check mysql://user:pass@host/db.users --rule "not_null(id)" + vlite check --conn mysql://user:pass@host/db --table users --rule "not_null(id)" """ safe_echo(help_text) diff --git a/cli/commands/check.py b/cli/commands/check.py index aa31bb6..cf8c531 100644 --- a/cli/commands/check.py +++ b/cli/commands/check.py @@ -1,7 +1,7 @@ """ Check Command Implementation -The core `vlite-cli check` command for data quality validation. +The core `vlite check` command for data quality validation. Supports smart source identification, rule parsing, and formatted output. """ @@ -76,7 +76,7 @@ def check_command( Check data quality for the given source. NEW FORMAT: - vlite-cli check --conn --table [options] + vlite check --conn --table [options] SOURCE can be: - File path: users.csv, data.xlsx, records.json @@ -84,8 +84,8 @@ def check_command( - SQLite file: sqlite:///path/to/file.db Examples: - vlite-cli check --conn users.csv --table users --rule "not_null(id)" - vlite-cli check --conn mysql://user:pass@host/db \ + vlite check --conn users.csv --table users --rule "not_null(id)" + vlite check --conn mysql://user:pass@host/db \ --table users --rules validation.json """ # Record start time @@ -300,17 +300,17 @@ def rules_help_command() -> None: enum(column,value1,value2...) - Check allowed enum values EXAMPLES: - vlite-cli check users.csv --rule "not_null(id)" - vlite-cli check users.csv --rule "length(name,2,50)" - vlite-cli check users.csv --rule "unique(email)" - vlite-cli check users.csv --rule "range(age,18,65)" - vlite-cli check users.csv --rule "regex(email,^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$)" + vlite check users.csv --rule "not_null(id)" + vlite check users.csv --rule "length(name,2,50)" + vlite check users.csv --rule "unique(email)" + vlite check users.csv --rule "range(age,18,65)" + vlite check users.csv --rule "regex(email,^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$)" MULTIPLE RULES: - vlite-cli check users.csv --rule "not_null(id)" --rule "unique(email)" + vlite check users.csv --rule "not_null(id)" --rule "unique(email)" RULES FILE: - vlite-cli check users.csv --rules validation.json + vlite check users.csv --rules validation.json Example validation.json: { diff --git a/cli/commands/schema.py b/cli/commands/schema.py index a216f6e..122205c 100644 --- a/cli/commands/schema.py +++ b/cli/commands/schema.py @@ -1,7 +1,7 @@ """ Schema Command -Adds `vlite-cli schema` command that parses parameters, performs minimal rules +Adds `vlite schema` command that parses parameters, performs minimal rules file validation (supports both single-table and multi-table formats), and prints output aligned with the existing CLI style. """ diff --git a/cli_main.py b/cli_main.py index 7ac983d..7efead3 100644 --- a/cli_main.py +++ b/cli_main.py @@ -2,7 +2,7 @@ """ ValidateLite CLI Main Entry Point -Main entry point for the vlite-cli command-line tool. +Main entry point for the vlite command-line tool. """ import os diff --git a/docs/CONFIG_REFERENCE.md b/docs/CONFIG_REFERENCE.md index 5bb029e..78caf02 100644 --- a/docs/CONFIG_REFERENCE.md +++ b/docs/CONFIG_REFERENCE.md @@ -129,7 +129,7 @@ export CLI_CONFIG_PATH=/path/to/custom/cli.toml export LOGGING_CONFIG_PATH=/path/to/custom/logging.toml # Run the application -vlite-cli check data.csv --rule "not_null(id)" +vlite check --conn data.csv --table data --rule "not_null(id)" ``` ## Configuration Loading Order diff --git a/docs/USAGE.md b/docs/USAGE.md index 6f2b687..b91a7c5 100644 --- a/docs/USAGE.md +++ b/docs/USAGE.md @@ -37,7 +37,7 @@ pip install validatelite **Option 2: Install from pre-built package** ```bash -pip install validatelite-0.4.0-py3-none-any.whl +pip install validatelite-0.4.2-py3-none-any.whl ``` **Option 3: Run from source** @@ -57,13 +57,13 @@ Let's start with a simple validation to check that all records in a CSV file hav ```bash # Validate a CSV file -vlite check examples/sample_data.csv --rule "not_null(customer_id)" +vlite check --conn examples/sample_data.csv --table data --rule "not_null(customer_id)" # Validate a database table -vlite check "mysql://user:pass@localhost:3306/mydb.customers" --rule "unique(email)" +vlite check --conn "mysql://user:pass@localhost:3306/mydb" --table customers --rule "unique(email)" # Validate against a schema file -vlite schema "mysql://user:pass@localhost:3306/mydb.customers" --rules schema.json +vlite schema --conn "mysql://user:pass@localhost:3306/mydb" --rules schema.json ``` --- @@ -79,7 +79,7 @@ ValidateLite provides two main commands: Both commands follow this general pattern: ```bash -vlite [options] +vlite --conn --table [options] ``` ### Data Source Types @@ -89,9 +89,9 @@ ValidateLite supports multiple data source types: | Type | Format | Example | |------|--------|---------| | **Local Files** | CSV, Excel, JSON, JSONL | `data/customers.csv` | -| **MySQL** | Connection string | `mysql://user:pass@host:3306/db.table` | -| **PostgreSQL** | Connection string | `postgresql://user:pass@host:5432/db.table` | -| **SQLite** | File path with table | `sqlite:///path/to/db.sqlite.table` | +| **MySQL** | Connection string | `mysql://user:pass@host:3306/db` | +| **PostgreSQL** | Connection string | `postgresql://user:pass@host:5432/db` | +| **SQLite** | File path with table | `sqlite:///path/to/db.sqlite` | ### Rule Types Overview @@ -114,11 +114,12 @@ The `check` command allows you to specify validation rules either inline or thro #### Basic Syntax & Parameters ```bash -vlite check [options] +vlite check --conn --table [options] ``` **Required Parameters:** -- `` - Path to file or database connection string +- `--conn ` - Path to file or database connection string +- `--table ` - Table name or identifier for the data source **Options:** | Option | Description | @@ -137,10 +138,10 @@ Use `--rule` for simple, quick validations: ```bash # Single rule -vlite check data.csv --rule "not_null(id)" +vlite check --conn data.csv --table data --rule "not_null(id)" # Multiple rules -vlite check data.csv \ +vlite check --conn data.csv --table data \ --rule "not_null(name)" \ --rule "unique(id)" \ --rule "range(age, 18, 99)" @@ -221,12 +222,12 @@ Sample Failed Data: **1. Basic file validation:** ```bash -vlite check test_data/customers.xlsx --rule "not_null(name)" +vlite check --conn test_data/customers.xlsx --table customers --rule "not_null(name)" ``` **2. Multiple rules with verbose output:** ```bash -vlite check test_data/customers.xlsx \ +vlite check --conn test_data/customers.xlsx --table customers \ --rule "unique(email)" \ --rule "regex(email, '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$')" \ --verbose @@ -234,14 +235,14 @@ vlite check test_data/customers.xlsx \ **3. Comprehensive validation using rules file:** ```bash -vlite check "mysql://root:password@localhost:3306/data_quality.customers" \ +vlite check --conn "mysql://root:password@localhost:3306/data_quality" --table customers \ --rules "validation_rules.json" \ --verbose ``` **4. CSV file with multiple constraints:** ```bash -vlite check examples/sample_data.csv \ +vlite check --conn examples/sample_data.csv --table data \ --rule "not_null(customer_id)" \ --rule "unique(customer_id)" \ --rule "length(email, 5, 100)" \ @@ -259,17 +260,17 @@ vlite check examples/sample_data.csv \ ### The `schema` Command - Schema Validation -The `schema` command validates tables against JSON schema files, automatically decomposing schemas into atomic rules with intelligent prioritization and aggregation. +The `schema` command validates tables against JSON schema files, automatically decomposing schemas into atomic rules with intelligent prioritization and aggregation. **NEW in v0.4.2**: Enhanced multi-table support, Excel multi-sheet file support, and improved output formatting. #### Basic Syntax & Parameters ```bash -vlite schema --rules [options] +vlite schema --conn --rules [options] ``` **Required Parameters:** -- `` - Database/table identifier (table derived from URL) -- `--rules ` - Path to JSON schema file +- `--conn ` - Database connection string or file path (now supports Excel multi-sheet files) +- `--rules ` - Path to JSON schema file (supports both single-table and multi-table formats) **Options:** | Option | Description | @@ -278,9 +279,10 @@ vlite schema --rules [options] | `--verbose` | Show detailed information in table mode | | `--help` | Display command help | -#### Schema File Structure (v1) +#### Schema File Structure -**Minimal Structure:** +**Single-Table Format (v1):** +_Only applicable to CSV file data sources_ ```json { "rules": [ @@ -295,6 +297,29 @@ vlite schema --rules [options] } ``` +**NEW: Multi-Table Format (v0.4.2):** +```json +{ + "customers": { + "rules": [ + { "field": "id", "type": "integer", "required": true }, + { "field": "name", "type": "string", "required": true }, + { "field": "email", "type": "string", "required": true } + ], + "strict_mode": true, + "case_insensitive": false + }, + "orders": { + "rules": [ + { "field": "order_id", "type": "integer", "required": true }, + { "field": "customer_id", "type": "integer", "required": true }, + { "field": "total", "type": "float", "min": 0.01 } + ], + "strict_mode": false + } +} +``` + **Supported Field Types:** - `string`, `integer`, `float`, `boolean`, `date`, `datetime` @@ -304,8 +329,24 @@ vlite schema --rules [options] - `required` - Generate NOT_NULL rule if true - `min`/`max` - Generate RANGE rule for numeric types - `enum` - Generate ENUM rule with allowed values -- `strict_mode` - Report extra columns as violations -- `case_insensitive` - Case-insensitive column matching +- `strict_mode` - Report extra columns as violations (table-level option) +- `case_insensitive` - Case-insensitive column matching (table-level option) + +#### NEW: Multi-Table and Excel Support + +**Excel Multi-Sheet Files:** +The schema command now supports Excel files with multiple worksheets as data sources. Each worksheet can be validated against its corresponding schema definition. + +```bash +# Validate Excel file with multiple sheets +vlite schema --conn "data.xlsx" --rules multi_table_schema.json +``` + +**Multi-Table Validation:** +- Support for validating multiple tables in a single command +- Table-level configuration options (strict_mode, case_insensitive) +- Automatic detection of multi-table data sources +- Grouped output display by table #### Rule Decomposition Logic @@ -328,7 +369,7 @@ Schema Field โ†’ Generated Rules #### Output Formats -**Table Mode (default)** - Column-grouped summary: +**Table Mode (default)** - Column-grouped summary with improved formatting: ``` Column Validation Results โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• @@ -345,42 +386,91 @@ Column: status โš  Dependent checks skipped ``` -**JSON Mode** (`--output json`) - Machine-readable format: +**NEW: Multi-Table Table Mode:** +``` +Table: customers +โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• +Column: id + โœ“ Field exists (integer) + โœ“ Not null constraint + +Table: orders +โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• +Column: order_id + โœ“ Field exists (integer) + โœ“ Not null constraint +``` + +**JSON Mode** (`--output json`) - Machine-readable format with enhanced structure: ```json { "summary": { - "total_checks": 8, - "passed": 5, - "failed": 2, - "skipped": 1 + "total_checks": 12, + "passed": 8, + "failed": 3, + "skipped": 1, + "execution_time_ms": 1250 }, "results": [...], "fields": { - "id": { "status": "passed", "checks": [...] }, - "age": { "status": "failed", "checks": [...] } + "age": { + "status": "passed", + "checks": ["existence", "type", "not_null", "range"] + }, + "unknown_field": { + "status": "extra", + "checks": [] + } }, - "schema_extras": ["unknown_column"] + "schema_extras": ["unknown_field"], + "tables": { + "customers": { + "status": "passed", + "total_checks": 6, + "passed": 6 + }, + "orders": { + "status": "failed", + "total_checks": 6, + "passed": 2, + "failed": 4 + } + } } ``` +**Full JSON schema definition:** `docs/schemas/schema_results.schema.json` + #### Practical Examples **1. Basic schema validation:** ```bash -vlite schema "mysql://root:password@localhost:3306/data_quality.customers" \ +vlite schema --conn "mysql://root:password@localhost:3306/data_quality" \ --rules test_data/schema.json ``` -**2. JSON output for automation:** +**2. NEW: Multi-table schema validation:** +```bash +vlite schema --conn "mysql://user:pass@host:3306/sales" \ + --rules multi_table_schema.json +``` + +**3. NEW: Excel multi-sheet validation:** +```bash +vlite schema --conn "data.xlsx" \ + --rules excel_schema.json +``` + +**4. JSON output for automation:** ```bash -vlite schema "mysql://user:pass@host:3306/sales.users" \ +vlite schema --conn "mysql://user:pass@host:3306/sales" \ --rules schema.json \ --output json ``` -**3. Verbose table output:** +**5. Verbose table output:** ```bash -vlite schema "postgresql://user:pass@localhost:5432/app.customers" \ +vlite schema --conn "postgresql://user:pass@localhost:5432/app" \ --rules customer_schema.json \ --verbose ``` @@ -407,13 +497,13 @@ vlite schema "postgresql://user:pass@localhost:5432/app.customers" \ **Examples:** ```bash # CSV with custom delimiter (auto-detected) -vlite check data/customers.csv --rule "not_null(id)" +vlite check --conn data/customers.csv --table customers --rule "not_null(id)" # Excel file (auto-detects first sheet) -vlite check reports/monthly_data.xlsx --rule "unique(transaction_id)" +vlite check --conn reports/monthly_data.xlsx --table data --rule "unique(transaction_id)" # JSON Lines file -vlite check logs/events.jsonl --rule "not_null(timestamp)" +vlite check --conn logs/events.jsonl --table events --rule "not_null(timestamp)" ``` #### Database Sources @@ -422,30 +512,30 @@ vlite check logs/events.jsonl --rule "not_null(timestamp)" **MySQL:** ``` -mysql://[username[:password]@]host[:port]/database.table +mysql://[username[:password]@]host[:port]/database ``` **PostgreSQL:** ``` -postgresql://[username[:password]@]host[:port]/database.table +postgresql://[username[:password]@]host[:port]/database ``` **SQLite:** ``` -sqlite:///[absolute_path_to_file].table -sqlite://[relative_path_to_file].table +sqlite:///[absolute_path_to_file] +sqlite://[relative_path_to_file] ``` **Connection Examples:** ```bash # MySQL with authentication -vlite check "mysql://admin:secret123@db.company.com:3306/sales.customers" --rule "unique(id)" +vlite check --conn "mysql://admin:secret123@db.company.com:3306/sales" --table customers --rule "unique(id)" # PostgreSQL with default port -vlite check "postgresql://analyst@analytics-db/warehouse.orders" --rules validation.json +vlite check --conn "postgresql://analyst@analytics-db/warehouse" --table orders --rules validation.json # SQLite local file -vlite check "sqlite:///data/local.db.users" --rule "not_null(email)" +vlite check --conn "sqlite:///data/local.db" --table users --rule "not_null(email)" ``` ### Validation Rules Deep Dive diff --git a/examples/README.md b/examples/README.md index a276956..6629940 100644 --- a/examples/README.md +++ b/examples/README.md @@ -18,14 +18,14 @@ This directory contains examples and sample files to help you get started with V 2. **Validate the sample data:** ```bash - python cli_main.py check examples/sample_data.csv --rules examples/sample_rules.json + python cli_main.py check --conn examples/sample_data.csv --table data --rules examples/sample_rules.json ``` 3. **Test with your own data:** ```bash # Create your own rules file based on sample_rules.json # Then run validation - python cli_main.py check your_data.csv --rules your_rules.json + python cli_main.py check --conn your_data.csv --table data --rules your_rules.json ``` ## Example Rules diff --git a/examples/basic_usage.py b/examples/basic_usage.py index 9800698..c872876 100644 --- a/examples/basic_usage.py +++ b/examples/basic_usage.py @@ -68,7 +68,9 @@ def example_csv_validation() -> None: print(f"CSV file: {csv_file}") print(f"Rules file: {rules_file}") print("Run command:") - print(f"python cli_main.py check {csv_file} --rules {rules_file}") + print( + f"python cli_main.py check --conn {csv_file} --table data --rules {rules_file}" + ) print() @@ -114,7 +116,10 @@ def example_database_validation() -> None: print(f"Database: {db_connection}") print(f"Rules file: {rules_file}") print("Run command:") - print(f'python cli_main.py check "{db_connection}" --rules {rules_file}') + print( + f'python cli_main.py check --conn "{db_connection}" --table customers ' + f"--rules {rules_file}" + ) print() @@ -153,7 +158,10 @@ def example_excel_validation() -> None: for rule in rules: print(f" - {rule['name']}: {rule['description']}") print("Run command:") - print("python cli_main.py check products.xlsx --rules rules.json") + print( + "python cli_main.py check --conn products.xlsx --table products " + "--rules rules.json" + ) print() @@ -195,7 +203,7 @@ def example_custom_sql_validation() -> None: print(f" - {rule['name']}: {rule['description']}") print("Run command:") print( - "python cli_main.py check " + "python cli_main.py check --conn " '"mysql://:@localhost:3306/testdb.sales" ' "--rules custom_rules.json" ) diff --git a/pyproject.toml b/pyproject.toml index 2beff36..d07390c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "validatelite" -version = "0.4.0" +version = "0.4.2" description = "A flexible, extensible command-line tool for automated data quality validation" readme = "README.md" license = {text = "MIT"} diff --git a/scripts/generate_config_docs.py b/scripts/generate_config_docs.py index 8f8a893..a2ac108 100644 --- a/scripts/generate_config_docs.py +++ b/scripts/generate_config_docs.py @@ -158,7 +158,7 @@ def generate_environment_variables_docs() -> str: docs += "export LOGGING_CONFIG_PATH=/path/to/custom/logging.toml\n" docs += "\n" docs += "# Run the application\n" - docs += 'vlite-cli check data.csv --rule "not_null(id)"\n' + docs += 'vlite check data.csv --rule "not_null(id)"\n' docs += "```\n\n" return docs diff --git a/tests/e2e/cli_scenarios/test_schema_command_e2e.py b/tests/e2e/cli_scenarios/test_schema_command_e2e.py index eed2bd1..1a17013 100644 --- a/tests/e2e/cli_scenarios/test_schema_command_e2e.py +++ b/tests/e2e/cli_scenarios/test_schema_command_e2e.py @@ -1,5 +1,5 @@ """ -E2E: vlite-cli schema on databases and table/json outputs +E2E: vlite schema on databases and table/json outputs Scenarios derived from notes/ๆต‹่ฏ•ๆ–นๆกˆ-ๆ•ฐๆฎๅบ“SchemaDriftไธŽCLI-Schemaๅ‘ฝไปค.md: - Happy path on DB URL with table/json outputs diff --git a/tests/unit/cli/core/test_cli_app.py b/tests/unit/cli/core/test_cli_app.py index 54ebde1..1a63664 100644 --- a/tests/unit/cli/core/test_cli_app.py +++ b/tests/unit/cli/core/test_cli_app.py @@ -55,7 +55,7 @@ def test_cli_app_version_option(self: Any, runner: CliRunner) -> None: result = runner.invoke(cli_app, ["--version"]) assert result.exit_code == 0 - assert "vlite-cli" in result.output + assert "vlite" in result.output # assert "1.0.0" in result.output def test_cli_app_help_option(self: Any, runner: CliRunner) -> None: @@ -118,7 +118,7 @@ def test_rules_help_command_content(self: Any, runner: CliRunner) -> None: assert "not_null(id)" in result.output assert "unique(email)" in result.output assert "length(name,2,50)" in result.output - assert "mysql://user:pass@host/db.users" in result.output + assert "mysql://user:pass@host/db" in result.output def test_rules_help_json_schema_example(self: Any, runner: CliRunner) -> None: """Test rules-help includes valid JSON schema example""" @@ -146,9 +146,9 @@ def test_rules_help_usage_examples(self: Any, runner: CliRunner) -> None: # Check usage examples usage_examples = [ - "vlite-cli check users.csv --rule", - "vlite-cli check users.csv --rules validation.json", - "vlite-cli check mysql://user:pass@host/db.users", + "vlite check --conn users.csv --rule", + "vlite check --conn users.csv --rules validation.json", + "vlite check --conn mysql://user:pass@host/db", ] for example in usage_examples: @@ -411,7 +411,7 @@ def test_cli_app_contract_compliance(self: Any, runner: CliRunner) -> None: # Should have proper Click structure assert "Usage:" in result.output - assert "vlite-cli" in result.output + assert "vlite" in result.output assert "Commands:" in result.output def test_error_exit_codes_consistency(self: Any, runner: CliRunner) -> None: