Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added
- None

### Changed
- None

### Fixed
- None

### Removed
- None

## [0.4.2] - 2025-08-27

### Added
- feat(cli): refactor check command interface from positional arguments to `--conn` and `--table` options
- feat(cli): add comprehensive test coverage for new CLI interface functionality
Expand All @@ -20,7 +34,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- feat(tests): add multi-table Excel file validation test scenarios

### Changed
- **BREAKING CHANGE**: CLI interface changed from `vlite-cli check <source>` to `vlite-cli check --conn <connection> --table <table_name>`
- **BREAKING CHANGE**: CLI interface changed from `vlite check <source>` to `vlite check --conn <connection> --table <table_name>`
- refactor(cli): update SourceParser to accept optional table_name parameter
- refactor(cli): modify check command to pass table_name to SourceParser.parse_source()
- refactor(tests): update all existing CLI tests to use new interface format
Expand All @@ -47,7 +61,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **BREAKING CHANGE**: remove backward compatibility for old positional argument interface
- remove(cli): eliminate support for `<source>` positional argument in check command

## [0.4.0] - 2025-01-27
## [0.4.0] - 2025-08-14

### Added
- feat(cli): add `schema` command skeleton
Expand All @@ -61,7 +75,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- tests(cli): comprehensive unit tests for `schema` command covering argument parsing, rules file validation, decomposition/mapping, aggregation priority, output formats (table/json), and exit codes (AC satisfied)
- tests(core): unit tests for `SCHEMA` rule covering normal/edge/error cases, strict type checks, and mypy compliance
- tests(integration): database schema drift tests for MySQL and PostgreSQL (existence, type consistency, strict mode extras, case-insensitive)
- tests(e2e): end-to-end `vlite-cli schema` scenarios on database URLs covering happy path, drift (FIELD_MISSING/TYPE_MISMATCH), strict extras, empty rules minimal payload; JSON and table outputs
- tests(e2e): end-to-end `vlite schema` scenarios on database URLs covering happy path, drift (FIELD_MISSING/TYPE_MISMATCH), strict extras, empty rules minimal payload; JSON and table outputs

### Changed
- docs: update README and USAGE with schema command overview and detailed usage
Expand Down
254 changes: 68 additions & 186 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,234 +1,116 @@
# ValidateLite

ValidateLite is a lightweight, zero-config Python CLI tool for validating data quality across files and SQL databases - built for modern data pipelines and CI/CD automation. This python data validation tool is a flexible, extensible command-line tool for automated data quality validation, profiling, and rule-based checks across diverse data sources. Designed for data engineers, analysts, and developers to ensure data reliability and compliance in modern data pipelines.

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code Coverage](https://img.shields.io/badge/coverage-80%25-green.svg)](https://github.com/litedatum/validatelite)

---
**ValidateLite: A lightweight data validation tool for engineers who need answers, fast.**

## 📝 Development Blog
Unlike other complex **data validation tools**, ValidateLite provides two powerful, focused commands for different scenarios:

Follow the journey of building ValidateLite through our development blog posts:
* **`vlite check`**: For quick, ad-hoc data checks. Need to verify if a column is unique or not null *right now*? The `check` command gets you an answer in 30 seconds, zero config required.

- **[DevLog #1: Building a Zero-Config Data Validation Tool](https://blog.litedatum.com/posts/Devlog01-data-validation-tool/)** - The initial vision and architecture of ValidateLite
- **[DevLog #2: Why I Scrapped My Half-Built Data Validation Platform](https://blog.litedatum.com/posts/Devlog02-Rethinking-My-Data-Validation-Tool/)** - Lessons learned from scope creep and the pivot to a focused CLI tool
- **[Rule-Driven Schema Validation: A Lightweight Solution](https://blog.litedatum.com/posts/Rule-Driven-Schema-Validation/)** - Deep dive into schema drift challenges and how ValidateLite's schema validation provides a lightweight alternative to complex frameworks
* **`vlite schema`**: For robust, repeatable **database schema validation**. It's your best defense against **schema drift**. Embed it in your CI/CD and ETL pipelines to enforce data contracts, ensuring data integrity before it becomes a problem.

---

## 🚀 Quick Start
## Core Use Case: Automated Schema Validation

### For Regular Users
The `vlite schema` command is key to ensuring the stability of your data pipelines. It allows you to quickly verify that a database table or data file conforms to a defined structure.

**Option 1: Install from [PyPI](https://pypi.org/project/validatelite/) (Recommended)**
```bash
pip install validatelite
vlite --help
```
### Scenario 1: Gate Deployments in CI/CD

**Option 2: Install from pre-built package**
```bash
# Download the latest release from GitHub
pip install validatelite-0.1.0-py3-none-any.whl
vlite --help
```
Automatically check for breaking schema changes before they get deployed, preventing production issues caused by unexpected modifications.

**Option 3: Run from source**
```bash
git clone https://github.com/litedatum/validatelite.git
cd validatelite
pip install -r requirements.txt
python cli_main.py --help
```

**Option 4: Install with pip-tools (for development)**
```bash
git clone https://github.com/litedatum/validatelite.git
cd validatelite
pip install pip-tools
pip-compile requirements.in
pip install -r requirements.txt
python cli_main.py --help
```
**Example Workflow (`.github/workflows/ci.yml`)**
```yaml
jobs:
validate-db-schema:
name: Validate Database Schema
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3

### For Developers & Contributors
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'

If you want to contribute to the project or need the latest development version:
- name: Install ValidateLite
run: pip install validatelite

```bash
git clone https://github.com/litedatum/validatelite.git
cd validatelite

# Install dependencies (choose one approach)
# Option 1: Install from pinned requirements
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Option 2: Use pip-tools for development
pip install pip-tools
python scripts/update_requirements.py
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install
- name: Run Schema Validation
run: |
vlite schema --conn "mysql://${{ secrets.DB_USER }}:${{ secrets.DB_PASS }}@${{ secrets.DB_HOST }}/sales" \
--rules ./schemas/customers_schema.json
```

See [DEVELOPMENT_SETUP.md](docs/DEVELOPMENT_SETUP.md) for detailed development setup instructions.

---

## ✨ Features

- **🔧 Rule-based Data Quality Engine**: Supports completeness, uniqueness, validity, and custom rules
- **🖥️ Extensible CLI**: Easily integrate with CI/CD and automation workflows
- **🗄️ Multi-Source Support**: Validate data from files (CSV, Excel) and databases (MySQL, PostgreSQL, SQLite)
- **⚙️ Configurable & Modular**: Flexible configuration via TOML and environment variables
- **🛡️ Comprehensive Error Handling**: Robust exception and error classification system
- **🧪 Tested & Reliable**: High code coverage, modular tests, and pre-commit hooks
- **📐 Schema Drift Prevention**: Lightweight schema validation that prevents data pipeline failures from unexpected schema changes - a simple alternative to complex validation frameworks

---

## 📖 Documentation

- **[USAGE.md](docs/USAGE.md)** - Complete user guide with examples and best practices
- Schema command JSON output contract: `docs/schemas/schema_results.schema.json`
- **[DEVELOPMENT_SETUP.md](docs/DEVELOPMENT_SETUP.md)** - Development environment setup and contribution guidelines
- **[CONFIG_REFERENCE.md](docs/CONFIG_REFERENCE.md)** - Configuration file reference
- **[ROADMAP.md](docs/ROADMAP.md)** - Development roadmap and future plans
- **[CHANGELOG.md](CHANGELOG.md)** - Release history and changes

---

## 🎯 Basic Usage

### Validate a CSV file
```bash
vlite check data.csv --rule "not_null(id)" --rule "unique(email)"
```

### Validate a database table
```bash
vlite check "mysql://user:pass@host:3306/db.table" --rules validation_rules.json
### Scenario 2: Monitor ETL/ELT Pipelines

Set up validation checkpoints at various stages of your data pipelines to guarantee data quality and avoid "garbage in, garbage out."

**Example Rule File (`customers_schema.json`)**
```json
{
"customers": {
"rules": [
{ "field": "id", "type": "integer", "required": true },
{ "field": "name", "type": "string", "required": true },
{ "field": "email", "type": "string", "required": true },
{ "field": "age", "type": "integer", "min": 18, "max": 100 },
{ "field": "gender", "enum": ["Male", "Female", "Other"] },
{ "field": "invalid_col" }
]
}
}
```

### Check with verbose output
**Run Command:**
```bash
vlite check data.csv --rules rules.json --verbose
```

### Validate against a schema file (single table)
```bash
# Table is derived from the data-source URL, the schema file is single-table in v1
vlite schema "mysql://user:pass@host:3306/sales.users" --rules schema.json

# Get aggregated JSON with column-level details (see docs/schemas/schema_results.schema.json)
vlite schema "mysql://.../sales.users" --rules schema.json --output json
```

For detailed usage examples and advanced features, see [USAGE.md](docs/USAGE.md).

---

## 🏗️ Project Structure

```
validatelite/
├── cli/ # CLI logic and commands
├── core/ # Rule engine and core validation logic
├── shared/ # Common utilities, enums, exceptions, and schemas
├── config/ # Example and template configuration files
├── tests/ # Unit, integration, and E2E tests
├── scripts/ # Utility scripts
├── docs/ # Documentation
└── examples/ # Usage examples and sample data
vlite schema --conn "mysql://user:pass@host:3306/sales" --rules customers_schema.json
```

---

## 🧪 Testing
## Quick Start: Ad-Hoc Checks with `check`

### For Regular Users
The project includes comprehensive tests to ensure reliability. If you encounter issues, please check the [troubleshooting section](docs/USAGE.md#error-handling) in the usage guide.
For temporary, one-off validation needs, the `check` command is your best friend.

### For Developers
**1. Install (if you haven't already):**
```bash
# Set up test databases (requires Docker)
./scripts/setup_test_databases.sh start

# Run all tests with coverage
pytest -vv --cov

# Run tests quietly (suppress debug messages)
python scripts/run_tests_quiet.py --cov

# Run specific test categories
pytest tests/unit/ -v # Unit tests only
pytest tests/integration/ -v # Integration tests
pytest tests/e2e/ -v # End-to-end tests

# Run specific tests quietly
python scripts/run_tests_quiet.py tests/unit/ -v
pip install validatelite
```

# Code quality checks
pre-commit run --all-files
**2. Run a check:**
```bash
# Check for nulls in a CSV file's 'id' column
vlite check --conn "customers.csv" --table customers --rule "not_null(id)"

# Stop test databases when done
./scripts/setup_test_databases.sh stop
# Check for uniqueness in a database table's 'email' column
vlite check --conn "mysql://user:pass@host/db" --table customers --rule "unique(email)"
```

---

## 🤝 Contributing
## Learn More

We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) and [Code of Conduct](CODE_OF_CONDUCT.md).

### Development Setup
For detailed development setup instructions, see [DEVELOPMENT_SETUP.md](docs/DEVELOPMENT_SETUP.md).
- **[Usage Guide (USAGE.md)](docs/USAGE.md)**: Learn about all commands, arguments, and advanced features.
- **[Configuration Reference (CONFIG_REFERENCE.md)](docs/CONFIG_REFERENCE.md)**: See how to configure the tool via `toml` files.
- **[Contributing Guide (CONTRIBUTING.md)](CONTRIBUTING.md)**: We welcome contributions!

---

## ❓ FAQ: Why ValidateLite?

### Q: What is ValidateLite, in one sentence?
A: ValidateLite is a lightweight, zero-config Python CLI tool for data quality validation, profiling, and rule-based checks across CSV files and SQL databases.

### Q: How is it different from other tools like Great Expectations or Pandera?
A: Unlike heavyweight frameworks, ValidateLite is built for simplicity and speed — no code generation, no DSLs, just one command to validate your data in pipelines or ad hoc scripts.

### Q: What kind of data sources are supported?
A: Currently supports CSV, Excel, and SQL databases (MySQL, PostgreSQL, SQLite) with planned support for more cloud and file-based sources.

### Q: Who should use this?
A: Data engineers, analysts, and Python developers who want to integrate fast, automated data quality checks into ETL jobs, CI/CD pipelines, or local workflows.

### Q: Does it require writing Python code?
A: Not at all. You can specify rules inline in the command line or via a simple JSON config file — no coding needed.

### Q: Is ValidateLite open-source?
A: Yes! It’s licensed under MIT and available on GitHub — stars and contributions are welcome!

### Q: How can I use it in CI/CD?
A: Just install via pip and add a vlite check ... step in your data pipeline or GitHub Action. It returns exit codes you can use for gating deployments.

---
## 📝 Development Blog

## 🔒 Security
Follow the journey of building ValidateLite through our development blog posts:

For security issues, please review [SECURITY.md](SECURITY.md) and follow the recommended process.
- **[DevLog #1: Building a Zero-Config Data Validation Tool](https://blog.litedatum.com/posts/Devlog01-data-validation-tool/)**
- **[DevLog #2: Why I Scrapped My Half-Built Data Validation Platform](https://blog.litedatum.com/posts/Devlog02-Rethinking-My-Data-Validation-Tool/)
- **[Rule-Driven Schema Validation: A Lightweight Solution](https://blog.litedatum.com/posts/Rule-Driven-Schema-Validation/)

---

## 📄 License

This project is licensed under the terms of the [MIT License](LICENSE).

---

## 🙏 Acknowledgements

- Inspired by best practices in data engineering and open-source data quality tools
- Thanks to all contributors and users for their feedback and support
This project is licensed under the [MIT License](LICENSE).
4 changes: 2 additions & 2 deletions cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
ValidateLite CLI Package

Command-line interface for the data quality validation tool.
Provides a unified `vlite-cli check` command for data quality checking.
Provides a unified `vlite check` command for data quality checking.
"""

__version__ = "0.4.0"
__version__ = "0.4.2"

from .app import cli_app

Expand Down
Loading