CSV Validation Library

Blazing fast format validations for your CSV files

This is a Python lib with a Rust core that will allow you to validate huge CSV files (GBs) in seconds (or in a few minutes for really huge files) using a minimal amount of memory.

Features

✨ Validate both plain and gzipped CSV files
🔍 Multiple validation types supported:
- Correct column name and order
- Regular expressions
- Well-known formats (integer, decimal, etc.)
- Minimum/Maximum numerical value checks
- Value set validation (allowed values)
🦀 + 🐍 Rust lib with Python bindings included
📝 Detailed validation summaries with sample invalid values
🚀 High performance with optimizations like regex pre-compilation
📊 Support for large CSV files

Installation

Python

pip install csv_validation

poetry add csv_validation

uv add csv_validation

Usage

Python

You can provide a file with the validation rules

from csv_validation import CSVValidator

validator = CSVValidator.from_file("validation_rules.yaml")
is_valid = validator.validate("data.csv")

You can also create a validator from a string

validation_rules = """
columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$
  - name: Age
    format: positive_integer
    max: 120
"""

validator = CSVValidator.from_string(validation_rules)
# Optionally set a custom column separator
validator.set_separator(";")  # Default is comma (,)

# Validate a CSV file
is_valid = validator.validate("data.csv")

Validation Definition Format

Create a small, easy-to-read YAML file with your validation rules. Example:

columns:
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$  # Letters and spaces, 2-50 characters
  - name: Family Name
    regex: ^[A-Za-z\s'-]{2,50}$  # Letters, spaces, hyphens and apostrophes
  - name: Age
    format: positive_integer  # Using predefined format instead of custom regex
    max: 120
  - name: Salary
    format: integer  # Allows negative integers too
    min: 20000
  - name: Email
    regex: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$  # Standard email format
  - name: Phone
    regex: ^\+?[0-9]{10,15}$  # International phone format with optional +
  - name: Status
    values: [active, inactive, pending, suspended]  # Only these values are allowed
  - name: Gender
    values: [M, F, NB, O]  # M: Male, F: Female, NB: Non-Binary, O: Other

Validation Types

Regular Expression (regex)
- Validate fields against custom regex patterns
Format (format)
- Predefined formats for common validations (This is the recommended way to validate numeric fields)
- In the background, the library uses regex patterns to validate formats.
- Available formats:
  - integer: Validates any integer number (positive or negative)
  - positive integer: Validates positive integer numbers
  - negative integer: Validates negative integer numbers
  - decimal/decimal point: Validates any decimal number (positive or negative) using point as decimal separator
  - negative decimal point: Validates negative decimal numbers using point as decimal separator
  - decimal comma: Validates decimal numbers using comma as decimal separator
  - positive decimal/positive decimal point: Validates positive decimal numbers using point as decimal separator
  - positive decimal comma: Validates positive decimal numbers using comma as decimal separator
  - decimal scientific: Validates decimal numbers in scientific notation (e.g. 23.02e-12)
  - decimal scientific comma: Validates decimal numbers in scientific notation using comma as decimal separator
  - positive decimal scientific: Validates positive decimal numbers in scientific notation
- More formats will be added in upcoming versions
Minimum Value (min)
- Check if numeric fields are greater than or equal to a specified value
Maximum Value (max)
- Check if numeric fields are less than or equal to a specified value
Value Set (values)
- Ensure fields only contain values from a predefined set
Extra (extra)
- Extra validations that normally complement one of the other types
- Available extras:
  - non_empty: Validates that field contains at least one character

You can add as many validation types as you want for the same column, but take into account that the column only will be considered correct if all the validations are OK.

Global Validations

empty_not_ok

Empty values (empty string '') are considered correct and accepted by default. However, if your data must always have some content, you can add a global empty_not_ok flag at the root level of your YAML definition to automatically add the extra: non_empty validation to all columns:

empty_not_ok: true
columns:
  - name: Column1
    # other validations...

Unique Key (Duplicate Rows) Validation

You can enforce uniqueness across one or more columns by adding a root-level unique field in your YAML definition. This enables a memory-efficient duplicate detection that scales to very large files.

Accepts a single column name (string) or a list of column names (array of strings).
Works with both plain and gzipped CSV files.
Uses a high-performant disk-backed key-value store (redb) under the hood to keep memory usage low.

Examples:

# Single-column unique key
unique: ID
columns:
  - name: ID
    format: positive integer
  - name: Name
    regex: ^[A-Za-z\s]{2,50}$

# Composite unique key across multiple columns
unique: [isin, date, score_provider]
columns:
  - name: isin
    regex: ^[A-Z0-9]{12}$
  - name: date
    regex: ^\d{4}-\d{2}-\d{2}$
  - name: score_provider
    values: [sp1, sp2, sp3]
  - name: score
    format: decimal

What you will see in the summary:

Always shows the number of duplicate key groups found.
If there are more than 100 duplicate groups, only the first 100 are displayed as a sample (with their occurrence counts), and the total number is indicated.
If no duplicates are found, it reports: "No duplicates found".

Sample output (composite key):

UNIQUE KEY: column(s): isin, date, score_provider

DUPLICATED KEYS (groups found: 134)
--------------------------------------------------
Showing first 100 of 134 duplicate key groups:
  - (isin='US0004026250', date='2025-10-01', score_provider='sp1') -> occurrences: 3
  - (isin='US5949181045', date='2025-10-01', score_provider='sp2') -> occurrences: 4
  ... up to 100 lines ...

If none:

UNIQUE KEYS CHECK
--------------------------------------------------
  - No duplicates found

Performance note: Duplicate detection stores only the unique keys on disk and keeps a tiny in-memory summary used for printing the report. This allows validating huge datasets with minimal RAM usage.

Column Separator

By default, the library uses comma (,) as the column separator. You can change this using the set_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_separator(";")  # Use semicolon as separator
validator.set_separator("\t")  # Use tab as separator

Decimal Separator

By default, the library uses point (.) as the decimal separator. You can change this using the set_decimal_separator method:

validator = CSVValidator.from_file("rules.yml")
validator.set_decimal_separator(",")  # Use comma

Error Handling

The library provides detailed validation reports through Python's logging system. When a validation fails, you'll get information about:

Which columns failed validation
What type of validation failed
Sample of invalid values (including row numbers)
Number of rows that failed each validation

Example of validation report

VALIDATIONS SUMMARY
==================================================
FILE: /test.csv
Rows: 232230 | Columns: 5

CORRECT COLUMNS: 3/5
--------------------------------------------------
  - County: [✔] OK
      ✔ - regex: '^.*$'
  - City: [✔] OK
      ✔ - regex: '^.*$'
  - Model Year: [✔] OK
      ✔ - regex: '^[0-9]+$'
      ✔ - min: 1999
      ✔ - max: 2025.78

WRONG COLUMNS: 2/5
--------------------------------------------------
  - State: [✖] FAIL
      ✖ - values: 'WA', 'OR', 'NY', 'DC', 'CA', 'TX', 'FL', 'OK', 'MO', 'KS', 'VA', 'MA', 'MO', 'NC', 'IL', 'AL', 'WY', 'CO', 'PA', 'WI', 'MD', 'NV', 'AZ'
          "Wrong Rows: 93 | Sample: [row 12: 'GA'], [row 15: 'NJ'], [row 22: 'CT'], [row 23: 'NJ'], [row 24: 'CT']"
  - Postal Code: [✖] FAIL
      ✔ - integer
      ✖ - non_empty
          "Wrong Rows: 4 | Sample: [row 45: ''], [row 123: ''], [row 456: ''], [row 789: '']"

VALIDATION RESULT
--------------------------------------------------
[✖] FAIL: File /test.csv DOESN'T match all validations

Development

Prerequisites

Rust 2021 edition or later
Python 3.6+ (for Python bindings)
Cargo and standard Rust tooling

Building from Source (Rust version)

# Clone the repository
git clone https://github.com/charro/csv_validation
cd csv_validation

# Build the project
cargo build --release

# Run tests
cargo test

Dependencies

csv: CSV parsing
flate2: Compression support
pyo3: Python bindings
regex: Regular expression support
yaml-rust2: YAML parsing
redb: Disk-backed key-value storage used for memory-efficient duplicate detection
Various utilities for logging and serialization

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
src		src
test		test
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build.sh		build.sh
csv_validation.pyi		csv_validation.pyi
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSV Validation Library

Features

Installation

Python

Usage

Python

You can provide a file with the validation rules

You can also create a validator from a string

Validation Definition Format

Validation Types

Global Validations

empty_not_ok

Unique Key (Duplicate Rows) Validation

Column Separator

Decimal Separator

Error Handling

Example of validation report

Development

Prerequisites

Building from Source (Rust version)

Dependencies

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CSV Validation Library

Features

Installation

Python

Usage

Python

You can provide a file with the validation rules

You can also create a validator from a string

Validation Definition Format

Validation Types

Global Validations

empty_not_ok

Unique Key (Duplicate Rows) Validation

Column Separator

Decimal Separator

Error Handling

Example of validation report

Development

Prerequisites

Building from Source (Rust version)

Dependencies

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages