A Python utility for cleaning CSV files by removing empty rows, trailing commas, and formatting issues while preserving UTF-8 encoding for international text (including Khmer script).
- Remove empty rows - Eliminates completely blank lines and rows with only commas
- Trim trailing commas - Removes extra commas at the end of rows that cause parsing issues
- Preserve UTF-8 encoding - Maintains international characters (Khmer, Chinese, Arabic, etc.)
- Multiple encoding support - Automatically detects and handles different file encodings
- Column structure validation - Ensures all rows have the same number of columns as the header
- Batch processing - Clean multiple files at once with glob patterns
- Flexible output options - Custom output paths or automatic naming with "_cleaned" suffix
- Detailed reporting - Shows processing statistics for each file
- Python 3.6 or higher
- No additional packages required (uses only standard library)
# Clean a single file (creates data_cleaned.csv)
python csv_cleaner.py data.csv
# Clean with custom output name
python csv_cleaner.py data.csv clean_data.csv
# Clean all CSV files in current directory
python csv_cleaner.py *.csv
# Clean multiple specific files
python csv_cleaner.py file1.csv file2.csv file3.csvFor Windows users, use the included batch file:
# Double-click clean_csv.bat for usage instructions
# Or use from command line:
clean_csv.bat data.csv
clean_csv.bat *.csv# Single file processing
python csv_cleaner.py input_file.csv [output_file.csv]
# Multiple file processing
python csv_cleaner.py file1.csv file2.csv file3.csv
# Glob pattern processing
python csv_cleaner.py "*.csv"
# Output to specific directory
python csv_cleaner.py input.csv --output-dir cleaned/
python csv_cleaner.py *.csv --output-dir cleaned/python csv_cleaner.py sample_data.csvOutput:
Cleaning file: sample_data.csv
β Cleaned file saved: sample_data_cleaned.csv
Original rows: 156
Cleaned rows: 154
Rows removed: 2
python csv_cleaner.py *.csv --output-dir cleanedOutput:
Cleaning file: data1.csv
β Cleaned file saved: cleaned/data1_cleaned.csv
Original rows: 156
Cleaned rows: 154
Rows removed: 2
Cleaning file: data2.csv
β Cleaned file saved: cleaned/data2_cleaned.csv
Original rows: 180
Cleaned rows: 178
Rows removed: 2
β Successfully cleaned 2 files
id,name,category,description,status,,,,,
,,,,,,,,
101,Product A,Electronics,High quality device,Active,,,,,
102,Product B,Clothing,Cotton material,Pending,,,,,
,,,,,,,,id,name,category,description,status
101,Product A,Electronics,High quality device,Active
102,Product B,Clothing,Cotton material,Pending- β Empty rows removed
- β Trailing commas stripped
- β Consistent column count
- β UTF-8 encoding preserved
- β Whitespace normalized
csv-cleaner/
βββ csv_cleaner.py # Main Python script
βββ clean_csv.bat # Windows batch file
βββ README.md # This documentation
βββ examples/
βββ input/
β βββ sample_data.csv # Sample input file
β βββ inventory.csv # Sample input file
βββ output/
βββ sample_data_cleaned.csv # Cleaned output
βββ inventory_cleaned.csv # Cleaned output
The script automatically handles multiple text encodings:
- UTF-8 (primary) - For international text
- UTF-8-BOM - For files with byte order mark
- CP1252 - For Windows Latin characters
- Latin1 - Fallback encoding
- Read file with UTF-8 encoding (fallback to other encodings if needed)
- Strip whitespace and trailing commas from each line
- Filter empty rows and rows containing only commas
- Validate structure ensuring consistent column count
- Write clean file maintaining UTF-8 encoding
- File not found - Clear error message with file path
- Encoding errors - Automatic fallback to alternative encodings
- Permission errors - Informative error messages
- Invalid CSV structure - Graceful handling with warnings
- Original files are never modified - Only cleaned copies are created
- UTF-8 encoding preserved - International characters remain intact
- Data structure maintained - Column relationships preserved
- Memory efficient - Processes files line by line
- Fast processing - No external dependencies
- Batch optimized - Efficient handling of multiple files
- CSV format only - Does not process Excel files (.xlsx, .xls)
- Memory usage - Very large files (>100MB) may require more RAM
- Column detection - Assumes first non-empty row is the header
FileNotFoundError: Input file not found: myfile.csvSolution: Check file path and ensure the file exists in the specified location.
UnicodeDecodeError: Could not decode file with any common encodingSolution: The script tries multiple encodings automatically. If this persists, the file may be corrupted or in an unsupported format.
python csv_cleaner.py "*.csv"
# No outputSolution: Ensure CSV files exist in the current directory and use quotes around the glob pattern.
- Check file permissions - Ensure you can read input files and write to output directory
- Verify Python version - Use Python 3.6 or higher
- Test with sample file - Try with a simple CSV file first
- Check file encoding - Open file in text editor to verify it's readable
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
git clone <repository-url>
cd csv-cleaner
python csv_cleaner.py --helpFor support or questions:
- Create an issue in the repository
- Check the troubleshooting section above
- Verify your CSV file format and encoding
Made with β€οΈ for cleaning messy CSV files