-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathReadMe.txt
More file actions
151 lines (133 loc) · 9.94 KB
/
ReadMe.txt
File metadata and controls
151 lines (133 loc) · 9.94 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# CsvReporter (HashTable, Streaming, Sort-Merge, Detailed Validation)
PowerShell scripts to compare two CSV files and produce a changes report. Choose by data size and validation depth. All provide robust header validation, configurable delimiter and encoding, case-sensitive/insensitive comparisons, and progress reporting.
Scripts
- CompareCSVs_small.ps1 — In-memory comparison (fastest for small/medium files; standard output)
- CompareCSVs_medium.ps1 — Streaming/batched comparison (handles larger files; bounded writes; standard output)
- CompareCSVs_large.ps1 — External sort-merge using temporary sorted files (bounded memory for very large files; standard output)
- CompareCSVs_Detailed.ps1 — In-memory comparison with field-level match tracking and summary statistics (validation/audit scenarios)
- CompareCSVs_DetailedV2.ps1 — In-memory comparison with field-level match tracking, value transforms, date normalization, and column filtering (advanced validation/audit scenarios)
Key features
- Robust header parsing (quoted headers, embedded delimiters) via Microsoft.VisualBasic.TextFieldParser
- Validation:
- Empty/blank header detection (with 1-based positions)
- Duplicate header detection after normalization (Trim + ToLowerInvariant)
- Required anchor column present in both files
- Duplicate anchor values within each file
- Consistent row length per file
- Blank/malformed data row detection
- Configurable delimiter: comma, tab, semicolon, pipe
- Configurable encoding for input and output
- Case-sensitive or insensitive comparisons
- Progress bars (Write-Progress) with phase updates; bars always clear even on errors
- Output filenames include millisecond-precision timestamps for uniqueness
Requirements
- Windows PowerShell 5.1 or PowerShell 7+ on Windows
- Microsoft.VisualBasic available (the scripts load it for TextFieldParser)
Parameters (common)
- -PreviousCSVFile: Path to the “Previous” CSV
- -CurrentCSVFile: Path to the “Current” CSV
- -AnchorColumn: Header name of the key column used to join rows
- -OutputFolder: Folder where the changes CSV will be written
- -DelimiterName: One of comma, tab, semicolon, pipe
- -EncodingName: One of auto, utf8, utf8BOM, unicode, oem, default
- auto: let Import-Csv detect BOM; export uses UTF-8 on PS 6+ (UTF-8 with BOM behavior on PS 5.1 mappings handled internally)
- utf8BOM: UTF-8 (Excel-friendly; default)
- utf8: UTF-8 (no BOM on PS 6+, BOM on PS 5.1)
- unicode: UTF-16 LE
- oem, default: legacy code pages as exposed by PowerShell
- -CaseSensitive: Use case-sensitive comparisons
Parameters (script-specific)
- CompareCSVs_medium.ps1: -BatchSize (default 1000)
- CompareCSVs_large.ps1: -BatchSize (default 1000)
- CompareCSVs_DetailedV2.ps1: -ValueTransforms (hashtable of column transformations), -DateFormats (hashtable of date format specs), -IgnoreColumns (array of column names to skip)
ValueTransforms syntax (DetailedV2 only)
-ValueTransforms applies transformations to Previous file values during comparison only (original values preserved in output). Format: @{ 'ColumnName' = @{ 'OldValue' = 'NewValue'; ... }; ... }
Three transformation types:
- Direct replacement: 'OldValue' = 'NewValue' (replaces value directly)
- Prefix modifier: 'OldValue' = '<<prefix' (prepends prefix to original value; useful for padding: '5' = '<<00' becomes '005')
- Suffix modifier: 'OldValue' = '>>suffix' (appends suffix to original value; useful for codes: 'USA' = '>>0' becomes 'USA0')
- Wildcard key '*': Matches any value not explicitly mapped (fallback rule)
Example wrapper script:
$transforms = @{
'worker_type' = @{
'Employee' = 'EMP'
'Contingent Worker' = 'CWK'
}
'profit_center' = @{
'*' = '>>0'
}
'legal_entity' = @{
'*' = '>>0'
}
}
.\CompareCSVs_DetailedV2.ps1 -PreviousCSVFile .\prev.csv -CurrentCSVFile .\curr.csv `
-AnchorColumn EmployeeID -OutputFolder .\out -ValueTransforms $transforms -DelimiterName comma -EncodingName utf8BOM
-DateFormats syntax (DetailedV2 only)
Normalizes dates across different formats during comparison. Format: @{ 'ColumnName' = @{ Previous = 'format'; Current = 'format'; Output = 'format' }; ... }
Skips empty/blank values. Non-empty values that fail to parse generate a warning but fall back to raw comparison.
Example: @{ 'HireDate' = @{ Previous = 'MM/dd/yyyy'; Current = 'yyyy-MM-dd'; Output = 'yyyy-MM-dd' } }
-IgnoreColumns syntax (DetailedV2 only)
Excludes specified columns from comparison. Format: @('ColumnName1', 'ColumnName2', ...)
Example: -IgnoreColumns @('LastModifiedDate', 'ProcessingNotes')
Cannot ignore the anchor column. Ignored columns still appear in output but are not compared.
Delimiters
- comma => ,
- tab => `t
- semicolon => ;
- pipe => |
Output
- A CSV written to -OutputFolder named:
- Changes_{CurrentFileBaseName}_GeneratedOn_{yyyy-MM-dd_HHmmssfff}.csv
- Standard output (Small, Medium, Large):
- Columns: AnchorColumn, ChangeType, and for each header: "old header" and "new header"
- Includes all rows (Add, Update, Delete, None)
- Detailed output (Detailed and DetailedV2):
- Columns: AnchorColumn, ChangeType, and for each header: "old header", "new header", "match header"
- Includes all rows (Add, Update, Delete, None) with per-field match indicators (True/False)
- Summary row inserted as first record showing mismatch counts per column ("X of Y FALSE")
- DetailedV2 only: Summary row's "old" column also shows applied transform rule counts (e.g., "Employee→EMP (5 applied), Contingent Worker→CWK (3 applied)")
- Output sorted by anchor column
- ChangeType meanings (all scripts):
- Add: Exists only in Current
- Delete: Exists only in Previous
- Update: Exists in both with at least one differing value
- None: Exists in both with no value changes
- Match column (Detailed only):
- True: Values match between Previous and Current
- False: Values differ between Previous and Current
- Empty: Not applicable (Add/Delete rows cannot be matched)
Usage examples
- Standard output, in-memory (small/medium):
- .\CompareCSVs_small.ps1 -PreviousCSVFile .\prev.csv -CurrentCSVFile .\curr.csv -AnchorColumn EmployeeID -OutputFolder .\out -DelimiterName comma -EncodingName utf8BOM
- Standard output, streaming (larger files):
- .\CompareCSVs_medium.ps1 -PreviousCSVFile .\prev.csv -CurrentCSVFile .\curr.csv -AnchorColumn EmployeeID -OutputFolder .\out -BatchSize 2000 -DelimiterName comma -EncodingName auto
- Standard output, sort-merge (very large files):
- .\CompareCSVs_large.ps1 -PreviousCSVFile .\prev.csv -CurrentCSVFile .\curr.csv -AnchorColumn EmployeeID -OutputFolder .\out -BatchSize 5000 -DelimiterName comma -EncodingName utf8
- Detailed validation (field-level match tracking, audit trails):
- .\CompareCSVs_Detailed.ps1 -PreviousCSVFile .\prev.csv -CurrentCSVFile .\curr.csv -AnchorColumn EmployeeID -OutputFolder .\out -DelimiterName comma -EncodingName utf8BOM
- Detailed validation with value transforms (normalize data during comparison):
- .\CompareCSVs_DetailedV2.ps1 -PreviousCSVFile .\prev.csv -CurrentCSVFile .\curr.csv -AnchorColumn EmployeeID -OutputFolder .\out -ValueTransforms @{'Salary'='<<0.00'} -DelimiterName comma -EncodingName utf8BOM
- Detailed validation with column filtering (exclude audit columns):
- .\CompareCSVs_DetailedV2.ps1 -PreviousCSVFile .\prev.csv -CurrentCSVFile .\curr.csv -AnchorColumn EmployeeID -OutputFolder .\out -IgnoreColumns @('LastModifiedDate','ProcessingNotes') -DelimiterName comma -EncodingName utf8BOM
- Detailed validation with date normalization (compare dates in different formats):
- .\CompareCSVs_DetailedV2.ps1 -PreviousCSVFile .\prev.csv -CurrentCSVFile .\curr.csv -AnchorColumn EmployeeID -OutputFolder .\out -DateFormats @{'HireDate'=@{Previous='d-MMM-yy';Current='yyyy-MM-dd';Output='yyyy-MM-dd'}} -DelimiterName comma -EncodingName utf8BOM
Choosing a script
- CompareCSVs_small.ps1: Entire files in memory. Fastest and simplest for small/medium datasets. Use when you need a quick list of changes.
- CompareCSVs_medium.ps1: Loads "Previous" into a dictionary; streams "Current" and writes in batches. Good for larger inputs where memory is a constraint.
- CompareCSVs_large.ps1: Sorts both files by anchor on disk, then merges. Best for very large inputs; temp files are always cleaned up. Use when standard output meets your needs but data is too large for memory.
- CompareCSVs_Detailed.ps1: In-memory with field-level match tracking. Use for validation scenarios where you need to identify exactly which fields differ and by how much. Ideal for pre-cutover validation (old process vs new process), audit trails, and Excel-based analysis with filtering. Not suitable for very large datasets due to memory requirements and output size.
- CompareCSVs_DetailedV2.ps1: Enhanced version of Detailed with -ValueTransforms (column-specific value mapping), -DateFormats (date normalization across formats), and -IgnoreColumns (skip columns from comparison). Use when you need to normalize data during comparison, handle dates in different formats, or exclude specific columns from analysis. Same memory/output limitations as Detailed.
Progress and diagnostics
- Progress bars show phases such as loading, streaming, sorting, and merging, with periodic updates.
- Clear error messages for header and row validation issues.
- A summary line with Adds/Updates/Deletes/Unchanged counts is printed at completion.
- Elapsed time (minutes and seconds) is printed at completion.
- "No changes detected; no CSV written" is printed if all records are unchanged (ChangeType=None).
Notes
- Paths are handled with -LiteralPath to avoid wildcard surprises.
- Temp file names in Large include millisecond timestamps and are removed in a finally block, even on errors.
Contributing
- File issues and PRs with minimal repro CSVs where possible.
- Keep behavior consistent across scripts.
License
- MIT (or your preferred license)