Data Reconciliation Agent is a local-first, CLI-first Python project for checking whether a target dataset preserved what mattered from a source dataset. It is built for migration, re-platforming, and source-to-target validation work where you need evidence, not guesswork.
Migration work usually fails in boring ways:
- rows go missing,
- extra rows show up,
- key fields do not line up,
- values change during transforms,
- and people sign off based on spreadsheet spot checks.
This project gives deterministic outputs you can rerun, inspect, and review.
LLMs are useful for summarising findings, but reconciliation should not depend on model judgement.
For repetitive, rules-heavy checks, deterministic code is usually cheaper, faster, and more accurate than asking a model to reason over the task each run. A good use of AI here is helping build and explain a reliable deterministic tool, not redoing deterministic reconciliation decisions every time.
This boundary matters when token cost, repeatability, auditability, and data handling are important.
Authority boundary in this repo:
- deterministic engine = evidence
- agent mode = orchestration
- LLM summary = readability only
The optional LLM layer reads structured deterministic metadata, not raw dataset rows, and trace/report/exception artifacts remain the source of truth.
I wanted to understand agents by building one from the inside out, starting with deterministic reconciliation that can stand on its own.
I deliberately separated responsibilities:
- deterministic reconciliation does the evidence-producing checks
- agent mode handles orchestration and assumption management
- optional LLM summary improves readability only
This is a learning project written so other people can follow the architecture without hidden automation.
The goal is not to make an LLM perform reconciliation. The goal is to show where an agent helps around reliable deterministic tools.
- source-to-target reconciliation for CSV/XLSX files
- YAML mapping config for different source/target key names and field names
- deterministic string/number/date/datetime comparators
- exception CSV outputs for missing/unexpected/duplicate/null/value mismatch cases
- bounded agent mode that plans and records assumptions
- optional OpenAI polish with deterministic fallback
- auditable trace/report artifacts
Deterministic mode runs directly from user-provided key/mapping inputs.
Agent mode does bounded orchestration:
- inspects inputs,
- resolves key and mapping strategy,
- infers safe same-name keys when possible,
- blocks when assumptions are unsafe,
- then calls deterministic tools.
The agent coordinates work. It does not judge whether values match.
python -m venv .venv
source .venv/Scripts/activate # Windows Git Bash
pip install -e ".[dev]"Optional OpenAI polish dependencies:
pip install -e ".[dev,llm]"- Deterministic same-name key:
python -m data_reconciliation_agent.cli \
--source sample_data/customers/source_customers.csv \
--target sample_data/customers/target_customers_clean.csv \
--key customer_id \
--mode deterministic \
--output-dir outputs/customers_clean_run- Deterministic mapping with value mismatches:
python -m data_reconciliation_agent.cli \
--source sample_data/crm_migration/source_contacts_salesforce.csv \
--target sample_data/crm_migration/target_contacts_dynamics_issues.csv \
--mapping config/examples/crm_contacts_mapping.yaml \
--mode deterministic \
--output-dir outputs/crm_issues_run- Agent mode inferred key:
python -m data_reconciliation_agent.cli \
--source sample_data/orders/source_orders.csv \
--target sample_data/orders/target_orders_clean.csv \
--mode agent \
--output-dir outputs/orders_agent_inferred- Agent mode with mapping:
python -m data_reconciliation_agent.cli \
--source sample_data/crm_migration/source_contacts_salesforce.csv \
--target sample_data/crm_migration/target_contacts_dynamics_clean.csv \
--mapping config/examples/crm_contacts_mapping.yaml \
--mode agent \
--output-dir outputs/crm_agent_mapping- Optional deterministic fallback summary:
python -m data_reconciliation_agent.cli \
--source sample_data/customers/source_customers.csv \
--target sample_data/customers/target_customers_clean.csv \
--key customer_id \
--mode deterministic \
--llm-summary \
--output-dir outputs/customers_fallback_summary- Optional OpenAI summary:
export OPENAI_API_KEY="your-key"
export OPENAI_MODEL="gpt-4o-mini" # optional
python -m data_reconciliation_agent.cli \
--source sample_data/customers/source_customers.csv \
--target sample_data/customers/target_customers_clean.csv \
--key customer_id \
--mode deterministic \
--llm-summary \
--output-dir outputs/customers_openai_summaryDeterministic artifacts (evidence):
reconciliation_trace.jsonreconciliation_report.md- exception CSVs (
missing_in_target.csv,unexpected_in_target.csv, etc.)
Agent artifacts (orchestration context):
agent_trace.jsonagent_report.md
Optional readability layer:
llm_summary.md
Authority boundary:
- deterministic artifacts are evidence,
- agent artifacts explain orchestration,
- LLM summary is readability-only.
src/data_reconciliation_agent/core code (CLI, deterministic engine, planner, reports, optional LLM polish)config/examples/mapping examplessample_data/fixture datasetsdocs/usage, architecture, and extension guidestests/deterministic and orchestration test suite
python -m pytest- no fuzzy matching
- no auto-correction
- no database integration in v1
- no web UI
- not a full migration platform
- LLM output is non-authoritative
- same-name key inference is deliberately cautious
ARCHITECTURE.mdPROJECT_SCOPE.mddocs/how_it_works.mddocs/interpreting_outputs.mddocs/example_commands.mddocs/extension_paths.mddocs/adding_a_new_comparator.mdFUTURE_WORK.md