Data Reconciliation Agent

Data Reconciliation Agent is a local-first, CLI-first Python project for checking whether a target dataset preserved what mattered from a source dataset. It is built for migration, re-platforming, and source-to-target validation work where you need evidence, not guesswork.

Why this exists

Migration work usually fails in boring ways:

rows go missing,
extra rows show up,
key fields do not line up,
values change during transforms,
and people sign off based on spreadsheet spot checks.

This project gives deterministic outputs you can rerun, inspect, and review.

Why not just ask an LLM?

LLMs are useful for summarising findings, but reconciliation should not depend on model judgement.

For repetitive, rules-heavy checks, deterministic code is usually cheaper, faster, and more accurate than asking a model to reason over the task each run. A good use of AI here is helping build and explain a reliable deterministic tool, not redoing deterministic reconciliation decisions every time.

This boundary matters when token cost, repeatability, auditability, and data handling are important.

Authority boundary in this repo:

deterministic engine = evidence
agent mode = orchestration
LLM summary = readability only

The optional LLM layer reads structured deterministic metadata, not raw dataset rows, and trace/report/exception artifacts remain the source of truth.

Why I built it this way

I wanted to understand agents by building one from the inside out, starting with deterministic reconciliation that can stand on its own.

I deliberately separated responsibilities:

deterministic reconciliation does the evidence-producing checks
agent mode handles orchestration and assumption management
optional LLM summary improves readability only

This is a learning project written so other people can follow the architecture without hidden automation.

The goal is not to make an LLM perform reconciliation. The goal is to show where an agent helps around reliable deterministic tools.

What this project demonstrates

source-to-target reconciliation for CSV/XLSX files
YAML mapping config for different source/target key names and field names
deterministic string/number/date/datetime comparators
exception CSV outputs for missing/unexpected/duplicate/null/value mismatch cases
bounded agent mode that plans and records assumptions
optional OpenAI polish with deterministic fallback
auditable trace/report artifacts

Why this is an agent

Deterministic mode runs directly from user-provided key/mapping inputs.

Agent mode does bounded orchestration:

inspects inputs,
resolves key and mapping strategy,
infers safe same-name keys when possible,
blocks when assumptions are unsafe,
then calls deterministic tools.

The agent coordinates work. It does not judge whether values match.

Quick start

python -m venv .venv
source .venv/Scripts/activate  # Windows Git Bash
pip install -e ".[dev]"

Optional OpenAI polish dependencies:

pip install -e ".[dev,llm]"

Example commands

Deterministic same-name key:

python -m data_reconciliation_agent.cli \
  --source sample_data/customers/source_customers.csv \
  --target sample_data/customers/target_customers_clean.csv \
  --key customer_id \
  --mode deterministic \
  --output-dir outputs/customers_clean_run

Deterministic mapping with value mismatches:

python -m data_reconciliation_agent.cli \
  --source sample_data/crm_migration/source_contacts_salesforce.csv \
  --target sample_data/crm_migration/target_contacts_dynamics_issues.csv \
  --mapping config/examples/crm_contacts_mapping.yaml \
  --mode deterministic \
  --output-dir outputs/crm_issues_run

Agent mode inferred key:

python -m data_reconciliation_agent.cli \
  --source sample_data/orders/source_orders.csv \
  --target sample_data/orders/target_orders_clean.csv \
  --mode agent \
  --output-dir outputs/orders_agent_inferred

Agent mode with mapping:

python -m data_reconciliation_agent.cli \
  --source sample_data/crm_migration/source_contacts_salesforce.csv \
  --target sample_data/crm_migration/target_contacts_dynamics_clean.csv \
  --mapping config/examples/crm_contacts_mapping.yaml \
  --mode agent \
  --output-dir outputs/crm_agent_mapping

Optional deterministic fallback summary:

python -m data_reconciliation_agent.cli \
  --source sample_data/customers/source_customers.csv \
  --target sample_data/customers/target_customers_clean.csv \
  --key customer_id \
  --mode deterministic \
  --llm-summary \
  --output-dir outputs/customers_fallback_summary

Optional OpenAI summary:

export OPENAI_API_KEY="your-key"
export OPENAI_MODEL="gpt-4o-mini"  # optional
python -m data_reconciliation_agent.cli \
  --source sample_data/customers/source_customers.csv \
  --target sample_data/customers/target_customers_clean.csv \
  --key customer_id \
  --mode deterministic \
  --llm-summary \
  --output-dir outputs/customers_openai_summary

Output artifacts

Deterministic artifacts (evidence):

reconciliation_trace.json
reconciliation_report.md
exception CSVs (missing_in_target.csv, unexpected_in_target.csv, etc.)

Agent artifacts (orchestration context):

agent_trace.json
agent_report.md

Optional readability layer:

llm_summary.md

Authority boundary:

deterministic artifacts are evidence,
agent artifacts explain orchestration,
LLM summary is readability-only.

Project structure

src/data_reconciliation_agent/ core code (CLI, deterministic engine, planner, reports, optional LLM polish)
config/examples/ mapping examples
sample_data/ fixture datasets
docs/ usage, architecture, and extension guides
tests/ deterministic and orchestration test suite

Run tests

python -m pytest

Limitations and non-goals

no fuzzy matching
no auto-correction
no database integration in v1
no web UI
not a full migration platform
LLM output is non-authoritative
same-name key inference is deliberately cautious

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
config		config
docs		docs
outputs		outputs
sample_data		sample_data
src/data_reconciliation_agent		src/data_reconciliation_agent
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.gitkeep		.gitkeep
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
FUTURE_WORK.md		FUTURE_WORK.md
LICENSE		LICENSE
PROJECT_SCOPE.md		PROJECT_SCOPE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Reconciliation Agent

Why this exists

Why not just ask an LLM?

Why I built it this way

What this project demonstrates

Why this is an agent

Quick start

Example commands

Output artifacts

Project structure

Run tests

Limitations and non-goals

Further reading

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Reconciliation Agent

Why this exists

Why not just ask an LLM?

Why I built it this way

What this project demonstrates

Why this is an agent

Quick start

Example commands

Output artifacts

Project structure

Run tests

Limitations and non-goals

Further reading

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages