Skip to content

fix: Revolut CSV — BOM, lowercase headers, unknown operations, "USD -0.07" (#33)#67

Merged
pbialon merged 4 commits into
mainfrom
fix/33-revolut-csv-bom
Apr 24, 2026
Merged

fix: Revolut CSV — BOM, lowercase headers, unknown operations, "USD -0.07" (#33)#67
pbialon merged 4 commits into
mainfrom
fix/33-revolut-csv-bom

Conversation

@przemyslawbialon

@przemyslawbialon przemyslawbialon commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Closes #33. Three layered problems originally reported by @inobrevi plus a follow-up CUSTODY FEE format.

Commit 1 (42e640b) — root-cause fix

Real Revolut CSV produced silent tax: 0 PLN because three bugs stacked:

  1. UTF-8 BOMopen(path, "r") left  as prefix in the first column name, row['date']KeyError.
  2. Lowercase column names — Revolut changed Datedate; hardcoded row['Date'] broke.
  3. Unknown operation types (CASH WITHDRAWAL, DEPOSIT, TRANSFER) — RowParser.OPERATIONS.get() returned None, row silently dropped.

Central CSV opener at pit38/data_sources/csv_utils.py (utf-8-sig + header normalization). All 6 broker readers migrated. Column refs lowercased. CsvService.read_with_summary() now returns (records, skipped_by_type: Counter) and the CLI prints a post-import summary so users can see what was ignored.

Commit 2 (c60990d) — test coverage

24 new tests pinning the building blocks introduced above:

  • tests/test_csv_utils.py — 11 unit tests for open_csv_reader (BOM, header normalization, edge cases)
  • tests/test_revolut_csv_service.py — 9 integration tests for read_with_summary flow
  • tests/test_generic_saver_no_bom.py — 4 tests asserting our output is BOM-free (Postel's law regression guard)

Commit 3 (d917af5) — follow-up: "USD -0.07" + EU formats + shared normalization

@inobrevi's subsequent CUSTODY FEE row exposed a second format:

2021-02-01T...,,CUSTODY FEE,,,USD -0.07,USD,0.26850...

Minus between code and amount (not before) — the existing regex [\d,.]+ didn't match. Row was silently dropped → users over-taxed by missing fee costs.

Rather than adding another regex variant, extracts normalization into a shared module used by Revolut and E*Trade:

  • pit38/plugins/normalization.py (new):
    • normalize_currency_layout(raw) — rewrites every observed variant (-USD X, USD -X, $X, -$X, $-X, €X, $25 001,75) to canonical <currency><space><signed_amount>
    • parse_amount(s) — delegates to babel.numbers.parse_decimal with strict=True, trying en_US then de_DE. Handles US "1,234.56", EU "1.234,56", "1317,06", negatives, nbsp thousand separator.
  • Revolut _fiat_value: single regex split + shared helpers (was: sign-strip + two distinct regex paths)
  • E*Trade FiatValueParser: migrated; drops custom _clean_up_raw_number / _resolve_currency (babel handles European comma-decimal + nbsp thousand)
  • pyproject.toml: adds babel~=2.14 (~6MB)

Why babel: battle-tested CLDR locale specs (used by Django/Flask ecosystem). Eliminates hand-rolled US/EU heuristics and covers edge cases we'd otherwise miss (nbsp, negative per-locale, etc.). strict=True ensures ambiguous inputs fail cleanly instead of silent mis-parsing.

Tests (total: 146, was 92 on main)

  • +33 tests across commits 2 and 3
  • tests/test_normalization.py (new, 22 tests): pins normalize_currency_layout + parse_amount behaviour
  • tests/test_revolut_row_parser.py (+3): regression for @inobrevi's USD -0.07, EU format, EU-negative combo
  • tests/test_etrade_fiat_value_parser.py (+4): euro symbol, negative, nbsp thousand, unknown symbol
  • tests/e2e/fixtures/revolut_stock_real.csv: anonymized CUSTODY FEE row (date 2024-03-15, value 0.15) preserving BOM
  • tests/e2e/test_stock_e2e.py: asserts ServiceFee reaches pipeline (was dropped before)

Verification

  • pytest tests/ → 146 passed
  • pit38 import revolut-stock on the real-world fixture: 2 transactions + 2 operations (vs 2+1 before — CUSTODY FEE now parses); 3 skipped rows reported transparently
  • pit38 stock downstream: 564 PLN profit / 107 PLN tax (non-zero, no silent zero)
  • grep -r OperationType pit38/ tests/ returns only local BinanceOperationType
  • @inobrevi repro: RowParser._fiat_value({'total amount': 'USD -0.07', 'currency': 'USD'})0.07 USD (was ValueError)

Note to @inobrevi

Both formats from your issue are now covered:

  • "USD 1317.06" and "USD -100.00" — handled in commit 1
  • "USD -0.07" (CUSTODY FEE) — handled in commit 3

Once released, your pit38 stock -f revolut_2025.csv -y 2025 should produce:

  • Actual tax calculation (no more silent 0 PLN)
  • Transparent summary of which rows were ignored and why
  • Custody fees properly deducted as costs

If anything still doesn't parse, please open a new issue with the output's "Skipped N rows" summary — it'll tell us exactly which operation types are new.

@przemyslawbialon przemyslawbialon added this to the Tax correctness 1.0 milestone Apr 21, 2026
@przemyslawbialon przemyslawbialon added bug Something isn't working correctness Tax-correctness impact (high priority) labels Apr 21, 2026
@codecov

codecov Bot commented Apr 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.09901% with 10 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
pit38/cli.py 57.14% 5 Missing and 1 partial ⚠️
pit38/plugins/stock/etrade/row_parser.py 83.33% 1 Missing and 1 partial ⚠️
pit38/plugins/stock/revolut/row_parser.py 89.47% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@przemyslawbialon przemyslawbialon changed the title fix: Revolut CSV with BOM, lowercase headers, and unknown operations (#33) fix: Revolut CSV — BOM, lowercase headers, unknown operations, "USD -0.07" (#33) Apr 24, 2026
…33)

Reported by @inobrevi: real Revolut stock export produced `tax: 0 PLN`
silently instead of a correct calculation. Three collaborating root
causes, all silenced by the loader's `except (ValueError, KeyError)`
clause — so users saw "Loaded 0 operations" with no error trace.

1. UTF-8 BOM at start of file
   Revolut exports begin with `\xef\xbb\xbf`. Plain `open(path, "r")`
   doesn't strip it, so the first column name becomes `date`
   and `row['date']` raises KeyError.

2. Lowercase column names
   Revolut changed `Date` → `date`. Our parsers hardcoded `row['Date']`,
   `row['Ticker']`, etc. All broke.

3. Unknown operation types (CASH WITHDRAWAL, DEPOSIT, TRANSFER)
   Not listed in `RowParser.OPERATIONS`. `_operation_type()` returned
   None, downstream parser returned None, row silently dropped.

Central CSV opener at `pit38/data_sources/csv_utils.py`:
- `encoding="utf-8-sig"` — strips BOM if present, no-op otherwise.
- `newline=""` — Python csv module best practice for quoted fields.
- Normalize fieldnames to `strip().lower()` at read time so downstream
  code uses a stable lowercase form regardless of broker capitalization.

Write side (`generic_saver.py`) unchanged — emits plain UTF-8 without
BOM. Postel's law: liberal in what we accept, strict in what we send.

All 5 broker readers migrated to the new util:
- `data_sources/stock_loader/csv_loader.py`
- `data_sources/crypto_loader/csv_loader.py`
- `plugins/stock/revolut/csv.py`
- `plugins/stock/etrade/csv.py`
- `plugins/crypto/revolut/csv.py`
- `plugins/crypto/binance/csv.py`

Column refs updated to lowercase across Revolut stock/crypto, E*Trade,
and Binance parsers.

`CsvService.read_with_summary()` returns a `ReadResult(records,
skipped_by_type: Counter)`. `pit38 import revolut-stock` prints a
post-import summary:

    Skipped 3 rows (operation types not recognized as tax-relevant):
      • CASH WITHDRAWAL: 1 rows
      • DEPOSIT: 1 rows
      • TRANSFER: 1 rows

No more silent drops. User can see what was ignored and flag
misclassifications (e.g. "INTEREST: 200 rows — is this taxable?").

New fixture `tests/e2e/fixtures/revolut_stock_real.csv` has BOM,
lowercase headers, and mixed known/unknown operation types — exactly
what the synthetic fixture lacked. `test_revolut_real_export_with_bom_and_unknown_operations`
verifies parser handles all three problems.

- 93 tests pass (was 92; added one regression test)
- `pit38 import revolut-stock` on the real-world fixture produces 2 tx
  + 1 dividend, reports 3 skipped rows, writes clean UTF-8 output
- `pit38 stock` on that output produces 564 PLN profit / 107 PLN tax
  (i.e. non-zero, no silent 0)

Closes #33.
Part of #33 fix — adds focused tests for the three building blocks
introduced in the preceding commit. Total: +24 tests (117 passing, was
93).

tests/test_csv_utils.py (11 tests)
Pins the behaviour of open_csv_reader:
- BOM stripping on leading position, preserved in middle of file
- Header normalization (lowercase, stripped, multi-word preserved)
- Edge cases (empty file, header-only, custom delimiter)
- Context manager file-close semantics

tests/test_revolut_csv_service.py (9 tests)
Integration tests for CsvService.read_with_summary:
- All-known-ops → empty skip counter
- Mixed known/unknown → correct per-type counts
- Empty operation column → tagged as <empty>
- OperationRowParser vs TransactionRowParser skip asymmetry
- ReadResult dataclass invariants (total_skipped)
- BOM tolerance at the CsvService boundary

tests/test_generic_saver_no_bom.py (4 tests)
Postel's law regression — our output must NOT contain BOM:
- Stock and crypto savers produce BOM-free output
- Output is valid UTF-8
- Round-trip: saver output reads back through our loader cleanly

These tests would have caught #33 before the user hit it. The
synthetic E2E fixture didn't exercise BOM or case drift — the new
unit tests do, in isolation, so future regressions are caught early.
…#33)

Follow-up on @inobrevi's CUSTODY FEE comment in #33:
    USD -0.07   (minus BETWEEN code and amount, not before)
raised ValueError in the previous regex ([\d,.]+) which didn't cover
a leading minus. Silent row-drop in downstream pipeline meant affected
users were over-taxed by missing fee costs.

Rather than bolting another regex variant on, extracts the two kinds
of CSV-value normalization used across broker plugins into a shared
module. Number parsing itself delegates to babel (CLDR locale-based).

pit38/plugins/normalization.py (new)
  normalize_currency_layout(raw) — rewrites to canonical
      "<currency><space><signed_amount>" form, covering all Revolut
      and E*Trade variants seen in real exports.
  parse_amount(s) — tries en_US then de_DE locales (strict=True).
      Handles US "1,234.56", EU "1.234,56" and "1317,06", negatives,
      nbsp thousand separator. Babel replaces the hand-rolled comma/
      space/replace chains both Revolut and E*Trade previously had.

Migrated:
  - Revolut row_parser._fiat_value: single regex split + shared helpers
  - E*Trade FiatValueParser: drops _clean_up_raw_number / _resolve_currency

pyproject.toml: adds babel~=2.14 (~6MB; lighter than pandas).

Tests: +29 (total 146, was 117)
  - tests/test_normalization.py (new) — 22 unit tests pinning behaviour
    of both normalize_currency_layout and parse_amount
  - tests/test_revolut_row_parser.py — 3 regression tests for
    @inobrevi's CUSTODY FEE + EU format + EU-negative combo
  - tests/test_etrade_fiat_value_parser.py — 4 edge cases (euro, neg,
    nbsp thousand, unknown symbol)
  - tests/e2e/fixtures/revolut_stock_real.csv — anonymized CUSTODY FEE
    row (2024-03-15, USD -0.15) preserving BOM
  - tests/e2e/test_stock_e2e.py — asserts CUSTODY FEE now reaches the
    ServiceFee pipeline (was silently dropped before)

Verification:
    pit38 import revolut-stock -i revolut_stock_real.csv -o out.csv
now yields "Saved 2 transactions and 2 operations" (vs 2+1 before —
CUSTODY FEE was dropped). No silent underreporting.

abs() on parse_amount result preserves existing absolute-magnitude
semantics — FiatValue(0.07, USD) matches current tax-calc expectations.
Signed-amount propagation is a separate refactor (#61 Decimal migration).
The `_print_skipped_summary` helper used `Counter` as a forward-ref
string, but Counter was never imported. CI flake8 (F821) caught it.

Replaced the string hint with a direct annotation and added the import.
@pbialon pbialon merged commit 50ca2bc into main Apr 24, 2026
4 of 5 checks passed
@pbialon pbialon deleted the fix/33-revolut-csv-bom branch April 24, 2026 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working correctness Tax-correctness impact (high priority)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revolut: CSV has changed and script doesn't work anymore

2 participants