fix: Revolut CSV — BOM, lowercase headers, unknown operations, "USD -0.07" (#33)#67
Merged
Conversation
This was referenced Apr 21, 2026
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…33) Reported by @inobrevi: real Revolut stock export produced `tax: 0 PLN` silently instead of a correct calculation. Three collaborating root causes, all silenced by the loader's `except (ValueError, KeyError)` clause — so users saw "Loaded 0 operations" with no error trace. 1. UTF-8 BOM at start of file Revolut exports begin with `\xef\xbb\xbf`. Plain `open(path, "r")` doesn't strip it, so the first column name becomes `date` and `row['date']` raises KeyError. 2. Lowercase column names Revolut changed `Date` → `date`. Our parsers hardcoded `row['Date']`, `row['Ticker']`, etc. All broke. 3. Unknown operation types (CASH WITHDRAWAL, DEPOSIT, TRANSFER) Not listed in `RowParser.OPERATIONS`. `_operation_type()` returned None, downstream parser returned None, row silently dropped. Central CSV opener at `pit38/data_sources/csv_utils.py`: - `encoding="utf-8-sig"` — strips BOM if present, no-op otherwise. - `newline=""` — Python csv module best practice for quoted fields. - Normalize fieldnames to `strip().lower()` at read time so downstream code uses a stable lowercase form regardless of broker capitalization. Write side (`generic_saver.py`) unchanged — emits plain UTF-8 without BOM. Postel's law: liberal in what we accept, strict in what we send. All 5 broker readers migrated to the new util: - `data_sources/stock_loader/csv_loader.py` - `data_sources/crypto_loader/csv_loader.py` - `plugins/stock/revolut/csv.py` - `plugins/stock/etrade/csv.py` - `plugins/crypto/revolut/csv.py` - `plugins/crypto/binance/csv.py` Column refs updated to lowercase across Revolut stock/crypto, E*Trade, and Binance parsers. `CsvService.read_with_summary()` returns a `ReadResult(records, skipped_by_type: Counter)`. `pit38 import revolut-stock` prints a post-import summary: Skipped 3 rows (operation types not recognized as tax-relevant): • CASH WITHDRAWAL: 1 rows • DEPOSIT: 1 rows • TRANSFER: 1 rows No more silent drops. User can see what was ignored and flag misclassifications (e.g. "INTEREST: 200 rows — is this taxable?"). New fixture `tests/e2e/fixtures/revolut_stock_real.csv` has BOM, lowercase headers, and mixed known/unknown operation types — exactly what the synthetic fixture lacked. `test_revolut_real_export_with_bom_and_unknown_operations` verifies parser handles all three problems. - 93 tests pass (was 92; added one regression test) - `pit38 import revolut-stock` on the real-world fixture produces 2 tx + 1 dividend, reports 3 skipped rows, writes clean UTF-8 output - `pit38 stock` on that output produces 564 PLN profit / 107 PLN tax (i.e. non-zero, no silent 0) Closes #33.
Part of #33 fix — adds focused tests for the three building blocks introduced in the preceding commit. Total: +24 tests (117 passing, was 93). tests/test_csv_utils.py (11 tests) Pins the behaviour of open_csv_reader: - BOM stripping on leading position, preserved in middle of file - Header normalization (lowercase, stripped, multi-word preserved) - Edge cases (empty file, header-only, custom delimiter) - Context manager file-close semantics tests/test_revolut_csv_service.py (9 tests) Integration tests for CsvService.read_with_summary: - All-known-ops → empty skip counter - Mixed known/unknown → correct per-type counts - Empty operation column → tagged as <empty> - OperationRowParser vs TransactionRowParser skip asymmetry - ReadResult dataclass invariants (total_skipped) - BOM tolerance at the CsvService boundary tests/test_generic_saver_no_bom.py (4 tests) Postel's law regression — our output must NOT contain BOM: - Stock and crypto savers produce BOM-free output - Output is valid UTF-8 - Round-trip: saver output reads back through our loader cleanly These tests would have caught #33 before the user hit it. The synthetic E2E fixture didn't exercise BOM or case drift — the new unit tests do, in isolation, so future regressions are caught early.
…#33) Follow-up on @inobrevi's CUSTODY FEE comment in #33: USD -0.07 (minus BETWEEN code and amount, not before) raised ValueError in the previous regex ([\d,.]+) which didn't cover a leading minus. Silent row-drop in downstream pipeline meant affected users were over-taxed by missing fee costs. Rather than bolting another regex variant on, extracts the two kinds of CSV-value normalization used across broker plugins into a shared module. Number parsing itself delegates to babel (CLDR locale-based). pit38/plugins/normalization.py (new) normalize_currency_layout(raw) — rewrites to canonical "<currency><space><signed_amount>" form, covering all Revolut and E*Trade variants seen in real exports. parse_amount(s) — tries en_US then de_DE locales (strict=True). Handles US "1,234.56", EU "1.234,56" and "1317,06", negatives, nbsp thousand separator. Babel replaces the hand-rolled comma/ space/replace chains both Revolut and E*Trade previously had. Migrated: - Revolut row_parser._fiat_value: single regex split + shared helpers - E*Trade FiatValueParser: drops _clean_up_raw_number / _resolve_currency pyproject.toml: adds babel~=2.14 (~6MB; lighter than pandas). Tests: +29 (total 146, was 117) - tests/test_normalization.py (new) — 22 unit tests pinning behaviour of both normalize_currency_layout and parse_amount - tests/test_revolut_row_parser.py — 3 regression tests for @inobrevi's CUSTODY FEE + EU format + EU-negative combo - tests/test_etrade_fiat_value_parser.py — 4 edge cases (euro, neg, nbsp thousand, unknown symbol) - tests/e2e/fixtures/revolut_stock_real.csv — anonymized CUSTODY FEE row (2024-03-15, USD -0.15) preserving BOM - tests/e2e/test_stock_e2e.py — asserts CUSTODY FEE now reaches the ServiceFee pipeline (was silently dropped before) Verification: pit38 import revolut-stock -i revolut_stock_real.csv -o out.csv now yields "Saved 2 transactions and 2 operations" (vs 2+1 before — CUSTODY FEE was dropped). No silent underreporting. abs() on parse_amount result preserves existing absolute-magnitude semantics — FiatValue(0.07, USD) matches current tax-calc expectations. Signed-amount propagation is a separate refactor (#61 Decimal migration).
d917af5 to
1045344
Compare
The `_print_skipped_summary` helper used `Counter` as a forward-ref string, but Counter was never imported. CI flake8 (F821) caught it. Replaced the string hint with a direct annotation and added the import.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #33. Three layered problems originally reported by @inobrevi plus a follow-up CUSTODY FEE format.
Commit 1 (42e640b) — root-cause fix
Real Revolut CSV produced silent
tax: 0 PLNbecause three bugs stacked:open(path, "r")leftas prefix in the first column name,row['date']→KeyError.Date→date; hardcodedrow['Date']broke.CASH WITHDRAWAL,DEPOSIT,TRANSFER) —RowParser.OPERATIONS.get()returnedNone, row silently dropped.Central CSV opener at
pit38/data_sources/csv_utils.py(utf-8-sig+ header normalization). All 6 broker readers migrated. Column refs lowercased.CsvService.read_with_summary()now returns(records, skipped_by_type: Counter)and the CLI prints a post-import summary so users can see what was ignored.Commit 2 (c60990d) — test coverage
24 new tests pinning the building blocks introduced above:
tests/test_csv_utils.py— 11 unit tests foropen_csv_reader(BOM, header normalization, edge cases)tests/test_revolut_csv_service.py— 9 integration tests forread_with_summaryflowtests/test_generic_saver_no_bom.py— 4 tests asserting our output is BOM-free (Postel's law regression guard)Commit 3 (d917af5) — follow-up: "USD -0.07" + EU formats + shared normalization
@inobrevi's subsequent CUSTODY FEE row exposed a second format:
Minus between code and amount (not before) — the existing regex
[\d,.]+didn't match. Row was silently dropped → users over-taxed by missing fee costs.Rather than adding another regex variant, extracts normalization into a shared module used by Revolut and E*Trade:
pit38/plugins/normalization.py(new):normalize_currency_layout(raw)— rewrites every observed variant (-USD X,USD -X,$X,-$X,$-X,€X,$25 001,75) to canonical<currency><space><signed_amount>parse_amount(s)— delegates tobabel.numbers.parse_decimalwithstrict=True, tryingen_USthende_DE. Handles US "1,234.56", EU "1.234,56", "1317,06", negatives, nbsp thousand separator._fiat_value: single regex split + shared helpers (was: sign-strip + two distinct regex paths)FiatValueParser: migrated; drops custom_clean_up_raw_number/_resolve_currency(babel handles European comma-decimal + nbsp thousand)pyproject.toml: addsbabel~=2.14(~6MB)Why babel: battle-tested CLDR locale specs (used by Django/Flask ecosystem). Eliminates hand-rolled US/EU heuristics and covers edge cases we'd otherwise miss (nbsp, negative per-locale, etc.).
strict=Trueensures ambiguous inputs fail cleanly instead of silent mis-parsing.Tests (total: 146, was 92 on main)
tests/test_normalization.py(new, 22 tests): pinsnormalize_currency_layout+parse_amountbehaviourtests/test_revolut_row_parser.py(+3): regression for @inobrevi'sUSD -0.07, EU format, EU-negative combotests/test_etrade_fiat_value_parser.py(+4): euro symbol, negative, nbsp thousand, unknown symboltests/e2e/fixtures/revolut_stock_real.csv: anonymized CUSTODY FEE row (date 2024-03-15, value 0.15) preserving BOMtests/e2e/test_stock_e2e.py: asserts ServiceFee reaches pipeline (was dropped before)Verification
pytest tests/→ 146 passedpit38 import revolut-stockon the real-world fixture: 2 transactions + 2 operations (vs 2+1 before — CUSTODY FEE now parses); 3 skipped rows reported transparentlypit38 stockdownstream: 564 PLN profit / 107 PLN tax (non-zero, no silent zero)grep -r OperationType pit38/ tests/returns only localBinanceOperationType@inobrevirepro:RowParser._fiat_value({'total amount': 'USD -0.07', 'currency': 'USD'})→0.07 USD(wasValueError)Note to @inobrevi
Both formats from your issue are now covered:
"USD 1317.06"and"USD -100.00"— handled in commit 1"USD -0.07"(CUSTODY FEE) — handled in commit 3Once released, your
pit38 stock -f revolut_2025.csv -y 2025should produce:If anything still doesn't parse, please open a new issue with the output's "Skipped N rows" summary — it'll tell us exactly which operation types are new.