Fix JFinQA data construction pipeline and EDINET mappings#3
Merged
Conversation
Root cause: EDINET XBRL reports the same element (``ProfitLoss``, ``NetSales``, ``TotalAssets`` ...) many times across different ``context_ref`` values — prior years, non-consolidated (parent-only) variants, and dimensional breakdowns. The previous ``_extract_items`` just overwrote per-element, so the value that ended up in the table was whichever happened to come last in the raw list. Concretely this meant: - 親会社株主に帰属する当期純利益 (連結) and 当期純利益 (単体) could coexist in the same table, making the NI decomposition identity (当期純利益 = 親会社 + 非支配) algebraically impossible. - 資産合計 (連結 summary) often disagreed with 流動資産 + 固定資産 (単体) by large margins. - In at least one case (スズキ) 売上原価 (unclear context) exceeded 売上高 (単体), which is physically impossible. Changes: - ``_is_canonical_context``: accept only ``CurrentYearDuration``, ``CurrentYearInstant``, and the ``Consolidated`` synonyms. Reject anything with ``Prior`` or ``_<...>Member``. - ``_extract_items`` applies the filter in both the display-order loop and the fallback loop. - Add IFRS-suffixed element names (``NetSalesIFRS``, ``ProfitLossIFRS``, ``AssetsIFRS``, etc.) to ``PL_ELEMENTS`` / ``BS_ELEMENTS`` and the display orders, so IFRS/US-GAAP filings are picked up from their consolidated IFRS contexts instead of silently falling back to parent-only J-GAAP rows. Verified on マルハニチロ, スズキ, KDDI: balance sheets now balance and NI decomposition holds to the last yen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-runs stages 2-4 of the pipeline with the context-filtering fix. Same 1000-question target, 61 unique companies (down from 68 because some filings had no canonical consolidated data for all their required line items). All five accounting-identity audits pass in full: - impossible COGS: 13 -> 0 - gross-profit identity: 26 -> 0 - asset-total decomposition: 301 -> 0 - balance-sheet identity: 217 -> 0 - NI decomposition: 334 -> 0 Distribution: J-GAAP 624, IFRS 332, US-GAAP 44; numerical 550, consistency 200, temporal 250. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scripts, intentionally runnable without CI: - ``scripts/audit.py`` — schema sanity, DSL executability, and DSL/gold answer agreement. Tolerance- and boolean-categorical-aware so it does not flag rounding artifacts or 増収/減収 style answers as mismatches. - ``scripts/audit_quality.py`` — cross-row accounting invariants: COGS <= sales, gross-profit identity, operating-income identity, NI decomposition, asset-total decomposition, balance-sheet equality, ROE-convention ambiguity, rounding edge cases. Report files are checked in for visibility; regenerate with ``uv run python scripts/audit.py`` and ``scripts/audit_quality.py``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… format ## Three root-cause fixes ### 1. EDINET codes were wrong for 98 of 104 companies ``config.py`` had LLM-hallucinated ``edinet_code`` values that pointed to completely unrelated filers (``E00045`` = 大室温泉, not キリン; ``E00353`` = 三井物産, not 三井物産; etc.). Cross-verified correct codes via EDINET official CSV (``EdinetcodeDlInfo.zip``, 11,242 rows, fetched 2026-04-18) plus independent IRBank checks for top-cap names. Five codes were already correct (Toyota, Sony, KDDI, MUFG, 中部電力); マルハニチロ is E00015 despite the 2025-12 Umios renaming because EDINET codes are tied to 法人番号 not 商号. 45 companies carry a ``# FIXME(consensus-only): CSV 照合のみ`` note because Researcher-B time-budget ran out before independently verifying them, but Researcher-A CSV match was exact high-confidence. ### 2. s1_collect.py did not run on edinet-mcp >= 0.4 edinet-mcp 0.4.0 migrated ``EdinetClient`` to native async (``__aenter__``/``__aexit__`` only, ``get_filings`` / ``download_document`` coroutines). The existing synchronous ``with EdinetClient(...)`` call raised ``TypeError`` at the first iteration. Rewrote the module so ``_find_annual_report`` and ``collect_company`` are coroutines, ``run()`` wraps them in ``asyncio.run(_run_async(...))``, and the ``EdinetClient`` is entered via ``async with``. Rate limit reduced from 2.0 to 1.0 requests/second to stay under EDINET's 100 req/min API cap under batch load. ### 3. _FILING_WINDOWS missed non-March fiscal years The two-window heuristic (June-Aug for March FY, March-May for December FY) silently skipped 8-month-FY filers like ファーストリテイリング (files November). Widened to six windows covering fiscal year ends in January, March, April-May, August, September, and December, so every COMPANY_POOL entry has at least one viable search range. ## Dependency Added ``defusedxml`` as a runtime dependency; edinet-mcp's XBRL parser now imports it at module load time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full re-collection under the fixed pipeline: 104 companies, all consolidated current-period figures (values cross-checked against published reports — e.g. Toyota FY2024 売上高 45,095,325 百万円, 資産合計 90,114,296 百万円). Same 1000-question target; subtask split 550/200/250 preserved; J-GAAP 656 / IFRS 323 / US-GAAP 21; avg program steps 2.58; 13 rejections (answer_mismatch only). Lite 150 regenerated on the new dataset with 82 unique companies, balanced strata: numerical 84 / consistency 29 / temporal 37, J-GAAP 88 / IFRS 55 / US-GAAP 7. Audit results on the new data: - ``scripts/audit.py``: 0 schema / DSL / duplicate findings. - ``scripts/audit_quality.py``: 0 impossible-COGS, 0 NI-decomposition violations, 0 balance-sheet violations, 1 asset-total rounding edge case. Remaining 311 op-income checks are IFRS false positives (营業利益 is not ``売上総利益 - 販管費`` for IFRS filers that have その他の収益 / その他の費用 lines); 89 ROE-convention ambiguities are an inherent specification question, not a data bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ``scripts/run_baseline.py``: adds R0/R1 regime support (thinking
OFF / native-moderate) with per-provider routing for OpenAI GPT-5
family (``reasoning_effort``), Gemini 2.5 (``thinking_config``
budget), and Anthropic extended thinking. Captures per-question
token usage (including reasoning tokens), parse success, truncation
flag, latency, and cost using a built-in pricing table. Emits
separated ``{model}__{regime}__predictions.json`` and
``...__metrics.json`` artefacts with subtask / accounting-standard
breakdowns and p50/p90/p95 output-token distribution.
- ``scripts/build_lite.py``: deterministic stratified sampler
producing a 150-question ``jfinqa-Lite`` subset. Primary stratum is
``subtask × accounting_standard`` with a soft cap of 4 questions
per ``edinet_code`` and a minimum US-GAAP quota. Seed 42 for
reproducibility.
- ``scripts/check_raw_integrity.py``: post-collection sanity check.
Verifies every ``raw/E*.json`` has ``company.edinet_code`` and
``filings[*].filing.edinet_code`` matching the filename stem,
``doc_type == "120"``, and a present ``company_name``. Catches the
cache-skip-reuses-wrong-filing class of bug the EDINET code fix
was meant to close.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four pilot runs on ``jfinqa-Lite`` (150 questions, stratified) using the corrected pipeline output. Canonical zero-shot prompt shared across all runs. | Model | Regime | Accuracy | Parse | Truncation | Cost | |--------------------|--------|----------|-------|------------|----------| | gemini-2.5-flash | R0 | 88.0 % | 100 % | 0 % | \$0.0138 | | gemini-2.5-flash | R1 | 85.33 % | 100 % | 0 % | \$0.0342 | | gpt-5.4-mini | R0 | 92.0 % | 100 % | 0 % | \$0.0527 | | gpt-5.4-mini | R1 | 91.33 % | 100 % | 0 % | \$0.1717 | Prior pilot numbers on the pre-fix v1 data are retained at ``scripts/data/baselines_pre_context_fix/`` for before/after comparison; they should **not** be cited as baselines because the tables they were answered against mixed 連結 and 単体 values. Subtask breakdown (R0): temporal_reasoning saturates at 100 % for both models, consistency_checking at 93-100 %, numerical_reasoning is the discriminating subtask (80.95 / 85.71). Accounting-standard accuracy is close between J-GAAP and IFRS for both models. Total pilot cost: \$0.27. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes benchmark construction issues discovered in post-release auditing of the JFinQA data pipeline.
What was broken
_extract_itemsinscripts/pipeline/s2_transform.pyoverwrote per-element withoutlooking at
context, so a single table could combine consolidatedand parent-only figures. 647 of the 1000 published questions
violated basic accounting identities (balance-sheet equality, NI
decomposition, impossible
売上原価 > 売上高, etc.).scripts/pipeline/config.pypointed to the wrong filer — e.g.E00045was labelled "キリンホールディングス" but actually resolvesto 大室温泉. This meant every regenerated row used the wrong
company's financial data.
scripts/pipeline/s1_collect.pywassynchronous against
edinet-mcp >= 0.4.0, which migrated to anative async API (
EdinetClientonly exposes__aenter__/__aexit__, andget_filings/download_documentarecoroutines). The collector failed with
TypeErrorat the firstiteration.
missed non-March fiscal-year filers such as ファーストリテイリング
(Aug FY, files in November).
pre-reconciled
{"科目": …, "前期": …, "当期": …}dicts instead ofraw XBRL elements with
contextfields. The old extractor saw nocontextand rejected everything.What this PR does
Pipeline: filter XBRL contexts to consolidated current-period,
add IFRS/US-GAAP element mappings, convert the collector to async,
expand filing windows to six, and support both the legacy and the
new edinet-mcp output formats.
Data: EDINET codes for 104 companies were independently
verified via the official
EdinetcodeDlInfoCSV (Researcher A),cross-checked against IRBank/企業 IR for 16 top-cap names
(Researcher B), and adversarially reviewed before application.
Raw data regenerated end-to-end; 104 files pass the new
scripts/check_raw_integrity.py.Accounting audits:
scripts/audit.pyandscripts/audit_quality.pycheck DSL executability, schema,duplicates, balance-sheet equality, NI decomposition, asset-total
decomposition, COGS-vs-sales, gross-profit identity, rounding
pedantry, and ROE-convention ambiguity. All invariant audits pass
on the regenerated data (the remaining
op_income_mismatchfindings are IFRS-specific false positives — 営業利益 is not
売上総利益 − 販管費when a filer reportsその他の収益/費用).Tooling: new baseline runner with R0 / R1 regimes
(thinking-off vs native-moderate), stratified Lite-150 sampler,
and the raw integrity check.
Baselines (pilot on Lite 150, new data):
Pre-audit baselines on the broken data are retained at
scripts/data/baselines_pre_context_fix/for diagnostic comparisononly — they should not be cited.
Public artifact status (not yet addressed)
The Hugging Face dataset at
ajtgjmdjp/jfinqaand the mirrors inlm-evaluation-harness(PR #3570) andllm-jp-eval(PR #230) stillreference the pre-fix revision. A corrected release and legacy tag
will be uploaded in a follow-up; this PR focuses on the
in-repository pipeline and regenerated data.
Test plan
uv run python scripts/check_raw_integrity.py scripts/data_edinet_fix_*/raw— 0 violations over 104 files.
uv run python scripts/audit.py— 0 DSL / duplicate / schemafindings.
uv run python scripts/audit_quality.py— 0 accounting-identityviolations (NI decomposition, balance sheet, asset total, impossible
COGS all pass).
(Toyota FY2024: 売上高 ¥45,095,325 million, 資産合計 ¥90,114,296
million — exact match).
Flash and GPT-5.4 mini under R0 and R1 regimes.
🤖 Generated with Claude Code