Fix JFinQA data construction pipeline and EDINET mappings by ajtgjmdjp · Pull Request #3 · ajtgjmdjp/jfinqa

ajtgjmdjp · 2026-04-18T11:05:44Z

Summary

Fixes benchmark construction issues discovered in post-release auditing of the JFinQA data pipeline.

What was broken

XBRL context mixing (連結 vs 単体). _extract_items in
scripts/pipeline/s2_transform.py overwrote per-element without
looking at context, so a single table could combine consolidated
and parent-only figures. 647 of the 1000 published questions
violated basic accounting identities (balance-sheet equality, NI
decomposition, impossible 売上原価 > 売上高, etc.).
EDINET code mappings. 98 of 104 entries in
scripts/pipeline/config.py pointed to the wrong filer — e.g.
E00045 was labelled "キリンホールディングス" but actually resolves
to 大室温泉. This meant every regenerated row used the wrong
company's financial data.
Broken collector. scripts/pipeline/s1_collect.py was
synchronous against edinet-mcp >= 0.4.0, which migrated to a
native async API (EdinetClient only exposes __aenter__ /
__aexit__, and get_filings / download_document are
coroutines). The collector failed with TypeError at the first
iteration.
Narrow filing windows. The two windows (Jun–Aug, Mar–May)
missed non-March fiscal-year filers such as ファーストリテイリング
(Aug FY, files in November).
edinet-mcp output format change. The 0.4+ library returns
pre-reconciled {"科目": …, "前期": …, "当期": …} dicts instead of
raw XBRL elements with context fields. The old extractor saw no
context and rejected everything.

What this PR does

Pipeline: filter XBRL contexts to consolidated current-period,
add IFRS/US-GAAP element mappings, convert the collector to async,
expand filing windows to six, and support both the legacy and the
new edinet-mcp output formats.
Data: EDINET codes for 104 companies were independently
verified via the official EdinetcodeDlInfo CSV (Researcher A),
cross-checked against IRBank/企業 IR for 16 top-cap names
(Researcher B), and adversarially reviewed before application.
Raw data regenerated end-to-end; 104 files pass the new
scripts/check_raw_integrity.py.
Accounting audits: scripts/audit.py and
scripts/audit_quality.py check DSL executability, schema,
duplicates, balance-sheet equality, NI decomposition, asset-total
decomposition, COGS-vs-sales, gross-profit identity, rounding
pedantry, and ROE-convention ambiguity. All invariant audits pass
on the regenerated data (the remaining op_income_mismatch
findings are IFRS-specific false positives — 営業利益 is not
売上総利益 − 販管費 when a filer reports その他の収益/費用).
Tooling: new baseline runner with R0 / R1 regimes
(thinking-off vs native-moderate), stratified Lite-150 sampler,
and the raw integrity check.

Baselines (pilot on Lite 150, new data):

Model	Regime	Accuracy	Parse	Cost
gemini-2.5-flash	R0	88.0 %	100 %	$0.0138
gemini-2.5-flash	R1	85.33 %	100 %	$0.0342
gpt-5.4-mini	R0	92.0 %	100 %	$0.0527
gpt-5.4-mini	R1	91.33 %	100 %	$0.1717

Pre-audit baselines on the broken data are retained at
scripts/data/baselines_pre_context_fix/ for diagnostic comparison
only — they should not be cited.

Public artifact status (not yet addressed)

The Hugging Face dataset at ajtgjmdjp/jfinqa and the mirrors in
lm-evaluation-harness (PR #3570) and llm-jp-eval (PR #230) still
reference the pre-fix revision. A corrected release and legacy tag
will be uploaded in a follow-up; this PR focuses on the
in-repository pipeline and regenerated data.

Test plan

uv run python scripts/check_raw_integrity.py scripts/data_edinet_fix_*/raw
— 0 violations over 104 files.
uv run python scripts/audit.py — 0 DSL / duplicate / schema
findings.
uv run python scripts/audit_quality.py — 0 accounting-identity
violations (NI decomposition, balance sheet, asset total, impossible
COGS all pass).
Spot-check real numbers against published annual reports
(Toyota FY2024: 売上高 ¥45,095,325 million, 資産合計 ¥90,114,296
million — exact match).
Baseline pilot runs on corrected Lite 150 for both Gemini 2.5
Flash and GPT-5.4 mini under R0 and R1 regimes.

🤖 Generated with Claude Code

Root cause: EDINET XBRL reports the same element (``ProfitLoss``, ``NetSales``, ``TotalAssets`` ...) many times across different ``context_ref`` values — prior years, non-consolidated (parent-only) variants, and dimensional breakdowns. The previous ``_extract_items`` just overwrote per-element, so the value that ended up in the table was whichever happened to come last in the raw list. Concretely this meant: - 親会社株主に帰属する当期純利益 (連結) and 当期純利益 (単体) could coexist in the same table, making the NI decomposition identity (当期純利益 = 親会社 + 非支配) algebraically impossible. - 資産合計 (連結 summary) often disagreed with 流動資産 + 固定資産 (単体) by large margins. - In at least one case (スズキ) 売上原価 (unclear context) exceeded 売上高 (単体), which is physically impossible. Changes: - ``_is_canonical_context``: accept only ``CurrentYearDuration``, ``CurrentYearInstant``, and the ``Consolidated`` synonyms. Reject anything with ``Prior`` or ``_<...>Member``. - ``_extract_items`` applies the filter in both the display-order loop and the fallback loop. - Add IFRS-suffixed element names (``NetSalesIFRS``, ``ProfitLossIFRS``, ``AssetsIFRS``, etc.) to ``PL_ELEMENTS`` / ``BS_ELEMENTS`` and the display orders, so IFRS/US-GAAP filings are picked up from their consolidated IFRS contexts instead of silently falling back to parent-only J-GAAP rows. Verified on マルハニチロ, スズキ, KDDI: balance sheets now balance and NI decomposition holds to the last yen. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-runs stages 2-4 of the pipeline with the context-filtering fix. Same 1000-question target, 61 unique companies (down from 68 because some filings had no canonical consolidated data for all their required line items). All five accounting-identity audits pass in full: - impossible COGS: 13 -> 0 - gross-profit identity: 26 -> 0 - asset-total decomposition: 301 -> 0 - balance-sheet identity: 217 -> 0 - NI decomposition: 334 -> 0 Distribution: J-GAAP 624, IFRS 332, US-GAAP 44; numerical 550, consistency 200, temporal 250. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two scripts, intentionally runnable without CI: - ``scripts/audit.py`` — schema sanity, DSL executability, and DSL/gold answer agreement. Tolerance- and boolean-categorical-aware so it does not flag rounding artifacts or 増収/減収 style answers as mismatches. - ``scripts/audit_quality.py`` — cross-row accounting invariants: COGS <= sales, gross-profit identity, operating-income identity, NI decomposition, asset-total decomposition, balance-sheet equality, ROE-convention ambiguity, rounding edge cases. Report files are checked in for visibility; regenerate with ``uv run python scripts/audit.py`` and ``scripts/audit_quality.py``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… format ## Three root-cause fixes ### 1. EDINET codes were wrong for 98 of 104 companies ``config.py`` had LLM-hallucinated ``edinet_code`` values that pointed to completely unrelated filers (``E00045`` = 大室温泉, not キリン; ``E00353`` = 三井物産, not 三井物産; etc.). Cross-verified correct codes via EDINET official CSV (``EdinetcodeDlInfo.zip``, 11,242 rows, fetched 2026-04-18) plus independent IRBank checks for top-cap names. Five codes were already correct (Toyota, Sony, KDDI, MUFG, 中部電力); マルハニチロ is E00015 despite the 2025-12 Umios renaming because EDINET codes are tied to 法人番号 not 商号. 45 companies carry a ``# FIXME(consensus-only): CSV 照合のみ`` note because Researcher-B time-budget ran out before independently verifying them, but Researcher-A CSV match was exact high-confidence. ### 2. s1_collect.py did not run on edinet-mcp >= 0.4 edinet-mcp 0.4.0 migrated ``EdinetClient`` to native async (``__aenter__``/``__aexit__`` only, ``get_filings`` / ``download_document`` coroutines). The existing synchronous ``with EdinetClient(...)`` call raised ``TypeError`` at the first iteration. Rewrote the module so ``_find_annual_report`` and ``collect_company`` are coroutines, ``run()`` wraps them in ``asyncio.run(_run_async(...))``, and the ``EdinetClient`` is entered via ``async with``. Rate limit reduced from 2.0 to 1.0 requests/second to stay under EDINET's 100 req/min API cap under batch load. ### 3. _FILING_WINDOWS missed non-March fiscal years The two-window heuristic (June-Aug for March FY, March-May for December FY) silently skipped 8-month-FY filers like ファーストリテイリング (files November). Widened to six windows covering fiscal year ends in January, March, April-May, August, September, and December, so every COMPANY_POOL entry has at least one viable search range. ## Dependency Added ``defusedxml`` as a runtime dependency; edinet-mcp's XBRL parser now imports it at module load time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Full re-collection under the fixed pipeline: 104 companies, all consolidated current-period figures (values cross-checked against published reports — e.g. Toyota FY2024 売上高 45,095,325 百万円, 資産合計 90,114,296 百万円). Same 1000-question target; subtask split 550/200/250 preserved; J-GAAP 656 / IFRS 323 / US-GAAP 21; avg program steps 2.58; 13 rejections (answer_mismatch only). Lite 150 regenerated on the new dataset with 82 unique companies, balanced strata: numerical 84 / consistency 29 / temporal 37, J-GAAP 88 / IFRS 55 / US-GAAP 7. Audit results on the new data: - ``scripts/audit.py``: 0 schema / DSL / duplicate findings. - ``scripts/audit_quality.py``: 0 impossible-COGS, 0 NI-decomposition violations, 0 balance-sheet violations, 1 asset-total rounding edge case. Remaining 311 op-income checks are IFRS false positives (营業利益 is not ``売上総利益 - 販管費`` for IFRS filers that have その他の収益 / その他の費用 lines); 89 ROE-convention ambiguities are an inherent specification question, not a data bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- ``scripts/run_baseline.py``: adds R0/R1 regime support (thinking OFF / native-moderate) with per-provider routing for OpenAI GPT-5 family (``reasoning_effort``), Gemini 2.5 (``thinking_config`` budget), and Anthropic extended thinking. Captures per-question token usage (including reasoning tokens), parse success, truncation flag, latency, and cost using a built-in pricing table. Emits separated ``{model}__{regime}__predictions.json`` and ``...__metrics.json`` artefacts with subtask / accounting-standard breakdowns and p50/p90/p95 output-token distribution. - ``scripts/build_lite.py``: deterministic stratified sampler producing a 150-question ``jfinqa-Lite`` subset. Primary stratum is ``subtask × accounting_standard`` with a soft cap of 4 questions per ``edinet_code`` and a minimum US-GAAP quota. Seed 42 for reproducibility. - ``scripts/check_raw_integrity.py``: post-collection sanity check. Verifies every ``raw/E*.json`` has ``company.edinet_code`` and ``filings[*].filing.edinet_code`` matching the filename stem, ``doc_type == "120"``, and a present ``company_name``. Catches the cache-skip-reuses-wrong-filing class of bug the EDINET code fix was meant to close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four pilot runs on ``jfinqa-Lite`` (150 questions, stratified) using the corrected pipeline output. Canonical zero-shot prompt shared across all runs. | Model | Regime | Accuracy | Parse | Truncation | Cost | |--------------------|--------|----------|-------|------------|----------| | gemini-2.5-flash | R0 | 88.0 % | 100 % | 0 % | \$0.0138 | | gemini-2.5-flash | R1 | 85.33 % | 100 % | 0 % | \$0.0342 | | gpt-5.4-mini | R0 | 92.0 % | 100 % | 0 % | \$0.0527 | | gpt-5.4-mini | R1 | 91.33 % | 100 % | 0 % | \$0.1717 | Prior pilot numbers on the pre-fix v1 data are retained at ``scripts/data/baselines_pre_context_fix/`` for before/after comparison; they should **not** be cited as baselines because the tables they were answered against mixed 連結 and 単体 values. Subtask breakdown (R0): temporal_reasoning saturates at 100 % for both models, consistency_checking at 93-100 %, numerical_reasoning is the discriminating subtask (80.95 / 85.71). Accounting-standard accuracy is close between J-GAAP and IFRS for both models. Total pilot cost: \$0.27. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ajtgjmdjp and others added 7 commits April 18, 2026 17:03

ajtgjmdjp marked this pull request as ready for review April 18, 2026 11:58

ajtgjmdjp merged commit a4f06ff into main Apr 18, 2026
4 checks passed

ajtgjmdjp deleted the fix/xbrl-context-filtering branch April 18, 2026 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix JFinQA data construction pipeline and EDINET mappings#3

Fix JFinQA data construction pipeline and EDINET mappings#3
ajtgjmdjp merged 7 commits into
mainfrom
fix/xbrl-context-filtering

ajtgjmdjp commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajtgjmdjp commented Apr 18, 2026

Summary

What was broken

What this PR does

Public artifact status (not yet addressed)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant