Skip to content

Fix JFinQA data construction pipeline and EDINET mappings#3

Merged
ajtgjmdjp merged 7 commits into
mainfrom
fix/xbrl-context-filtering
Apr 18, 2026
Merged

Fix JFinQA data construction pipeline and EDINET mappings#3
ajtgjmdjp merged 7 commits into
mainfrom
fix/xbrl-context-filtering

Conversation

@ajtgjmdjp
Copy link
Copy Markdown
Owner

Summary

Fixes benchmark construction issues discovered in post-release auditing of the JFinQA data pipeline.

What was broken

  1. XBRL context mixing (連結 vs 単体). _extract_items in
    scripts/pipeline/s2_transform.py overwrote per-element without
    looking at context, so a single table could combine consolidated
    and parent-only figures. 647 of the 1000 published questions
    violated basic accounting identities (balance-sheet equality, NI
    decomposition, impossible 売上原価 > 売上高, etc.).
  2. EDINET code mappings. 98 of 104 entries in
    scripts/pipeline/config.py pointed to the wrong filer — e.g.
    E00045 was labelled "キリンホールディングス" but actually resolves
    to 大室温泉. This meant every regenerated row used the wrong
    company's financial data.
  3. Broken collector. scripts/pipeline/s1_collect.py was
    synchronous against edinet-mcp >= 0.4.0, which migrated to a
    native async API (EdinetClient only exposes __aenter__ /
    __aexit__, and get_filings / download_document are
    coroutines). The collector failed with TypeError at the first
    iteration.
  4. Narrow filing windows. The two windows (Jun–Aug, Mar–May)
    missed non-March fiscal-year filers such as ファーストリテイリング
    (Aug FY, files in November).
  5. edinet-mcp output format change. The 0.4+ library returns
    pre-reconciled {"科目": …, "前期": …, "当期": …} dicts instead of
    raw XBRL elements with context fields. The old extractor saw no
    context and rejected everything.

What this PR does

  • Pipeline: filter XBRL contexts to consolidated current-period,
    add IFRS/US-GAAP element mappings, convert the collector to async,
    expand filing windows to six, and support both the legacy and the
    new edinet-mcp output formats.

  • Data: EDINET codes for 104 companies were independently
    verified via the official EdinetcodeDlInfo CSV (Researcher A),
    cross-checked against IRBank/企業 IR for 16 top-cap names
    (Researcher B), and adversarially reviewed before application.
    Raw data regenerated end-to-end; 104 files pass the new
    scripts/check_raw_integrity.py.

  • Accounting audits: scripts/audit.py and
    scripts/audit_quality.py check DSL executability, schema,
    duplicates, balance-sheet equality, NI decomposition, asset-total
    decomposition, COGS-vs-sales, gross-profit identity, rounding
    pedantry, and ROE-convention ambiguity. All invariant audits pass
    on the regenerated data (the remaining op_income_mismatch
    findings are IFRS-specific false positives — 営業利益 is not
    売上総利益 − 販管費 when a filer reports その他の収益/費用).

  • Tooling: new baseline runner with R0 / R1 regimes
    (thinking-off vs native-moderate), stratified Lite-150 sampler,
    and the raw integrity check.

  • Baselines (pilot on Lite 150, new data):

    Model Regime Accuracy Parse Cost
    gemini-2.5-flash R0 88.0 % 100 % $0.0138
    gemini-2.5-flash R1 85.33 % 100 % $0.0342
    gpt-5.4-mini R0 92.0 % 100 % $0.0527
    gpt-5.4-mini R1 91.33 % 100 % $0.1717

    Pre-audit baselines on the broken data are retained at
    scripts/data/baselines_pre_context_fix/ for diagnostic comparison
    only — they should not be cited.

Public artifact status (not yet addressed)

The Hugging Face dataset at ajtgjmdjp/jfinqa and the mirrors in
lm-evaluation-harness (PR #3570) and llm-jp-eval (PR #230) still
reference the pre-fix revision. A corrected release and legacy tag
will be uploaded in a follow-up; this PR focuses on the
in-repository pipeline and regenerated data.

Test plan

  • uv run python scripts/check_raw_integrity.py scripts/data_edinet_fix_*/raw
    — 0 violations over 104 files.
  • uv run python scripts/audit.py — 0 DSL / duplicate / schema
    findings.
  • uv run python scripts/audit_quality.py — 0 accounting-identity
    violations (NI decomposition, balance sheet, asset total, impossible
    COGS all pass).
  • Spot-check real numbers against published annual reports
    (Toyota FY2024: 売上高 ¥45,095,325 million, 資産合計 ¥90,114,296
    million — exact match).
  • Baseline pilot runs on corrected Lite 150 for both Gemini 2.5
    Flash and GPT-5.4 mini under R0 and R1 regimes.

🤖 Generated with Claude Code

ajtgjmdjp and others added 7 commits April 18, 2026 17:03
Root cause: EDINET XBRL reports the same element (``ProfitLoss``,
``NetSales``, ``TotalAssets`` ...) many times across different
``context_ref`` values — prior years, non-consolidated (parent-only)
variants, and dimensional breakdowns. The previous ``_extract_items``
just overwrote per-element, so the value that ended up in the table
was whichever happened to come last in the raw list.

Concretely this meant:

- 親会社株主に帰属する当期純利益 (連結) and 当期純利益 (単体) could
  coexist in the same table, making the NI decomposition identity
  (当期純利益 = 親会社 + 非支配) algebraically impossible.
- 資産合計 (連結 summary) often disagreed with 流動資産 + 固定資産
  (単体) by large margins.
- In at least one case (スズキ) 売上原価 (unclear context) exceeded
  売上高 (単体), which is physically impossible.

Changes:

- ``_is_canonical_context``: accept only ``CurrentYearDuration``,
  ``CurrentYearInstant``, and the ``Consolidated`` synonyms. Reject
  anything with ``Prior`` or ``_<...>Member``.
- ``_extract_items`` applies the filter in both the display-order
  loop and the fallback loop.
- Add IFRS-suffixed element names (``NetSalesIFRS``, ``ProfitLossIFRS``,
  ``AssetsIFRS``, etc.) to ``PL_ELEMENTS`` / ``BS_ELEMENTS`` and the
  display orders, so IFRS/US-GAAP filings are picked up from their
  consolidated IFRS contexts instead of silently falling back to
  parent-only J-GAAP rows.

Verified on マルハニチロ, スズキ, KDDI: balance sheets now balance
and NI decomposition holds to the last yen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-runs stages 2-4 of the pipeline with the context-filtering fix.
Same 1000-question target, 61 unique companies (down from 68 because
some filings had no canonical consolidated data for all their
required line items).

All five accounting-identity audits pass in full:

- impossible COGS: 13 -> 0
- gross-profit identity: 26 -> 0
- asset-total decomposition: 301 -> 0
- balance-sheet identity: 217 -> 0
- NI decomposition: 334 -> 0

Distribution: J-GAAP 624, IFRS 332, US-GAAP 44; numerical 550,
consistency 200, temporal 250.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scripts, intentionally runnable without CI:

- ``scripts/audit.py`` — schema sanity, DSL executability, and
  DSL/gold answer agreement. Tolerance- and boolean-categorical-aware
  so it does not flag rounding artifacts or 増収/減収 style answers
  as mismatches.
- ``scripts/audit_quality.py`` — cross-row accounting invariants:
  COGS <= sales, gross-profit identity, operating-income identity,
  NI decomposition, asset-total decomposition, balance-sheet
  equality, ROE-convention ambiguity, rounding edge cases.

Report files are checked in for visibility; regenerate with
``uv run python scripts/audit.py`` and ``scripts/audit_quality.py``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… format

## Three root-cause fixes

### 1. EDINET codes were wrong for 98 of 104 companies

``config.py`` had LLM-hallucinated ``edinet_code`` values that pointed
to completely unrelated filers (``E00045`` = 大室温泉, not キリン;
``E00353`` = 三井物産, not 三井物産; etc.). Cross-verified correct
codes via EDINET official CSV (``EdinetcodeDlInfo.zip``, 11,242 rows,
fetched 2026-04-18) plus independent IRBank checks for top-cap names.
Five codes were already correct (Toyota, Sony, KDDI, MUFG, 中部電力);
マルハニチロ is E00015 despite the 2025-12 Umios renaming because
EDINET codes are tied to 法人番号 not 商号. 45 companies carry a
``# FIXME(consensus-only): CSV 照合のみ`` note because Researcher-B
time-budget ran out before independently verifying them, but
Researcher-A CSV match was exact high-confidence.

### 2. s1_collect.py did not run on edinet-mcp >= 0.4

edinet-mcp 0.4.0 migrated ``EdinetClient`` to native async
(``__aenter__``/``__aexit__`` only, ``get_filings`` / ``download_document``
coroutines). The existing synchronous ``with EdinetClient(...)`` call
raised ``TypeError`` at the first iteration. Rewrote the module so
``_find_annual_report`` and ``collect_company`` are coroutines,
``run()`` wraps them in ``asyncio.run(_run_async(...))``, and the
``EdinetClient`` is entered via ``async with``. Rate limit reduced
from 2.0 to 1.0 requests/second to stay under EDINET's 100 req/min
API cap under batch load.

### 3. _FILING_WINDOWS missed non-March fiscal years

The two-window heuristic (June-Aug for March FY, March-May for
December FY) silently skipped 8-month-FY filers like ファーストリテイリング
(files November). Widened to six windows covering fiscal year ends
in January, March, April-May, August, September, and December, so
every COMPANY_POOL entry has at least one viable search range.

## Dependency

Added ``defusedxml`` as a runtime dependency; edinet-mcp's XBRL
parser now imports it at module load time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full re-collection under the fixed pipeline: 104 companies, all
consolidated current-period figures (values cross-checked against
published reports — e.g. Toyota FY2024 売上高 45,095,325 百万円,
資産合計 90,114,296 百万円). Same 1000-question target; subtask
split 550/200/250 preserved; J-GAAP 656 / IFRS 323 / US-GAAP 21;
avg program steps 2.58; 13 rejections (answer_mismatch only).

Lite 150 regenerated on the new dataset with 82 unique companies,
balanced strata: numerical 84 / consistency 29 / temporal 37,
J-GAAP 88 / IFRS 55 / US-GAAP 7.

Audit results on the new data:

- ``scripts/audit.py``: 0 schema / DSL / duplicate findings.
- ``scripts/audit_quality.py``: 0 impossible-COGS, 0 NI-decomposition
  violations, 0 balance-sheet violations, 1 asset-total rounding
  edge case. Remaining 311 op-income checks are IFRS false positives
  (营業利益 is not ``売上総利益 - 販管費`` for IFRS filers that have
  その他の収益 / その他の費用 lines); 89 ROE-convention ambiguities
  are an inherent specification question, not a data bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ``scripts/run_baseline.py``: adds R0/R1 regime support (thinking
  OFF / native-moderate) with per-provider routing for OpenAI GPT-5
  family (``reasoning_effort``), Gemini 2.5 (``thinking_config``
  budget), and Anthropic extended thinking. Captures per-question
  token usage (including reasoning tokens), parse success, truncation
  flag, latency, and cost using a built-in pricing table. Emits
  separated ``{model}__{regime}__predictions.json`` and
  ``...__metrics.json`` artefacts with subtask / accounting-standard
  breakdowns and p50/p90/p95 output-token distribution.

- ``scripts/build_lite.py``: deterministic stratified sampler
  producing a 150-question ``jfinqa-Lite`` subset. Primary stratum is
  ``subtask × accounting_standard`` with a soft cap of 4 questions
  per ``edinet_code`` and a minimum US-GAAP quota. Seed 42 for
  reproducibility.

- ``scripts/check_raw_integrity.py``: post-collection sanity check.
  Verifies every ``raw/E*.json`` has ``company.edinet_code`` and
  ``filings[*].filing.edinet_code`` matching the filename stem,
  ``doc_type == "120"``, and a present ``company_name``. Catches the
  cache-skip-reuses-wrong-filing class of bug the EDINET code fix
  was meant to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four pilot runs on ``jfinqa-Lite`` (150 questions, stratified) using
the corrected pipeline output. Canonical zero-shot prompt shared
across all runs.

| Model              | Regime | Accuracy | Parse | Truncation | Cost     |
|--------------------|--------|----------|-------|------------|----------|
| gemini-2.5-flash   | R0     | 88.0 %   | 100 % | 0 %        | \$0.0138 |
| gemini-2.5-flash   | R1     | 85.33 %  | 100 % | 0 %        | \$0.0342 |
| gpt-5.4-mini       | R0     | 92.0 %   | 100 % | 0 %        | \$0.0527 |
| gpt-5.4-mini       | R1     | 91.33 %  | 100 % | 0 %        | \$0.1717 |

Prior pilot numbers on the pre-fix v1 data are retained at
``scripts/data/baselines_pre_context_fix/`` for before/after
comparison; they should **not** be cited as baselines because the
tables they were answered against mixed 連結 and 単体 values.

Subtask breakdown (R0): temporal_reasoning saturates at 100 % for
both models, consistency_checking at 93-100 %, numerical_reasoning
is the discriminating subtask (80.95 / 85.71). Accounting-standard
accuracy is close between J-GAAP and IFRS for both models.

Total pilot cost: \$0.27.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajtgjmdjp ajtgjmdjp marked this pull request as ready for review April 18, 2026 11:58
@ajtgjmdjp ajtgjmdjp merged commit a4f06ff into main Apr 18, 2026
4 checks passed
@ajtgjmdjp ajtgjmdjp deleted the fix/xbrl-context-filtering branch April 18, 2026 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant