Skip to content

Normalize whitespace + strip corporate suffix in Wikidata search#20

Merged
fleveque merged 1 commit into
mainfrom
fix/wikidata-name-normalization
May 16, 2026
Merged

Normalize whitespace + strip corporate suffix in Wikidata search#20
fleveque merged 1 commit into
mainfrom
fix/wikidata-name-normalization

Conversation

@fleveque

Copy link
Copy Markdown
Owner

Why

Yesterday's prod log (after #19 surfaced the wikidata-miss reason) showed REP.MC failing here:

```
"wikidata provider miss" symbol=REP.MC company_name="REPSOL, S.A."
error="wikidata search: no wikidata entity for "REPSOL, S.A.""
```

Two structural problems with names coming through from Yahoo:

  • whitespace can be doubled (`REPSOL, S.A.`)
  • corporate suffix + uppercase tank Wikidata's fuzzy match score

Live-tested against the real Wikidata API:

Query Result
`REPSOL, S.A.` 0 results
`REPSOL, S.A.` Q174747 (Repsol)
`Repsol` Q174747 (Repsol)

So a whitespace collapse alone unblocks Repsol. As a secondary fallback, stripping a known list of corporate suffixes (`S.A.`, `Inc.`, `plc`, `Ltd`, `AG`, `GmbH`, `N.V.`, `Corp.`, `Co.`, …) catches the cases where the as-is search still misses.

Two Wikidata queries worst case — cheap against their free API.

Test plan

🤖 Generated with Claude Code

Yesterday's prod log surfaced REP.MC's wikidata miss:

  "no wikidata entity for \"REPSOL,  S.A.\""

Two structural problems with names coming through from Yahoo:
- whitespace can be doubled (\"REPSOL,  S.A.\")
- corporate suffix + uppercasing tank Wikidata's fuzzy match score

Live-tested both forms against the Wikidata API:
- \"REPSOL,  S.A.\"  → 0 results
- \"REPSOL, S.A.\"   → Q174747 (Repsol)
- \"Repsol\"         → Q174747 (Repsol)

So a whitespace collapse alone unblocks Repsol. As a secondary
fallback, stripping a known list of corporate suffixes (S.A., Inc., plc,
Ltd, AG, GmbH, N.V., Corp., Co., …) catches the cases where the as-is
search still misses. Two queries worst case — cheap against Wikidata's
free API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fleveque fleveque merged commit e2db5af into main May 16, 2026
2 checks passed
@fleveque fleveque deleted the fix/wikidata-name-normalization branch May 16, 2026 22:05
fleveque added a commit that referenced this pull request May 16, 2026
Yesterday's normalization fix (#20) helped REPSOL, but DGE.L's
company_name from Yahoo arrived as the full security descriptor:

  "DIAGEO PLC ORD 28 101/108P"

(Diageo plc, Ordinary shares, par value 28 101/108 pence.)

My previous suffix-strip only handled trailing corporate forms ("plc"
at the END), so it left this string unchanged and Wikidata couldn't
match.

Replace the two helpers (normalizeWhitespace + stripCorporateSuffix
with a single companyNameVariants function that scans for the corporate
form as a *word inside* the name and returns up to three search
variants in decreasing fidelity:

  "DIAGEO PLC ORD 28 101/108P"  ─►  ["DIAGEO PLC ORD 28 101/108P",
                                     "DIAGEO PLC",
                                     "DIAGEO"]
  "REPSOL,  S.A."                ─►  ["REPSOL, S.A.", "REPSOL"]
  "Apple Inc."                   ─►  ["Apple Inc.", "Apple"]
  "Berkshire Hathaway"           ─►  ["Berkshire Hathaway"]

Search now tries each variant in order, stopping at the first hit.
Three Wikidata queries worst case — still cheap against their free API.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant