Skip to content

fix: retune AA index bounds and refresh fallback snapshot for the reworked AA scale#139

Open
devangpratap wants to merge 1 commit into
Andyyyy64:mainfrom
devangpratap:fix/aa-index-retune-reworked-scale
Open

fix: retune AA index bounds and refresh fallback snapshot for the reworked AA scale#139
devangpratap wants to merge 1 commit into
Andyyyy64:mainfrom
devangpratap:fix/aa-index-retune-reworked-scale

Conversation

@devangpratap

@devangpratap devangpratap commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Part of #101 (partial - this is the first slice you scoped, not the full issue, so please keep the issue open).

What this does

AA reworked their Intelligence Index and the open-weights raw scores compressed a lot, so the old 12.5/56.2 bounds and the 2026-05-14 snapshot no longer line up with the live data. On the old bounds, live raw for Qwen3-8B (around 7.4 now) normalizes to roughly 0, while the curated snapshot still reported a raw value that normalized to 40 for the same model.

  1. Re-derived the normalization bounds from the current distribution: _AA_INDEX_MIN = -19.4, _AA_INDEX_MAX = 47.6. I kept the same two-point calibration the code already used: the top open model (DeepSeek-V4-Pro, raw 44.3) maps to 95, and the 8B class (Qwen3-8B, raw 7.4) maps to 40. On the compressed scale that fit puts the floor below zero. Live AA raw values are always positive and the output is clamped to 0 to 100, so the negative floor only sets where the curve sits.

  2. Refreshed the curated fallback from a fresh 2026-06-29 scrape. Models AA currently tracks carry their real new-scale raw values. The peer entries that AA does not track keep their previous normalized score: their raw value is set so it reproduces that score under the new bounds, so the bounds change does not move them.

  3. Added tests asserting the new normalized values for known models (top open model around 95, Qwen3-8B at 40, clamp behavior, and a few fallback values).

What this deliberately does not do

I did not touch the overlay merge policy in fetch_aa_index_scores. It is still max-merge: a live score only replaces a snapshot score when the live value is higher. The switch to live-wins is left for a separate PR, as you asked. The snapshot is still stored as raw AA values normalized on read, same as before.

One thing worth a look

On the reworked scale, several peer entries (models AA does not track) fall below the index floor and now read as negative raw values. They normalize correctly because the output is clamped, but storing negative numbers in a raw-index table is a bit odd. If you would rather, the snapshot could instead store already-normalized 0 to 100 values, which drops the negatives and means a future bounds retune cannot silently shift the fallback. I left that out of this PR since it is more than the issue asks for, but happy to do it in a follow-up if you want it.

Note on the README

The "What can I run?" example table is marked as a 2026-05 snapshot of illustrative scores. This retune shifts those numbers slightly, but I left the README untouched since refreshing it is outside this issue and needs a live run. Flagging it so you can update it whenever you next regenerate that example.

Testing

  • uv run pytest: 456 passed
  • ruff check . and ruff format --check .: clean

…d scale

AA reworked their Intelligence Index and the open-weights raw scores
compressed, so the old 12.5/56.2 bounds and the 2026-05-14 snapshot no longer
match the live data. Live raw for Qwen3-8B (around 7.4) normalized to roughly 0
under the old floor while the snapshot still mapped it to 40.

- Re-derive _AA_INDEX_MIN/_AA_INDEX_MAX (-19.4/47.6) from the current
  distribution, keeping the two-point calibration (DeepSeek-V4-Pro 44.3 -> 95,
  Qwen3-8B 7.4 -> 40). The floor goes below zero on the compressed scale; live
  values are always positive and clamp at 0.
- Refresh the curated fallback from a 2026-06-29 scrape: AA-tracked models get
  real new-scale raw values, untracked peers keep their prior normalized score.
- Add tests for the new normalized values of known models.

Scope is the first slice of Andyyyy64#101 only; the live-wins merge change is left for a
separate PR and the max-merge overlay is unchanged.

Refs Andyyyy64#101
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant