fix: retune AA index bounds and refresh fallback snapshot for the reworked AA scale#139
Open
devangpratap wants to merge 1 commit into
Open
Conversation
…d scale AA reworked their Intelligence Index and the open-weights raw scores compressed, so the old 12.5/56.2 bounds and the 2026-05-14 snapshot no longer match the live data. Live raw for Qwen3-8B (around 7.4) normalized to roughly 0 under the old floor while the snapshot still mapped it to 40. - Re-derive _AA_INDEX_MIN/_AA_INDEX_MAX (-19.4/47.6) from the current distribution, keeping the two-point calibration (DeepSeek-V4-Pro 44.3 -> 95, Qwen3-8B 7.4 -> 40). The floor goes below zero on the compressed scale; live values are always positive and clamp at 0. - Refresh the curated fallback from a 2026-06-29 scrape: AA-tracked models get real new-scale raw values, untracked peers keep their prior normalized score. - Add tests for the new normalized values of known models. Scope is the first slice of Andyyyy64#101 only; the live-wins merge change is left for a separate PR and the max-merge overlay is unchanged. Refs Andyyyy64#101
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #101 (partial - this is the first slice you scoped, not the full issue, so please keep the issue open).
What this does
AA reworked their Intelligence Index and the open-weights raw scores compressed a lot, so the old 12.5/56.2 bounds and the 2026-05-14 snapshot no longer line up with the live data. On the old bounds, live raw for Qwen3-8B (around 7.4 now) normalizes to roughly 0, while the curated snapshot still reported a raw value that normalized to 40 for the same model.
Re-derived the normalization bounds from the current distribution:
_AA_INDEX_MIN = -19.4,_AA_INDEX_MAX = 47.6. I kept the same two-point calibration the code already used: the top open model (DeepSeek-V4-Pro, raw 44.3) maps to 95, and the 8B class (Qwen3-8B, raw 7.4) maps to 40. On the compressed scale that fit puts the floor below zero. Live AA raw values are always positive and the output is clamped to 0 to 100, so the negative floor only sets where the curve sits.Refreshed the curated fallback from a fresh 2026-06-29 scrape. Models AA currently tracks carry their real new-scale raw values. The peer entries that AA does not track keep their previous normalized score: their raw value is set so it reproduces that score under the new bounds, so the bounds change does not move them.
Added tests asserting the new normalized values for known models (top open model around 95, Qwen3-8B at 40, clamp behavior, and a few fallback values).
What this deliberately does not do
I did not touch the overlay merge policy in
fetch_aa_index_scores. It is still max-merge: a live score only replaces a snapshot score when the live value is higher. The switch to live-wins is left for a separate PR, as you asked. The snapshot is still stored as raw AA values normalized on read, same as before.One thing worth a look
On the reworked scale, several peer entries (models AA does not track) fall below the index floor and now read as negative raw values. They normalize correctly because the output is clamped, but storing negative numbers in a raw-index table is a bit odd. If you would rather, the snapshot could instead store already-normalized 0 to 100 values, which drops the negatives and means a future bounds retune cannot silently shift the fallback. I left that out of this PR since it is more than the issue asks for, but happy to do it in a follow-up if you want it.
Note on the README
The "What can I run?" example table is marked as a 2026-05 snapshot of illustrative scores. This retune shifts those numbers slightly, but I left the README untouched since refreshing it is outside this issue and needs a live run. Flagging it so you can update it whenever you next regenerate that example.
Testing
uv run pytest: 456 passedruff check .andruff format --check .: clean