Skip to content

Relax tolerance in Qwen Scope loader test to fix flaky CI#706

Merged
chanind merged 1 commit into
decoderesearch:mainfrom
robbiebusinessacc:contrib/saelens-qwen-scope-loader-test-tol
Jun 16, 2026
Merged

Relax tolerance in Qwen Scope loader test to fix flaky CI#706
chanind merged 1 commit into
decoderesearch:mainfrom
robbiebusinessacc:contrib/saelens-qwen-scope-loader-test-tol

Conversation

@robbiebusinessacc

Copy link
Copy Markdown
Contributor

Description

test_qwen_scope_sae_huggingface_loader_with_mocked_download intermittently
fails in CI (e.g. the two runs linked in #688). It compares two mathematically
equal matmuls: the loader stores W_enc/W_dec as .T.contiguous(), while the
reference in the test multiplies a transposed view (raw_W_enc.T). Over the
2048- and 32768-dim reductions, a contiguous tensor and a transposed view can
accumulate float32 in a different order depending on the BLAS backend, so the
two results agree only up to rounding. The default assert_close tolerance
(rtol=1e-5, atol=1e-8) is too tight to absorb that, which is what danra
diagnosed on the issue.

This compares with rtol=1e-3, atol=1e-2, matching the existing precedent for a
matmul comparison elsewhere in this file. The tolerance is generous enough to
absorb the accumulation noise but still catches a real loader regression, which
would be orders of magnitude off (verified locally: a 1% weight-scale error and
a wrong-bias error are both still rejected).

Note: the failure is BLAS/threading-dependent and reproduces on CI but not on
local arm64 macOS (where the two layouts produce bit-identical results), so this
is verified by the linked CI logs plus the regression check above rather than a
local repro.

Fixes #688

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and tests

  • I have run the formatter and linter (ruff format + ruff check) on the change.

test_qwen_scope_sae_huggingface_loader_with_mocked_download compares two
mathematically equal matmuls: the loader stores W_enc/W_dec as
`.T.contiguous()`, while the reference multiplies a transposed view. Over
the 2048- and 32768-dim reductions these layouts can accumulate float32 in
a different order depending on the BLAS backend, so the results differ by
small rounding noise that intermittently exceeds the default
rtol=1e-5/atol=1e-8 and fails CI. Compare with rtol=1e-3/atol=1e-2, which
absorbs the rounding while still catching any real loader regression.

Fixes decoderesearch#688

@chanind chanind left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for this!

@chanind chanind merged commit 10c26c8 into decoderesearch:main Jun 16, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Random test failure

2 participants