Skip to content

[BUG] Commit 94e55c7 causes precision failure in test_engram_gate_conv_bwd #5

@wanglei19991004

Description

@wanglei19991004

Describe the Bug

At commit 94e55c7a3b4ffdaa11052e1d3d459505e6ed126c, the following test starts failing with a precision mismatch:
pytest tests/ops/test_engram_bwd.py::test_engram_gate_conv_bwd[1-32-256-dtype0-False]

To Reproduce

  1. Checkout the problematic commit:
    git checkout 94e55c7a3b4ffdaa11052e1d3d459505e6ed126c
  2. Then run:
    pytest tests/ops/test_engram_bwd.py::test_engram_gate_conv_bwd

Expected Behavior

The test should pass, and the output should be numerically close to the reference output.

Actual Behavior

The test fails with an AssertionError:

FAILED tests/ops/test_engram_bwd.py::test_engram_gate_conv_bwd[1-32-256-dtype0-False]

AssertionError: Tensor-likes are not close!

output = tensor([[0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.],
                 [0., 0., 0., ..., 0., 0., 0.]], device='cuda:0')

output_ref = tensor([[ 0.1562, -0.3986, -0.1758, ..., -0.0725, -0.5347, -0.2780],
                     [ 0.0719,  0.2237,  0.1933, ..., -0.3443,  0.1883, -0.0696],
                     [-0.3473, -0.0062, -0.2820, ..., -0.3163,  0.1547, -0.7590]],
                     device='cuda:0')

Mismatched elements: 388 / 1024 (37.9%)
Greatest absolute difference: 1.0608892440795898
Greatest relative difference: 1.0

It looks like the kernel output becomes all zeros after this commit, while the reference output remains valid.
The same test passes on the previous commit, so this issue appears to be introduced by commit 94e55c7.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions