fix: add TPU v5e compatibility for trace parsing#1
Open
midiareshadi wants to merge 1 commit into
Open
Conversation
- trace_parser.py: handle both list and dict JSON formats since TPU v5e emits a raw list while TPU v4 wraps events in a traceEvents dict key - flexible_validation.py: dynamically detect TPU device pid from trace metadata instead of hardcoding pid=8, fixing compatibility with TPU v5e and future TPU generations Tested on Google Colab with TPU v5e, JAX 0.7.2, Python 3.12.13
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
This repo was not runnable on TPU v5e (and likely any TPU newer than v4)
due to two issues in the trace parsing pipeline, causing all
Actual_Duration_us values to be 0.0 and MAPE to be inf%.
Root Causes and Fixes
1. trace_parser.py — JSON format difference
TPU v4 wraps trace events in a dict:
{"traceEvents": [...]}TPU v5e emits a raw JSON list:
[...]The parser called
trace_data.get('traceEvents', [])which silentlyreturns [] on v5e, so no events were ever parsed.
Fix: check if trace_data is already a list before calling .get()
2. flexible_validation.py — Hardcoded TPU device pid
The event filter hardcoded
pid == 8to identify TPU device events.On TPU v5e the device pid is 3, causing all real hardware events to
be filtered out and Actual_Duration_us to always be 0.0.
Fix: dynamically detect the TPU device pid by scanning trace metadata
events for a process_name containing 'TPU', making it robust across
all TPU generations rather than hardcoding any specific pid value.
Testing
Tested on Google Colab with:
Results after fixes:
The remaining MAPE is expected — the linear models were calibrated on
TPU v4 hardware. TPU v5e runs matmuls ~1.9x faster on average.
Recalibrating the linear models for v5e would be a worthwhile
follow-up contribution.