Tool to inspect checkpoint structure by shuningjin · Pull Request #3139 · AI-Hypercomputer/maxtext

shuningjin · 2026-02-13T22:44:41Z

Description

Unified script to inspect checkpoint structure for HF/MaxText/Orbax, to help bringup and debugging

Fix: b/484416862

A unified tool to inspect checkpoint structures for:

HuggingFace/PyTorch

input: local path (safetensor or pth)
need load weight
print: param-name / shape

MaxText Model Architecture

input: model name, scan mode
lightweight, no memory or compute need (cpu or tpu)
how: jax.eval_shape
print: param-name / shape + param count

Orbax Checkpoints

input: gcs path or local path
lightweight, no memory or compute need (cpu or tpu)
how: read meta data
print: param-name / shape

Usage Examples:

[Mode 1: HF/PyTorch]   
   python src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py hf --path <local_hf_path> --format <safetensors | pth>
[Mode 2: MaxText Arch] 
  python src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py maxtext --model_name <maxtext_model_name> --scan_layers <True | False>
[Mode 3: Orbax]        
  python src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py orbax --path <local_orbax_path | gcs_orbax_path>

Tests

1 HF checkpoint, locally downloaded

python src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py hf \
--path ~/deepseek2-16b/hf-16b \
--format safetensors

https://paste.googleplex.com/5971112811954176

2 MaxText model, on-fly

python src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py maxtext \
--model_name deepseek2-16b --scan_layers True

https://paste.googleplex.com/5636102443630592

python src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py maxtext \
--model_name deepseek2-16b --scan_layers False

https://paste.googleplex.com/5609907941408768

3 Orbax checkpoint

python src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py orbax \
--path gs://maxtext-deepseek/deepseek3-671b/2025-03-31/scanned/0/items

https://paste.googleplex.com/5320246790586368

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-02-13T22:52:56Z

Codecov Report

❌ Patch coverage is 0% with 109 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...axText/utils/ckpt_conversion/inspect_checkpoint.py	0.00%	109 Missing ⚠️

📢 Thoughts on this report? Let us know!

RissyRan

Thanks for making our process better!

RissyRan · 2026-02-18T02:27:51Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+Usage Examples:
+[Mode 1: HF/PyTorch]   
+   python src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py hf --path <local_hf_path> --format <safetensors | pth>
+[Mode 2: MaxText Arch] 


Nit: Shall we update MaxText Arch to Maxtext or MaxText Architecture to avoid potential confusion?

RissyRan · 2026-02-18T02:30:50Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+      os.path.join(MAXTEXT_PKG_DIR, "configs", "base.yml"),
+      f"model_name={args.model_name}",
+      f"scan_layers={args.scan_layers}",
+      "attention=dot_product",


Did you meet issues?

RissyRan · 2026-02-18T02:36:30Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+  parser_mt.add_argument(
+      "--scan_layers",
+      type=str,


Do you think it's better we leverage

maxtext/benchmarks/benchmark_utils.py

Line 33 in 2b06b9c

def str2bool(v: str) -> bool:

?

github-actions · 2026-02-18T02:37:29Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces a unified checkpoint inspector tool for HuggingFace, MaxText architecture, and Orbax. The tool is a great addition for debugging model bring-ups. However, there is a significant bug in the MaxText mode where layer indices are ignored during path flattening, leading to incomplete output. Additionally, there are opportunities to improve memory efficiency when inspecting large checkpoints.

🔍 General Feedback

Bug Fix Required: The MaxText architecture inspection currently collapses all layers into a single set of keys because it ignores SequenceKey indices. This needs to be addressed to provide a complete view of the model structure.
Memory Efficiency: For safetensors, using get_slice instead of get_tensor avoids unnecessary data loading. For PyTorch checkpoints, better handling of large files and common state-dict wrappers would make the tool more robust.
Consistency: Standardizing separators across different modes (e.g., using . everywhere) would improve the user experience.

github-actions · 2026-02-18T02:39:25Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+      from safetensors import safe_open
+    except ImportError:
+      sys.exit("Error: 'safetensors' is required. `pip install safetensors`")
+


🟡 f.get_tensor(k) loads the entire tensor into memory. Since you only need the shape, using f.get_slice(k) is much more memory-efficient, especially for large models.

Suggested change

chkpt_vars_raw[k] = f.get_slice(k).shape

Sounds like a good idea if it works.

github-actions · 2026-02-18T02:39:26Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+  elif args.format == "pth":
+    for i, ckpt_path in enumerate(ckpt_paths):
+      print(f"Loading {ckpt_path.name} ({i+1}/{len(ckpt_paths)})...")
+      checkpoint = torch.load(ckpt_path, map_location="cpu")


🟡 Loading a full .pth checkpoint into CPU memory can be very memory-intensive for large models. If the torch version allows, consider using mmap=True. Also, .pth files often wrap the state_dict in a dictionary (e.g., under a 'model' or 'state_dict' key).

Suggested change

checkpoint = torch.load(ckpt_path, map_location="cpu")

checkpoint = torch.load(ckpt_path, map_location="cpu", weights_only=True)

# Some checkpoints wrap the state_dict

if isinstance(checkpoint, dict) and "model" in checkpoint:

checkpoint = checkpoint["model"]

elif isinstance(checkpoint, dict) and "state_dict" in checkpoint:

checkpoint = checkpoint["state_dict"]

github-actions · 2026-02-18T02:39:26Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+
+  # Initialize without heavyweight runtime
+  config = pyconfig.initialize(argv)
+  devices_array = maxtext_utils.create_device_mesh(config)


🟠 This line only extracts the key attribute from the path components, which means SequenceKey (used for list indices, like layer numbers) is ignored. This will cause all layers in a model to have the same flattened key, leading to them overwriting each other in the flat_shapes dictionary and resulting in incomplete output.

Suggested change

devices_array = maxtext_utils.create_device_mesh(config)

key_parts = [str(getattr(k, "key", getattr(k, "idx", k))) for k in path_tuple]

github-actions · 2026-02-18T02:39:27Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+  except ImportError:
+    sys.exit("Error: 'orbax-checkpoint' or 'etils' not found. `pip install orbax-checkpoint etils[epath]`")
+
+  path = epath.Path(args.path)


🟡 The path_tuple in Orbax might contain integers for indices. Attempting to join them will raise a TypeError. Using map(str, k) ensures compatibility.

Suggested change

path = epath.Path(args.path)

key_str = ".".join(map(str, k))

github-actions · 2026-02-18T02:39:28Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+# ==============================================================================
+# Main CLI Driver
+# ==============================================================================
+def main():


🟢 Consider adding bin to the choices, as HuggingFace often uses pytorch_model.bin for PyTorch weights.

Suggested change

def main():

"--format", type=str, required=False, choices=["safetensors", "pth", "bin"], default="safetensors", help="File format"

github-actions · 2026-02-18T02:39:29Z

src/MaxText/utils/ckpt_conversion/inspect_checkpoint.py

+  config = pyconfig.initialize(argv)
+  devices_array = maxtext_utils.create_device_mesh(config)
+  mesh = jax.sharding.Mesh(devices_array, config.mesh_axes)
+  quant = quantizations.configure_quantization(config)


🟢 For consistency with the Orbax inspection mode and typical MaxText parameter paths, consider using . as a separator instead of -.

Suggested change

quant = quantizations.configure_quantization(config)

mt_param_key = "params." + ".".join(key_parts)

Tool to inspect checkpoint structure

e37a93d

clean

5243e93

shuningjin marked this pull request as ready for review February 13, 2026 23:37

shuningjin requested review from NicoGrande, RissyRan, bvandermoon, gagika, gobbleturk, hengtaoguo, jiangjy1982, parambole, richjames0, shralex and suexu1025 as code owners February 13, 2026 23:37

shuningjin assigned parambole, RissyRan and Rohan-Bierneni Feb 13, 2026

RissyRan mentioned this pull request Feb 17, 2026

Add external model bringup guide #3151

Open

4 tasks

RissyRan reviewed Feb 18, 2026

View reviewed changes

RissyRan added the gemini-review label Feb 18, 2026

github-actions bot reviewed Feb 18, 2026

View reviewed changes

-      checkpoint = torch.load(ckpt_path, map_location="cpu")
+      checkpoint = torch.load(ckpt_path, map_location="cpu", weights_only=True)
+      # Some checkpoints wrap the state_dict
+      if isinstance(checkpoint, dict) and "model" in checkpoint:
+        checkpoint = checkpoint["model"]
+      elif isinstance(checkpoint, dict) and "state_dict" in checkpoint:
+        checkpoint = checkpoint["state_dict"]

	devices_array = maxtext_utils.create_device_mesh(config)
	key_parts = [str(getattr(k, "key", getattr(k, "idx", k))) for k in path_tuple]

	def main():
	"--format", type=str, required=False, choices=["safetensors", "pth", "bin"], default="safetensors", help="File format"

	quant = quantizations.configure_quantization(config)
	mt_param_key = "params." + ".".join(key_parts)

Conversation

shuningjin commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Feb 13, 2026

Codecov Report

Uh oh!

RissyRan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RissyRan Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

RissyRan Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

RissyRan Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

RissyRan Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

shuningjin commented Feb 13, 2026 •

edited

Loading

RissyRan left a comment •

edited

Loading