Add OSDF cache-diagnostic appendix to JRA-3Q heat-index notebook by hrhampapura · Pull Request #57 · NCAR/osdf-examples

hrhampapura · 2026-06-30T15:33:48Z

What & why

Follow-up to the JRA-3Q d640000 read failures debugged in #56. The PelicanFS team confirmed the most likely cause is a cache serving corrupt bytes (deterministic zlib: incorrect header check from Casper; the same chunks decode cleanly from the origin and over HTTPS), and asked us for two things:

Print which cache Casper is routed to — they operate many caches and have no easy way to find a bad object in one.
Use preferred_caches to pin a known-good cache as the interim lever, until client-side checksumming / automatic failover lands in PelicanFS.

This PR adds a diagnostic appendix to notebooks/jja_heatindex.ipynb. The working direct_reads=True open is left untouched, so the notebook still runs end-to-end.

What's added (appendix, run on Casper)

Cell A — prints the director-chosen cache for d640000, the full candidate list, and the origin; enables the fsspec.pelican logger so cache selection and Marking cache at <url> as bad events print inline.
Cell B — re-opens JRA-3Q through the cache (no direct_reads), reproduces the failing read, and dumps get_access_data() to name the cache that served/failed each object (handles both the zlib-corrupt case, recorded as success, and the ContentLengthError case, recorded as failure).
Cell C — guarded preferred_caches template (keeps "+" for director fallback); inert until a healthy cache host is filled in and USE_PREFERRED_CACHES=True.

Notes

Diagnostic cells need to be run on Casper to capture the bad cache; their outputs are not committed.
Verified against pelicanfs 1.3.1 source (get_working_cache/get_origin_url return (url, director_response); get_access_data() keeps the last 3 responses per object path).

🤖 Generated with Claude Code

The PelicanFS team confirmed (a) a cache serving corrupt bytes is the most likely cause of the deterministic zlib failures, and (b) they have no easy way to find a bad object across their caches and asked us to print which cache Casper is routed to. They also endorsed pinning known-good caches via preferred_caches as the interim lever until client-side checksumming / automatic failover lands in PelicanFS. This adds an appendix (run on Casper) that leaves the working direct_reads open untouched: - Cell A: print the director-chosen cache + full candidate list + origin for d640000, and enable the fsspec.pelican logger. - Cell B: re-open JRA-3Q through the cache, reproduce the failing read, and dump get_access_data() to name the serving/failing cache. - Cell C: guarded preferred_caches template (inert until a good cache is set). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hrhampapura and others added 2 commits June 30, 2026 09:30

Added correct diagnostic cells

b1785aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OSDF cache-diagnostic appendix to JRA-3Q heat-index notebook#57

Add OSDF cache-diagnostic appendix to JRA-3Q heat-index notebook#57
hrhampapura wants to merge 2 commits into
mainfrom
jra3q-cache-diagnostics

hrhampapura commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hrhampapura commented Jun 30, 2026

What & why

What's added (appendix, run on Casper)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant