Skip to content

Add OSDF cache-diagnostic appendix to JRA-3Q heat-index notebook#57

Open
hrhampapura wants to merge 2 commits into
mainfrom
jra3q-cache-diagnostics
Open

Add OSDF cache-diagnostic appendix to JRA-3Q heat-index notebook#57
hrhampapura wants to merge 2 commits into
mainfrom
jra3q-cache-diagnostics

Conversation

@hrhampapura

Copy link
Copy Markdown
Collaborator

What & why

Follow-up to the JRA-3Q d640000 read failures debugged in #56. The PelicanFS team confirmed the most likely cause is a cache serving corrupt bytes (deterministic zlib: incorrect header check from Casper; the same chunks decode cleanly from the origin and over HTTPS), and asked us for two things:

  1. Print which cache Casper is routed to — they operate many caches and have no easy way to find a bad object in one.
  2. Use preferred_caches to pin a known-good cache as the interim lever, until client-side checksumming / automatic failover lands in PelicanFS.

This PR adds a diagnostic appendix to notebooks/jja_heatindex.ipynb. The working direct_reads=True open is left untouched, so the notebook still runs end-to-end.

What's added (appendix, run on Casper)

  • Cell A — prints the director-chosen cache for d640000, the full candidate list, and the origin; enables the fsspec.pelican logger so cache selection and Marking cache at <url> as bad events print inline.
  • Cell B — re-opens JRA-3Q through the cache (no direct_reads), reproduces the failing read, and dumps get_access_data() to name the cache that served/failed each object (handles both the zlib-corrupt case, recorded as success, and the ContentLengthError case, recorded as failure).
  • Cell C — guarded preferred_caches template (keeps "+" for director fallback); inert until a healthy cache host is filled in and USE_PREFERRED_CACHES=True.

Notes

  • Diagnostic cells need to be run on Casper to capture the bad cache; their outputs are not committed.
  • Verified against pelicanfs 1.3.1 source (get_working_cache/get_origin_url return (url, director_response); get_access_data() keeps the last 3 responses per object path).

🤖 Generated with Claude Code

hrhampapura and others added 2 commits June 30, 2026 09:30
The PelicanFS team confirmed (a) a cache serving corrupt bytes is the most
likely cause of the deterministic zlib failures, and (b) they have no easy
way to find a bad object across their caches and asked us to print which
cache Casper is routed to. They also endorsed pinning known-good caches via
preferred_caches as the interim lever until client-side checksumming /
automatic failover lands in PelicanFS.

This adds an appendix (run on Casper) that leaves the working direct_reads
open untouched:
- Cell A: print the director-chosen cache + full candidate list + origin for
  d640000, and enable the fsspec.pelican logger.
- Cell B: re-open JRA-3Q through the cache, reproduce the failing read, and
  dump get_access_data() to name the serving/failing cache.
- Cell C: guarded preferred_caches template (inert until a good cache is set).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant