NVIDIA · xupinjie · May 15, 2026 · May 18, 2026 · May 18, 2026 · RmSchaffert
diff --git a/.claude/skills/docs-compliance/SKILL.md b/.claude/skills/docs-compliance/SKILL.md
@@ -0,0 +1,217 @@
+---
+name: docs-compliance
+description: ACCV-Lab documentation conventions and pre-PR compliance check. INVOKE when creating or editing any .md or .rst file under docs/ or packages/*/docs/, when modifying a Python module/class/function/method docstring, or before opening a PR that touches documentation. Provides hard rules for Sphinx role usage, docstring formatting, public-API export requirements, admonition syntax, and a pre-PR checklist with verification commands.
+---
+
+# ACCV-Lab Documentation Compliance
+
+## When this skill applies
+
+- Creating or editing any `.md` / `.rst` file under `docs/` or `packages/*/docs/`
+- Modifying a Python module, class, function, or method docstring (anything autodoc renders)
+- Preparing a PR that touches documentation, samples, or public-API docstrings
+
+## Authoritative project references
+
+Read these for ground truth before deviating from any rule below:
+
+- `docs/conf.py` — Sphinx configuration (extensions, autodoc options, custom handlers)
+- `docs/guides/DOCUMENTATION_SETUP_GUIDE.md` — build pipeline & directory structure
+- `docs/guides/FORMATTING_GUIDE.md` — Python/C++ formatting (also affects docstring rendering)
+- `docs/spelling_wordlist.txt` — accepted technical-term whitelist
+- `docs/_ext/` — local Sphinx extensions (`note_literalinclude`, `module_docstring`, `markdown_note_admonitions`)
+
+## Hard rules
+
+### Rule 1 — API references: use Sphinx roles, never bare backticks
+
+For any `accvlab.*` symbol mentioned in narrative text, use the appropriate role so it cross-links in the rendered HTML.
+
+| Symbol kind | MyST role (`.md`) | RST role (`.rst`) |
+|---|---|---|
+| Class | `` {py:class}`~accvlab.<pkg>.<Class>` `` | `` :class:`~accvlab.<pkg>.<Class>` `` |
+| Method | `` {py:meth}`~accvlab.<pkg>.<Class>.<method>` `` | `` :meth:`~accvlab.<pkg>.<Class>.<method>` `` |
+| Module function | `` {py:func}`~accvlab.<pkg>.<func>` `` | `` :func:`~accvlab.<pkg>.<func>` `` |
+| Attribute | `` {py:attr}`~accvlab.<pkg>.<Class>.<attr>` `` | `` :attr:`~accvlab.<pkg>.<Class>.<attr>` `` |
+
+```
+Bad:
+  See `lookup()` and `put()` for details.
+
+Good:
+  See {py:meth}`~accvlab.on_demand_video_decoder.SharedGopStore.lookup`
+  and  {py:meth}`~accvlab.on_demand_video_decoder.SharedGopStore.put`
+  for details.
+```
+
+**Exclusions** — keep bare backticks for:
+- Stdlib types (`RuntimeWarning`, `NamedTuple`, `multiprocessing.Lock`) — this project does not cross-ref stdlib in user docs
+- Parameter names, field names, prose terms (`access_tick`, `flock`, `spawn`)
+- API names appearing inside fenced code blocks (` ```python ` … ` ``` `) — only narrative prose gets roles
+
+### Rule 2 — `Returns:` block formatting gotcha
+
+In Google/NumPy-style docstrings, the **first line** after `Returns:` is silently parsed as a return-type annotation if it ends with `:`, even when a real type annotation is on the signature. This produces malformed return docs that look fine in the source but break in the rendered API table.
+
+```
+Bad:
+    Returns:
+        Tuple of three things:
+            - first
+            - second
+            - third
+
+Good:
+    Returns:
+        Tuple containing
+
+        - first
+        - second
+        - third
+```
+
+Lead with prose that does **not** end in `:`, then a blank line, then the bullets.
+
+### Rule 3 — Public API must be exported
+
+A new public class or function will not appear in the auto-generated `api.rst` unless **both** of these hold:
+
+1. It is imported in `packages/<pkg>/accvlab/<pkg>/__init__.py`
+2. It is listed in that file's `__all__`
+
+Internal helpers belong under `_internal/` and are not exported.
+
+### Rule 4 — Type annotations on public APIs
+
+Every public function parameter and return value must have a type annotation. `sphinx_autodoc_typehints` renders them into the docs; missing annotations produce gaps in the rendered API table.
+
+```python
+Good:
+    def get_batch(self, refs: List[GopRef]) -> List[np.ndarray]:
+        ...
+```
+
+### Rule 5 — Annotation must match docstring
+
+When changing a function's signature (parameter types, return type, parameter names), update the corresponding `Args:` / `Returns:` lines in the docstring **in the same edit**. Stale docstrings vs. live signatures are caught in review.
+
+### Rule 6 — No implementation details in user-facing docs
+
+User-facing docs (`docs/`, `packages/*/docs/`, public-class docstrings) describe **what the user does**, not **how the framework is implemented**.
+
+```
+Bad (jargon / impl detail leaked to user):
+  - put() acquires an flock for atomicity (double-check after acquiring the lock)
+  - Returns the original decoder
+  - Uses C++ GetGOP under the hood
+
+Good:
+  - put() acquires an flock for atomicity
+  - Returns the underlying PyNvGopDecoder
+  - Returns cached data without re-demuxing
+```
+
+If a phrase would prompt the question *"is there something the user should do?"*, rewrite it. Implementation notes belong in source-level comments or developer-facing docstrings under `_internal/`, not user docs.
+
+### Rule 7 — Doc build must be warning-free
+
+`./scripts/build_docs.sh` warnings and errors are **blocking**. Before requesting review:
+
+```bash
+./scripts/build_docs.sh 2>&1 | tee /tmp/docs_build.log
+grep -iE 'warning|error' /tmp/docs_build.log
+```
+
+Resolve every new warning. Common sources: bad role syntax, missing `__all__` exports, malformed `Returns:` blocks, broken cross-refs, unknown spelling.
+
+### Rule 8 — Admonitions: blockquote form for dual-readable files
+
+Files that must render correctly in **both** GitHub/IDE preview **and** Sphinx HTML use the blockquote admonition pattern:
+
+```md
+> **ℹ️ Note**: Short tip for the reader.
+
+> **⚠️ Important**: Crucial warning users must not miss.
+```
+
+The local `markdown_note_admonitions` extension converts these to Sphinx admonitions at build time. Multi-line notes are supported as long as every line starts with `>`.
+
+Use fenced admonitions ```` ```{note} ```` / ```` ```{important} ```` **only** in files that are exclusively part of the built docs and never opened in GitHub/IDE.
+
+### Rule 9 — Edit source, not mirror
+
+Source-of-truth lives at `packages/<pkg>/docs/`. Files under `docs/contained_package_docs_mirror/<pkg>/docs/` are symlinks regenerated by `mirror_referenced_dirs.py` at build time. Editing the mirror is at best a no-op and at worst destructive (overwritten on next build).
+
+### Rule 10 — Relative paths in include/image/literalinclude
+
+Paths inside `.md` / `.rst` directives (`include`, `image`, `literalinclude`, etc.) must be **relative to the current document**. This keeps docs portable: links resolve correctly both in the original package directory and after mirroring into `docs/contained_package_docs_mirror/`.
+
+```
+Bad:
+  .. literalinclude:: /home/user/project/packages/foo/examples/demo.py
+
+Good:
+  .. literalinclude:: ../examples/demo.py
+```
+
+Inside Python docstrings, paths are relative to the file that includes the docstring (the autodoc directive's location), so absolute paths are acceptable there.
+
+### Rule 11 — Sample docs: explain real-use-case provenance and cross-link
+
+When a sample uses hard-coded values that would normally come from runtime sources (parser output, demuxer results, model outputs, etc.), explicitly document where those values come from in production. Cross-link to a related sample that demonstrates the real flow.
+
+```python
+Good:
+    # Each task tuple: (video_path, target_frame_id, gop_first_frame, gop_len).
+    #
+    # In a real pipeline, gop_first_frame and gop_len would come from a
+    # demuxer (e.g. GetGOPList returning first_frame_ids / gop_lens).
+    # See samples/SampleSeparationAccessGOPListAPI.py for an end-to-end
+    # example. Hard-coded values here keep the demo dependency-free.
+    tasks = [...]
+```
+
+## Pre-PR compliance checklist
+
+Run through these before requesting review on any PR that touches docs or docstrings:
+
+- [ ] All `accvlab.*` API references in narrative use `{py:meth}` / `{py:func}` / `{py:class}` roles
+- [ ] No bare backticks for `accvlab.*` names except in code blocks
+- [ ] Admonitions in dual-readable files use the blockquote pattern
+- [ ] Edits land in `packages/<pkg>/docs/`, not in the mirror
+- [ ] Paths in directives are relative
+- [ ] New technical terms added to `docs/spelling_wordlist.txt`
+- [ ] `./scripts/build_docs.sh` runs with no new warnings or errors
+- [ ] `./scripts/build_docs.sh --spelling` reviewed; report at `docs/_build/spelling/output.txt`
+- [ ] Sample docs reference real-use-case origin and cross-link to related samples
+
+## Verification commands
+
+Quick scans to surface common violations before review:
+
+```bash
+# 1. Bare backtick API references that should be sphinx roles.
+#    Customise the regex with the symbols touched by your PR.
+grep -rnE '`(SharedGopStore|GopRef|CachedGopDecoder|PyNvGopDecoder|CreateGopDecoder)[A-Za-z_]*\(?\)?`' \
+    docs/ packages/*/docs/ 2>/dev/null | grep -v '```'
+
+# 2. Returns: block immediately followed by a line that ends in ':' (Rule 2 violation).
+grep -rEn -A1 'Returns:$' packages/*/accvlab/ | grep -E ':\s*$'
+
+# 3. Public symbols not exported in root __init__.py (manual diff).
+#    After adding `class Foo` or `def bar`, confirm:
+#      - `from .<sub> import Foo` (or `bar`) appears in __init__.py
+#      - `'Foo'` (or `'bar'`) appears in __all__
+grep -E '^(class|def) [A-Z]' packages/<pkg>/accvlab/<pkg>/<file>.py
+grep -E "(<symbol>|__all__)" packages/<pkg>/accvlab/<pkg>/__init__.py
+
+# 4. Accidental edits inside the mirror directory (Rule 9 violation).
+git diff --name-only | grep contained_package_docs_mirror
+
+# 5. Full doc build with warning surface.
+./scripts/build_docs.sh 2>&1 | grep -iE 'warning|error' | grep -v -i 'INFO'
+
+# 6. Spelling check.
+./scripts/build_docs.sh --spelling
+cat docs/_build/spelling/output.txt 2>/dev/null
+```
diff --git a/docs/spelling_wordlist.txt b/docs/spelling_wordlist.txt
@@ -209,3 +209,4 @@ unlinking
 atomicity
 picklable
 ABI
+aggregator
diff --git a/packages/on_demand_video_decoder/accvlab/on_demand_video_decoder/__init__.py b/packages/on_demand_video_decoder/accvlab/on_demand_video_decoder/__init__.py
@@ -83,10 +83,12 @@ def _preload_local_ffmpeg() -> None:
     # C++ core interfaces
     'PyNvGopDecoder',
     'PyNvSampleReader',
+    'PyNvBatchAsyncStreamReader',
     'FastStreamInfo',
     'DecodedFrameExt',
     'RGBFrame',
     'CreateSampleReader',
+    'CreateBatchAsyncStreamReader',
     'GetFastInitInfo',
     'SavePacketsToFile',
     # Python decoder with caching

diff --git a/packages/on_demand_video_decoder/docs/sample.md b/packages/on_demand_video_decoder/docs/sample.md
@@ -25,6 +25,7 @@ section helps you quickly locate the sample code that matches your requirements.
 | [SampleDecodeFromGopFilesToListAPI.py](../samples/SampleDecodeFromGopFilesToListAPI.py) | Selective GOP loading | {py:meth}`~accvlab.on_demand_video_decoder.PyNvGopDecoder.LoadGopsToList`, {py:meth}`~accvlab.on_demand_video_decoder.PyNvGopDecoder.DecodeFromGOPListRGB` |
 | [SampleDecodeFromGopList.py](../samples/SampleDecodeFromGopList.py) | Batch decode from multiple demux results (N demux → 1 decode) | {py:meth}`~accvlab.on_demand_video_decoder.PyNvGopDecoder.DecodeFromGOPListRGB` |
 | [SampleStreamAsyncAccess.py](../samples/SampleStreamAsyncAccess.py) | Async stream decoding with prefetching | {py:func}`~accvlab.on_demand_video_decoder.CreateSampleReader`, {py:meth}`~accvlab.on_demand_video_decoder.PyNvSampleReader.DecodeN12ToRGBAsync`, {py:meth}`~accvlab.on_demand_video_decoder.PyNvSampleReader.DecodeN12ToRGBAsyncGetBuffer` |
+| [SampleBatchAsyncStreamAccess.py](../samples/SampleBatchAsyncStreamAccess.py) | 2D async stream decoding — multiple frames per video per call, with prefetching | {py:func}`~accvlab.on_demand_video_decoder.CreateBatchAsyncStreamReader`, {py:meth}`~accvlab.on_demand_video_decoder.PyNvBatchAsyncStreamReader.Decode`, {py:meth}`~accvlab.on_demand_video_decoder.PyNvBatchAsyncStreamReader.GetBuffer` |
 | [SampleSharedGopStore.py](../samples/SampleSharedGopStore.py) | Cross-process shared GOP cache for DataLoader | {py:class}`~accvlab.on_demand_video_decoder.SharedGopStore`, {py:class}`~accvlab.on_demand_video_decoder.GopRef` |
 
 For details on the **Key APIs**, please refer to the API documentation of the corresponding functions and classes.
@@ -43,7 +44,9 @@ If you need random frame access:
         → Use SampleRandomAccess
 
 If you need sequential frame decoding:
-    If you need async decoding with prefetching for lower latency:
+    If you need multiple frames per video per call (2D batch):
+        → Use SampleBatchAsyncStreamAccess
+    Else if you need async decoding with prefetching for lower latency:
         → Use SampleStreamAsyncAccess
     Otherwise:
         → Use SampleStreamAccess
@@ -554,6 +557,141 @@ cd packages/on_demand_video_decoder/samples
 python SampleStreamAsyncAccess.py
 ```
 
+#### 3.2.4 Sample: Batch Async Stream Access (2D)
+
+**File:** `packages/on_demand_video_decoder/samples/SampleBatchAsyncStreamAccess.py`
+
+**When to Use**
+
+The 2D batch async API is preferred over basic async stream access when:
+- Each iteration consumes **multiple frames per video** (e.g. multi-sweep
+  StreamPETR-like training where one batch needs F sweeps × V cameras)
+- You want a single in-flight submission to cover V × F frames instead of
+  V frames
+- You want the output as a 2D structure ``out[v][f]`` rather than re-batching
+  V results F times in Python
+
+The 1D async API ({py:meth}`~accvlab.on_demand_video_decoder.PyNvSampleReader.DecodeN12ToRGBAsync`)
+remains the right choice when you only need one frame per video per
+iteration.
+
+**Key Differences from 1D Async Stream Access**
+
+| Feature | 1D Async ({py:class}`~accvlab.on_demand_video_decoder.PyNvSampleReader`) | 2D Batch Async ({py:class}`~accvlab.on_demand_video_decoder.PyNvBatchAsyncStreamReader`) |
+|---------|---------|---------|
+| Frame ids shape | ``List[int]`` (len V) | ``List[List[int]]`` (V × F) |
+| Returned structure | ``List[RGBFrame]`` (len V) | ``List[List[RGBFrame]]`` (V × F) |
+| Frames decoded per call | V | V × F |
+| Result buffer | 1 result, V frames | 1 result, V × F frames |
+| Pool sized at construction by | (n/a — per-reader) | ``max_frames_per_decode_call`` |
+
+**Core APIs**
+
+- {py:func}`~accvlab.on_demand_video_decoder.CreateBatchAsyncStreamReader`: Construct a 2D batch async reader
+- {py:meth}`~accvlab.on_demand_video_decoder.PyNvBatchAsyncStreamReader.Decode`: Submit an async 2D decode (returns immediately)
+- {py:meth}`~accvlab.on_demand_video_decoder.PyNvBatchAsyncStreamReader.GetBuffer`: Block until decode is done and return decoded frames
+
+**Code Walkthrough**
+
+Construct the reader. ``max_frames_per_decode_call`` is the F upper bound
+(per ``Decode()`` call, not per video file):
+
+```python
+import accvlab.on_demand_video_decoder as nvc
+
+reader = nvc.CreateBatchAsyncStreamReader(
+    num_of_set=1,
+    num_of_file=6,                  # V upper bound
+    max_frames_per_decode_call=4,   # F upper bound (per Decode() call)
+    iGpu=0,
+)
+```
+
+Build a 2D frame_ids and submit:
+
+```python
+V = len(file_path_list)
+F = 4
+# frame_ids[v][f] = f-th frame requested for video v.
+# All inner lists must be the same length (jagged inner lengths are rejected).
+frame_ids = [[0, 7, 14, 21]] * V
+
+reader.Decode(file_path_list, frame_ids, as_bgr=False)
+# Returns immediately; decoding happens on a background worker thread.
+```
+
+Retrieve the result:
+
+```python
+out = reader.GetBuffer(file_path_list, frame_ids, as_bgr=False)
+# out is List[List[RGBFrame]] indexed [v][f].
+# out[v][f].shape == (H, W, 3), dtype uint8, GPU memory.
+```
+
+**Two Contracts to Remember**
+
+> **ℹ️ Note**: When 
+> {py:meth}`~accvlab.on_demand_video_decoder.PyNvBatchAsyncStreamReader.GetBuffer` 
+> returns, all GPU work (decode + internal copies) is already complete. You 
+> can read the returned frames on any CUDA stream — including PyTorch's 
+> default stream — without additional synchronization.
+
+> **⚠️ Important**: The returned 
+> {py:class}`~accvlab.on_demand_video_decoder.RGBFrame` objects are zero-copy 
+> views into the reader's internal aggregator pool. Submitting the next 
+> {py:meth}`~accvlab.on_demand_video_decoder.PyNvBatchAsyncStreamReader.Decode` 
+> reuses that memory. You **must** clone every frame you want to keep 
+> **before** the next ``Decode()`` call. Skipping the clone is silent data 
+> corruption.
+
+**Canonical Prefetch Pattern**
+
+```python
+# Iteration 0: prime the pipeline
+reader.Decode(files, frame_ids_0, as_bgr=False)
+out = reader.GetBuffer(files, frame_ids_0, as_bgr=False)
+
+# Clone before submitting the next batch
+tensors_0 = [
+    [torch.as_tensor(out[v][f], device="cuda").clone() for f in range(F)]
+    for v in range(V)
+]
+
+# Prefetch iteration 1 in parallel with processing iteration 0
+reader.Decode(files, frame_ids_1, as_bgr=False)
+# ... process tensors_0 here (model forward, etc.) ...
+
+# Iteration 1: GetBuffer is usually already-ready because of the prefetch
+out = reader.GetBuffer(files, frame_ids_1, as_bgr=False)
+tensors_1 = [
+    [torch.as_tensor(out[v][f], device="cuda").clone() for f in range(F)]
+    for v in range(V)
+]
+reader.Decode(files, frame_ids_2, as_bgr=False)
+# ... process tensors_1 ...
+```
+
+**Resolution Handling**
+
+Videos in a single ``Decode()`` call may have **different resolutions** — each
+video gets its own per-slot aggregator pool, sized lazily to that video's
+``F * H_v * W_v * 3`` on the first ``Decode()`` that hits the slot. If a later
+``Decode()`` swaps in a video at the same slot with a different resolution,
+the pool is reallocated automatically (grows if larger; reuses the existing
+allocation if same or smaller).
+
+Per-frame shape consequence: ``out[v][f].shape == (H_v, W_v, 3)`` may vary
+across ``v``. The frames are not stack-able into a single
+``[V, F, H, W, 3]`` tensor without resize/pad — that is a physical fact of
+mixed-resolution input, not an API limitation.
+
+**Running the Sample**
+
+```bash
+cd packages/on_demand_video_decoder/samples
+python SampleBatchAsyncStreamAccess.py
+```
+
 ### 3.3 Separation Access Decoding
 
 Separation Access mode decouples demuxing and decoding into two separate stages. This provides fine-grained control over the video processing pipeline and enables advanced optimization strategies.