Skip to content

perf(datasets): datasets cli support --filter/--offset/--limit#1105

Open
xdlkc wants to merge 6 commits into
masterfrom
perf/datasets-oss-pagination
Open

perf(datasets): datasets cli support --filter/--offset/--limit#1105
xdlkc wants to merge 6 commits into
masterfrom
perf/datasets-oss-pagination

Conversation

@xdlkc

@xdlkc xdlkc commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Split out from #1064 (part 1 of 2). Contains the OSS registry performance baseline, the datasets tasks server-side pagination / --filter work, and the object-oriented refactor of the CLI command module.

The datasets fs file-access subcommand is split into #1106.

Changes

Performance / feature

  • Cache oss2.Bucket: lazy-init once per OssDatasetRegistry and reuse across calls.
  • Paginate all listings: org/dataset/split/all listings page past the 1000-key cap, with a hard page bound and a non-advancing-token guard.
  • Filter hidden entries: dot-prefixed names are skipped.
  • Server-side pagination for datasets tasks: offset / limit / task_filter pushed down through BaseDatasetRegistryDatasetClient → CLI; --limit stops scanning early; sequential pages resume via a continuation-token cache.
  • datasets tasks --filter PREFIX: pushed down as an OSS prefix.

Refactor (addresses @StephenRi's review on #1064 — "函数都封装到类里面")

  • Module-level CLI helpers folded into the command class: an OutputWriter collaborator for the -o json decision and JSON writing; argparse int validators become DatasetsCommand staticmethods.
  • Drops the --ouput typo alias.
  • Behavior unchanged; covered by existing tests plus new OutputWriter units.

Testing

uv run pytest tests/unit/datasets/ -q
# 75 passed
uv run ruff check rock/cli/command/datasets.py rock/sdk/envhub/datasets/ tests/unit/datasets/
# All checks passed!

Related

Refs #1063

…r/--offset/--limit

Lazy-init and reuse the oss2.Bucket per registry instance, paginate all
list operations past the 1000-key cap with a hard page bound, and filter
hidden (dot-prefixed) entries. Push offset/limit/task_filter down to the
OSS layer for `datasets tasks` via a continuation-token pagination cache,
so bounded queries stop scanning early.

Refs #1063

Co-Authored-By: Claude Code <noreply@anthropic.com>
AI-Model: claude-opus-4-8
AI-Contributed/Feature: 355/355
AI-Contributed/UT: 184/184
Move the module-level helper functions added for this feature into the
command class: JSON output and the -o json decision become an OutputWriter
collaborator, and the argparse int validators become DatasetsCommand
staticmethods. Also drop the `--ouput` typo alias. Behavior is unchanged;
covered by existing tests plus new OutputWriter units.

Refs #1063

Co-Authored-By: Claude Code <noreply@anthropic.com>
AI-Model: claude-opus-4-8
AI-Contributed/Feature: 97/97
AI-Contributed/UT: 26/26
@xdlkc xdlkc changed the title perf(datasets): cache OSS bucket, paginate listing, add tasks --filter/--offset/--limit (#1063) perf(datasets): OSS bucket cache, pagination, tasks --filter + CLI OO refactor (#1063) Jun 15, 2026
Comment thread rock/cli/command/datasets.py Outdated
Comment thread rock/cli/command/datasets.py Outdated
xdlkc and others added 4 commits June 15, 2026 16:50
Remove the _PaginationCache mechanism. It only helped repeated in-process
queries on the same registry instance with increasing offset, which the CLI
(one-shot process) never hits; it added cross-call state and unbounded memory
for no real benefit. _extract_tasks_from_split now reuses the existing
_list_objects_v2_pages generator and stops early once max_items distinct tasks
are collected. Pagination-termination safety is provided by the generator.
Behavior is unchanged; covered by existing tests.

Refs #1063

Co-Authored-By: Claude Code <noreply@anthropic.com>
AI-Model: claude-opus-4-8
AI-Contributed/Feature: 80/80
AI-Contributed/UT: 0/0
Per review feedback on #1105: rename the output option to `-f/--format`
accepting `table` (default) or `json`, instead of `-o/--output` that only
took `json`. The default `table` makes the human-readable output explicit.
OutputWriter now reads `args.format`; handler logic is unchanged.

Refs #1063

Co-Authored-By: Claude Code <noreply@anthropic.com>
AI-Model: claude-opus-4-8
AI-Contributed/Feature: 20/20
AI-Contributed/UT: 38/38
…token

The continuation token is returned by OSS per page and fed back into the next
request; the "empty or non-advancing token" guard already guarantees the
pagination loop terminates. The fixed page-count ceiling was a redundant
second safeguard, so the generator now uses a plain `while True` driven solely
by the token. Termination is still covered by the non-advancing-token test.

Refs #1063

Co-Authored-By: Claude Code <noreply@anthropic.com>
AI-Model: claude-opus-4-8
AI-Contributed/Feature: 16/16
AI-Contributed/UT: 0/0
…st coverage

Address review feedback on #1105:

- splits: honor --format json (previously parsed but silently ignored,
  always rendered as table). Now emits {dataset, splits, count}.
- Add add_output_arg to list/splits subparsers so -f/--format is accepted
  in a consistent position (after the subcommand) for all of
  list/tasks/splits/upload, matching tasks/upload.
- Rename JSON field total -> count in tasks output. After --offset/--limit
  the value is the number of items returned, not the split true total;
  count reflects that accurately and avoids misleading API consumers.
- Add unit tests for the new OSS pagination logic:
  * multi-page continuation-token chaining with merge/dedupe
  * early termination when max_items (--limit) is reached
  * task_filter push-down into the OSS prefix
  * hidden entry (.DS_Store/.git) filtering for tasks and organizations
  * splits JSON output (incl. empty)
  * --format flag accepted after every subcommand

All 87 dataset unit tests pass.

AI-Contributed/Feature: 0/17
AI-Contributed/UT: 0/178
@xdlkc xdlkc changed the title perf(datasets): OSS bucket cache, pagination, tasks --filter + CLI OO refactor (#1063) perf(datasets): datasets cli support --filter/--offset/--limit Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants