perf(datasets): datasets cli support --filter/--offset/--limit by xdlkc · Pull Request #1105 · alibaba/ROCK

xdlkc · 2026-06-15T06:12:21Z

Summary

Split out from #1064 (part 1 of 2). Contains the OSS registry performance baseline, the datasets tasks server-side pagination / --filter work, and the object-oriented refactor of the CLI command module.

The datasets fs file-access subcommand is split into #1106.

Changes

Performance / feature

Cache oss2.Bucket: lazy-init once per OssDatasetRegistry and reuse across calls.
Paginate all listings: org/dataset/split/all listings page past the 1000-key cap, with a hard page bound and a non-advancing-token guard.
Filter hidden entries: dot-prefixed names are skipped.
Server-side pagination for datasets tasks: offset / limit / task_filter pushed down through BaseDatasetRegistry → DatasetClient → CLI; --limit stops scanning early; sequential pages resume via a continuation-token cache.
datasets tasks --filter PREFIX: pushed down as an OSS prefix.

Refactor (addresses @StephenRi's review on #1064 — "函数都封装到类里面")

Module-level CLI helpers folded into the command class: an OutputWriter collaborator for the -o json decision and JSON writing; argparse int validators become DatasetsCommand staticmethods.
Drops the --ouput typo alias.
Behavior unchanged; covered by existing tests plus new OutputWriter units.

Testing

uv run pytest tests/unit/datasets/ -q
# 75 passed
uv run ruff check rock/cli/command/datasets.py rock/sdk/envhub/datasets/ tests/unit/datasets/
# All checks passed!

…r/--offset/--limit Lazy-init and reuse the oss2.Bucket per registry instance, paginate all list operations past the 1000-key cap with a hard page bound, and filter hidden (dot-prefixed) entries. Push offset/limit/task_filter down to the OSS layer for `datasets tasks` via a continuation-token pagination cache, so bounded queries stop scanning early. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 355/355 AI-Contributed/UT: 184/184

Move the module-level helper functions added for this feature into the command class: JSON output and the -o json decision become an OutputWriter collaborator, and the argparse int validators become DatasetsCommand staticmethods. Also drop the `--ouput` typo alias. Behavior is unchanged; covered by existing tests plus new OutputWriter units. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 97/97 AI-Contributed/UT: 26/26

Remove the _PaginationCache mechanism. It only helped repeated in-process queries on the same registry instance with increasing offset, which the CLI (one-shot process) never hits; it added cross-call state and unbounded memory for no real benefit. _extract_tasks_from_split now reuses the existing _list_objects_v2_pages generator and stops early once max_items distinct tasks are collected. Pagination-termination safety is provided by the generator. Behavior is unchanged; covered by existing tests. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 80/80 AI-Contributed/UT: 0/0

Per review feedback on #1105: rename the output option to `-f/--format` accepting `table` (default) or `json`, instead of `-o/--output` that only took `json`. The default `table` makes the human-readable output explicit. OutputWriter now reads `args.format`; handler logic is unchanged. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 20/20 AI-Contributed/UT: 38/38

…token The continuation token is returned by OSS per page and fed back into the next request; the "empty or non-advancing token" guard already guarantees the pagination loop terminates. The fixed page-count ceiling was a redundant second safeguard, so the generator now uses a plain `while True` driven solely by the token. Termination is still covered by the non-advancing-token test. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 16/16 AI-Contributed/UT: 0/0

…st coverage Address review feedback on #1105: - splits: honor --format json (previously parsed but silently ignored, always rendered as table). Now emits {dataset, splits, count}. - Add add_output_arg to list/splits subparsers so -f/--format is accepted in a consistent position (after the subcommand) for all of list/tasks/splits/upload, matching tasks/upload. - Rename JSON field total -> count in tasks output. After --offset/--limit the value is the number of items returned, not the split true total; count reflects that accurately and avoids misleading API consumers. - Add unit tests for the new OSS pagination logic: * multi-page continuation-token chaining with merge/dedupe * early termination when max_items (--limit) is reached * task_filter push-down into the OSS prefix * hidden entry (.DS_Store/.git) filtering for tasks and organizations * splits JSON output (incl. empty) * --format flag accepted after every subcommand All 87 dataset unit tests pass. AI-Contributed/Feature: 0/17 AI-Contributed/UT: 0/178

xdlkc mentioned this pull request Jun 15, 2026

feat(datasets): add fs subcommand to inspect and download task files (#1104) #1106

Open

xdlkc changed the title ~~perf(datasets): cache OSS bucket, paginate listing, add tasks --filter/--offset/--limit (#1063)~~ perf(datasets): OSS bucket cache, pagination, tasks --filter + CLI OO refactor (#1063) Jun 15, 2026

StephenRi reviewed Jun 15, 2026

View reviewed changes

Comment thread rock/cli/command/datasets.py Outdated

StephenRi reviewed Jun 15, 2026

View reviewed changes

Comment thread rock/cli/command/datasets.py Outdated

xdlkc and others added 4 commits June 15, 2026 16:50

xdlkc changed the title ~~perf(datasets): OSS bucket cache, pagination, tasks --filter + CLI OO refactor (#1063)~~ perf(datasets): datasets cli support --filter/--offset/--limit Jun 16, 2026

StephenRi approved these changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(datasets): datasets cli support --filter/--offset/--limit#1105

perf(datasets): datasets cli support --filter/--offset/--limit#1105
xdlkc wants to merge 6 commits into
masterfrom
perf/datasets-oss-pagination

xdlkc commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xdlkc commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Related

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xdlkc commented Jun 15, 2026 •

edited

Loading