perf(datasets): datasets cli support --filter/--offset/--limit#1105
Open
xdlkc wants to merge 6 commits into
Open
perf(datasets): datasets cli support --filter/--offset/--limit#1105xdlkc wants to merge 6 commits into
xdlkc wants to merge 6 commits into
Conversation
…r/--offset/--limit Lazy-init and reuse the oss2.Bucket per registry instance, paginate all list operations past the 1000-key cap with a hard page bound, and filter hidden (dot-prefixed) entries. Push offset/limit/task_filter down to the OSS layer for `datasets tasks` via a continuation-token pagination cache, so bounded queries stop scanning early. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 355/355 AI-Contributed/UT: 184/184
Move the module-level helper functions added for this feature into the command class: JSON output and the -o json decision become an OutputWriter collaborator, and the argparse int validators become DatasetsCommand staticmethods. Also drop the `--ouput` typo alias. Behavior is unchanged; covered by existing tests plus new OutputWriter units. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 97/97 AI-Contributed/UT: 26/26
StephenRi
reviewed
Jun 15, 2026
StephenRi
reviewed
Jun 15, 2026
Remove the _PaginationCache mechanism. It only helped repeated in-process queries on the same registry instance with increasing offset, which the CLI (one-shot process) never hits; it added cross-call state and unbounded memory for no real benefit. _extract_tasks_from_split now reuses the existing _list_objects_v2_pages generator and stops early once max_items distinct tasks are collected. Pagination-termination safety is provided by the generator. Behavior is unchanged; covered by existing tests. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 80/80 AI-Contributed/UT: 0/0
Per review feedback on #1105: rename the output option to `-f/--format` accepting `table` (default) or `json`, instead of `-o/--output` that only took `json`. The default `table` makes the human-readable output explicit. OutputWriter now reads `args.format`; handler logic is unchanged. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 20/20 AI-Contributed/UT: 38/38
…token The continuation token is returned by OSS per page and fed back into the next request; the "empty or non-advancing token" guard already guarantees the pagination loop terminates. The fixed page-count ceiling was a redundant second safeguard, so the generator now uses a plain `while True` driven solely by the token. Termination is still covered by the non-advancing-token test. Refs #1063 Co-Authored-By: Claude Code <noreply@anthropic.com> AI-Model: claude-opus-4-8 AI-Contributed/Feature: 16/16 AI-Contributed/UT: 0/0
…st coverage Address review feedback on #1105: - splits: honor --format json (previously parsed but silently ignored, always rendered as table). Now emits {dataset, splits, count}. - Add add_output_arg to list/splits subparsers so -f/--format is accepted in a consistent position (after the subcommand) for all of list/tasks/splits/upload, matching tasks/upload. - Rename JSON field total -> count in tasks output. After --offset/--limit the value is the number of items returned, not the split true total; count reflects that accurately and avoids misleading API consumers. - Add unit tests for the new OSS pagination logic: * multi-page continuation-token chaining with merge/dedupe * early termination when max_items (--limit) is reached * task_filter push-down into the OSS prefix * hidden entry (.DS_Store/.git) filtering for tasks and organizations * splits JSON output (incl. empty) * --format flag accepted after every subcommand All 87 dataset unit tests pass. AI-Contributed/Feature: 0/17 AI-Contributed/UT: 0/178
StephenRi
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split out from #1064 (part 1 of 2). Contains the OSS registry performance baseline, the
datasets tasksserver-side pagination /--filterwork, and the object-oriented refactor of the CLI command module.The
datasets fsfile-access subcommand is split into #1106.Changes
Performance / feature
oss2.Bucket: lazy-init once perOssDatasetRegistryand reuse across calls.datasets tasks:offset/limit/task_filterpushed down throughBaseDatasetRegistry→DatasetClient→ CLI;--limitstops scanning early; sequential pages resume via a continuation-token cache.datasets tasks --filter PREFIX: pushed down as an OSS prefix.Refactor (addresses @StephenRi's review on #1064 — "函数都封装到类里面")
OutputWritercollaborator for the-o jsondecision and JSON writing; argparse int validators becomeDatasetsCommandstaticmethods.--ouputtypo alias.OutputWriterunits.Testing
Related
Refs #1063