feat(datasets): enhance SDK with pagination, file ops, query APIs, and format parsing by jake11-oho · Pull Request #1171 · alibaba/ROCK

jake11-oho · 2026-06-25T15:45:50Z

Summary

Add PageResult[T] generic for paginated listing across all APIs (offset/limit)
Add query APIs: get_dataset(), get_task(), get_task_metadata()
Add task file operations: browse/list/read/download files
Add pluggable format parsers for PinchBench, SWE-bench, TB2 benchmark datasets
Add CLI subcommands: info, files, cat, download with --offset/--limit
Cache OSS bucket instance; iterator-based listing for large result sets

Test plan

Unit tests added for all new models (PageResult, DatasetInfo, TaskInfo, etc.)
Unit tests for OssDatasetRegistry covering pagination, file listing, metadata discovery
Unit tests for DatasetClient delegation layer
Unit tests for CLI subcommands (info, files, cat, download)
Unit tests for format parsers (PinchBench, SWE, TB2)
CI passes all existing + new tests

🤖 Generated with Claude Code

…d format parsing - Add PageResult[T] generic type for paginated responses across all listing APIs - Add new query APIs: get_dataset(), get_task(), get_task_metadata() - Add task file operations: browse/list/read/download files - Add dataset format parsers (pinchbench, swe, tb2) for structured task loading - Add CLI subcommands: info, files, cat, download with --offset/--limit pagination - Cache OSS bucket instance; use iterator-based listing for large result sets Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cache dataset task counts in `meta/{org}/{dataset}/meta.json` to avoid slow OSS listing on every `get_dataset()` call. Meta is auto-updated on upload and can be manually refreshed via `refresh_metadata()`. Add `sync_dataset()` for incremental cross-bucket dataset sync, adapted from harbor viewer's DatasetSyncService. Supports dry-run diff preview, server-side copy with GET+PUT fallback, and optional delete-extra mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Users should call refresh_metadata() explicitly after modifying datasets rather than having upload_dataset() and sync_dataset() update meta automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change meta granularity from per-dataset (meta.json) to per-split ({split}.json), reducing write contention and update cost. Add optional split parameter to refresh_metadata() for targeted refresh. Re-enable auto-refresh after upload_dataset() and sync_dataset() at split level. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jake11-oho and others added 6 commits June 25, 2026 15:44

docs: add Datasets SDK v2 design document

adb4bca

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove unrelated changes (docs, pyproject.toml, uv.lock)

f42a67f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(datasets): remove automatic meta updates from upload and sync

8e7f051

Users should call refresh_metadata() explicitly after modifying datasets rather than having upload_dataset() and sync_dataset() update meta automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(datasets): enhance SDK with pagination, file ops, query APIs, and format parsing#1171

feat(datasets): enhance SDK with pagination, file ops, query APIs, and format parsing#1171
jake11-oho wants to merge 6 commits into
alibaba:masterfrom
jake11-oho:feat/datasets-sdk-v2

jake11-oho commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jake11-oho commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jake11-oho commented Jun 25, 2026 •

edited

Loading