Skip to content

feat(datasets): enhance SDK with pagination, file ops, query APIs, and format parsing#1171

Open
jake11-oho wants to merge 6 commits into
alibaba:masterfrom
jake11-oho:feat/datasets-sdk-v2
Open

feat(datasets): enhance SDK with pagination, file ops, query APIs, and format parsing#1171
jake11-oho wants to merge 6 commits into
alibaba:masterfrom
jake11-oho:feat/datasets-sdk-v2

Conversation

@jake11-oho

@jake11-oho jake11-oho commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add PageResult[T] generic for paginated listing across all APIs (offset/limit)
  • Add query APIs: get_dataset(), get_task(), get_task_metadata()
  • Add task file operations: browse/list/read/download files
  • Add pluggable format parsers for PinchBench, SWE-bench, TB2 benchmark datasets
  • Add CLI subcommands: info, files, cat, download with --offset/--limit
  • Cache OSS bucket instance; iterator-based listing for large result sets

fixes #1170

Test plan

  • Unit tests added for all new models (PageResult, DatasetInfo, TaskInfo, etc.)
  • Unit tests for OssDatasetRegistry covering pagination, file listing, metadata discovery
  • Unit tests for DatasetClient delegation layer
  • Unit tests for CLI subcommands (info, files, cat, download)
  • Unit tests for format parsers (PinchBench, SWE, TB2)
  • CI passes all existing + new tests

🤖 Generated with Claude Code

jake11-oho and others added 6 commits June 25, 2026 15:44
…d format parsing

- Add PageResult[T] generic type for paginated responses across all listing APIs
- Add new query APIs: get_dataset(), get_task(), get_task_metadata()
- Add task file operations: browse/list/read/download files
- Add dataset format parsers (pinchbench, swe, tb2) for structured task loading
- Add CLI subcommands: info, files, cat, download with --offset/--limit pagination
- Cache OSS bucket instance; use iterator-based listing for large result sets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cache dataset task counts in `meta/{org}/{dataset}/meta.json` to avoid
slow OSS listing on every `get_dataset()` call. Meta is auto-updated on
upload and can be manually refreshed via `refresh_metadata()`.

Add `sync_dataset()` for incremental cross-bucket dataset sync, adapted
from harbor viewer's DatasetSyncService. Supports dry-run diff preview,
server-side copy with GET+PUT fallback, and optional delete-extra mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Users should call refresh_metadata() explicitly after modifying datasets
rather than having upload_dataset() and sync_dataset() update meta
automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change meta granularity from per-dataset (meta.json) to per-split
({split}.json), reducing write contention and update cost. Add optional
split parameter to refresh_metadata() for targeted refresh. Re-enable
auto-refresh after upload_dataset() and sync_dataset() at split level.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: enhance Datasets SDK with pagination, file operations, and format parsing

1 participant