Skip to content

feat: add cloud_type parameter for on-demand vs spot GPU selection#50

Closed
slacki-ai wants to merge 3 commits intolongtermrisk:v0.9from
slacki-ai:support_cloud_types
Closed

feat: add cloud_type parameter for on-demand vs spot GPU selection#50
slacki-ai wants to merge 3 commits intolongtermrisk:v0.9from
slacki-ai:support_cloud_types

Conversation

@slacki-ai
Copy link
Copy Markdown
Contributor

Summary

  • Adds a cloud_type parameter ("SECURE" / "ALL" / "COMMUNITY") to control which RunPod cloud tier workers are provisioned on
  • Stored in the job's params JSONB — no database migration needed
  • Defaults to "SECURE" (on-demand) for full backward compatibility

Changes

Threaded through the entire pipeline (8 files):

  • openweights/cli/exec.py — new --cloud-type CLI argument
  • openweights/client/jobs.py — base Jobs.create() extracts and stores cloud_type in params
  • openweights/jobs/inference/__init__.pyInferenceJobs.create()
  • openweights/jobs/unsloth/__init__.pyFineTuning.create() + LogProb.create()
  • openweights/jobs/vllm/__init__.pyAPI.create() + deploy() + multi_deploy()
  • openweights/jobs/weighted_sft/__init__.pySFT.create() + MultipleChoice.create() + LogProb.create()
  • openweights/cluster/org_manager.py — groups jobs by (cloud_type, allowed_hardware) so each worker is launched on the correct tier
  • openweights/cluster/start_runpod.py — passes cloud_type to create_pod()

Test plan

  • Submit a job with cloud_type="COMMUNITY" — verify RunPod pod is created as a spot instance
  • Submit a job with default (no cloud_type) — verify it uses SECURE (on-demand) as before
  • Submit two jobs with different cloud_type values and same allowed_hardware — verify they get separate workers
  • CLI: ow exec --cloud-type COMMUNITY "echo test" — verify argument is passed through

🤖 Generated with Claude Code

slacki-ai and others added 3 commits March 26, 2026 08:58
Add a `cloud_type` parameter ("SECURE", "ALL", or "COMMUNITY") that
controls which RunPod cloud tier is used when provisioning workers.
This is stored in the job's `params` JSONB (no DB migration needed)
and threaded through the entire pipeline:

- CLI: `ow exec --cloud-type COMMUNITY ...`
- Client: `Jobs.create()` base class extracts and stores cloud_type
- All job types: inference, unsloth (fine-tuning + logprob), vllm
  (create + deploy + multi_deploy), weighted_sft (SFT + MC + logprob)
- Scheduler: groups pending jobs by (cloud_type, allowed_hardware)
  so each worker lands on the correct RunPod tier
- Worker start: passes cloud_type to RunPod pod creation

Defaults to "SECURE" (on-demand) for backward compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests verify that:
- Jobs are grouped by (cloud_type, allowed_hardware) in the scheduler
- cloud_type defaults to "SECURE" when absent or params is None
- Different cloud_type values produce separate worker groups
- Group keys unpack correctly for the scale_workers loop
- CLI parser accepts valid choices and rejects invalid ones

All tests are pure-Python logic checks, no DB or RunPod needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep 5 core tests that exercise the actual grouping logic (same group,
separate groups, SECURE default, params=None edge case, hardware sort
normalization). Remove 6 tests: argparse-only tests, redundant grouping
variants, and Python tuple-unpacking check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nielsrolf
Copy link
Copy Markdown
Collaborator

Why do we need this?

@nielsrolf
Copy link
Copy Markdown
Collaborator

Closing because I don't think that we need it, feel free to reopen if this is wrong / explain why we need it

@nielsrolf nielsrolf closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants