feat: Automatic Region/Provider Assignment for Instance Creation

## Summary

This feature proposal suggests **making the `region` field optional** in the Talis instance creation API. When a user does not specify a region (and potentially even a provider), the backend will automatically select a region (and provider, if applicable) for each requested instance. The selection will use a **round-robin strategy** across available providers and their regions, taking into account the **available capacity/quota per region and instance size**. This will allow Talis to intelligently distribute instances across regions (and cloud providers) on the user's behalf, improving resource utilization and user experience.

## Motivation

Currently, Talis requires the user to explicitly provide a cloud provider and region for every instance creation request. This rigid requirement can be inconvenient and limiting:

* **Ease of Use:** Users may not always have a preference for a specific region. In many cases, they simply need a certain number of instances and don't care where they run. Making the region optional simplifies the API for these cases.
* **Multi-Cloud Distribution:** Talis is a multi-cloud orchestrator. If a user has access to multiple providers, they might want their instances spread across providers for redundancy or capacity reasons. Today, the user would have to manually divide requests by provider and region. An automatic distribution would save this manual effort.
* **Quota Management:** Cloud providers impose quotas (limits) on resources **per region and per instance size**. For example, a user might only be allowed to run a certain number of `s-1vcpu-1gb` instances in region "nyc3" on DigitalOcean, or a limited number of a particular AWS EC2 instance type in a given region. If the user requests more instances than one region can accommodate, the request could fail or exhaust resources. Currently, Talis doesn’t check these limits and requires the user to manage them. By auto-selecting regions/providers, Talis can prevent overloading a single region and avoid hitting quotas.
* **High Availability and Performance:** Distributing instances across regions (and providers) can improve fault tolerance and potentially reduce latency (if instances are closer to end-users). Automating this distribution encourages best practices without burdening the user.
* **Current Gap:** Talis does **not yet support querying or caching provider quotas** or dynamic region selection. Introducing this feature addresses a gap in intelligent resource scheduling within the platform.

In summary, this feature is motivated by a desire to make the API more user-friendly and to improve system robustness by intelligently balancing instance creation across available infrastructure.

## Proposed Behavior

When a user creates instances **without specifying a region**, the backend will automatically determine the region and possibly the provider for each instance. The behavior can be outlined in several scenarios:

* **1. Region Omitted (Single Provider Specified):** If the user specifies a cloud provider (e.g. `"provider": "do"`) but leaves the `region` field empty or not provided, Talis will assume the user is flexible about the region on that provider. The system will then:

  * **Fetch available regions and capacities for that provider:** Talis will retrieve the list of regions for the given provider and check each region’s current capacity/quota for the requested instance size. This could involve querying the provider’s API or an internal cache of quotas. Only regions that have not reached the quota for the specified instance type will be considered eligible.
  * **Round-robin assignment across regions:** For the number of instances requested, the backend will cycle through the eligible regions in turn to assign each instance a region. This ensures an even distribution. For example, if a user requests 3 instances on DigitalOcean with no region specified and DigitalOcean has two eligible regions (say `nyc3` and `sfo3`) with capacity:

    * Instance 1 → `nyc3`
    * Instance 2 → `sfo3`
    * Instance 3 → `nyc3` (back to the start of the rotation)
  * If more than two regions are available, the assignment continues to rotate through all of them in sequence. This round-robin approach ensures no single region gets all the load if others are available.

* **2. Region Omitted (Multiple Providers Allowed):** If the user has **not specified a provider** at all (or explicitly indicates that multiple providers can be used), the system will distribute instances across the different cloud providers accessible to the user:

  * **Determine provider pool:** The backend will gather all providers that the user has configured or that are specified for use. For example, if the user’s account is linked to both DigitalOcean and AWS, those providers will be considered. (If the user explicitly provided a list of providers in the request, only those will be used for selection.)
  * **Fetch capacities per provider and region:** For each provider in the pool, Talis will query the available quota per region for the requested instance size. For instance, it might find that DigitalOcean has capacity for 5 more of the given size in each of its regions, and AWS has certain EC2 instance limits per region.
  * **Round-robin assignment across providers and regions:** The system will first rotate through providers for each instance, and then assign a region within that provider:

    * Provider selection: Instances will be assigned to providers in a round-robin manner. For example, if 4 instances are requested and two providers (DO and AWS) are available, the assignment by provider could be: instance1 → DO, instance2 → AWS, instance3 → DO, instance4 → AWS.
    * Region selection within provider: Once a provider is chosen for a given instance, the region for that instance is selected by rotating through that provider’s region list (as in scenario 1). The selection considers only regions of that provider and ensures region-specific quotas are respected. For example, if instance1 is assigned to DigitalOcean, and DO’s eligible regions are `nyc3` and `sfo3`, instance1 might get `nyc3`. The next time an instance is assigned to DigitalOcean (instance3 in this example), it would get `sfo3` (the next region in DO’s round-robin cycle). Similarly for AWS regions when an instance goes to AWS.
  * **Distribution among multiple providers:** The net effect is that instances are evenly (or as evenly as possible) split across providers and across regions within each provider. This scenario covers cases where the user either **omitted the provider field** entirely or provided a set of multiple providers to use. For example, a request for 4 instances with no provider specified (and the user has two providers configured) might result in:

    * Instance 1 → DigitalOcean, region `nyc3`
    * Instance 2 → AWS, region `us-east-1`
    * Instance 3 → DigitalOcean, region `sfo3`
    * Instance 4 → AWS, region `us-west-2`
      (Here both the provider and region are auto-chosen for each instance.)

* **3. Multiple Providers Specified by User:** If the user explicitly **provides multiple provider entries in one request** (e.g., the request body contains one object with `"provider": "do"` and another with `"provider": "aws"`, each with some instances and no region), the system will treat each provider group separately for region assignment:

  * Within each provider’s group of instances, regions will be assigned in round-robin as described in scenario 1.
  * Across the groups, since the user explicitly split the count per provider, the system won’t alter those proportions (the distribution among providers is user-defined in this case). However, it will still ensure each provider’s instances are internally balanced across that provider’s regions.
  * For example, if the request asks for 2 DO instances and 2 AWS instances (with regions unspecified), Talis might assign DO instances to `nyc3` and `sfo3`, and AWS instances to `us-east-1` and `us-west-2` (assuming those regions are available).

* **Quota and Capacity Consideration:** In all cases, the selection logic will respect the **quota limits per region for the given instance size**:

  * Before assigning an instance to a region, Talis will ensure that creating the instance will not exceed the user’s allowed quota in that region for that instance type. (For example, if the user can only have 5 instances of size `s-1vcpu-1gb` in region `nyc3` and already has 5 running there, the system will not place a new instance in `nyc3` — it will choose a different region.)
  * If a region is at capacity, that region will be skipped in the round-robin rotation until capacity becomes available again. The algorithm may adjust the distribution if some regions/providers reach their limits. For instance, if out of 3 regions one is full, the instances will cycle through the remaining two regions.
  * If **all possible regions** for a given provider are exhausted for that instance type, and an instance is supposed to go to that provider, the system should fall back to an alternate provider (if available) or return an error if no capacity is available anywhere. This ensures that the request doesn’t partially fail silently – if the system cannot find any region to place a requested instance, it will surface that issue.

* **Examples:**

  1. *Single provider, no region:*
     **Request:**

     ```json
     [
       {
         "owner_id": 1,
         "provider": "do",
         "size": "s-1vcpu-1gb",
         "image": "ubuntu-20-04-x64",
         "project_name": "my-app",
         "ssh_key_name": "default-key",
         "number_of_instances": 3,
         "region": null    // region omitted or null
       }
     ]
     ```

     **Backend behavior:** Suppose the user’s DigitalOcean account has two regions (`nyc3` and `sfo3`) with capacity for this size. The system will create 3 instances, assigning the first to `nyc3`, the second to `sfo3`, and the third back to `nyc3` (round-robin). The response would show each instance with its assigned region.

  2. *Multiple providers, no region:*
     **Request:**

     ```json
     [
       {
         "owner_id": 1,
         "provider": null,
         "size": "s-1vcpu-1gb",
         "image": "ubuntu-20-04-x64",
         "project_name": "my-app",
         "ssh_key_name": "default-key",
         "number_of_instances": 4
       }
     ]
     ```

     (Here `"provider": null` indicates the user has not specified a provider, meaning any available provider can be used. Alternatively, this could be a new field or just omission of the provider field.)
     **Backend behavior:** If the user has two configured providers, say DO and AWS, the system will allocate 4 instances across them. For example: instance1 → DO (`nyc3`), instance2 → AWS (`us-east-1`), instance3 → DO (`sfo3`), instance4 → AWS (`us-west-2`). Each instance in the response will include the provider and region that was chosen.

  3. *Explicit multiple providers, no regions:*
     **Request:**

     ```json
     [
       {
         "owner_id": 1,
         "provider": "do",
         "size": "s-1vcpu-1gb",
         "image": "ubuntu-20-04-x64",
         "project_name": "my-app",
         "ssh_key_name": "default-key",
         "number_of_instances": 2,
         "region": null
       },
       {
         "owner_id": 1,
         "provider": "aws",
         "size": "t3.small",
         "image": "ami-0620bd5b2e1abc40b",
         "project_name": "my-app",
         "ssh_key_name": "default-key",
         "number_of_instances": 2,
         "region": null
       }
     ]
     ```

     **Backend behavior:** The system will treat the DigitalOcean portion and AWS portion separately. For the DO instances: suppose DO has regions `nyc3` and `sfo3` – one DO instance might go to `nyc3`, the other to `sfo3`. For the AWS instances: suppose AWS has regions `us-east-1` and `us-west-2` configured – one AWS instance in each. The result is 2 instances on DO (balanced across two DO regions) and 2 on AWS (balanced across two AWS regions). The user effectively gets a total of 4 instances across two providers, with the distribution handled automatically per provider group.

In all of the above cases, if the `region` is specified by the user for an instance request, the system will **honor the user’s choice** and **skip the automatic selection** for that instance. The round-robin logic and quota checks only apply when the region (or provider) is not explicitly given.

## Implementation Notes

Implementing this feature will require changes in several parts of the system, as well as careful consideration of how to integrate quota awareness and distribution logic into the instance creation workflow.

* **1. Making `region` Optional:**
  The `types.InstanceRequest` model and validation logic must be updated to mark the `region` field as optional (currently it’s a required field in the API). This means:

  * The API should no longer reject requests that omit the region. Instead, it will treat a missing or `null` region as a signal to auto-assign the region.
  * Documentation and client code will need to reflect that `region` is now optional. (We will detail API changes in the next section.)

* **2. Quota/Capacity Data Retrieval:**
  Since Talis currently does not have built-in support to query cloud provider quotas or capacities, we need to introduce a way to get this information:

  * *Provider APIs:* Investigate each supported provider’s API for retrieving usage limits. For example, AWS provides **Service Quotas** per region for EC2 instance types, and GCP/Azure have similar quota endpoints. DigitalOcean’s API might not have a per-region quota API (it usually has an account-level droplet limit), so we may rely on known limits (e.g., DO droplet limits) and the number of running instances tracked by Talis.
  * *Caching Layer:* To avoid calling external APIs too frequently (which would slow down instance creation), implement a caching mechanism for quota data. For instance, Talis could fetch the quota limits for each provider’s regions once and store them in memory (with periodic refresh, say every X minutes) or in the database. Similarly, track the current usage (how many instances of each size are currently active per region, according to Talis’s database).
  * *Data Structure:* We might introduce an internal structure like `ProviderCapacity[provider][region][instanceSize] = remainingCapacity` (or a struct with capacity and quota info). This can be built by combining quota limits minus current usage. During a create request, this structure will be consulted to decide where capacity is available.
  * *Integration:* Likely, a new internal module or extension of the existing provider/“hypervisor” interface will handle fetching and caching these quotas. For example, a `QuotaManager` or part of the provider client that can list available regions and their limits.

* **3. Region & Provider Selection Logic:**
  Implement the round-robin selection algorithm as described:

  * Determine the list of target providers:

    * If the request’s `provider` is specified and not `null`, use that single provider.
    * If the `provider` is omitted or null (meaning no specific provider was given), gather all providers that the user is authorized to use. This could be derived from the user’s linked cloud accounts or a default set of providers configured for the deployment.
    * If the user explicitly provided multiple providers (e.g., in an array of requests or via a new field), use exactly that set.
  * For each provider in the set, retrieve the list of regions and their available capacities (from the cache/queries above). Filter out any regions that have **zero capacity** for the requested instance size.
  * If no region in a given provider has capacity for the instance size, that provider should either be skipped entirely for this request (or at least any instance that would have gone to it should go to another provider). Similarly, if a provider itself is at some global capacity limit, consider that in selection.
  * Create an ordered list for providers and, within each provider, an ordered list of regions. The ordering could be by a fixed list or perhaps randomized at the start to avoid always picking the same region first (to prevent bias). However, a simple fixed rotation (alphabetical or the order returned by provider API) is acceptable as a start.
  * As we iterate over the number of instances to create:

    1. Pick the next provider in round-robin order (if more than one provider is being used).
    2. Within that provider, pick the next region in that provider’s round-robin list.
    3. “Allocate” an instance to that provider-region. This means decrementing the internal capacity count for that region (so the next instance will see updated capacity).
    4. Record this choice (provider & region) for the instance.
    5. Move to the next instance and repeat the cycle.
  * By the end, we will have a concrete provider and region assignment for every instance requested. This assignment happens before any actual instance creation tasks are executed.

* **4. Changes to Task Creation Flow:**
  In the current design, creating instances is an asynchronous operation handled by tasks. We need to integrate the new logic into the task flow:

  * **Synchronous Planning:** The region/provider selection should occur **during the API request processing**, i.e., before tasks are enqueued. This ensures that the decisions are made in one place and can be reflected immediately in the database. For example, when the `CreateInstance` handler receives the request, it will perform the selection algorithm and split the work accordingly.
  * **Instance Records:** Instead of creating a single instance record for a request with multiple instances, the system may now create multiple instance records (one per actual instance to be created) each with a specific region (and possibly provider). For each of these, a task will be created. (If previously `number_of_instances` was handled by a single task looping to create multiple VMs, we might refactor it so that each instance gets its own task, or at least each region grouping gets its own task. Parallel creation is a goal, so likely one task per instance is appropriate.)
  * **Task Distribution:** Each task will carry out an instance creation on a specific provider and region. The TaskExecutor should handle these tasks in parallel as it normally would. There is no fundamental change in how tasks execute, except that tasks now may be targeting different providers/regions as determined by the new logic.
  * **Updating Task Workflow:** We should review any assumptions in the task execution code. For instance, if a task assumed the region came directly from user input, this is no longer true for auto-assigned cases – but by the time the task is actually running, the `Instance` object in the DB will have the region filled in by the planner logic, so the task can use it normally. We must ensure that the transition from “no region” in the request to “specific region” in the task is smooth:

    * The API handler will need to populate the `region` field for each Instance (and similarly `provider` if that was omitted) before saving/queueing tasks.
    * If the logic splits one request into multiple instances, we need to handle naming or identification (auto-generated instance names might incorporate region or provider info for uniqueness or clarity, for example, `instance-abc123-usw2` for an AWS us-west-2 instance).
  * **Example Flow:** Suppose a user requests 5 instances with region omitted on a single provider. The handler might create 5 instance entries (and 5 tasks) instead of one, each entry having the chosen region. These tasks could then execute concurrently, each calling the provider’s API to create one instance in its designated region. This is slightly different from possibly how a bulk request was handled before, but it aligns with the project’s goal of parallel instance creation.
  * **Error Handling in Tasks:** With the new system, a task could, in rare cases, encounter a region-level capacity error (e.g., if the quota information was outdated and the provider rejects the creation due to quota exceeded). We should plan for this: the task could catch such an error and possibly communicate back to schedule a retry in a different region. However, implementing an automatic fallback at task runtime can be complex (it would mean the task itself would need to know the alternate regions and check capacity again). As a simpler approach, we might choose to fail that task with an error indicating quota exhaustion for that region. The user could then retry the request, which would trigger the planner to pick a different region (assuming the cache is updated). In the future, we could enhance tasks to attempt a fallback region if an API returns a quota error. For now, the focus is on avoiding these situations by doing the upfront checks.

* **5. Caching and Performance Considerations:**

  * Introducing quota checks and round-trip API calls to providers in the request path could add latency. To mitigate this, caching is critical. For example, when a request comes in, if we have a cached snapshot of quota availability (even if it’s a few minutes old), we can make decisions quickly. We should design the cache to be updated periodically in the background, or on a cache miss, fetch once and reuse for subsequent instance creations.
  * The cache could be invalidated or refreshed if a creation task succeeds (since that reduces available capacity) or if one fails (which might restore capacity).
  * We should also consider what happens if multiple instance creation requests occur concurrently. A locking or synchronization mechanism might be needed when updating the cached capacity counts to avoid two requests oversubscribing the same region. Alternatively, each request can pessimistically decrement the cached capacity and proceed, relying on the provider to enforce hard limits if a race condition occurs.
  * The system should log the auto-selected provider/region for each instance for audit purposes (so users know where their instances ended up, and for debugging distribution logic).
  * We must also ensure that existing functionality (like specifying a region or provider explicitly) continues to work as before with no regression. The new logic should be bypassed entirely in those cases.

* **6. Data Model Impact:**

  * The `instances` database table will continue to store the region and provider for each instance. There’s no schema change required if those fields already exist. We just need to fill them differently when region is not provided by the user.
  * We might consider adding a field or indicator (e.g., a boolean like `auto_assigned`) to mark that the region was chosen by the system. This is not strictly necessary, but could be useful for debugging or future features (for example, to differentiate system-chosen region vs user-specified).
  * No changes are expected for the volumes data model – volumes already default to the instance’s region if none is provided, which aligns perfectly with this feature (the instance will have a region by the time of creation, and volumes will inherit that).

* **7. Architectural Considerations:**

  * The **Hypervisor/Provider abstraction** may need extension. If currently the provider interface only handles direct create/delete calls, we might extend it with methods like `ListRegions(provider)` and `GetQuota(provider, region, instanceSize)` or a combined `GetCapacity(provider, instanceSize)` returning all regions. This keeps provider-specific logic (like how to fetch quotas) encapsulated per provider plugin.
  * The **Task Executor** likely remains unchanged in structure, but we should ensure it can handle a burst of tasks (e.g., if a user requests 100 instances with no region, we might suddenly enqueue 100 tasks, one per instance). The system’s task runner and thread pool should be configured to handle such loads (which is in line with the scalability goal of parallel creation).
  * There may be an impact on how **Projects** or higher-level orchestration treats a group of instances. If some code assumed all instances in one request share the same region (which will no longer be true), it should be revisited. For example, if after creation the system triggers some configuration tasks per project, it should not assume a single region context.
  * **Testing:** This feature will require robust testing:

    * Unit tests for the selection algorithm (given a set of providers/regions and quotas, does it distribute correctly?).
    * Integration tests using mock provider responses to ensure that if one region is at capacity, the next is chosen.
    * End-to-end tests where we simulate a user with multiple providers to see that instances do get created in different providers/regions as expected.
    * Also, tests for backward compatibility: if region *is* specified, the behavior should remain exactly the same as before.

In summary, the implementation will introduce a planning phase in the instance creation flow to decide on regions/providers using quota data, then proceed with task creation as normal. This requires new logic for querying and storing quotas, modifications to request handling, but largely keeps the asynchronous task execution model intact.

## API Changes

**Endpoint:** `POST /api/v1/instances` (Instance creation API)

The primary API change is that the `region` field in the request body will become **optional**:

* **Request Body Change:** The `region` attribute of each `InstanceRequest` object can now be omitted or set to `null`. If omitted, the backend interprets this as "no preference, auto-select region". If provided, the behavior is unchanged (use the specified region).

  * *Before:* Clients had to always specify a region (e.g., `"region": "nyc3"`). If it was missing, the request would be rejected as invalid.
  * *After:* Clients **may omit the region**, for example:

    ```json
    {
      "owner_id": 1,
      "provider": "do",
      "size": "s-1vcpu-1gb",
      "image": "ubuntu-20-04-x64",
      "project_name": "my-app",
      "ssh_key_name": "default-key",
      "number_of_instances": 3
      // "region": not provided
    }
    ```

    This request is considered valid and will trigger the auto-assignment logic described above.

* **Provider Field Consideration:** The `provider` field remains **required in the API input** for now (since the system needs to know at least which cloud or clouds to target). However, to fully support multi-provider distribution in one request, we have a couple of options:

  * We could allow `provider` to be set to a special value (like `null` or `"any"`) meaning "no particular provider, use any available". In the examples above, we treated a `null` provider as the signal to use all providers the user has. This would be a new interpretation but keeps the field in place. We will update validation to accept `provider: null` as a valid case (only if region is also not specified).
  * Alternatively, we could introduce a new optional field like `"providers": ["do", "aws"]` to explicitly list multiple providers. If present, it would indicate the set of providers to distribute across. In that case, `provider` could be omitted or ignored. This approach is more explicit but would change the API schema. Given that the request already accepts an array of instances, a user can already specify multiple providers by giving multiple objects – so a new field may not be strictly necessary. **For this iteration, we lean towards using the `provider` field as follows**: if a single provider string is given, use it; if `provider` is null/omitted, use all available; if the user wants specific multiple providers, they can include multiple entries or we can consider adding support for an array in future.
  * **Backward compatibility:** Existing clients that always send a provider string will see no change (except they can now omit region if they choose). We will document the `null` provider behavior for advanced usage, but ensure that nothing breaks for clients not aware of it.

* **Response Changes:** The response format (`Instance` data in the `"data"` list) will not structurally change. Each created instance already includes fields for provider and region. The only difference is that when region was omitted in the request, the returned instances will show the region that was chosen by the system. For example, the response might look like:

  ```json
  {
    "slug": "success",
    "error": "",
    "data": [
      {
        "ID": 101,
        "owner_id": 1,
        "project_id": 42,
        "provider": "do",
        "region": "nyc3",
        "size": "s-1vcpu-1gb",
        "status": "pending",
        ... 
      },
      {
        "ID": 102,
        "owner_id": 1,
        "project_id": 42,
        "provider": "do",
        "region": "sfo3",
        "size": "s-1vcpu-1gb",
        "status": "pending",
        ...
      },
      ...
    ]
  }
  ```

  Here, even though the user didn’t specify regions, the response shows `nyc3` and `sfo3` for the instances – confirming the auto-selection. The client should be prepared to handle that the region might not match any input (since input was none), and use this information as the actual deployment location.

* **Error Cases:** We may introduce new error responses for scenarios like “no capacity available”. For instance, a `400 Bad Request` or `503 Service Unavailable` could be returned if the user requests X instances but none of the available providers/regions can fulfill it (e.g., all quotas exhausted or an invalid configuration where no providers are configured). The error message would explain that the system could not find a region/provider with enough capacity for the request. This is a new failure mode to document for the API.

  * Additionally, if the user provides an invalid combination (say `provider: null` but also explicitly gives a region – which contradicts the idea since region is specified to a provider that isn’t chosen), the API should reject the request as malformed. We’ll add validation rules such as “if region is provided, provider cannot be null” because a region without a provider is ambiguous (region codes can overlap across clouds).

* **Documentation Updates:** The API documentation will be updated to reflect:

  * `region` is optional. If omitted, the backend will choose a region automatically.
  * Explanation of how automatic region assignment works (at a high level; we might say “the system will choose a region based on availability and load-balancing across regions”).
  * Any new fields or uses of `provider` (for multi-provider distribution) will be documented. If we support `provider: null` to mean auto-select provider, we must clearly state that.
  * Examples will be added to show a request without region, and the expected outcome.

By making these API changes, we empower users to let Talis handle the complexity of region and provider selection. This proposal ensures that the system can intelligently balance instance creation across cloud infrastructure while keeping the API flexible and user-friendly. All changes will be implemented with backward compatibility in mind, so existing workflows (specifying a region and provider explicitly) remain unaffected, but users opting into the new behavior get a powerful new feature for multi-region, multi-cloud deployments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Automatic Region/Provider Assignment for Instance Creation #327

Summary

Motivation

Proposed Behavior

Implementation Notes

API Changes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Automatic Region/Provider Assignment for Instance Creation #327

Description

Summary

Motivation

Proposed Behavior

Implementation Notes

API Changes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions