UpdateTaskQueueConfig api changes#604
Conversation
SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted._ <!-- Describe what has changed in this PR --> **What changed?** + Boolean field added to DescribeTaskQueue request to control sending task queue config in the response. + New messages for RateLimit and TaskQueueConfig + UpdateTaskQueueConfig api implementation: Created a UpdateTaskQueueConfigRequest and UpdateTaskQueueConfigResponse. <!-- Tell your future self why have you made these changes --> **Why?** Server needs an API endpoint for update TaskQueue RateLimits. <!-- Are there any breaking changes on binary or code level? --> **Breaking changes** Yes. New Api's are breaking changes <!-- If this breaks the Server, please provide the Server PR to merge right after this PR was merged. --> **Server PR** None at the moment.
…o TaskQueueConfigApi
| bool report_task_reachability = 10 [deprecated = true]; | ||
|
|
||
| // Show Task Queue Config | ||
| bool show_task_queue_config = 11; |
There was a problem hiding this comment.
Can you confirm that task queue config is expensive to fetch therefore justifying it having to be opt-in?
There was a problem hiding this comment.
It's not expensive to fetch (it should be kept in memory always), but it could be a little large.. do we want selective returning in that case? I'm okay with always returning it to simplify things.
There was a problem hiding this comment.
Same, I have no strong opinion. I wish describe were infrequent, but with stats on it I suspect it may be a bit frequent. I am less concerned with data size than performance. Up to y'all, just wanted to toss this out there.
There was a problem hiding this comment.
How are we translating these opt-in bools in the sdk/cli so far? Is it annoying to add more? (for us or the user)
Size/performance: in some sense size affects performance since we have to serialize and send it. But we're not going to do an extra database read or rpc to get this data, fwiw
There was a problem hiding this comment.
How are we translating these opt-in bools in the sdk/cli so far? Is it annoying to add more?
We haven't really for any of this newer stuff. Unfortunately these APIs are being designed without cross-SDK design, but it should be just trivial booleans there too. So Go does have an old form at https://pkg.go.dev/go.temporal.io/sdk/internal#DescribeTaskQueueEnhancedOptions but I think I heard "enhanced" is being deprecated before it ever stabilized.
Is it annoying to add more?
Not really. Just opt-in bools that affect whether some data is in the response I think.
| string identity = 2; | ||
|
|
||
| // Selects the task queue to update. | ||
| string task_queue_name = 3; |
There was a problem hiding this comment.
In DescribeTaskQueue we accept a full TaskQueue type, would there be value in doing so here? Or maybe it was a mistake to do so there?
There was a problem hiding this comment.
There were a lot of historical mistakes... the main one was not to put type in TaskQueue. Kind is a mostly useless value and all these config/versioning/deployment apis don't work on sticky queues so there's not really any reason to use TaskQueue there (except on polls).
There is kind of a reason to take TaskQueue on Describe though: Describe will cause the tq to be loaded if it's not loaded, and when we load it we should have kind available to switch some behavior. And you can Describe a sticky queue. But you can't configure it, so the plain string here makes sense, even if it feels inconsistent.
(Arguably we shouldn't have had a kind enum either and gotten the sticky bit from somewhere higher level)
| // Unless modified, this is the system-defined rate limit. | ||
| RateLimitConfig queue_rate_limit = 3; | ||
| RateLimitConfig fairness_keys_rate_limit_default = 4; |
There was a problem hiding this comment.
To confirm, these two fields are never null?
There was a problem hiding this comment.
I think we'll have queue_rate_limit returned with some value no matter what, but the fairness key limit can be missing.
| message RateLimit { | ||
|
|
||
| // Zero is a valid rate limit. | ||
| float requests_per_second = 1; | ||
|
|
||
| } | ||
|
|
||
| message ConfigMetadata { |
There was a problem hiding this comment.
Seems like both of these should just be inlined into RateLimitConfig, esp. since they relate to each other.
There was a problem hiding this comment.
There’s a use case where omitting the rate_limit field signals that we want to reset to the worker defaults. By not inlining metadata within RateLimit, we retain the flexibility to provide context while intentionally unsetting the limit. This avoids ambiguity—since 0 is a valid value for requests_per_second.
| // Can be "system", "worker" or "api". | ||
| string source = 2; |
There was a problem hiding this comment.
Should be an enum if it has known fixed values
| // This map contains Task Queue information for each Build ID. Empty string as key value means unversioned. | ||
| map<string, temporal.api.taskqueue.v1.TaskQueueVersionInfo> versions_info = 3 [deprecated = true]; | ||
|
|
||
| temporal.api.taskqueue.v1.TaskQueueConfig configs = 6; |
There was a problem hiding this comment.
Plural would imply repeated here, and it seems like it should be? Otherwise how are we getting different configs for different task_queue_types?
There was a problem hiding this comment.
In the DescribeTaskQueue we do accept a TaskQueueType and if nothing is provided we default to the TASK_QUEUE_TYPE_WORKFLOW. So at any point of time we will only be returning a config corresponding to a TaskQueueType so it will be singular itself. I have fixed this. We have depricated the enhanced version of the api so these config changes will not be affecting that flow.
| }; | ||
| } | ||
|
|
||
| // Update taskqueue config and uncouple taskqueue settings from the worker lifecycle |
There was a problem hiding this comment.
This comment reads oddly. "...and uncouple" implies that the uncoupling happens as part of the update. Is that true?
IE: Does calling this mean that if, previously, workers had been setting a ratelimit, any future attempts by workers to do so will no longer be respected? If so, that should be spelled out in more detail here.
There was a problem hiding this comment.
Yes, a rate limit set by this api overrides the worker-set rate limit. If it's unset, then the worker-set limit takes effect again.
There was a problem hiding this comment.
OK, yeah, comment needs way more detail about that
There was a problem hiding this comment.
Updated the comment.
| message UpdateQueueRateLimit { | ||
| // Rate limit to apply to task queue. | ||
| // It must be less than or equal to the system-level setting at the time | ||
| // of the request. If null, removes the existing rate limit. | ||
| RateLimitUpdate rate_limit = 1; | ||
| } | ||
|
|
||
| message UpdateFairnessKeyRateLimitDefault { | ||
| // Default rate limit to apply for all fairness keys. | ||
| // If null, removes the existing rate limit. | ||
| RateLimitUpdate rate_limit = 1; | ||
| } |
There was a problem hiding this comment.
These also seem like they could be inlined unless we expect more things to go in here besides the rate limit
There was a problem hiding this comment.
The catch is that there needs to be a way to unset the rate limit (either one or both). With this schema you can do that by providing an empty UpdateQueueRateLimit or UpdateFairnessKeyRateLimitDefault. If they're inlined you need an extra bool unset in RateLimitUpdate. Which is a valid way to do it if you prefer that
New apis and messages are not breaking changes, let's not be unnecessarily scary here |
| bytes next_page_token = 2; | ||
| } | ||
|
|
||
| message UpdateTaskQueueConfigRequest { |
There was a problem hiding this comment.
Do we need some kind of conflict-token approach to ensure updates are not clashing?
There was a problem hiding this comment.
Hmm, good question. My initial inclination is no/add it later: for versioning/deployment stuff, the state is more complex and you might make changes that depend (implicitly) on other parts of the state. Here, we have a bag of LWW fields. It doesn't feel necessary. But if we want to add it later we can.
+ Only the config data set from the api is persisted the default configs w.r.t to workers are in memory. + Source need not be persisted in this case.
cretz
left a comment
There was a problem hiding this comment.
Nothing blocking from my POV
| // Only populated if show_task_queue_config is set to true. | ||
| temporal.api.taskqueue.v1.TaskQueueConfig config = 6; | ||
|
|
||
| // Source of the update |
There was a problem hiding this comment.
What is "the update"?
The source is a property of the effective rate limit, and there's no field here that has the effective rate limit. If you want to add one now, maybe we should add a submessage of DescribeTaskQueueResponse that has a taskqueue.RateLimit and a source?
There was a problem hiding this comment.
Added a new EffectiveRateLimit message and introduce a float to store the effective rate limit and the source of the update.
| // If the overall queue rate limit is unset, the worker-set rate limit takes effect. | ||
| rpc UpdateTaskQueueConfig (UpdateTaskQueueConfigRequest) returns (UpdateTaskQueueConfigResponse) { | ||
| option (google.api.http) = { | ||
| post: "/namespaces/{namespace}/task-queues/{task_queue_name}/update-config" |
There was a problem hiding this comment.
Hmm, just looking at this now: how do we get the type in this url? it has just the name. Same question for DescribeTaskQueue, actually.
There was a problem hiding this comment.
We don't get the type in this url. In case of describeTaskQueue we pass the task queue type as a query parameter and in case if the UpdateTaskQueueConfig api we will get the TaskQueueType as part of the request body. Is there any reason why we would want it as part of the url?
There was a problem hiding this comment.
Updated the url and the additional binding to add the {task_queue_type} as well in the url to be indicative of the exact resource that is being updated as part of the POST request.
| message EffectiveRateLimit { | ||
| // The effective rate limit for the task queue. | ||
| float requests_per_second = 1; | ||
|
|
||
| // Source of the RateLimit Configuration,which can be one of the following values: | ||
| // - SOURCE_API: The rate limit that is set via the TaskQueueConfig api. | ||
| // - SOURCE_WORKER: The rate limit is the value set using the workerOptions in TaskQueueActivitiesPerSecond. | ||
| // - SOURCE_SYSTEM: The rate limit is the default value set by the system | ||
| temporal.api.enums.v1.RateLimitSource rate_limit_source = 2; | ||
| } | ||
|
|
||
| EffectiveRateLimit effective_rate_limit = 7; |
There was a problem hiding this comment.
Any reason why this wouldn't be considered stats or config?
There was a problem hiding this comment.
For historical reasons it is kind of a different thing:
I want "config" to be a more resource-oriented value: you set/unset it, you get back what you set/unset.
But because we previously had a rate limit passed from the workers, there can be a rate limit that is set, but not part of the "config". We want to report that for informative purposes and to avoid confusion, but not in the config.
I also don't think it's "stats", it has nothing to do with the actual flow of tasks, historical or present. It's just a sort of runtime piece of data that comes from one of those three places and one of them is in effect at the moment.
Even if it was sorta "stats", it shouldn't go in temporal.api.taskqueue.v1.TaskQueueStats because that's a per-physical-queue thing (which can be aggregated) and we are/will be using that message to report per-priority-level stats, but the rate limit applies to the whole queue, above priority levels.
There was a problem hiding this comment.
It just came off a bit inconsistent because we have these other two messages that you have to opt in to viewing, are effectively-single-use messages in the taskqueue package, etc. We just a bit strange. Also, I assume that unlike the config which you opt in to, this we have decided we will always return no matter what because it won't be large and config will?
There was a problem hiding this comment.
Yeah, there's no way to do it that's not awkward somewhere: either config has something that's not a config (or at least has a very different data flow from the rest of config), or stats has something that's not a stat, or we have this unusual "effective rate limit" thing. And yeah, config may be large, stats needs to be aggregated across partitions, but effective rate limit is a single value we can just pull from memory.
There was a problem hiding this comment.
👍 We will probably be able to make it look prettier in SDKs if/when we make high level forms of this. Can you confirm these new fields are set regardless of "mode"?
dnr
left a comment
There was a problem hiding this comment.
I'm good with this now, maybe just that one rename
| RateLimit rate_limit = 1; | ||
| ConfigMetadata metadata = 2; |
There was a problem hiding this comment.
It is all subjective and can be case-by-case. Personally I'm much more in favor of message reuse than we do. I think a RequestMetadata message would be great! There's lots of benefits from that.
And here, I do see some benefits from a common ConfigMetadata. But of course the SDK API can inline everything if that's how you want to do it in SDKs. Why does ConfigMetadata show up in intellisense when working with SDKs? It's a proto message in a different package/module.
| message EffectiveRateLimit { | ||
| // The effective rate limit for the task queue. | ||
| float requests_per_second = 1; | ||
|
|
||
| // Source of the RateLimit Configuration,which can be one of the following values: | ||
| // - SOURCE_API: The rate limit that is set via the TaskQueueConfig api. | ||
| // - SOURCE_WORKER: The rate limit is the value set using the workerOptions in TaskQueueActivitiesPerSecond. | ||
| // - SOURCE_SYSTEM: The rate limit is the default value set by the system | ||
| temporal.api.enums.v1.RateLimitSource rate_limit_source = 2; | ||
| } | ||
|
|
||
| EffectiveRateLimit effective_rate_limit = 7; |
There was a problem hiding this comment.
For historical reasons it is kind of a different thing:
I want "config" to be a more resource-oriented value: you set/unset it, you get back what you set/unset.
But because we previously had a rate limit passed from the workers, there can be a rate limit that is set, but not part of the "config". We want to report that for informative purposes and to avoid confusion, but not in the config.
I also don't think it's "stats", it has nothing to do with the actual flow of tasks, historical or present. It's just a sort of runtime piece of data that comes from one of those three places and one of them is in effect at the moment.
Even if it was sorta "stats", it shouldn't go in temporal.api.taskqueue.v1.TaskQueueStats because that's a per-physical-queue thing (which can be aggregated) and we are/will be using that message to report per-priority-level stats, but the rate limit applies to the whole queue, above priority levels.
| // If the overall queue rate limit is unset, the worker-set rate limit takes effect. | ||
| rpc UpdateTaskQueueConfig (UpdateTaskQueueConfigRequest) returns (UpdateTaskQueueConfigResponse) { | ||
| option (google.api.http) = { | ||
| post: "/namespaces/{namespace}/task-queues/{task_queue_name}/update-config/{task_queue_type}" |
There was a problem hiding this comment.
Btw @cretz did you have an opinion on task queue type in URLs? In my mind it should clearly be in the URL since it's identifying the resource being modified. But why didn't we do it for DescribeTaskQueue?
There was a problem hiding this comment.
I think we should remove this from the URL (sorry I didn't see this before).
| post: "/namespaces/{namespace}/task-queues/{task_queue_name}/update-config/{task_queue_type}" | |
| post: "/namespaces/{namespace}/task-queues/{task_queue_name}/update-config" |
Traditionally the URLs have just been named qualifiers and if there are other qualifiers they are via body/query, even required ones (though in describe's case it defaults to workflow IIUC). Basically the URL alone isn't all things needed to qualify what resource your affecting, it's just a name for the API call on the resource (same for run ID on workflow calls), recognizing the URL may not represent a single resource like it would if we were more RESTful. If we had a single string that uniquely identified a task queue and type, that'd be worth using instead, but it's awkward to pack tuples into URLs.
GET https://temporal-api/namespaces/my-namespace/task-queues/my-task-queue?task_queue_type=TASK_QUEUE_TYPE_ACTIVITY works for describe, and here you'll have to use POST https://temporal-api/namespaces/my-namespace/task-queues/my-task-queue/update-config with the full body including type and other things. This is why it makes sense for this to be POST /namespaces/{namespace}/task-queues/{task_queue_name}/update-config instead of POST /namespaces/{namespace}/task-queues/{task_queue_name}, because we are not truly RESTful.
There was a problem hiding this comment.
It doesn't really make sense to me to have the namespace name and task queue name in one place and the type in another. You need all three to identify the task queue. But I also don't care, so I'm fine with removing the type here.
There was a problem hiding this comment.
Our URLs are not about uniquely identifying something because they can't do so well with our non-single-key resource design (same for workflows, activities, workers, etc). This is especially true with operations that sometimes have optional items to make up the key (such as some task queue and workflow calls). We just need a common base for operations on the general thing from an HTTP API simplicity POV.
There was a problem hiding this comment.
removed the {task_queue_type} from the url as suggested.
cretz
left a comment
There was a problem hiding this comment.
Only slight request to remove the enumerate from the task queue URL for consistency, otherwise LGTM
| string namespace = 1; | ||
| string identity = 2; | ||
| // Selects the task queue to update. | ||
| string task_queue = 3; |
There was a problem hiding this comment.
Updated the task_queue_name to task_queue as the routing magic login in temporal server needs this field to be named task_queue to get picked up automatically.
There was a problem hiding this comment.
Great catch! I am so happy we wrote (some of?) the impl server-side before merging this so we didn't have an API release with this other field name present.
| // If the overall queue rate limit is unset, the worker-set rate limit takes effect. | ||
| rpc UpdateTaskQueueConfig (UpdateTaskQueueConfigRequest) returns (UpdateTaskQueueConfigResponse) { | ||
| option (google.api.http) = { | ||
| post: "/namespaces/{namespace}/task-queues/{task_queue}/update-config" |
There was a problem hiding this comment.
@cretz had to update the task_queue_name mapping to task_queue as some routing logic in the server repo relies on this field being named task_queue. Hence updating this api gateway bindings as well.
_**READ BEFORE MERGING:** All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted._ <!-- Describe what has changed in this PR --> **What changed?** + Boolean field added to DescribeTaskQueue request to control sending task queue config in the response. + New messages for RateLimit and TaskQueueConfig + UpdateTaskQueueConfig api implementation: Created a UpdateTaskQueueConfigRequest and UpdateTaskQueueConfigResponse. <!-- Tell your future self why have you made these changes --> **Why?** + Server needs an API endpoint for update TaskQueue RateLimits. <!-- Are there any breaking changes on binary or code level? --> **Breaking changes** + No. <!-- If this breaks the Server, please provide the Server PR to merge right after this PR was merged. --> **Server PR** + PR for persistence of TaskQueueConfigs : temporalio/temporal#8004
READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.
What changed?
Why?
Breaking changes
Server PR