Add worker visibility API - RecordWorkerHeartbeat and ListWorkers#601
Conversation
|
|
||
| // Worker identity, set by the client, may not be unique. | ||
| // Usually host_name+(user group name)+process_id, but can be overwritten by the user. | ||
| string worker_identity = 3; |
There was a problem hiding this comment.
Mentioned on previous PR, this is not specific to host just because it has a default that is, and therefore this should be moved to top-level worker info IMO. I also think the other two fields don't deserve their own message any more than SDK name/version do, but not a strong opinion there.
There was a problem hiding this comment.
i think last time I either missread this or the context was different. I agree, worker_identity shouldn't be here. I will move it out. But I will keep the structure - unlike SDK I feel it will make sense.
There was a problem hiding this comment.
unlike SDK I feel it will make sense
Can you clarify how host name and process ID in its own object makes sense but SDK name and version does not?
There was a problem hiding this comment.
I already did in previous PR - this structure can be extended.
There was a problem hiding this comment.
The SDK structure could be extended too. I am not sure I understand why host uniquely deserves its own message just because it may be extended like any other. It's obviously not a big deal, just a strange inconsistency.
There was a problem hiding this comment.
The probability is different. I think I've shared what I can on this topic, not much to add. Let's call it here.
There was a problem hiding this comment.
I don't think the probability is different. I have noticed you have chosen not to put CPU and memory host info into host info, so I am suspecting you may not put other host info into host info? Also, how sure are you there is no more SDK info we may want in SDK info?
There was a problem hiding this comment.
I hear your concerns, but I'm not convinced the removal is necessary based on the arguments presented. Keeping has value. Let's keep this one for now.
There was a problem hiding this comment.
I don't think this is hugely important either way, but if you're convinced this is worth keeping, I'd agree with Chad that putting the CPU/Mem in there makes sense.
There was a problem hiding this comment.
Mmm.
Pro: it is host related, not worker related. So should be in host struct since we have it.
Con: not all metrics in one place.
Well. Yes, Pro > Con for me in this case, I will move them.
| //* SdkName | ||
| //* SdkVersion | ||
| //* StartTime | ||
| //* LastHeartbeatTime |
There was a problem hiding this comment.
Are there concerns on updating this value so frequently in visibility store?
There was a problem hiding this comment.
we are going to update them (in batch, but still) for every heartbeat. I don't think we can avoid updating this info/skip updates. Question is - will it be indexable? This is something to decide in P2 (I think it shouldn't, but may be users will lead us in different direction).
There was a problem hiding this comment.
👍 My only concern here is whether indexable. Do we want to remove this from the indexable item list here?
There was a problem hiding this comment.
we for sure want to be able to query by "last heartbeat time".
There was a problem hiding this comment.
Do we? Or do we need to query by a certain state that may relate to last heartbeat time (e.g. staleness)?
will it be indexable? [...] I think it shouldn't [...] we for sure want to
I am a bit confused here. Doesn't it have to be indexable to query on it? So if I have 5 million workers, as part of this first version, do we expect users to be able to query for all last heartbeat times before a certain timestamp? How can we defer indexability to a later decision but list it as an indexable attribute here?
There was a problem hiding this comment.
How can we defer indelibility to a later decision
Easily - this is a two-way door decision, and right now it is a comment.
We are going to use "in-memory" approach for MVP.
Id we decide not to index it later - we will remove it from the comment (and from filtering list).
| } | ||
|
|
||
| message ListWorkersResponse { | ||
| repeated temporal.api.worker.v1.WorkerInfo worker_info = 1; |
There was a problem hiding this comment.
Consider making a wrapper message for this instead of returning the same worker info. There may be a time where we want server-set info and we'll regret we had nowhere to put it. Consider something like:
message WorkerDescription {
temporal.api.worker.v1.WorkerInfo info = 1;
}Or WorkerListEntry or anything. Basically some way for server to have room to add server-only things and not assume that the only thing we will ever return from list is the exact thing worker gives us. I expect some server-set info in here one day.
There was a problem hiding this comment.
Can't we just add server-only fields to WorkerInfo and mark them as such?
There's no shortage of tag numbers...
There was a problem hiding this comment.
I think it gets a little confusing to only accept some fields from worker on the same message. If we're willing to make separate messages for trivial things like host info, we should be willing to create separate messages for worker-provided data vs non-worker-provided data. I think of it like spec vs status in k8s. Or like in schedules case, ScheduleListEntry/ScheduleListInfo that wraps ScheduleSpec. I don't think it's very future proof to reuse the spec type as the list type with out some kind of wrapping. A single-field message just for list entries is mostly harmless but lets us grow as needed with the list response items.
There was a problem hiding this comment.
I do expect to have something else (coming from server) in this answer, besides just heartbeat info. Creating separate structure for further extension make sense.
There was a problem hiding this comment.
also will allow us to separate receiving heartbeats from providing this info to the users.
| } | ||
|
|
||
| message ListWorkersResponse { | ||
| repeated temporal.api.worker.v1.WorkerInfo worker_info = 1; |
There was a problem hiding this comment.
Can't we just add server-only fields to WorkerInfo and mark them as such?
There's no shortage of tag numbers...
| // Number of slots used by the worker for specific tasks. | ||
| int32 current_slots_used = 2; |
There was a problem hiding this comment.
slots used by the worker for specific tasks.
@Sushisource - can you confirm that this is intentionally slots marked "used" and not meant to be slots reserved? Meaning you can't really get total slots on available + used (unless you also add active pollers, but even then there are races e.g. with eager activities/workflows).
There was a problem hiding this comment.
Yeah, I think having it be used makes good sense. It's more useful than reserved. I don't think being able to add them together to get the total is really useful in any meaningful way. Remaining available is the useful thing (if known).
Sushisource
left a comment
There was a problem hiding this comment.
Thanks for your work on this. Nothing blocking, but a couple last things worth considering.
|
|
||
| // Worker identity, set by the client, may not be unique. | ||
| // Usually host_name+(user group name)+process_id, but can be overwritten by the user. | ||
| string worker_identity = 3; |
There was a problem hiding this comment.
I don't think this is hugely important either way, but if you're convinced this is worth keeping, I'd agree with Chad that putting the CPU/Mem in there makes sense.
<!-- Describe what has changed in this PR --> **What changed?** What changed? * 2 new APIs added: * WorkerHeartbeat * ListWorkerStatus * WorkerInfo information added to: * ShutDownWorker request * Poll(Activity|Nexus|Workflow)TaskQueue requests This is "for merge". Here is a previous PR, where we discuss this at length: #599 <!-- Tell your future self why have you made these changes --> **Why?** Part of the worker visibility work. Workers should be able to report their status. Users should be able to get information about workers. <!-- Are there any breaking changes on binary or code level? --> **Breaking changes** Yes - new API is introduce. <!-- If this breaks the Server, please provide the Server PR to merge right after this PR was merged. --> **Server PR** [Server PR](temporalio/temporal#7870)
What changed?
What changed?
This is "for merge".
Here is a previous PR, where we discuss this at length: #599
Why?
Part of the worker visibility work. Workers should be able to report their status.
Users should be able to get information about workers.
Breaking changes
Yes - new API is introduce.
Server PR
Server PR