Real-world fixes for inspecting distributed clusters across hosts and data centers#18
Conversation
…cking init, remove cookie from URL - Replace sequential per-process/port/table erpc calls with batched erpc:send_request/receive_response for parallel data collection - Add list_processes_with_info, list_ports_with_info, list_ets_tables_bulk FFI functions with 5s bulk timeout - Make processes, ETS, and ports live component inits non-blocking to avoid mist's 500ms actor init timeout - Remove cookie parameter from URL/connect widget (security concern) - Bump dependencies via gleam update (gleam_stdlib 0.65→0.69, etc.) Usage for remote node inspection: cd spectator_app ERL_FLAGS="-name spectator@127.0.0.1 -setcookie <cookie>" gleam run
That's super cool to hear, thanks for sharing! These changes sound good to me, I've been meaning to handle the data fetching more gracefully – thanks for your work, I will try to take a closer look when I can. I think not storing the cookie in the URL query parameter is a good idea; I had been thinking about moving both the cookie and target node name to command line arguments passed to spectator when starting it, because I'd generally prefer to avoid creating atoms from user input dynamically – do you think that's something that would work for your use case as well or do you benefit from being able to swap between inspection targets on a running spectator instance? |
We run clusters of up to 8 nodes across multiple hosts in three different locations. Deployment is driven by a "cluster configuration" that defines all nodes and where they live. We modified spectator to accept a list of nodes on the command line and populate the connection dropdown for quick switching between them. This integrates with our central CLI tool ( On the feature side, we've added:
That last one is highly specific to our setup, and we expect to keep adding more. For example, we have centralized OpenTelemetry logging that collects into ETS tables — interpreting and surfacing that data in spectator is next on the list. I was happy to find this project as a starting point, especially since it uses Lustre server components for the UI — which is the same approach we use. We've actually taken it a step further: our views are abstracted and loaded dynamically into different panes of a coordinating "webui" cell. |
| {some, NodeName} -> | ||
| Pids = erpc:call(NodeName, erlang, processes, [], ?BULK_TIMEOUT), | ||
| Reqs = [{P, erpc:send_request(NodeName, erlang, process_info, [P, Items])} | ||
| || P <- Pids], |
There was a problem hiding this comment.
I'm curious about the idea behind this part – if I'm reading correctly this would still make N erpc requests for N processes, just in parallel instead of in series.
My thought for optimization would have been to do a single erpc call, with a function that collects all the process info on the remote node rather than locally (could still be done in parallel rather than series by fanning out processes on the node), so the whole thing would only be a single request over the network.
Is that something that you had considered but didn't go with for some specific reason? Am I missing something with that thought?
There was a problem hiding this comment.
It was just easy to do. I believe this does not make multiple connections — the requests all go over the existing distribution channel, and they're processed concurrently on the remote node.
If two nodes are connected together, all their communications will tend to happen over a single TCP connection. Because we generally want to maintain message ordering between two processes (even across the network), messages will be sent sequentially over the connection. (LYSE)
Running it fully on the remote as a single call would be faster, but it's not trivial since you need to get the code onto the remote node using a helper module. We do that for other stuff in our system, but Spectator was just a quick hack to start with. Spectator could use erl_eval to get the code there. But in reality it was already fast enough so we didn't bother with it.
There was a problem hiding this comment.
Ah interesting, I was just thinking since network round-trips were mentioned in the comment leading up to this, this might still have to make those same round trips, just in parallel, but not idea if that's any actual overhead, I'd be interested in benchmarking this, but I suppose it would be tricky to recreate a realistic network environment for such a benchmark.
you need to get the code onto the remote node using a helper module
I was thinking of an anonymous function rather than going through a module loaded onto the remote, though I also don't know if the overhead from the BEAM automatically serializing that every time is significant.
If the approach you've taken here works well on your infrastructure I'll probably just take it in like that, was just curious regarding the thoughts behind it :)
There was a problem hiding this comment.
Well it is much faster to just call it 10 times in "pseudo" parallel and have it getting "serialized" by the system and get the results basically when they are done. This skips the RTT, because you do not wait till you got the result. For the erl_eval this would not be compiled code but interpreted. I did not test it.
I am fixing something more problematic for us. We have > 100 MB state data in some actors and when you click on that process you get the state fetches every x and that is crazy. I plan to add an update button and an auto checkmark. Then guesstimate the "danger" from the memory usage of the process. If that is "big" we won't do automatic fetches. I think that I also want only one initial fetch for others, until the checkmark gets set. If that is not satisfying we may need to add more queries though.
There was a problem hiding this comment.
That seems like a good idea yeah, even just inspecting the spectator server component process itself is way too heavy since it has a huge state that changes every time
I'm aware of the spawn changes (even if I sometimes lost track on how the Gleam Erlang and OTP modules got changed last year). I postponed thinking about it because I have tracing plans — our cells use a multi-subject protocol, and we want to build more specific introspection for that. But while inspecting stuff I already ended up with quite a lot ">10" message inboxes. Thank you for the heads-up. We will add |
…cess message queues
|
@JonasGruenwald I added the is_otp_check. Is that how you would do it, or do you have a better idea? |
There was a problem hiding this comment.
That's perfect, I was also planning to just mimic what observer does, the added _bulk and _with_info functions also look good to me.
To merge this I think these things should be done:
- Since the
_bulkand_with_infoversions of the data fetching functions are quicker, and have replaced the 'regular' ones, I would just replace the regular ones with those and delete the old implementations, so that there are no unused functions in the ffi and api modules - In
spectator.gleamthere is still the code that reads out the cookie from the query parameters and sets the nodes cookie based on that, I would remove that since this functionality has been removed from the UI - The docs need to be slightly adapted, in the README, the note about the cookie being stored plainly in the URL should be removed since it's no longer the case (in the production section), perhaps a note that the inspection target must have the same cookie set as spectator could be added.
Same for theexample/README.md– just a note about the cookie, and the outdated screenshot should be removed (don't really need a new one, can just get rid of it)
Let me know if you don't feel like working on this further, I can also branch off from here and make those changes myself on top of yours.
|
As you already identified those places it would be nice if you do that. |
…andling code, update docs, update dokerfile versions
|
Thanks for your contribution! |
We've been using Spectator to inspect production BEAM clusters with many processes across remote nodes. These are the changes we needed to make it work reliably at that scale. Feel free to cherry-pick, adapt, or ignore as you see fit.
Usage for remote node inspection (short node names will not work for that):