Real-world fixes for inspecting distributed clusters across hosts and data centers by oderwat · Pull Request #18 · JonasGruenwald/spectator

oderwat · 2026-02-26T02:54:16Z

We've been using Spectator to inspect production BEAM clusters with many processes across remote nodes. These are the changes we needed to make it work reliably at that scale. Feel free to cherry-pick, adapt, or ignore as you see fit.

Replace sequential per-process/port/table erpc calls with batched erpc:send_request/receive_response for parallel data collection
Add list_processes_with_info, list_ports_with_info, list_ets_tables_bulk FFI functions with 5s bulk timeout
Make processes, ETS, and ports live component inits non-blocking to avoid mist's 500ms actor init timeout
Remove cookie parameter from URL/connect widget (security concern)
Bump dependencies via gleam update (gleam_stdlib 0.65→0.69, etc.)
Changed the connection

Usage for remote node inspection (short node names will not work for that):

cd spectator_app
ERL_FLAGS="-name spectator@127.0.0.1 -setcookie <cookie>" gleam run

…cking init, remove cookie from URL - Replace sequential per-process/port/table erpc calls with batched erpc:send_request/receive_response for parallel data collection - Add list_processes_with_info, list_ports_with_info, list_ets_tables_bulk FFI functions with 5s bulk timeout - Make processes, ETS, and ports live component inits non-blocking to avoid mist's 500ms actor init timeout - Remove cookie parameter from URL/connect widget (security concern) - Bump dependencies via gleam update (gleam_stdlib 0.65→0.69, etc.) Usage for remote node inspection: cd spectator_app ERL_FLAGS="-name spectator@127.0.0.1 -setcookie <cookie>" gleam run

JonasGruenwald · 2026-02-27T17:09:09Z

We've been using Spectator to inspect production BEAM clusters with many processes across remote nodes

That's super cool to hear, thanks for sharing!

These changes sound good to me, I've been meaning to handle the data fetching more gracefully – thanks for your work, I will try to take a closer look when I can.

I think not storing the cookie in the URL query parameter is a good idea; I had been thinking about moving both the cookie and target node name to command line arguments passed to spectator when starting it, because I'd generally prefer to avoid creating atoms from user input dynamically – do you think that's something that would work for your use case as well or do you benefit from being able to swap between inspection targets on a running spectator instance?

oderwat · 2026-02-27T18:10:20Z

swap between inspection targets on a running spectator instance

We run clusters of up to 8 nodes across multiple hosts in three different locations. Deployment is driven by a "cluster configuration" that defines all nodes and where they live. We modified spectator to accept a list of nodes on the command line and populate the connection dropdown for quick switching between them. This integrates with our central CLI tool (cx), so inspecting a cluster is just:

cx -c <cluster_name> spectator

On the feature side, we've added:

Reductions per second with a 10-second sliding window
Basic Syn (process registry) support
A cell name column in the process list — "cells" are independently loadable Gleam projects that we hot-reload into running nodes. Since cells can spawn multiple processes and run as multi-instance, sorting by cell name gives an immediate overview of all the running parts.

That last one is highly specific to our setup, and we expect to keep adding more. For example, we have centralized OpenTelemetry logging that collects into ETS tables — interpreting and surfacing that data in spectator is next on the list.

I was happy to find this project as a starting point, especially since it uses Lustre server components for the UI — which is the same approach we use. We've actually taken it a step further: our views are abstracted and loaded dynamically into different panes of a coordinating "webui" cell.

JonasGruenwald · 2026-02-27T21:41:16Z

+            {some, NodeName} ->
+                Pids = erpc:call(NodeName, erlang, processes, [], ?BULK_TIMEOUT),
+                Reqs = [{P, erpc:send_request(NodeName, erlang, process_info, [P, Items])}
+                        || P <- Pids],


I'm curious about the idea behind this part – if I'm reading correctly this would still make N erpc requests for N processes, just in parallel instead of in series.

My thought for optimization would have been to do a single erpc call, with a function that collects all the process info on the remote node rather than locally (could still be done in parallel rather than series by fanning out processes on the node), so the whole thing would only be a single request over the network.
Is that something that you had considered but didn't go with for some specific reason? Am I missing something with that thought?

It was just easy to do. I believe this does not make multiple connections — the requests all go over the existing distribution channel, and they're processed concurrently on the remote node.

If two nodes are connected together, all their communications will tend to happen over a single TCP connection. Because we generally want to maintain message ordering between two processes (even across the network), messages will be sent sequentially over the connection. (LYSE)

Running it fully on the remote as a single call would be faster, but it's not trivial since you need to get the code onto the remote node using a helper module. We do that for other stuff in our system, but Spectator was just a quick hack to start with. Spectator could use erl_eval to get the code there. But in reality it was already fast enough so we didn't bother with it.

Ah interesting, I was just thinking since network round-trips were mentioned in the comment leading up to this, this might still have to make those same round trips, just in parallel, but not idea if that's any actual overhead, I'd be interested in benchmarking this, but I suppose it would be tricky to recreate a realistic network environment for such a benchmark.

you need to get the code onto the remote node using a helper module

I was thinking of an anonymous function rather than going through a module loaded onto the remote, though I also don't know if the overhead from the BEAM automatically serializing that every time is significant.

If the approach you've taken here works well on your infrastructure I'll probably just take it in like that, was just curious regarding the thoughts behind it :)

Well it is much faster to just call it 10 times in "pseudo" parallel and have it getting "serialized" by the system and get the results basically when they are done. This skips the RTT, because you do not wait till you got the result. For the erl_eval this would not be compiled code but interpreted. I did not test it.

I am fixing something more problematic for us. We have > 100 MB state data in some actors and when you click on that process you get the state fetches every x and that is crazy. I plan to add an update button and an auto checkmark. Then guesstimate the "danger" from the memory usage of the process. If that is "big" we won't do automatic fetches. I think that I also want only one initial fetch for others, until the checkmark gets set. If that is not satisfying we may need to add more queries though.

That seems like a good idea yeah, even just inspecting the spectator server component process itself is way too heavy since it has a huge state that changes every time

JonasGruenwald · 2026-02-28T11:58:17Z

@oderwat Also just want to note since you are using your fork in production, there is one other smaller thing I see as an issue for production use currently, and it's something that should now be fixable since v1 of gleam_erlang: #19 – I'll try take a look at that at some point.

oderwat · 2026-02-28T15:06:56Z

@oderwat Also just want to note since you are using your fork in production, there is one other smaller thing I see as an issue for production use currently, and it's something that should now be fixable since v1 of gleam_erlang: #19 – I'll try take a look at that at some point.

I'm aware of the spawn changes (even if I sometimes lost track on how the Gleam Erlang and OTP modules got changed last year). I postponed thinking about it because I have tracing plans — our cells use a multi-subject protocol, and we want to build more specific introspection for that. But while inspecting stuff I already ended up with quite a lot ">10" message inboxes. Thank you for the heads-up. We will add is_otp_process check as a guard before sending system messages. The guards still are useful to not step onto the feet of busy OTP processes. I may add a commit that implements that after updating our fork and testing that it works.

…cess message queues

oderwat · 2026-02-28T18:00:43Z

@JonasGruenwald I added the is_otp_check. Is that how you would do it, or do you have a better idea?

JonasGruenwald

That's perfect, I was also planning to just mimic what observer does, the added _bulk and _with_info functions also look good to me.

To merge this I think these things should be done:

Since the _bulk and _with_info versions of the data fetching functions are quicker, and have replaced the 'regular' ones, I would just replace the regular ones with those and delete the old implementations, so that there are no unused functions in the ffi and api modules
In spectator.gleam there is still the code that reads out the cookie from the query parameters and sets the nodes cookie based on that, I would remove that since this functionality has been removed from the UI
The docs need to be slightly adapted, in the README, the note about the cookie being stored plainly in the URL should be removed since it's no longer the case (in the production section), perhaps a note that the inspection target must have the same cookie set as spectator could be added.
Same for the example/README.md – just a note about the cookie, and the outdated screenshot should be removed (don't really need a new one, can just get rid of it)

Let me know if you don't feel like working on this further, I can also branch off from here and make those changes myself on top of yours.

oderwat · 2026-02-28T21:18:21Z

As you already identified those places it would be nice if you do that.

…andling code, update docs, update dokerfile versions

JonasGruenwald · 2026-02-28T21:48:47Z

Thanks for your contribution!

oderwat mentioned this pull request Feb 26, 2026

WebSocket "Invalid frame header" when inspecting remote nodes #17

Closed

Set Gleam version to 1.14

9922c9d

JonasGruenwald reviewed Feb 27, 2026

View reviewed changes

Guard sys:get_status with proc_lib check to avoid filling non-OTP pro…

7e54821

…cess message queues

JonasGruenwald requested changes Feb 28, 2026

View reviewed changes

remove unused data fetching functions, adjust naming, remove cookie h…

a373135

…andling code, update docs, update dokerfile versions

JonasGruenwald merged commit f49474c into JonasGruenwald:main Feb 28, 2026

JonasGruenwald mentioned this pull request Feb 28, 2026

Re-evaluate sending system messages to all processes #19

Closed

Conversation

oderwat commented Feb 26, 2026

Uh oh!

JonasGruenwald commented Feb 27, 2026

Uh oh!

oderwat commented Feb 27, 2026

Uh oh!

JonasGruenwald Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

oderwat Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

JonasGruenwald Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

oderwat Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

JonasGruenwald Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

JonasGruenwald commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oderwat commented Feb 28, 2026

Uh oh!

oderwat commented Feb 28, 2026

Uh oh!

JonasGruenwald left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oderwat commented Feb 28, 2026

Uh oh!

JonasGruenwald commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JonasGruenwald commented Feb 28, 2026 •

edited

Loading

JonasGruenwald left a comment •

edited

Loading