Skip to content

Real-world fixes for inspecting distributed clusters across hosts and data centers#18

Merged
JonasGruenwald merged 4 commits into
JonasGruenwald:mainfrom
oderwat:main
Feb 28, 2026
Merged

Real-world fixes for inspecting distributed clusters across hosts and data centers#18
JonasGruenwald merged 4 commits into
JonasGruenwald:mainfrom
oderwat:main

Conversation

@oderwat
Copy link
Copy Markdown
Contributor

@oderwat oderwat commented Feb 26, 2026

We've been using Spectator to inspect production BEAM clusters with many processes across remote nodes. These are the changes we needed to make it work reliably at that scale. Feel free to cherry-pick, adapt, or ignore as you see fit.

  • Replace sequential per-process/port/table erpc calls with batched erpc:send_request/receive_response for parallel data collection
  • Add list_processes_with_info, list_ports_with_info, list_ets_tables_bulk FFI functions with 5s bulk timeout
  • Make processes, ETS, and ports live component inits non-blocking to avoid mist's 500ms actor init timeout
  • Remove cookie parameter from URL/connect widget (security concern)
  • Bump dependencies via gleam update (gleam_stdlib 0.65→0.69, etc.)
  • Changed the connection

Usage for remote node inspection (short node names will not work for that):

cd spectator_app
ERL_FLAGS="-name spectator@127.0.0.1 -setcookie <cookie>" gleam run

…cking init, remove cookie from URL

- Replace sequential per-process/port/table erpc calls with batched
  erpc:send_request/receive_response for parallel data collection
- Add list_processes_with_info, list_ports_with_info, list_ets_tables_bulk
  FFI functions with 5s bulk timeout
- Make processes, ETS, and ports live component inits non-blocking to
  avoid mist's 500ms actor init timeout
- Remove cookie parameter from URL/connect widget (security concern)
- Bump dependencies via gleam update (gleam_stdlib 0.65→0.69, etc.)

Usage for remote node inspection:

cd spectator_app
ERL_FLAGS="-name spectator@127.0.0.1 -setcookie <cookie>" gleam run
@JonasGruenwald
Copy link
Copy Markdown
Owner

We've been using Spectator to inspect production BEAM clusters with many processes across remote nodes

That's super cool to hear, thanks for sharing!

These changes sound good to me, I've been meaning to handle the data fetching more gracefully – thanks for your work, I will try to take a closer look when I can.

I think not storing the cookie in the URL query parameter is a good idea; I had been thinking about moving both the cookie and target node name to command line arguments passed to spectator when starting it, because I'd generally prefer to avoid creating atoms from user input dynamically – do you think that's something that would work for your use case as well or do you benefit from being able to swap between inspection targets on a running spectator instance?

@oderwat
Copy link
Copy Markdown
Contributor Author

oderwat commented Feb 27, 2026

swap between inspection targets on a running spectator instance

We run clusters of up to 8 nodes across multiple hosts in three different locations. Deployment is driven by a "cluster configuration" that defines all nodes and where they live. We modified spectator to accept a list of nodes on the command line and populate the connection dropdown for quick switching between them. This integrates with our central CLI tool (cx), so inspecting a cluster is just:

cx -c <cluster_name> spectator

On the feature side, we've added:

  • Reductions per second with a 10-second sliding window
  • Basic Syn (process registry) support
  • A cell name column in the process list — "cells" are independently loadable Gleam projects that we hot-reload into running nodes. Since cells can spawn multiple processes and run as multi-instance, sorting by cell name gives an immediate overview of all the running parts.

That last one is highly specific to our setup, and we expect to keep adding more. For example, we have centralized OpenTelemetry logging that collects into ETS tables — interpreting and surfacing that data in spectator is next on the list.

I was happy to find this project as a starting point, especially since it uses Lustre server components for the UI — which is the same approach we use. We've actually taken it a step further: our views are abstracted and loaded dynamically into different panes of a coordinating "webui" cell.

{some, NodeName} ->
Pids = erpc:call(NodeName, erlang, processes, [], ?BULK_TIMEOUT),
Reqs = [{P, erpc:send_request(NodeName, erlang, process_info, [P, Items])}
|| P <- Pids],
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the idea behind this part – if I'm reading correctly this would still make N erpc requests for N processes, just in parallel instead of in series.

My thought for optimization would have been to do a single erpc call, with a function that collects all the process info on the remote node rather than locally (could still be done in parallel rather than series by fanning out processes on the node), so the whole thing would only be a single request over the network.
Is that something that you had considered but didn't go with for some specific reason? Am I missing something with that thought?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just easy to do. I believe this does not make multiple connections — the requests all go over the existing distribution channel, and they're processed concurrently on the remote node.

If two nodes are connected together, all their communications will tend to happen over a single TCP connection. Because we generally want to maintain message ordering between two processes (even across the network), messages will be sent sequentially over the connection. (LYSE)

Running it fully on the remote as a single call would be faster, but it's not trivial since you need to get the code onto the remote node using a helper module. We do that for other stuff in our system, but Spectator was just a quick hack to start with. Spectator could use erl_eval to get the code there. But in reality it was already fast enough so we didn't bother with it.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting, I was just thinking since network round-trips were mentioned in the comment leading up to this, this might still have to make those same round trips, just in parallel, but not idea if that's any actual overhead, I'd be interested in benchmarking this, but I suppose it would be tricky to recreate a realistic network environment for such a benchmark.

you need to get the code onto the remote node using a helper module

I was thinking of an anonymous function rather than going through a module loaded onto the remote, though I also don't know if the overhead from the BEAM automatically serializing that every time is significant.

If the approach you've taken here works well on your infrastructure I'll probably just take it in like that, was just curious regarding the thoughts behind it :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it is much faster to just call it 10 times in "pseudo" parallel and have it getting "serialized" by the system and get the results basically when they are done. This skips the RTT, because you do not wait till you got the result. For the erl_eval this would not be compiled code but interpreted. I did not test it.

I am fixing something more problematic for us. We have > 100 MB state data in some actors and when you click on that process you get the state fetches every x and that is crazy. I plan to add an update button and an auto checkmark. Then guesstimate the "danger" from the memory usage of the process. If that is "big" we won't do automatic fetches. I think that I also want only one initial fetch for others, until the checkmark gets set. If that is not satisfying we may need to add more queries though.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea yeah, even just inspecting the spectator server component process itself is way too heavy since it has a huge state that changes every time

@JonasGruenwald
Copy link
Copy Markdown
Owner

JonasGruenwald commented Feb 28, 2026

@oderwat Also just want to note since you are using your fork in production, there is one other smaller thing I see as an issue for production use currently, and it's something that should now be fixable since v1 of gleam_erlang: #19 – I'll try take a look at that at some point.

@oderwat
Copy link
Copy Markdown
Contributor Author

oderwat commented Feb 28, 2026

@oderwat Also just want to note since you are using your fork in production, there is one other smaller thing I see as an issue for production use currently, and it's something that should now be fixable since v1 of gleam_erlang: #19 – I'll try take a look at that at some point.

I'm aware of the spawn changes (even if I sometimes lost track on how the Gleam Erlang and OTP modules got changed last year). I postponed thinking about it because I have tracing plans — our cells use a multi-subject protocol, and we want to build more specific introspection for that. But while inspecting stuff I already ended up with quite a lot ">10" message inboxes. Thank you for the heads-up. We will add is_otp_process check as a guard before sending system messages. The guards still are useful to not step onto the feet of busy OTP processes. I may add a commit that implements that after updating our fork and testing that it works.

@oderwat
Copy link
Copy Markdown
Contributor Author

oderwat commented Feb 28, 2026

@JonasGruenwald I added the is_otp_check. Is that how you would do it, or do you have a better idea?

Copy link
Copy Markdown
Owner

@JonasGruenwald JonasGruenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfect, I was also planning to just mimic what observer does, the added _bulk and _with_info functions also look good to me.

To merge this I think these things should be done:

  • Since the _bulk and _with_info versions of the data fetching functions are quicker, and have replaced the 'regular' ones, I would just replace the regular ones with those and delete the old implementations, so that there are no unused functions in the ffi and api modules
  • In spectator.gleam there is still the code that reads out the cookie from the query parameters and sets the nodes cookie based on that, I would remove that since this functionality has been removed from the UI
  • The docs need to be slightly adapted, in the README, the note about the cookie being stored plainly in the URL should be removed since it's no longer the case (in the production section), perhaps a note that the inspection target must have the same cookie set as spectator could be added.
    Same for the example/README.md – just a note about the cookie, and the outdated screenshot should be removed (don't really need a new one, can just get rid of it)

Let me know if you don't feel like working on this further, I can also branch off from here and make those changes myself on top of yours.

@oderwat
Copy link
Copy Markdown
Contributor Author

oderwat commented Feb 28, 2026

As you already identified those places it would be nice if you do that.

…andling code, update docs, update dokerfile versions
@JonasGruenwald JonasGruenwald merged commit f49474c into JonasGruenwald:main Feb 28, 2026
@JonasGruenwald
Copy link
Copy Markdown
Owner

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants