Erlang Clustering: a survey

22 Jan 2023 12:10 erlang

A question on Mastodon asks “What are people using [for cluster management] in 2023?”. I thought I’d address a couple of hidden assumptions in the question and do a quick survey of what’s available.

Mature or abandoned?

One of the things about the Erlang ecosystem is that it’s kinda slow-moving compared to some others – even compared to Elixir. It’s not unusual to find a package that hasn’t had any recent updates and looks to be abandoned. Frequently, however, this is because it’s finished. It’s stable, mature, and doesn’t need any updates.

Q: Age is no guarantee of efficiency.

James Bond: And youth is no guarantee of innovation.

– Skyfall (2012)

It’s a risk, of course. You might start relying on it and then discover a bug, and then you’ll find out whether it’s actually abandoned. I don’t have any solid advice here. You kind of get a feel for the difference after a while.

What is “clustering”?

When people talk about “clustering”, they’re usually conflating a number of different topics:

Node discovery.
Node communication.
Process discovery.
Leader elections.

Erlang/OTP, out of the box, handles node communication just fine. It also provides some very basic node discovery.

So, what’s missing?

Node discovery

Erlang’s node discovery was fine back when everyone was cool with hand-configuring things. These days, we’re looking for something easier.

As far as I know, the current state of the art is libcluster, about which I’ve written in the past. It’s an Elixir package; for an Erlang-only application, you’ll have to put something together yourself.

Process discovery

Of course, now you’ve got your nodes talking to each other, you’ll need to get your processes talking to each other. Again, this just works out of the box. Where you’re going to struggle is finding the process you want to talk to. You’ll need a process registry.

Erlang/OTP comes with pg, which is used by Phoenix PubSub.
Still in Erlang-world, there’s gproc.
There’s syn, which was originally aimed at the IoT field.
I’ve used horde in the past.
There’s also swarm. I don’t have a lot of experience with that.

They each make different trade-offs w.r.t. consistency and availability, and they use different cross-cluster strategies: horde uses CRDTs; gproc uses a leader, for example.

Leader elections

gen_leader – this one does seem to be in some kind of purgatory. The “latest” version I can find (and the one used by gproc) is this one: https://github.com/garret-smith/gen_leader_revival.git.

Uncategorised

ra, which is RabbitMQ’s implementation of Raft.
riak_core.
partisan.

What did I miss? Hit the discussion button at the top, or find me on Hackyderm.