Auto clustering and process distribution in Elixir

Writing this down because I keep coming back to solving this whenever I build and deploy a new Elixir/Phoenix app. This article answers the following questions:

How to automatically form a cluster of elixir nodes.
How to ensure only one copy of a process runs on an elixir cluster of two or more nodes.
How to ensure at least one copy of a process runs on the cluster when nodes can die and restart.

The easiest way is to use two libraries:

libcluster - for auto-clustering.
swarm - for process registration and distribution.

All you need is demonstrated in this sample app repository on github. Readme is pasted below for quick reference.

How to use swarm + libcluster

Relevant lines of code from example code https://github.com/emilsoman/elixir_clustering:

swarm and libcluster added in mix.exs
A new dynamic supervisor created at lib/my_app/dynamic_supervisor.ex
{DynamicSupervisor, strategy: :one_for_one, name: MyApp.DynamicSupervisor} added to lib/my_app/application.ex.
Example GenServer created at lib/my_app/singleton_worker.ex to demonstrate singleton behaviour in cluster.
Added cluster topology config in config. See config/dev.exs. Use K8S, EC2 or other strategy as desired in prod.
Added {Cluster.Supervisor, [Application.get_env(:libcluster, :topologies), [name: MyApp.ClusterSupervisor]]} to supervisor.

At this point, we can test the clustering by opening two terminal tabs and running:

# terminal 1
iex --sname a -S mix

# terminal 2
iex --sname b -S mix

When node b is started, it should print this in the log in terminal 1:

[tracker:ensure_swarm_started_on_remote_node] nodeup b@<hostname>

This means nodes are now connected and libcluster works. Now we can test process distribution by swarm.

Goals:

Process name should be unique per node and per cluster.
Process should move from node a to b when we kill node a.

# terminal 1 (where node a is running)
Swarm.register_name(SingletonName, MyApp.DynamicSupervisor, :register, [SingletonWorker])
# Prints the following:
[swarm on a@<hostname>] [tracker:handle_call] registering SingletonName as process started by Elixir.MyApp.DynamicSupervisor.register/1 with args [SingletonWorker]
[swarm on a@<hostname>] [tracker:do_track] starting SingletonName on a@<hostname>
SingletonWorker running on :"a@<hostname>"
[swarm on a@<hostname>] [tracker:do_track] started SingletonName on a@<hostname>
{:ok, #PID<0.187.0>}

# terminal 1 again
Swarm.register_name(SingletonName, MyApp.DynamicSupervisor, :register, [SingletonWorker])
# Prints the following:
[swarm on a@<hostname>] [tracker:handle_call] registering SingletonName as process started by Elixir.MyApp.DynamicSupervisor.register/1 with args [SingletonWorker]
{:error, {:already_registered, #PID<0.187.0>}}

# terminal 2 (where node b is running)
Swarm.register_name(SingletonName, MyApp.DynamicSupervisor, :register, [SingletonWorker])
# Prints the following:
[swarm on b@<hostname>] [tracker:handle_call] registering SingletonName as process started by Elixir.MyApp.DynamicSupervisor.register/1 with args [SingletonWorker]
[swarm on b@<hostname>] [tracker:do_track] found SingletonName already registered on a@<hostname>
{:error, {:already_registered, #PID<16338.187.0>}}

# Kill node a in terminal 1, prints the following in terminal 2 (node b)
[swarm on b@<hostname>] [tracker:handle_monitor] lost connection to SingletonName (#PID<16338.187.0>) on a@<hostname>, node is down
[swarm on b@<hostname>] [tracker:nodedown] nodedown a@<hostname>
[swarm on b@<hostname>] [tracker:handle_topology_change] topology change (nodedown for a@<hostname>)
[swarm on b@<hostname>] [tracker:handle_topology_change] restarting SingletonName on b@<hostname>
[swarm on b@<hostname>] [tracker:do_track] starting SingletonName on b@<hostname>
SingletonWorker running on :"b@<hostname>"
[swarm on b@<hostname>] [tracker:do_track] started SingletonName on b@<hostname>
[swarm on b@<hostname>] [tracker:handle_topology_change] topology change complete

Starting a scheduler process

Using SchedEx, creating a job that prints "Hello world" every minute:

schedex_child_spec = %{id: "every_minute", start: {SchedEx, :run_every, [IO, :inspect, ["Hello world"], "* * * * *"]}}
Swarm.register_name(MyAppScheduler, __MODULE__, :register, [schedex_child_spec])

Adding this to the supervision tree:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    distributed_workers =
      %{
        id: SlashTeam.StartupSupervisor,
        restart: :transient,
        start:
        {Task, :start_link,
          [
            fn ->
              schedex_child_spec = %{id: "every_minute", start: {SchedEx, :run_every, [IO, :inspect, ["Hello world"], "* * * * *"]}}
              Swarm.register_name(MyAppScheduler, __MODULE__, :register, [schedex_child_spec])
            end
          ]}
      }

    children = [
      {Cluster.Supervisor, [Application.get_env(:libcluster, :topologies), [name: MyApp.ClusterSupervisor]]},
      {DynamicSupervisor, strategy: :one_for_one, name: MyApp.DynamicSupervisor},
      distributed_workers
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Emil Soman

Auto clustering and process distribution in Elixir

How to use swarm + libcluster

Starting a scheduler process