Dealing with Multiple Failed Nodes

If any site in your deployment loses half or more of its etcd master nodes permanently, it loses “quorum”. This means that the underlying etcd cluster becomes read-only. While the etcd cluster is in this state, you can’t perform any scaling operations or change configuration and have it synced across the deployment. You should use this process to recover the etcd cluster in the failed site.

If you haven’t lost half (or more) of your etcd master nodes in a site, then you can use the process described here for each of your failed nodes.

Procedure

The procedure creates a new etcd cluster to replace the existing cluster. The new cluster is populated with the configuration saved in the configuration files on disk. This allows us to recreate the cluster, even in cases when the existing cluster is too badly corrupted to read from.

This procedure won’t impact service. You should follow this process completely - the behaviour is unspecified if this process is started but not completed. It is always safe to restart this process from the beginning (for example, if you encounter an error partway through).

If there are no live Vellum nodes in the site, you should continue with this process, missing out the steps that require running commands on Vellum. Once the etcd cluster is recovered, you should add the new replacement nodes.

Stop the etcd processes

Stop the etcd processes on every node in the affected site.

  • Run sudo monit stop -g etcd
  • Run sudo monit stop -g clearwater_cluster_manager
  • Run sudo monit stop -g clearwater_config_manager
  • Run sudo monit stop -g clearwater_queue_manager
  • Run sudo touch /etc/clearwater/no_cluster_manager
  • Run sudo rm -rf  /var/lib/clearwater-etcd/*

Select your master nodes

To follow this process you need to choose some nodes to be the new masters of the etcd cluster:

  • If you have 3 or more working Vellum nodes in the site, you should use those
  • If not, you should use all the nodes in the site

Check the configuration on your nodes

The next step is to ensure that the configuration files on each node are correct.

Any of the master nodes - Shared configuration

The shared configuration is at /etc/clearwater/shared_config. Verify that this is correct, then copy this file onto every other master node by using cw-config. Please see the configuration options reference for more details on how to set the configuration values.

Vellum - Chronos configuration

  • The configuration file is at /etc/chronos/chronos_cluster.conf.
  • Verify that this is present and syntactically correct on all Vellum nodes in the affected site.
    • This should follow the format here.
    • If the file isn’t present, or is invalid, then make the configuration file contain all Vellum nodes in the site as nodes.
    • Otherwise, don’t change the states of any nodes in the file (even if you know the node has failed).
  • If there is more than one failed node then there will be timer failures until this process has been completed. This could prevent subscribers from receiving notifications when their registrations/subscriptions expire.

Vellum - Memcached configuration

  • The configuration file is at /etc/clearwater/cluster_settings.
  • Verify that this is present and syntactically correct on all Vellum nodes in the affected site.
    • This can have a servers line and a new_servers line - each line has the format <servers|new_servers>=<ip address>,<ip address>, ...
    • If the file isn’t present, or is invalid, then make the configuration file contain all Vellum nodes in the site on the servers lines, and don’t add a new_servers line.
    • Otherwise, don’t change the states of any nodes in the file (even if you know the node has failed).
  • If there is more than one failed node (and there is no remote site, or more than one failed node in the remote site) then there will be registration and call failures, and calls will be incorrectly billed (if using Ralf) until this process has been completed.

Vellum - Cassandra configuration

If you are using any of Homestead-Prov, Homer or Memento, check that the Cassandra cluster is healthy by running the following on a Vellum node:

sudo /usr/share/clearwater/bin/run-in-signaling-namespace nodetool status

If the Cassandra cluster isn’t healthy, you must fix this up before continuing, and remove any failed nodes.

Sprout - JSON configuration

Check the JSON configuration files on all Sprout nodes in the affected site:

  • Verify that the /etc/clearwater/enum.json file is correct, fixing it up if it’s not using cw-config.
  • Verify that the /etc/clearwater/s-cscf.json file is correct, fixing it up if it’s not using cw-config.
  • Verify that the /etc/clearwater/bgcf.json file is correct, fixing it up if it’s not using cw-config.

Sprout - XML configuration

Check the XML configuration files on all Sprout nodes in the affected site:

  • Verify that the /etc/clearwater/shared_ifcs.xml file is correct, fixing it up if it’s not using cw-config.
  • Verify that the /etc/clearwater/fallback_ifcs.xml file is correct, fixing it up if it’s not using cw-config.

Running one of the commands sudo cw-validate_{shared|fallback}_ifcs_xml will check if the specified file is syntactically correct.

Recreate the etcd cluster

  • On your selected master nodes, set etcd_cluster in /etc/clearwater/local_config to a comma separated list of the management IP addresses of your master nodes.
  • Start etcd on the master nodes
    • Run sudo monit monitor -g etcd
    • Run sudo monit monitor -g clearwater_config_manager
    • Run sudo monit monitor -g clearwater_queue_manager
  • This creates the etcd cluster, and synchronises the shared configuration. It doesn’t recreate the data store cluster information in etcd yet.
  • Verify that the master nodes have formed a new etcd cluster successfully:
    • Running sudo monit summary on each master node should show that the etcd processes are running successfully, except the clearwater_cluster_manager_process
    • Running sudo clearwater-etcdctl cluster-health (on a single master node) should show that the etcd cluster is healthy
    • Running sudo clearwater-etcdctl member list should show that all the master nodes are members of the etcd cluster.
  • Verify that the configuration has successfully synchronized by running sudo /usr/share/clearwater/clearwater-config-manager/scripts/check_config_sync

Add the rest of the nodes to the etcd cluster

Run this process on every node which is not one of the master nodes in the affected site in turn. If all nodes in the site are master nodes, you can skip this step.

  • Set etcd_proxy in /etc/clearwater/local_config to a comma separated list of the management IP addresses of your master nodes.
  • Start etcd on the node
    • Run sudo monit monitor -g etcd
    • Run sudo monit monitor -g clearwater_config_manager
    • Run sudo monit monitor -g clearwater_queue_manager
  • Verify that the node has contacted the etcd cluster successfully:
    • Running sudo monit summary should show that the etcd processes are running successfully, except the clearwater_cluster_manager_process

Recreate the data store cluster values in etcd

Run these commands on one Vellum node in the affected site:

sudo /usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_chronos_cluster vellum
sudo /usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_memcached_cluster vellum

If you are using any of Homestead-Prov, Homer or Memento, also run:

sudo /usr/share/clearwater/clearwater-cluster-manager/scripts/load_from_cassandra_cluster vellum

Verify the cluster state is correct in etcd by running sudo /usr/share/clearwater/clearwater-cluster-manager/scripts/check_cluster_state

Start the cluster manager on all nodes

Run this process on every node (including the master nodes) in the affected site in turn.

  • Run sudo rm /etc/clearwater/no_cluster_manager
  • Run sudo monit monitor -g clearwater_cluster_manager
  • Verify that the cluster-manager comes back up by running sudo monit summary.

Next steps

Your deployment now has a working etcd cluster. You now need to: