Photo by Ankush Minda on Unsplash

Rotating pet servers with SaltStack

How to rotate a machine with minimal downtime.

Glynn Forrest
Saturday, August 31, 2019

A key infrastructure goal of ours is to replace all servers a maximum of 30 days after creating them.

Why? Because it reduces configuration sprawl, manual changes to machines, and forces us to have adequate automation in place to reliably provision all aspects of our infrastructure. If a disaster scenario ever were to occur, our recovery procedure will be well rehearsed.

In this post I’ll explain how we use salt and salt-cloud to accomplish this.

Identifying machines

First we need to identify servers that are more than 30 days old.

There are number of commands we could use:

# age of the root filesystem
tune2fs -l $(df / | awk 'NR==2 {print $1}') | awk '/created/ {$1=""; $2=""; print}'

# date the minion public key was created
ls -l /etc/salt/pki/minion/minion.pub | awk '{print $6 " " $7 " " $8}'

# Red Hat systems only
rpm -qi basesystem | awk '/Install Date/'

Let’s go with the tune2fs one-liner. Run it on all servers with cmd.run (watch out for shell quoting!) and look for dates earlier than 30 days ago:

salt \* cmd.run "tune2fs -l \$(df / | awk 'NR==2 {print \$1}') | awk '/created/ {\$1=\"\"; \$2=\"\"; print}'"

# node-abc123.dc1.backbeat.tech:
#       Wed Jul 17 18:33:49 2019
# lb1.dc1.backbeat.tech:
#       Tue Aug 6 10:33:16 2019
# lb2.dc1.backbeat.tech:
#       Fri Aug 16 11:31:56 2019
# lb3.dc1.backbeat.tech:
#       Thu Jul 18 14:06:49 2019
# node-def456.dc1.backbeat.tech:
#       Fri Aug 16 13:42:27 2019

We’ve got two servers to replace - node-123abc.dc1.backbeat.tech and lb3.dc1.backbeat.tech.

Pets vs cattle

The ‘pets vs cattle’ analogy, coined by Bill Baker of Microsoft and popularised in DevOps circles, explains the difference between stateful ‘pet’ servers (distinct individuals that need looking after) and stateless ‘cattle’ servers (a single member of a group, can be replaced with little thought).

It’s good to have as few ‘pet’ servers as possible, but sometimes you have no choice but to store important business state in these servers. You may also have legacy infrastructure - not everything can be easily migrated to the cloud.

Looking back at the naming scheme in Bootstrapping infrastructure with Salt Cloud and Terraform reveals we have one of each: node-123abc is part of a Nomad cluster (cattle), while lb3 is a load balancer with a dedicated IP address (pet).

Replacing the Nomad cluster machine is easy, as it doesn’t contain any state we care about:

# add a new server to the cluster
salt-cloud -p nomad_profile node-ghi789.dc1.backbeat.tech

# move all Nomad jobs on the old server elsewhere (the new server)
salt node-abc123.dc1.backbeat.tech cmd.run 'nomad node drain -enable -self'
# (wait for Nomad jobs to be reallocated)

# remove the old server
salt-cloud -d node-abc123.dc1.backbeat.tech

Replacing our pet lb3 server will be a little harder however:

It needs to keep serving traffic while its replacement is being provisioned.
The static IP should remain assigned until its replacement is ready.
Its replacement should also be called lb3.dc1.backbeat.tech, as we have SaltStack pillar data that mentions that minion ID explicitly.

Rename the existing machine

We use salt-cloud to create new machines and connect them to the Salt master.

Unfortunately, we can’t create a new machine while the original still exists:

salt-cloud -p lb_profile lb3.dc1.backbeat.tech

lb3.dc1.backbeat.tech:
    ----------
    Error:
        lb3.dc1.backbeat.tech already exists under my_cloud:provider

Let’s rename the minion’s ID from lb3.dc1.backbeat.tech to lb3.dc1.old.backbeat.tech:

# tell the machine its new name
salt lb3.dc1.backbeat.tech cmd.run 'echo lb3.dc1.old.backbeat.tech > /etc/salt/minion_id'

# restart salt-minion on the machine for changes to take affect
salt lb3.dc1.backbeat.tech service.restart salt-minion

# update the master to identify the already approved key with the new name
mv /etc/salt/pki/master/minions/lb3.dc1.backbeat.tech /etc/salt/pki/master/minions/lb3.dc1.old.backbeat.tech

Make sure to pick a name that won’t conflict with other machines, and ideally won’t match any expressions in Salt’s top.sls:

base:
  '*.dc1.backbeat.tech':
    - network
    - dns
  'lb*.dc1.backbeat.tech':
    - haproxy

With this top.sls structure, the machine will no longer be managed by Salt. It will continue to function as a load balancer, but won’t be updated by highstate runs. We can consider it ‘retired’ now.

You might also want to rename the machine using your cloud provider’s control panel or API.

Create the replacement machine

Now the name is available, create lb3.dc1.backbeat.tech:

salt-cloud -p lb_profile lb3.dc1.backbeat.tech

The new machine should connect to the Salt master and run a highstate. The machine is ready to serve traffic but not ‘live’, as the static IP isn’t pointing to it yet.

Move state from the retired machine to the replacement

We now need to point the static IP address to the new machine, which depends on how you provision your cloud resources.

Here’s a simple Terraform example using Digitalocean and their floating_ip resource:

data "digitalocean_droplet" "lb3" {
  name = "lb3.dc1.backbeat.tech"
}

resource "digitalocean_floating_ip" "lb3" {
  droplet_id = "${data.digitalocean_droplet.lb3.id}"
  region     = "${data.digitalocean_droplet.lb3.region}"
}

terraform plan main.tf

# An execution plan has been generated and is shown below.
# Resource actions are indicated with the following symbols:
#   ~ update in-place

# Terraform will perform the following actions:

#   ~ digitalocean_floating_ip.lb3
#       droplet_id: "216508506" => "216508947"

# Plan: 0 to add, 1 to change, 0 to destroy.

The new machine will now receive traffic, and the old machine is ready to be removed.

Remove the old machine

With the new machine deployed and the state removed from the old machine, it can be safely removed:

salt-cloud -d lb3.dc1.old.backbeat.tech

Conclusion and next steps

That’s it! We use these steps to safely rotate all of our machines:

Rename
Create
Move state
Remove

Of course, the real complexity lies in step 3. Moving state can get considerably more complicated than simply moving a static IP address:

To migrate a database, you might create the new machine as a read-only replica of the old, then promote it to be the primary machine.
Old and new machines will need to be aware of service discovery (e.g. Consul) and avoid registering both the new and old machine in the service catalog at the same time.
The Salt master machine is a high-risk rotation - ensure that all minions connect to the new master before deleting the old!

There’s room to improve upon the methods described in this post:

Wrap the commands to determine a machine’s age with a custom salt module, e.g. salt-call server_age.days could return 26.
Automate the minion renaming process with the Salt orchestrate runner.
Write a Salt engine to check the age of servers and automatically retire and replace them. Automating this could be extremely risky with ‘pets’, but very effective for managing ‘cattle’ (e.g. Nomad cluster workers).

Do you need help managing stateful server workloads? Send us an email to see how we could help!

Rotating pet servers with SaltStack

Identifying machines

Pets vs cattle

Rename the existing machine

Create the replacement machine

Move state from the retired machine to the replacement

Remove the old machine

Conclusion and next steps

More from the blog

Building a SaltStack development machine

Secure servers with SaltStack and Vault (part 5)

Secure servers with SaltStack and Vault (part 4)