Rotating pet servers with SaltStack
A key infrastructure goal of ours is to replace all servers a maximum of 30 days after creating them.
Why? Because it reduces configuration sprawl, manual changes to machines, and forces us to have adequate automation in place to reliably provision all aspects of our infrastructure. If a disaster scenario ever were to occur, our recovery procedure will be well rehearsed.
In this post I’ll explain how we use salt
and salt-cloud
to accomplish this.
Identifying machines
First we need to identify servers that are more than 30 days old.
There are number of commands we could use:
# age of the root filesystem
tune2fs -l $(df / | awk 'NR==2 {print $1}') | awk '/created/ {$1=""; $2=""; print}'
# date the minion public key was created
ls -l /etc/salt/pki/minion/minion.pub | awk '{print $6 " " $7 " " $8}'
# Red Hat systems only
rpm -qi basesystem | awk '/Install Date/'
Let’s go with the tune2fs
one-liner.
Run it on all servers with cmd.run
(watch out for shell quoting!) and look for dates earlier than 30 days ago:
salt \* cmd.run "tune2fs -l \$(df / | awk 'NR==2 {print \$1}') | awk '/created/ {\$1=\"\"; \$2=\"\"; print}'"
# node-abc123.dc1.backbeat.tech:
# Wed Jul 17 18:33:49 2019
# lb1.dc1.backbeat.tech:
# Tue Aug 6 10:33:16 2019
# lb2.dc1.backbeat.tech:
# Fri Aug 16 11:31:56 2019
# lb3.dc1.backbeat.tech:
# Thu Jul 18 14:06:49 2019
# node-def456.dc1.backbeat.tech:
# Fri Aug 16 13:42:27 2019
We’ve got two servers to replace - node-123abc.dc1.backbeat.tech
and lb3.dc1.backbeat.tech
.
Pets vs cattle
The ‘pets vs cattle’ analogy, coined by Bill Baker of Microsoft and popularised in DevOps circles, explains the difference between stateful ‘pet’ servers (distinct individuals that need looking after) and stateless ‘cattle’ servers (a single member of a group, can be replaced with little thought).
It’s good to have as few ‘pet’ servers as possible, but sometimes you have no choice but to store important business state in these servers. You may also have legacy infrastructure - not everything can be easily migrated to the cloud.
Looking back at the naming scheme in Bootstrapping infrastructure with Salt Cloud and Terraform reveals we have one of each:
node-123abc
is part of a Nomad cluster (cattle), while lb3
is a load balancer with a dedicated IP address (pet).
Replacing the Nomad cluster machine is easy, as it doesn’t contain any state we care about:
# add a new server to the cluster
salt-cloud -p nomad_profile node-ghi789.dc1.backbeat.tech
# move all Nomad jobs on the old server elsewhere (the new server)
salt node-abc123.dc1.backbeat.tech cmd.run 'nomad node drain -enable -self'
# (wait for Nomad jobs to be reallocated)
# remove the old server
salt-cloud -d node-abc123.dc1.backbeat.tech
Replacing our pet lb3
server will be a little harder however:
- It needs to keep serving traffic while its replacement is being provisioned.
- The static IP should remain assigned until its replacement is ready.
- Its replacement should also be called
lb3.dc1.backbeat.tech
, as we have SaltStack pillar data that mentions that minion ID explicitly.
Rename the existing machine
We use salt-cloud
to create new machines and connect them to the Salt master.
Unfortunately, we can’t create a new machine while the original still exists:
salt-cloud -p lb_profile lb3.dc1.backbeat.tech
lb3.dc1.backbeat.tech:
----------
Error:
lb3.dc1.backbeat.tech already exists under my_cloud:provider
Let’s rename the minion’s ID from lb3.dc1.backbeat.tech
to lb3.dc1.old.backbeat.tech
:
# tell the machine its new name
salt lb3.dc1.backbeat.tech cmd.run 'echo lb3.dc1.old.backbeat.tech > /etc/salt/minion_id'
# restart salt-minion on the machine for changes to take affect
salt lb3.dc1.backbeat.tech service.restart salt-minion
# update the master to identify the already approved key with the new name
mv /etc/salt/pki/master/minions/lb3.dc1.backbeat.tech /etc/salt/pki/master/minions/lb3.dc1.old.backbeat.tech
Make sure to pick a name that won’t conflict with other machines, and ideally won’t match any expressions in Salt’s top.sls
:
base:
'*.dc1.backbeat.tech':
- network
- dns
'lb*.dc1.backbeat.tech':
- haproxy
With this top.sls
structure, the machine will no longer be managed by Salt.
It will continue to function as a load balancer, but won’t be updated by highstate runs.
We can consider it ‘retired’ now.
You might also want to rename the machine using your cloud provider’s control panel or API.
Create the replacement machine
Now the name is available, create lb3.dc1.backbeat.tech
:
salt-cloud -p lb_profile lb3.dc1.backbeat.tech
The new machine should connect to the Salt master and run a highstate. The machine is ready to serve traffic but not ‘live’, as the static IP isn’t pointing to it yet.
Move state from the retired machine to the replacement
We now need to point the static IP address to the new machine, which depends on how you provision your cloud resources.
Here’s a simple Terraform example using Digitalocean and their floating_ip
resource:
data "digitalocean_droplet" "lb3" {
name = "lb3.dc1.backbeat.tech"
}
resource "digitalocean_floating_ip" "lb3" {
droplet_id = "${data.digitalocean_droplet.lb3.id}"
region = "${data.digitalocean_droplet.lb3.region}"
}
terraform plan main.tf
# An execution plan has been generated and is shown below.
# Resource actions are indicated with the following symbols:
# ~ update in-place
# Terraform will perform the following actions:
# ~ digitalocean_floating_ip.lb3
# droplet_id: "216508506" => "216508947"
# Plan: 0 to add, 1 to change, 0 to destroy.
The new machine will now receive traffic, and the old machine is ready to be removed.
Remove the old machine
With the new machine deployed and the state removed from the old machine, it can be safely removed:
salt-cloud -d lb3.dc1.old.backbeat.tech
Conclusion and next steps
That’s it! We use these steps to safely rotate all of our machines:
- Rename
- Create
- Move state
- Remove
Of course, the real complexity lies in step 3. Moving state can get considerably more complicated than simply moving a static IP address:
- To migrate a database, you might create the new machine as a read-only replica of the old, then promote it to be the primary machine.
- Old and new machines will need to be aware of service discovery (e.g. Consul) and avoid registering both the new and old machine in the service catalog at the same time.
- The Salt master machine is a high-risk rotation - ensure that all minions connect to the new master before deleting the old!
There’s room to improve upon the methods described in this post:
- Wrap the commands to determine a machine’s age with a custom salt module, e.g.
salt-call server_age.days
could return26
. - Automate the minion renaming process with the Salt orchestrate runner.
- Write a Salt engine to check the age of servers and automatically retire and replace them. Automating this could be extremely risky with ‘pets’, but very effective for managing ‘cattle’ (e.g. Nomad cluster workers).
Do you need help managing stateful server workloads? Send us an email to see how we could help!