Wednesday, April 18, 2018

Scripting Consul Backups

Consul is a hashicorp tool for discovering and configuring services in your infrastructure. Among its many uses is Key Value Store. As a key value store, it can be used to dynamically store passwords, ssh keys, encryption keys.

You can read more on consul on the hashicorp website.

The aim of this write up is to show how you could backup your consul data in S3 if you do not have an enterprise version which would normally come with a consul backup agent.

Here is a sample of my script


#!/bin/bash

BAK_DEST=/tmp/consul/backup

#Polling associated AWS variables

REGION=$(/usr/bin/curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)

INSTANCE_ID=$(/usr/bin/curl --silent http://169.254.169.254/latest/meta-data/instance-id)

#S3 bucket is published in AWS Parameter store. You might decide to hardcode this.
S3_BUCKET=$(/usr/local/bin/aws ssm get-parameter --name "/keystore/$REGION/consul_s3_destination" --region $REGION | jq -r .Parameter.Value)

#hostname is a tag on the EC2 instance so my server can easily poll that information.
HOSTNAME=$(/usr/local/bin/aws ec2 describe-tags --region=$REGION --filter "Name=resource-id,Values=$INSTANCE_ID" "Name=key, Values=Name" --output=text | cut -f5)


#number of days to keep archives
KEEP_DAYS=2

#script variables
BAK_DATE=`date +%F`
BAK_DATETIME=`date +%F-%H%M`
BAK_FOLDER=${BAK_DEST}
BAK_DB=${BAK_DEST}/${HOSTNAME}-${BAK_DATETIME}

#CREATE folder where backup database is to be place
#echo 'Creating consul back up ' ${BAK_FOLDER}
#mkdir ${BAK_FOLDER}

#PERFORM Consul backup
echo 'Creating archive file ' ${BAK_DB}'.tar.gz Please wait ......'
/usr/local/bin/consul snapshot save ${BAK_DB}.snap
tar czPf ${BAK_DB}.tar.gz ${BAK_DB}.snap


#Moving backups to AWS. This uses AWS CLI to copy snapshots to S3
echo 'Copying consul backups to S3'
/usr/local/bin/aws s3 cp ${BAK_DB}.snap s3://${S3_BUCKET}/dailybackup/${HOSTNAME}-${BAK_DATETIME}.snap


# DELETE FILES OLDER THAN 2 days
echo 'Deleting backup older than '${KEEP_DAYS}' days'
find ${BAK_FOLDER} -type f -mtime +${KEEP_DAYS} -name '*.gz' -execdir rm -- {} \;
find ${BAK_FOLDER} -type f -mtime +${KEEP_DAYS} -name '*.snap' -execdir rm -- {} \;


A few items of interest; the script works on the premise that your consul server has the permission to read from AWS parameter store; also write to S3. Servers should also have AWS CLI installed.

I also created a local destination for consul backups on the host /tmp/consul/backup

The hostname is a tag in AWS and hence the server must also be able to describe ec2 instances to pull it tag information. 

The script uses the command "consul snapshot save" to take a snapshot and save in the local destination followed by using was AWS CLI to copy the snapshot to a predefined S3 destination. The S3 destination is published in parameter store which the EC2 instance would also grab. 

Tuesday, April 17, 2018

Buiding a Vault-Consul Cluster in AWS

Vault and Consul are hashicorp tools that can be powerfully combined to store key values. These values can very from passwords, encryption keys, ssh keys etc.

You can read up on consul HERE

You can read up on vault HERE

One of my biggest challenges at getting this setup right was identifying the right architecture while providing the right combination of resiliency and HA.

Vault can be used all by itself, installed to use a file backend. It will write the keys to this path specified in the config. The downside to this is, when you lose the host... you lose all your keys. Also a file backend does not provide for High Availability.

Some other backend include etcd, mongoDB, dynamoDB, PostgreSQL. Some of these backends provide HA while some do not.

I decided to go with a consul backend for the following reasons.


  • Consul is a Hashicorp product just like Vault. On house to solve all my problems. This might be a good choice if you were to be buying enterprise support since in most cases hashicorp offers a Vault and Consul Package of support. 
  • HA persistent data. Just like dynamoDB, etcd, google cloud storage; consul storage backends provide HA.
Technical Implementation

Number of servers - 5
Number of servers running CONSUL - 3 servers running consul server agent.
Number of server running VAULT - 2 servers running consul client agent, and then vault agent. 

In my case, my environment is deployed in AWS but bare in mind the concepts are similar even if you were deploying this in a traditional datacenter or another cloud provider environment.


If you need to know how to configure consul, here is a good documentation from digital ocean.

One of the important highlight is the configuration of the consul servers in this set up. 


{
  "node_name" : "consul-1",
  "bind_addr": "10.43.51.118",
  "advertise_addr": "10.43.51.118",
  "server" : true,
  "data_dir": "/var/consul",
  "log_level" : "INFO",
  "client_addr" : "0.0.0.0",
  "bootstrap_expect": 3,
  "disable_remote_exec": true,
  "disable_update_check": true,
  "leave_on_terminate": true,
  "retry_join": [
     "provider=aws tag_key=consul-role tag_value=server"
  ]
}


node-name = hostname it registers in the quorum with
bind_addr/advertise_addr = IP address of node
bootstrap_expect = Number of servers to expect to form quorum
retry_join = This is an interesting concept in automation. They are basically tags when running this setup in AWS. It ensures that all servers that come up in AWS with the tag attempt to form a quorum. 

"retry-join accepts a unified interface using the go-discover library for doing automatic cluster joining using cloud metadata. To use retry-join with a supported cloud provider, specify the configuration on the command line or configuration file as a key=value key=value ... string."

For the consul clients, the configuration will look like this 


{
  "node_name" : "vault-0",
  "bind_addr": "10.43.51.67",
  "advertise_addr": "10.43.51.67",
  "server" : false,
  "data_dir": "/var/consul",
  "log_level" : "INFO",
  "client_addr" : "0.0.0.0",
  "disable_remote_exec": true,
  "disable_update_check": true,
  "leave_on_terminate": true,
  "retry_join": [
     "provider=aws tag_key=consul-role tag_value=server"
  ]
}

note that the "server" item is set to false. This is because the consul running on the vault nodes are only clients so technically do not hold your keys but act as a communication forwarder to the quorum of consul servers. 

The vault config on both vault nodes however would look like this;

backend "consul" {
address = "127.0.0.1:8500"
path = "vault/"
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}

as you will see above, it uses consul as backend; sending traffic to local host port 8500. (which is the consul client running on this host)

For a production environment, you might want to use a tool like terraform to deploy your infrastructure. I would build custom AMIs for consul and vault and any other additional configuration will be deployed with an sensible playbook on startup. The vault and consul configuration were baked into the packer AMI; while the retry_join values were set using terraform. 

Here is an excerpt of my terraform config 

resource "aws_instance" "consul" {
  count                      = "${var.consul_count}"
  ami                        = "${data.aws_ami.consul_ami.id}"
  instance_type              = "${var.consul_instance_type}"
  key_name                   = "${var.ssh_keyname}"
  subnet_id                  = "${element(local.subnet_ids, count.index)}"
  iam_instance_profile       = "${aws_iam_instance_profile.ec2.id}"
  vpc_security_group_ids     = ["${module.consul.this_security_group_id}",   "${module.ec2_utility.this_security_group_id}"]
  
 root_block_device {
               volume_type = "gp2"
               volume_size = 20
              }
  tags = "${merge (local.tags_server, map ("Name", "consul-${count.index}"))}"
 }

resource "aws_instance" "vault" {
  count                      = "${var.vault_count}"
  ami                        = "${data.aws_ami.vault_ami.id}"
  instance_type              = "${var.vault_instance_type}"
  key_name                   = "${var.ssh_keyname}"
  subnet_id                  = "${element(local.subnet_ids, count.index)}"
  iam_instance_profile       = "${aws_iam_instance_profile.ec2.id}"
  vpc_security_group_ids     = ["${module.vault.this_security_group_id}", "${module.ec2_utility.this_security_group_id}", "${module.consul.this_security_group_id}"]

  lifecycle {
    ignore_changes = ["ebs_block_device"]
  }

  root_block_device {
             volume_type = "gp2"
             volume_size = 20
             }
 tags                        = "${merge (local.tags_client, map ("Name", "vault-${count.index}"))}"
}
NOTE: you will need to be very conversant with terraform to understand what i have above since some parts of the config are missing. 

As part of an effort to provide some resiliency for the environment; i have a lambda function that takes a snapshot of the consul EBS volumes once a day. I also have a cronjob that runs a script that would use the consul application to backup consul data. i.e. consul snapshot save.
You might not need all this if you run consul enterprise because i understand it has a consul enterprise has a backup agent that comes with it. 

Maybe one day i will write something about terraform and packer. I find them impressively useful for automation and some form of configuration management. 

Let me know if you find this useful and  if you have any other questions.