Tuesday, April 17, 2018

Buiding a Vault-Consul Cluster in AWS

Vault and Consul are hashicorp tools that can be powerfully combined to store key values. These values can very from passwords, encryption keys, ssh keys etc.

You can read up on consul HERE

You can read up on vault HERE

One of my biggest challenges at getting this setup right was identifying the right architecture while providing the right combination of resiliency and HA.

Vault can be used all by itself, installed to use a file backend. It will write the keys to this path specified in the config. The downside to this is, when you lose the host... you lose all your keys. Also a file backend does not provide for High Availability.

Some other backend include etcd, mongoDB, dynamoDB, PostgreSQL. Some of these backends provide HA while some do not.

I decided to go with a consul backend for the following reasons.


  • Consul is a Hashicorp product just like Vault. On house to solve all my problems. This might be a good choice if you were to be buying enterprise support since in most cases hashicorp offers a Vault and Consul Package of support. 
  • HA persistent data. Just like dynamoDB, etcd, google cloud storage; consul storage backends provide HA.
Technical Implementation

Number of servers - 5
Number of servers running CONSUL - 3 servers running consul server agent.
Number of server running VAULT - 2 servers running consul client agent, and then vault agent. 

In my case, my environment is deployed in AWS but bare in mind the concepts are similar even if you were deploying this in a traditional datacenter or another cloud provider environment.


If you need to know how to configure consul, here is a good documentation from digital ocean.

One of the important highlight is the configuration of the consul servers in this set up. 


{
  "node_name" : "consul-1",
  "bind_addr": "10.43.51.118",
  "advertise_addr": "10.43.51.118",
  "server" : true,
  "data_dir": "/var/consul",
  "log_level" : "INFO",
  "client_addr" : "0.0.0.0",
  "bootstrap_expect": 3,
  "disable_remote_exec": true,
  "disable_update_check": true,
  "leave_on_terminate": true,
  "retry_join": [
     "provider=aws tag_key=consul-role tag_value=server"
  ]
}


node-name = hostname it registers in the quorum with
bind_addr/advertise_addr = IP address of node
bootstrap_expect = Number of servers to expect to form quorum
retry_join = This is an interesting concept in automation. They are basically tags when running this setup in AWS. It ensures that all servers that come up in AWS with the tag attempt to form a quorum. 

"retry-join accepts a unified interface using the go-discover library for doing automatic cluster joining using cloud metadata. To use retry-join with a supported cloud provider, specify the configuration on the command line or configuration file as a key=value key=value ... string."

For the consul clients, the configuration will look like this 


{
  "node_name" : "vault-0",
  "bind_addr": "10.43.51.67",
  "advertise_addr": "10.43.51.67",
  "server" : false,
  "data_dir": "/var/consul",
  "log_level" : "INFO",
  "client_addr" : "0.0.0.0",
  "disable_remote_exec": true,
  "disable_update_check": true,
  "leave_on_terminate": true,
  "retry_join": [
     "provider=aws tag_key=consul-role tag_value=server"
  ]
}

note that the "server" item is set to false. This is because the consul running on the vault nodes are only clients so technically do not hold your keys but act as a communication forwarder to the quorum of consul servers. 

The vault config on both vault nodes however would look like this;

backend "consul" {
address = "127.0.0.1:8500"
path = "vault/"
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}

as you will see above, it uses consul as backend; sending traffic to local host port 8500. (which is the consul client running on this host)

For a production environment, you might want to use a tool like terraform to deploy your infrastructure. I would build custom AMIs for consul and vault and any other additional configuration will be deployed with an sensible playbook on startup. The vault and consul configuration were baked into the packer AMI; while the retry_join values were set using terraform. 

Here is an excerpt of my terraform config 

resource "aws_instance" "consul" {
  count                      = "${var.consul_count}"
  ami                        = "${data.aws_ami.consul_ami.id}"
  instance_type              = "${var.consul_instance_type}"
  key_name                   = "${var.ssh_keyname}"
  subnet_id                  = "${element(local.subnet_ids, count.index)}"
  iam_instance_profile       = "${aws_iam_instance_profile.ec2.id}"
  vpc_security_group_ids     = ["${module.consul.this_security_group_id}",   "${module.ec2_utility.this_security_group_id}"]
  
 root_block_device {
               volume_type = "gp2"
               volume_size = 20
              }
  tags = "${merge (local.tags_server, map ("Name", "consul-${count.index}"))}"
 }

resource "aws_instance" "vault" {
  count                      = "${var.vault_count}"
  ami                        = "${data.aws_ami.vault_ami.id}"
  instance_type              = "${var.vault_instance_type}"
  key_name                   = "${var.ssh_keyname}"
  subnet_id                  = "${element(local.subnet_ids, count.index)}"
  iam_instance_profile       = "${aws_iam_instance_profile.ec2.id}"
  vpc_security_group_ids     = ["${module.vault.this_security_group_id}", "${module.ec2_utility.this_security_group_id}", "${module.consul.this_security_group_id}"]

  lifecycle {
    ignore_changes = ["ebs_block_device"]
  }

  root_block_device {
             volume_type = "gp2"
             volume_size = 20
             }
 tags                        = "${merge (local.tags_client, map ("Name", "vault-${count.index}"))}"
}
NOTE: you will need to be very conversant with terraform to understand what i have above since some parts of the config are missing. 

As part of an effort to provide some resiliency for the environment; i have a lambda function that takes a snapshot of the consul EBS volumes once a day. I also have a cronjob that runs a script that would use the consul application to backup consul data. i.e. consul snapshot save.
You might not need all this if you run consul enterprise because i understand it has a consul enterprise has a backup agent that comes with it. 

Maybe one day i will write something about terraform and packer. I find them impressively useful for automation and some form of configuration management. 

Let me know if you find this useful and  if you have any other questions. 



1 comment: