Thursday, May 3, 2018

Setting up Python Environments for Ansible

There is always a need for a different version of Ansible that what is installed on ur running OS. Initially to avoid this mind loops, i have run Ansible from docker containers. I basically just build a docker image with my preferred version of Ansible and other packages i might need and run my Ansible playbooks from within the docker container. 

While Docker is great tool, there are some times that managing the various docker images and container can become a pain. One other method that can be used is creating virtual environments for specific versions of Ansible.

The steps would typically involve


  • installing pip
  • install python virtualenv module
  • install you preferred sensible version 
  • create Python virtual environment.
All of these steps have been combined in the script below


#!/bin/bash

ANSIBLE_VERSIONS=( "2.2.0.0" "2.2.1.0" "2.2.2.0" "2.2.3.0" "2.3.0.0" "2.3.1.0" "2.4.0.0")

VIRTUALENV_PATH="/home/$USER"

if [[ $(uname) == "Linux" ]]; then
  # Ubuntu
  if [ -f /etc/debian_version ]; then
    codename="$(lsb_release -c | awk '{print $2}')"
    sudo apt-get update
    sudo apt-get -y install build-essential libffi-dev libssl-dev python-dev \
    python-minimal python-pip python-setuptools python-virtualenv
  fi

  # AMAZON LINUX
  if [ -f /etc/system-release ]; then
    codename="$(cat /etc/system-release | awk '{print $1}')"
    sudo yum -y install gmp-devel libffi-devel openssl-devel python-crypto \
    python-devel python-pip python-setuptools python-virtualenv \
    sudo yum -y group install "Development Tools"
  fi
fi

  # RHEL
  if [ -f /etc/redhat-release ]; then
    codename="$(cat /etc/redhat-release | awk '{print $1}')"
    if [[ $codename == "Fedora" ]]; then
      sudo dnf -y install gmp-devel libffi-devel openssl-devel python-crypto \
      python-devel python-dnf python-pip python-setuptools python-virtualenv \
      redhat-rpm-config && \
      sudo dnf -y group install "C Development Tools and Libraries"
      elif [[ $codename == "CentOS" ]]; then
      sudo yum -y install gmp-devel libffi-devel openssl-devel python-crypto \
      python-devel python-pip python-setuptools python-virtualenv \
      redhat-rpm-config && \
      sudo yum -y group install "Development Tools"
    fi
  fi


if  ! python -c "import virtualenv" &> /dev/null;
then
    sudo pip install virtualenv;
    # Allow users to use virtualenv in their homedirs
    sudo chmod -R o+rX /usr/local/lib/python2.7/dist-packages/virtualenv*
fi


# Setup Ansible Virtual Environments
for ANSVER in "${ANSIBLE_VERSIONS[@]}"
do
  if [ ! -d "$VIRTUALENV_PATH/ansible-$ANSVER" ]; then
    virtualenv $VIRTUALENV_PATH/ansible-$ANSVER
    source $VIRTUALENV_PATH/ansible-$ANSVER/bin/activate
    pip install ansible==$ANSVER ansible-lint
    deactivate
  fi
done

printf "\nRun this to activate virtualenv:\n\n\tsource ~/ansible-(version)/bin/activate\n\n"

I hope this helps

Wednesday, April 18, 2018

Scripting Consul Backups

Consul is a hashicorp tool for discovering and configuring services in your infrastructure. Among its many uses is Key Value Store. As a key value store, it can be used to dynamically store passwords, ssh keys, encryption keys.

You can read more on consul on the hashicorp website.

The aim of this write up is to show how you could backup your consul data in S3 if you do not have an enterprise version which would normally come with a consul backup agent.

Here is a sample of my script


#!/bin/bash

BAK_DEST=/tmp/consul/backup

#Polling associated AWS variables

REGION=$(/usr/bin/curl --silent http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)

INSTANCE_ID=$(/usr/bin/curl --silent http://169.254.169.254/latest/meta-data/instance-id)

#S3 bucket is published in AWS Parameter store. You might decide to hardcode this.
S3_BUCKET=$(/usr/local/bin/aws ssm get-parameter --name "/keystore/$REGION/consul_s3_destination" --region $REGION | jq -r .Parameter.Value)

#hostname is a tag on the EC2 instance so my server can easily poll that information.
HOSTNAME=$(/usr/local/bin/aws ec2 describe-tags --region=$REGION --filter "Name=resource-id,Values=$INSTANCE_ID" "Name=key, Values=Name" --output=text | cut -f5)


#number of days to keep archives
KEEP_DAYS=2

#script variables
BAK_DATE=`date +%F`
BAK_DATETIME=`date +%F-%H%M`
BAK_FOLDER=${BAK_DEST}
BAK_DB=${BAK_DEST}/${HOSTNAME}-${BAK_DATETIME}

#CREATE folder where backup database is to be place
#echo 'Creating consul back up ' ${BAK_FOLDER}
#mkdir ${BAK_FOLDER}

#PERFORM Consul backup
echo 'Creating archive file ' ${BAK_DB}'.tar.gz Please wait ......'
/usr/local/bin/consul snapshot save ${BAK_DB}.snap
tar czPf ${BAK_DB}.tar.gz ${BAK_DB}.snap


#Moving backups to AWS. This uses AWS CLI to copy snapshots to S3
echo 'Copying consul backups to S3'
/usr/local/bin/aws s3 cp ${BAK_DB}.snap s3://${S3_BUCKET}/dailybackup/${HOSTNAME}-${BAK_DATETIME}.snap


# DELETE FILES OLDER THAN 2 days
echo 'Deleting backup older than '${KEEP_DAYS}' days'
find ${BAK_FOLDER} -type f -mtime +${KEEP_DAYS} -name '*.gz' -execdir rm -- {} \;
find ${BAK_FOLDER} -type f -mtime +${KEEP_DAYS} -name '*.snap' -execdir rm -- {} \;


A few items of interest; the script works on the premise that your consul server has the permission to read from AWS parameter store; also write to S3. Servers should also have AWS CLI installed.

I also created a local destination for consul backups on the host /tmp/consul/backup

The hostname is a tag in AWS and hence the server must also be able to describe ec2 instances to pull it tag information. 

The script uses the command "consul snapshot save" to take a snapshot and save in the local destination followed by using was AWS CLI to copy the snapshot to a predefined S3 destination. The S3 destination is published in parameter store which the EC2 instance would also grab. 

Tuesday, April 17, 2018

Buiding a Vault-Consul Cluster in AWS

Vault and Consul are hashicorp tools that can be powerfully combined to store key values. These values can very from passwords, encryption keys, ssh keys etc.

You can read up on consul HERE

You can read up on vault HERE

One of my biggest challenges at getting this setup right was identifying the right architecture while providing the right combination of resiliency and HA.

Vault can be used all by itself, installed to use a file backend. It will write the keys to this path specified in the config. The downside to this is, when you lose the host... you lose all your keys. Also a file backend does not provide for High Availability.

Some other backend include etcd, mongoDB, dynamoDB, PostgreSQL. Some of these backends provide HA while some do not.

I decided to go with a consul backend for the following reasons.


  • Consul is a Hashicorp product just like Vault. On house to solve all my problems. This might be a good choice if you were to be buying enterprise support since in most cases hashicorp offers a Vault and Consul Package of support. 
  • HA persistent data. Just like dynamoDB, etcd, google cloud storage; consul storage backends provide HA.
Technical Implementation

Number of servers - 5
Number of servers running CONSUL - 3 servers running consul server agent.
Number of server running VAULT - 2 servers running consul client agent, and then vault agent. 

In my case, my environment is deployed in AWS but bare in mind the concepts are similar even if you were deploying this in a traditional datacenter or another cloud provider environment.


If you need to know how to configure consul, here is a good documentation from digital ocean.

One of the important highlight is the configuration of the consul servers in this set up. 


{
  "node_name" : "consul-1",
  "bind_addr": "10.43.51.118",
  "advertise_addr": "10.43.51.118",
  "server" : true,
  "data_dir": "/var/consul",
  "log_level" : "INFO",
  "client_addr" : "0.0.0.0",
  "bootstrap_expect": 3,
  "disable_remote_exec": true,
  "disable_update_check": true,
  "leave_on_terminate": true,
  "retry_join": [
     "provider=aws tag_key=consul-role tag_value=server"
  ]
}


node-name = hostname it registers in the quorum with
bind_addr/advertise_addr = IP address of node
bootstrap_expect = Number of servers to expect to form quorum
retry_join = This is an interesting concept in automation. They are basically tags when running this setup in AWS. It ensures that all servers that come up in AWS with the tag attempt to form a quorum. 

"retry-join accepts a unified interface using the go-discover library for doing automatic cluster joining using cloud metadata. To use retry-join with a supported cloud provider, specify the configuration on the command line or configuration file as a key=value key=value ... string."

For the consul clients, the configuration will look like this 


{
  "node_name" : "vault-0",
  "bind_addr": "10.43.51.67",
  "advertise_addr": "10.43.51.67",
  "server" : false,
  "data_dir": "/var/consul",
  "log_level" : "INFO",
  "client_addr" : "0.0.0.0",
  "disable_remote_exec": true,
  "disable_update_check": true,
  "leave_on_terminate": true,
  "retry_join": [
     "provider=aws tag_key=consul-role tag_value=server"
  ]
}

note that the "server" item is set to false. This is because the consul running on the vault nodes are only clients so technically do not hold your keys but act as a communication forwarder to the quorum of consul servers. 

The vault config on both vault nodes however would look like this;

backend "consul" {
address = "127.0.0.1:8500"
path = "vault/"
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}

as you will see above, it uses consul as backend; sending traffic to local host port 8500. (which is the consul client running on this host)

For a production environment, you might want to use a tool like terraform to deploy your infrastructure. I would build custom AMIs for consul and vault and any other additional configuration will be deployed with an sensible playbook on startup. The vault and consul configuration were baked into the packer AMI; while the retry_join values were set using terraform. 

Here is an excerpt of my terraform config 

resource "aws_instance" "consul" {
  count                      = "${var.consul_count}"
  ami                        = "${data.aws_ami.consul_ami.id}"
  instance_type              = "${var.consul_instance_type}"
  key_name                   = "${var.ssh_keyname}"
  subnet_id                  = "${element(local.subnet_ids, count.index)}"
  iam_instance_profile       = "${aws_iam_instance_profile.ec2.id}"
  vpc_security_group_ids     = ["${module.consul.this_security_group_id}",   "${module.ec2_utility.this_security_group_id}"]
  
 root_block_device {
               volume_type = "gp2"
               volume_size = 20
              }
  tags = "${merge (local.tags_server, map ("Name", "consul-${count.index}"))}"
 }

resource "aws_instance" "vault" {
  count                      = "${var.vault_count}"
  ami                        = "${data.aws_ami.vault_ami.id}"
  instance_type              = "${var.vault_instance_type}"
  key_name                   = "${var.ssh_keyname}"
  subnet_id                  = "${element(local.subnet_ids, count.index)}"
  iam_instance_profile       = "${aws_iam_instance_profile.ec2.id}"
  vpc_security_group_ids     = ["${module.vault.this_security_group_id}", "${module.ec2_utility.this_security_group_id}", "${module.consul.this_security_group_id}"]

  lifecycle {
    ignore_changes = ["ebs_block_device"]
  }

  root_block_device {
             volume_type = "gp2"
             volume_size = 20
             }
 tags                        = "${merge (local.tags_client, map ("Name", "vault-${count.index}"))}"
}
NOTE: you will need to be very conversant with terraform to understand what i have above since some parts of the config are missing. 

As part of an effort to provide some resiliency for the environment; i have a lambda function that takes a snapshot of the consul EBS volumes once a day. I also have a cronjob that runs a script that would use the consul application to backup consul data. i.e. consul snapshot save.
You might not need all this if you run consul enterprise because i understand it has a consul enterprise has a backup agent that comes with it. 

Maybe one day i will write something about terraform and packer. I find them impressively useful for automation and some form of configuration management. 

Let me know if you find this useful and  if you have any other questions. 



Tuesday, January 30, 2018

Automating backups of your AWS EBS volumes using Lambda and Cloudwatch

I found this write up below very insightful when you try to create schedule snapshots of EBS volumes in AWS.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/TakeScheduledSnapshot.html

The problem i found however was that the Snapshots created were without description and it provided no means for clean up.

A better approach will be to use lambda function to make API calls to take EBS of these volumes and then another Lambda function to clean up older snapshots. This is what i ended up doing.

In Summary, this is what we are trying to achieve

- Filter information about the volumes we want to take snapshots of
- Take the snapshots of these volumes
- Label and tag these snapshots with names of their parent volume
- Delete Snapshots that past the specified retention period
- Write logs to cloud watch

IAM Role Assignment
First we need to create a IAM Role that has permission to do the following
- retrieve information about EC2 instances
- take snapshots of these EC2 instance
- Create Tags for the snapshots
- delete snapshots
- Make log entries to CloudWatch

In your AWS Management console, select IAM  > Roles  > Create Role
Under "AWS service"; select Lambda. Click on "Next: Permissions".
Skip the Next Page, we will create a policy later and attach to this Role.
Name you roles, and also a quick description.



In your AWS Management console, select IAM  > Policies > Create policy

This policy would include the tasks listed in the IAM role assignment; paste the JSON below in the JSON editor.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:*"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        },
        {
            "Effect": "Allow",
            "Action": "ec2:Describe*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateSnapshot",
                "ec2:DeleteSnapshot",
                "ec2:CreateTags",
                "ec2:ModifySnapshotAttribute",
                "ec2:ResetSnapshotAttribute"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

select the policy you just created and attach it to the role you created earlier 








Now that we have our permissions in place, we will go create out Lambda function

Select Lambda in the list of AWS services , click on "Create a Function"
Type in the name of your function, for runtime; select python 2.7.
for Role, select existing role and choose the IAM role you created earlier from the list









Click on create on function and the next screen takes you to the functions main page.
We will be using both library for the lambda function; paste the following in the function section.

import boto3
import collections
import datetime

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    
    # Get list of regions
    regions = ec2.describe_regions().get('Regions',[] )

    # Iterate over regions
    for region in regions:
        print "Checking region %s " % region['RegionName']
        reg=region['RegionName']

        # Connect to region
        ec2 = boto3.client('ec2', region_name=reg)
    
        # Get all in-use volumes in all regions  
        result = ec2.describe_volumes( Filters=[{'Name': 'tag:service', 'Values': ['keypads']}])
        
        for volume in result['Volumes']:
            print "Backing up %s in %s" % (volume['VolumeId'], volume['AvailabilityZone'])
        
            # Create snapshot
            result = ec2.create_snapshot(VolumeId=volume['VolumeId'],Description='Created by Lambda backup function ebs_snapshot_consul')
        
            # Get snapshot resource 
            ec2resource = boto3.resource('ec2', region_name=reg)
            snapshot = ec2resource.Snapshot(result['SnapshotId'])
        
            volumename = 'N/A'
        
            # Find the following tags for volume if it exists
            if 'Tags' in volume:
                for tags in volume['Tags']:
                    if tags["Key"] == 'Name':
                        volumename = tags["Value"]
                    if tags["Key"] == 'service':
                        servicename = tags["Value"]
        
            # Add volume name to snapshot for easier identification
            snapshot.create_tags(Tags=[{'Key': 'Name','Value': volumename},{'Key': 'service','Value': servicename}])





The code above scans through all regions and filters volume with that are tagged with "service"=="keypads"; as seen below 

result = ec2.describe_volumes( Filters=[{'Name': 'tag:service', 'Values': ['encryption']}])

Also it tags the eventual snapshot with the name of the parent volume and parent service tag, as seen below 

# Find the tags for volume if it exists
  if 'Tags' in volume:
      for tags in volume['Tags']:
          if tags["Key"] == 'Name':
              volumename = tags["Value"]
          if tags["Key"] == 'service':
              servicename = tags["Value"]

Note that you can use other tags for filtering your volume, and you can leave out filters if you want to take a snapshot of all volumes.

For your function, adjust some basic settings

- set timeout value to 2mins ( enough time for your code to run)
- Memory set to 128MB

go ahead and save, then test your function 
        
Log output will look like this if the run was a success


Checking region ap-south-1 
Checking region eu-west-3 
Checking region eu-west-2 
Checking region eu-west-1 
Checking region ap-northeast-2 
Checking region ap-northeast-1 
Checking region sa-east-1 
Checking region ca-central-1 
Checking region ap-southeast-1 
Checking region ap-southeast-2 
Checking region eu-central-1 
Checking region us-east-1 
Checking region us-east-2 
Checking region us-west-1 
Checking region us-west-2 
END RequestId: d610f18d-05c9-11e8-a062-d72177d9ae3e
REPORT RequestId: d610f18d-05c9-11e8-a062-d72177d9ae3 Duration: 13258.99 ms Billed Duration: 13300 ms Memory Size: 128 MB Max Memory Used: 63 Mns().get('Regions',[] )
Now our snapshot creation function is complete. However, we need to create a schedule around our function; this is where triggers come into play. On the list of trigger on the left hand side; click on cloud watch events. 
Create a new rule, enter the rule name, enter rule description, choose the CRON expression. 







Click Add, and save your function.
And you are all done with snapshot creation. 
Now we need to create another function for clean up of older snapshots.

Follow the same process for creating Lambda function as above, here is what the lambda_function.py will look like 

import boto3
from botocore.exceptions import ClientError

from datetime import datetime,timedelta

def delete_snapshot(snapshot_id, reg):
    print "Deleting snapshot %s " % (snapshot_id)
    try:  
        ec2resource = boto3.resource('ec2', region_name=reg)
        snapshot = ec2resource.Snapshot(snapshot_id)
        snapshot.delete()
         except ClientError as e:
        print "Caught exception: %s" % e
        
    return
    
def lambda_handler(event, context):
    
    # Get current timestamp in UTC
    now = datetime.now()

    # AWS Account ID    
    account_id = '1234567890'
    
    # Define retention period in days
    retention_days = 3
    
    # Create EC2 client
    ec2 = boto3.client('ec2')
    
    # Get list of regions
    regions = ec2.describe_regions().get('Regions',[] )

    # Iterate over regions
    for region in regions:
        print "Checking region %s " % region['RegionName']
        reg=region['RegionName']
        
        # Connect to region
        ec2 = boto3.client('ec2', region_name=reg)
        
        # Lets grab all snapshot id's by Tags
        result = ec2.describe_snapshots( Filters=[{'Name': 'tag:service', 'Values': ['keystore']}] )
    
        for snapshot in result['Snapshots']:
            print "Checking snapshot %s which was created on %s" % (snapshot['SnapshotId'],snapshot['StartTime'])
       
            # Remove timezone info from snapshot in order for comparison to work below
            snapshot_time = snapshot['StartTime'].replace(tzinfo=None)
        
            # Subtract snapshot time from now returns a timedelta 
            # Check if the timedelta is greater than retention days
            if (now - snapshot_time) > timedelta(retention_days):
                print "Snapshot is older than configured retention of %d days" % (retention_days)
                delete_snapshot(snapshot['SnapshotId'], reg)
            else:
                print "Snapshot is newer than configured retention of %d days so we keep it"(retention_days)

When you test your lambda function, the logs will look like this 
      
   Deleting snapshot snap-010444196f3597b6b 
    Checking region us-east-2 
    Checking region us-west-1 
    Checking region us-west-2 
    END RequestId: 5630a0fb-05e4-11e8-8aa1-3d6c72fb82d4
    REPORT RequestId: 5630a0fb-05e4-11e8-8aa1-3d6c72fb82d4 Duration: 17820.37 ms Billed Duration: 17900 ms  Memory Size: 128 MB Max Memory Used: 67 





Monday, January 29, 2018

New Season, New Me

So my old posts were basically papers i wrote while studying for my Masters degree in information systems management.

I finally completed my studies sometime in 2013; which i immediately changed fields and started working IT full time. My previous role was more like a Hybrid business systems analyst. I was acting as an intermediary between core system/software developers and the front end users of these products.

I have always wanted a full blown IT position which was why i took IT certifications (self paces learning) while in my old role. I bagged a CCNA certification even without touching a physical switch or router.

I got a chance in 2013 to jump ship; it was a fresh start for me. 6 years after my first degree, i was starting a Job as a "Storage engineer". It was a fertile learning ground for me as i put everything i got into it. I bagged 8 various certifications in storage in a space of 2 years and i soon became a SNIA certified Storage Network Expert.

With all my storage/server/network knowledge; i moved again and this time into Cloud Engineering working with AWS Automation tools. I also have a CISSP certification as well.

I love what i do.

I remembered i had this blog a long time ago, so i have decided it might be a good platform to share some of the interesting things i have been working on.

Watch out for more cool stuff.