Friday, April 19, 2019

Vault and Consul as a personal credential store


The combination of Vault and Consul has been proclaimed to be the answer to every secrets management. In core production environments, consul on its own has been used as a KV store, service discovery and even a backend datastore. Vault on its own is more of a secret management system that can be used with different backends of which consul is one of them.

A production environment can combine these two tools (vault as a secure store for passwords, token, certificates, consul as its backend store) for a complete secret management solution.

If this can be used in production, i thought i could adapt this as my own personal password manager and i intend to show how i achieved that in this post.

I have got a repo in GitHub with all you will need for this set up.

https://github.com/sksegha/vault-consul-docker

Here is screen recording



The repo creates 2 docker containers - One vault and one consul. Vault is the actual secret management tool  while consul is the backend storage. 

The repo also contains a script that helps set up the "cluster"; outputting the encryption keys to a keys.txt file. The consul data is also saved in the data directory. As long as this data directory exist then your secrets are safe.

If you decide that you want to blow away the installation + data, the cleanup script will do that for you.



Saturday, April 13, 2019

Grokking Bitbucket Logs

Bitbucket has a non-standard log format so many collectors would not be able to poll its logs into a time-series database like influxdb.

This is where Grok patterning comes in.

But before I show you the grok pattern for bitbucket logs, i will like to show you the difference between the customary nginx log format and bitbucket logs.

Bitbucket has a couple of logs, atlassian-bitbucket-access.log, atlassian-bitbucket.log, atlassian-bitbucket-audit.log and couple of others; but for the sake of this post we will compare nginx's access.log and atlassian's atlassian-bitbucket-access.log.

This is assuming you are using nginx as a load balancer for your bitbucket server installation, either for proxy redirection or some form of TLS termination. Whatever your reason is, nginx will front your service and hand over communication to the bitbucket service running on the backend. So we have 2 access points, one  - Nginx, the other Atlassian bitbucket Apache.

Here are some excerpts

NGINX Logs

172.16.68.113 - - [13/Apr/2019:11:43:24 -0500] "GET /scm/dt/repo.git/info/refs?service=git-upload-pack HTTP/1.1" 401 5 "-" "git/2.17.1"
172.16.68.113 - user [13/Apr/2019:11:43:24 -0500] "GET /scm/dt/repo.git/info/refs?service=git-upload-pack HTTP/1.1" 200 2656 "-" "git/2.17.1"
172.16.68.113 - - [13/Apr/2019:11:43:26 -0500] "GET /scm/dt/repo.git/info/refs?service=git-upload-pack HTTP/1.1" 401 5 "-" "git/2.17.1"
172.16.68.113 - user [13/Apr/2019:11:43:26 -0500] "GET /scm/dt/repo.git/info/refs?service=git-upload-pack HTTP/1.1" 200 2656 "-" "git/2.17.1"
172.16.68.133 - - [13/Apr/2019:11:43:31 -0500] "GET /scm/dt/repo.git/info/refs?service=git-upload-pack HTTP/1.1" 401 5 "-" "git/2.17.1"
172.16.68.133 - user [13/Apr/2019:11:43:31 -0500] "GET /scm/dt/repo.git/info/refs?service=git-upload-pack HTTP/1.1" 200 2656 "-" "git/2.17.1"

Bitbucket Logs

172.16.68.113,127.0.0.1 | https | i@1Q4STNAx702x1743109x3 | - | 2019-04-13 11:42:49,034 | "GET /scm/dt/repo.git/info/refs HTTP/1.0" | "" "git/2.17.1" | - | - | - | - | - | - |
172.16.68.113,127.0.0.1 | https | o@1Q4STNAx702x1743109x3 | - | 2019-04-13 11:42:49,036 | "GET /scm/dt/repo.git/info/refs HTTP/1.0" | "" "git/2.17.1" | 401 | 0 | 0 | - | 2 | - |
172.16.68.113,0:0:0:0:0:0:0:1 | https | i@1Q4STNAx702x1743110x3 | - | 2019-04-13 11:42:49,038 | "GET /scm/dt/repo.git/info/refs HTTP/1.0" | "" "git/2.17.1" | - | - | - | - | - | - |
172.16.68.113,0:0:0:0:0:0:0:1 | https | o@1Q4STNAx702x1743110x3 | cicd | 2019-04-13 11:42:49,127 | "GET /scm/dt/repo.git/info/refs HTTP/1.0" | "" "git/2.17.1" | 200 | 0 | 2644 | cache:hit, refs | 89 | - |

These are log entries from the same session from both the frontend NGINX and backend BITBUCKET process.

If you want to read more on interpreting atlassian access logs, you can this read up here and if you wonder why a session is giving so much 401s, read up here

Collecting NGINX logs is easy enough, if you run a Telegraf agent on the host, you can configure telegraf with the input below to monitor the NGINX process and parse its access logs


[[inputs.nginx]]
  ## An array of Nginx stub_status URI to gather stats.
  urls = ["https://localhost:80/nginx_status"]

# # Stream and parse log file(s).
[[inputs.logparser]]
  files = ["/var/log/nginx/access.log"]
  ## Read file from beginning.
  from_beginning = false
  [inputs.logparser.grok]
    patterns = ["%{COMBINED_LOG_FORMAT}"]
    measurement = "nginx_access_log"
However, bitbucket logs do not follow the conventional combined log format as we see above, we need some conversion of these log strings to what influxdb might be able to use.

This is where grok patterns come in.

The Grok pattern for bitbucket based on the log strings above could look like this

%{IP:clientIP},%{IP:localIP} \| %{WORD:protocol:tag} \| %{DATA:requestID} \| %{USERNAME:username} \| %{TIMESTAMP_ISO8601:timestamp} \| "%{DATA:action} %{DATA:resource} %{DATA:http_version}" \| "" "%{DATA:request_details}" \| %{NUMBER:resp_code} \| %{NUMBER:bytes_read} \| %{NUMBER:bytes_written} \| %{DATA:labels} \| %{NUMBER:resp_time} \| %{DATA:session_id} \|

I have tried to follow this format

%{<capture_syntax>[:<semantic_name>][:<modifier>]}  (the almighty Grok format)

The semantic_name for each capture syntax closely follow atlassian's recommendation and whats obtainable in the combined_log format. This is to allow for comparison between NGINX logs and Atlassian logs.

Creating an input block to collect these logs should be pretty easy from here

Add this block to telegraf config

[[inputs.logparser]]
  files = ["/var/atlassian/application-data/stash/log/atlassian-bitbucket-access.log"]
  from_beginning = false
  [inputs.logparser.grok]
   patterns = ["%{HTTP}"]
   measurement = "bitbucket_access_log"
   custom_patterns = '''
     HTTP %{IP:clientIP},%{IP:localIP} \| %{WORD:protocol:tag} \| %{DATA:requestID} \| %{USERNAME:username} \| %{TIMESTAMP_ISO8601:timestamp} \| "%{DATA:action} %{DATA:resource} %{DATA:http_version}" \| "" "%{DATA:request_details}" \| %{NUMBER:resp_code} \| %{NUMBER:bytes_read} \| %{NUMBER:bytes_written} \| %{DATA:labels} \| %{NUMBER:resp_time} \| %{DATA:session_id} \|
   '''

You could also collect atlassian audit logs with this block in your telegraf configuration

[[inputs.logparser]]
  files = ["/var/atlassian/application-data/stash/log/audit/atlassian-bitbucket-audit.log"]
  from_beginning = false
  [inputs.logparser.grok]
   patterns = ["%{AUDIT_LOG}"]
   measurement = "stash_audit_log"
   custom_patterns = '''
     AUDIT_LOG %{IP:clientIP},%{IP:local} \| %{WORD:eventType} \| %{USERNAME:user:tag} \| %{INT:msSinceJan11970} \| %{USERNAME:eventDetails} \| \{\"authentication-method\":\"%{WORD:authenticationMethod}\"\,\"error\"\:\"%{DATA:authError}\"\} \| %{DATA:requestID} \| %{DATA:sessionID}
   '''

Friday, April 12, 2019

Bitbucket and TIG (Telegraf InfluxDB and Grafana)



Bitbucket is our Git repository management solution designed for professional teams. It gives you a central place to manage git repositories, collaborate on your source code and guide you through the development flow. (ref: https://confluence.atlassian.com/confeval/development-tools-evaluator-resources/bitbucket/bitbucket-what-is-bitbucket)

Other competing tools used for the same purpose are Github, Gitlab and Mercurial.

It is basically a combination of a Git Server and a web interface written in Java built with Apache. You also have the option of using a local postgres installation or an external one. Git also needs to be installed on the server.

But rather than talking about bitbucket, i really just wanted to talk about gathering operational stats on performance of bitbucket installations. One thing i did run into was the scarcity of public information about how best to collect these stats but after many trial and error, i found an approach that actually works.

Since bitbucket is basically a java application, as part of other system/host level metrics; i needed to collect JVM metrics as well. In this case, here is list of metrics we will be hoping to collect


  • System/ Host Metrics - cpu, disk, diskio, kernel, memory, network, network stats etc.
  • JVM metrics - java memory, java class loading, java threading etc.
  • nginx stats - installations could have a frontend loadbalancer like nginx. Mostly just connection metrics.
  • postgres - DB performance metrics/stats.
So how do we collect all of this information?

Telegraf
Telegraf is a plugin-driven server agent for collecting and reporting metrics. It also allows you define where you want to send these metrics to and in this case we will be sending all to influxdb.

So assuming metrics is collected, what do i do with it?

Influxdb 
Influxdb is a time-series database.

Ok metrics is collected by telegraf, warehoused by influxdb; how do i see what this all looks like?

Grafana
Grafana is a data visualization and monitoring tool which is also capable of sending notification about alert thresholds.

So what should i do with all these

1. Install Telegraf on the bitbucket host
Refer to documentation here

2. SetUp destination Influxdb server 
Refer to documentation here
Also for the Influxdb, you will need to create a database, also create a user with write permissions for that database 
Refer to documentation here

3. Set up a Grafana server 
Refer to documentation here

The 3 steps above have enough documentation out there if you need extra help, but one part that seems clouded in mystery is enabling Jolokia plugin for bitbucket. As mentioned earlier bitbucket itself is nothing more than a java application built with Apache in most cases running behind a load balancer like Nginx. If you do use a monitoring tool like prometheus, you could easily poll metrics from the JVM using by enabling the Prometheus plugin and have your prometheus server do the rest of the job. However with telegraf, the story is different. Telegraf can not poll data from the JVM directly hence it needs an agent like Jolokia to attach itself to the running JVM ( i think this is the term called symbiosis in biology) and then collect such data. 

The Jolokia plugin can be enabled from the bitbucket UI, however you also need to enable JMX monitoring in bitbucket to actually make it all come together. 

This is a documentation from Atlassian ( it might be old too, so some steps are cumbersome and might be unnecessary)

As far as enabling JMX is concerned, here is all you need 




























So if this is done correctly, then the process for bitbucket should look like this



1
atlbitb+  8966  176 14.5 19871572 7155652 ?    Sl   14:39 615:44 /opt/atlassian/bitbucket/5.7.1/jre/bin/java -classpath /opt/atlassian/bitbucket/5.7.1/app -Datlassian.standalone=BITBUCKET -Dbitbucket.home=/var/atlassian/application-data/stash -Dbitbucket.install=/opt/atlassian/bitbucket/5.7.1 -Dcom.sun.management.jmxremote.port=3333 -Dcom.sun.management.jmxremote.rmi.port=45625 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/var/atlassian/application-data/stash/shared/config/jmx.access -Xms4g -Xmx4g -XX:+UseG1GC -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -Djava.io.tmpdir=/var/atlassian/application-data/stash/tmp -Djava.library.path=/opt/atlassian/bitbucket/5.7.1/lib/native;/var/atlassian/application-data/stash/lib/native -Xloggc:/var/atlassian/application-data/stash/logs/2019-04-12_14-39-51-gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/atlassian/application-data/stash/logs/heap.log com.atlassian.bitbucket.internal.launcher.BitbucketServerLauncher start
That excerpt does show that we have a couple flags turned on for bitbucket which also include turning on JMX monitoring.

So here is where it gets interesting.

Telegraf configuration is determined by a file in /etc/telegraf/ named telegraf.conf. Lets go ahead and configure the said file to collect all the metrics we are interested in.

Edit telegraf.conf with your favorite editor (vi, vim or nano), it should end up looking like this

telegraf.conf

There are bits of the configuration above i would like to explain in a little more detail, the jolokia piece of it and also the log parsing bit of it. Maybe i will make some other post about matching log lines in bitbucket and then exporting said logs to a time series database like influxdb.

Anyways, now we have telegraf configured. One the telegraf agent on the box is restarted, series should start polling in the influxdb.

On the influxdb



> show measurements
name: measurements
name
----
bitbucket.atlassian
bitbucket.jvm_class_loading
bitbucket.jvm_memory
bitbucket.jvm_operatingsystem
bitbucket.jvm_runtime
bitbucket.jvm_thread
bitbucket.thread_pools
bitbucket.webhooks
bitbucket_access_log
cpu
disk
diskio
java_class_loading
java_garbage_collector
java_last_garbage_collection
java_memory
java_memory_pool
java_runtime
java_threading
kernel
linux_sysctl_fs
mem
net
netstat
nginx
nginx_access_log
postgresql
processes
stash_access_log
stash_audit_log
swap
system
This does show that we are in business. From this point on, setting up visualization for the data in influx should come easy.

Here is a visualization in grafana











each panel consist of queries into the series in influxdb. e.g the panel JVM Uptime if from this query


SELECT last("Uptime") FROM "java_runtime" WHERE ("host" =~ /^$host$/) AND $timeFilter GROUP BY time($interval) fill(null)

I apologize if this post skipped some details. Hopefully i can answer some questions.