Friday, April 12, 2019

Bitbucket and TIG (Telegraf InfluxDB and Grafana)



Bitbucket is our Git repository management solution designed for professional teams. It gives you a central place to manage git repositories, collaborate on your source code and guide you through the development flow. (ref: https://confluence.atlassian.com/confeval/development-tools-evaluator-resources/bitbucket/bitbucket-what-is-bitbucket)

Other competing tools used for the same purpose are Github, Gitlab and Mercurial.

It is basically a combination of a Git Server and a web interface written in Java built with Apache. You also have the option of using a local postgres installation or an external one. Git also needs to be installed on the server.

But rather than talking about bitbucket, i really just wanted to talk about gathering operational stats on performance of bitbucket installations. One thing i did run into was the scarcity of public information about how best to collect these stats but after many trial and error, i found an approach that actually works.

Since bitbucket is basically a java application, as part of other system/host level metrics; i needed to collect JVM metrics as well. In this case, here is list of metrics we will be hoping to collect


  • System/ Host Metrics - cpu, disk, diskio, kernel, memory, network, network stats etc.
  • JVM metrics - java memory, java class loading, java threading etc.
  • nginx stats - installations could have a frontend loadbalancer like nginx. Mostly just connection metrics.
  • postgres - DB performance metrics/stats.
So how do we collect all of this information?

Telegraf
Telegraf is a plugin-driven server agent for collecting and reporting metrics. It also allows you define where you want to send these metrics to and in this case we will be sending all to influxdb.

So assuming metrics is collected, what do i do with it?

Influxdb 
Influxdb is a time-series database.

Ok metrics is collected by telegraf, warehoused by influxdb; how do i see what this all looks like?

Grafana
Grafana is a data visualization and monitoring tool which is also capable of sending notification about alert thresholds.

So what should i do with all these

1. Install Telegraf on the bitbucket host
Refer to documentation here

2. SetUp destination Influxdb server 
Refer to documentation here
Also for the Influxdb, you will need to create a database, also create a user with write permissions for that database 
Refer to documentation here

3. Set up a Grafana server 
Refer to documentation here

The 3 steps above have enough documentation out there if you need extra help, but one part that seems clouded in mystery is enabling Jolokia plugin for bitbucket. As mentioned earlier bitbucket itself is nothing more than a java application built with Apache in most cases running behind a load balancer like Nginx. If you do use a monitoring tool like prometheus, you could easily poll metrics from the JVM using by enabling the Prometheus plugin and have your prometheus server do the rest of the job. However with telegraf, the story is different. Telegraf can not poll data from the JVM directly hence it needs an agent like Jolokia to attach itself to the running JVM ( i think this is the term called symbiosis in biology) and then collect such data. 

The Jolokia plugin can be enabled from the bitbucket UI, however you also need to enable JMX monitoring in bitbucket to actually make it all come together. 

This is a documentation from Atlassian ( it might be old too, so some steps are cumbersome and might be unnecessary)

As far as enabling JMX is concerned, here is all you need 




























So if this is done correctly, then the process for bitbucket should look like this



1
atlbitb+  8966  176 14.5 19871572 7155652 ?    Sl   14:39 615:44 /opt/atlassian/bitbucket/5.7.1/jre/bin/java -classpath /opt/atlassian/bitbucket/5.7.1/app -Datlassian.standalone=BITBUCKET -Dbitbucket.home=/var/atlassian/application-data/stash -Dbitbucket.install=/opt/atlassian/bitbucket/5.7.1 -Dcom.sun.management.jmxremote.port=3333 -Dcom.sun.management.jmxremote.rmi.port=45625 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/var/atlassian/application-data/stash/shared/config/jmx.access -Xms4g -Xmx4g -XX:+UseG1GC -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -Djava.io.tmpdir=/var/atlassian/application-data/stash/tmp -Djava.library.path=/opt/atlassian/bitbucket/5.7.1/lib/native;/var/atlassian/application-data/stash/lib/native -Xloggc:/var/atlassian/application-data/stash/logs/2019-04-12_14-39-51-gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/atlassian/application-data/stash/logs/heap.log com.atlassian.bitbucket.internal.launcher.BitbucketServerLauncher start
That excerpt does show that we have a couple flags turned on for bitbucket which also include turning on JMX monitoring.

So here is where it gets interesting.

Telegraf configuration is determined by a file in /etc/telegraf/ named telegraf.conf. Lets go ahead and configure the said file to collect all the metrics we are interested in.

Edit telegraf.conf with your favorite editor (vi, vim or nano), it should end up looking like this

telegraf.conf

There are bits of the configuration above i would like to explain in a little more detail, the jolokia piece of it and also the log parsing bit of it. Maybe i will make some other post about matching log lines in bitbucket and then exporting said logs to a time series database like influxdb.

Anyways, now we have telegraf configured. One the telegraf agent on the box is restarted, series should start polling in the influxdb.

On the influxdb



> show measurements
name: measurements
name
----
bitbucket.atlassian
bitbucket.jvm_class_loading
bitbucket.jvm_memory
bitbucket.jvm_operatingsystem
bitbucket.jvm_runtime
bitbucket.jvm_thread
bitbucket.thread_pools
bitbucket.webhooks
bitbucket_access_log
cpu
disk
diskio
java_class_loading
java_garbage_collector
java_last_garbage_collection
java_memory
java_memory_pool
java_runtime
java_threading
kernel
linux_sysctl_fs
mem
net
netstat
nginx
nginx_access_log
postgresql
processes
stash_access_log
stash_audit_log
swap
system
This does show that we are in business. From this point on, setting up visualization for the data in influx should come easy.

Here is a visualization in grafana











each panel consist of queries into the series in influxdb. e.g the panel JVM Uptime if from this query


SELECT last("Uptime") FROM "java_runtime" WHERE ("host" =~ /^$host$/) AND $timeFilter GROUP BY time($interval) fill(null)

I apologize if this post skipped some details. Hopefully i can answer some questions.

1 comment: