Motivation

Why you would run an monitoring system like this? I would mostly say because we can but no, this time it is necessary to run such an stack. Since I run autonomous systems and an IT-company and Emile mostly run this as an companion in most of this non-customer based projects. After two years of maintaining a shitload of Icinga2 and check_mk based systems, I decided to migrate the whole monitoring to a new shiny system. After 2 weeks of evaluation i tried the setup we describe in the blogpost and can recommend the setup!

Quickstats

Monitoring in a nutshell: Have a master to which the workers report their status. Scalabale, simple, efficient: the ETVGA stack (Exporter Telegraf Victoriametrics Grafana Alertmanager).

The bird exporter exports metrics that are scraped by Telegraf. Telegraf then sends the scraped data to Victoria Metrics. Grafana then accesses the data exposed for it by Victoria Metrics.

Overall concept

The individual nodes report their stats to the master. This makes it possible to dynamically add nodes without needing to adjust stuff on the master node.

In the example schema above, the worker nodes node[1-n].company.com report their stats to the master node located at masternode.company.com.

Setup

All files needed are located in this git repository.

The setup works like this: The Ansible inventory is built using the data provided by the netbox. This is then used by the Ansible runner to create the exporter and sidecar Telegraf service for exporting the data on the individual nodes.

Setup the main node

Install docker + docker-compose

  • install docker
    $ curl -s https://get.docker.com | sh
  • install docker-compose
    $ sudo curl -L "https://github.com/docker/compose/releases/download/1.25.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

Setup the directory structure

  • create a docker directory
    $ mkdir -p /docker/monitoring
    $ cd /docker/monitoring

Insert the needed files

  • insert docker-compose here
  • insert the grafana.env here

Adjust the docker-compose to suite your needs

  • adjust host rules (replace "yourdomain.com" with your domain)
    • sed -i 's/yourdomain.com/newdomain.com/g' docker-compose.yml
  • create passwords using htpasswd
    • create passwords the auth (traefik, victoria-metrics)
    • create a password for grafana in the grafana.env

Deploy the compose

docker-compose up -d

Setup grafana

  • login to grafana using the user admin and the password defined in the env file
  • add the victoria metrics endpoint

Setup the worker nodes

Ansible setup

We use ansible to deploy Telegraf and the Exporter onto the devices.

  • Add influx repo
  • Add influx repo gpg key
  • Update apt cache
  • Install Telegraf
  • Build config from template
  • Restart telegraf
  • add the host to the ansible inventory

This is done like this

  • adjust the telegraf config file
    • host
    • password
  • run the ansible playbook
    ansible playbook -i <inventory> Playboks/setup-telegraf.yml --limit "<ip>"

Master

This is the master node which bundles the metrics. This means that all other nodes PUSH their metrics here and Victoria Metrics bundles the results so that Grafana. can display them.

Nodes

These are worker nodes that aggergate metrics that should be monitored. This happens in two steps:

  1. Aggergate the metrics using an Exporter (such as bird_exporter)
  2. Scrape the exported data on the node using Telegraf. This periodically collects the results from the exporter and pushes the data to the Victoria Metrics instance on the master node.

Alerting

The alerting is done by the Grafana, i want to attach the alertmanager by Prometheus, but it is currently not support by Victoria Metrics