Authors: Aécio Pires, Alan Santos e Gabriel Monteiro.
Application monitoring is a major challenge, therefore, several tools are used to deliver high-level control to the teams involved in data control and analysis. Considering it and bringing a little of Sensedia's experience, we can mention the following systems that can assist in monitoring a company's applications and services:
• Prometheus: a system for collecting metrics of applications and services for storage in a time series database. It is very efficient.
• AlertManager: works in an integrated way with Prometheus to assess alert rules and send notifications by email, Jira, Slack, and other supported systems.
• Grafana: an analysis and observability solution that supports several log and metrics collection systems. When it is integrated with Prometheus, it serves to display metrics in very elegant and useful dashboards for different areas of a company/organization.
• Prometheus-Operator: software to simplify and automate the installation and configuration of Prometheus, AlertManager, Grafana and exporters in Kubernetes clusters.
• VictoriaMetrics: a fast, cost-effective, and scalable time series database and monitoring solution. But in our case, it is used for long-term and centralized storage of the metrics collected by different Prometheis servers (the plural of Prometheus).
All these tools are open source and available on Github.
In the previous tutorial we learned how to create a Kubernetes cluster on AWS using Terraform. You can create the cluster and install Prometheus-Operator by following the tutorial that is available in the sensedia/open-tools repository. If you don't know what Prometheus-Operator is, I strongly recommend watching this lecture presented by Daniel Requena during StayAtHomeConf and/or see the links that are in the references.
We can highlight several advantages of using these tools to increase data control. As we are looking a lot at Open finance this year, we can talk a little bit about how these tools can help bring control and governance over banking processes.
When you integrate your software with the bank's APIs, you open a range of possibilities, but what about controlling all of that? How to monitor if the services that support these interactions are working correctly? The answer to these questions is monitoring, which can be done using Prometheus, AlertManager, Grafana and other systems.
By having effective monitoring, you can:
And these are some of the many benefits that tracking data can bring to your business.
Our production environment is multi cloud and we have several Kubernetes clusters distributed on AWS and GCP. It is there that we run the applications and services used by customers.
Basically, in each Kubernetes cluster the prometheus-operator is installed to manage only the prometheus and the exporters needed to collect the metrics. Instead of installing Grafana and Alertmanager in each cluster, we chose to install these services in a single cluster.
All metrics collected by Prometheis are sent to VictoriaMetrics, which centralizes their storage and queries. By default, Prometheus-Operator stores metrics locally for just 2 hours. With VictoriaMetrics we can store all metrics from all Prometheis for a long time.
Grafana is then configured to connect to VictoriaMetrics to query and display the metrics. Regarding alerts, all Prometheis send to an AlertManager pool. So, we can guarantee the availability and centralization of metrics and alerts. It is worth mentioning that all data is transmitted in encrypted form and access to the data is done through authentication and from authorized source IPs.
VictoriaMetrics
This tool deserves special attention, and we will explain the reason below.
From January to October 2020, we used InfluxDB as the storage solution for Prometheis' long-term metrics. But we started to have serious problems in the storage and visualization of the metrics in face of the growing demand for collecting new metrics. This does not mean that InfluxDB is a bad solution, on the contrary, it served us very well for a long time, but after extensive research, adjustments in the configuration files and a considerable increase in CPU and memory resources, we saw that running InfluxDB without using the cluster mode (which is paid) was not serving us well.
So, we decided to do a search for alternative solutions and among them were:
· Thanos;
· Cortex;
· TimeScaleDB;
After detailed analysis of each solution, conversations with friends and colleagues who already used some of them in other environments, in addition to reflecting on the business needs and the return on investment that each one would add, we opted for VictoriaMetrics.
The key factors in making this decision were:
• The simplicity of installation and configuration, after comparison with other tools;
• It was not necessary to redo the dashboards in Grafana, as VictoriaMetrics supports PromQL;
• It was not necessary to have a read-only Prometheus (that is, using the remote_read API, intermediating the communication between Grafana and VictoriaMetrics);
• Little use of CPU and memory resources to store and process a high volume of metrics. This has been observed in some use cases in production environments of globally relevant companies. See the page: https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/CaseStudies;
• Benchmark with positive results compared to other solutions, done by a person who works at Addidas and in an environment with a high volume of metrics.
We did a PoC using VictoriaMetrics (cluster mode) during Black Friday (the period of the year when the greatest volume of metrics generation occurs according to the demand and business model of our customers). During this period, we had no problems monitoring stack and it has remained that way ever since. Before, we had recurrent problems monitoring stack almost every day. It did not impact the customer’s production environment, but it demanded a good amount of time from the operating teams during troubleshooting.
In cluster mode, VictoriaMetrics consists of the following components:
• vmstorage: stores metrics in a persistent volume. In our case, the data is stored on EBS disks;
• vmselect: used for reading/querying metrics. It receives the reading requests, interacts with vmstorage to obtain access to them and then sends the requested data.
• vminsert: used for writing metrics. It receives the writing requests and passes them to vmstorage for storage;
• vmauth: is a simple authentication proxy that redirects read/write requests to vmselect/vminsert. It validates the username and password for Basic Auth headers, compares them with the defined redirection settings, and acts as a proxy for HTTP requests.
About the numbers:
• 28 Prometheis (one for each cluster) sending metrics to VictoriaMetrics;
• 2 pods for the vmselect component
o CPU: 500 millicores minimum and 1 CPU maximum
o Memory: 500 MB minimum and 3 GB maximum
Volume: 8 GB
• 2 pods for the vminsert component
o CPU: 500 millicores minimum and 1 CPU maximum
o Memory: 500 MB minimum and 3 GB maximum
Volume: 8 GB
• 2 pods for the vmstorage components:
o CPU: 500 millicores minimum and 1 CPU maximum
o Memory: 500 MB minimum and 4 GB maximum
Volume: 1 TB
• 2 pods for the vmauth component
o CPU: 200 millicores minimum and 1 CPU maximum
o Memory: 128 MB minimum and 1 GB maximum
• 2 Load Balancers for ALB (Application Load Balancer) type vmauth:
o 1 internal type to receive the Prometheis metrics that are in the same VPC as VictoriaMetrics;
o 1 internet-facing type to receive metrics from other Prometheis that are in other VPCs or in another Cloud Provider.
• Active Time Series: 2.3 Million
• Disk Space Usage: 67 GB (metrics are rotated every 15 days)
• Ingestion Rate: 40.1 K points/second
• Requests Rate: 134 req/second
• Total Datapoints: 90.3 billion (metrics are rotated every 15 days)
• Network Usage:
o 20 Mbps used only to receive metrics from all Prometheis.
o 70 Kbps used only for checking metrics through Grafana.
During migration activities to VictoriaMetrics, we also had to develop the Helm Chart for the vmauth component. The code for this Helm Chart was shared in the official VictoriaMetrics repository and we also helped to organize the documentation of other Helm Charts using helm-docs. This was a simple way of giving back and thanking the open-source community.
If you are interested in doing a test with VictoriaMetrics on Kubernetes clusters, we recommend using Helm Charts which are available at: https://github.com/VictoriaMetrics/helm-charts. For more information, access the links that are in the references.
There are several dashboards available on the Grafana website to view VictoriaMetrics internal metrics, but we recommend using this one: https://grafana.com/grafana/dashboards/11831.
Final considerations
In this tutorial, we learned more about a set of tools used for monitoring services and applications and discovered a solution that can be used for long-term and centralized storage of metrics collected by Prometheus.