One of the main features of Prometheus is alerting. It’s possible to create advanced and universal alerting rules with Prometheus. In this context, universal means to have one alerting rule instead of many. Having many alerting rules is inconvenient and we need to spend a lot of time to create said alerts because not all alerting rules have the same condition or value of the threshold. So how to avoid this process?🤔
Suppose we have several hundred servers that have different resource specifications (like a number of CPU cores) and we need to write alerting rules of the system load for each one. There are two ways to do it:
1. Write rules for each group of instances (or each instance) which have the same count of CPU cores.
2. Write one universal rule which will apply to all instances.
In general, simple alerting rules have a constant threshold value, like the example below which monitors a 1 min average of the server system load. According to the expression, the alert will fire when the returned value is greater than 4 for 3 minutes.
Moving in this way, we must create other rules as well, for the instances which have different quantity of CPU cores.
Obviously, the second solution is better, so it’s the main theme of this topic.
To create the universal rule, we need to write two expressions in each part of the inequality, instead of the constant value of the threshold. At first, let’s create a recording rule which will be our “dynamic” threshold variable depending on the count of CPU cores of the appropriate node.
Then we can create our universal alerting rule by using already generated recording rules.
This is an example of the alerting rule which is flexible and can be used for all instances of the nodes. The alert will fire when the system load is more than the count of CPU cores of one or more nodes. You can easily customize and set a rule e.g for the condition of warning severity by multiplying a constant value to the right side of the inequality.
We just covered one use case of alerting rule as an example of server system load. Although this technique can be used in many cases when we create alerting rules with Prometheus.