If you’ve been following along with the previous two posts, you’ve already learned how to configure your Java application to export metrics into Prometheus, and you’ve augmented your monitoring with better visualization through Grafana!
Configuring thresholds and alerts with AlertManager
All of this data-gathering and visualization is great, but in order to truly protect your application and assets, you need a way to alert your teams when something goes wrong. Prometheus gives us a way to do this by using its modular AlertManager component. Revisiting our diagram from Part 1, we can see that the AlertManager runs alongside of Prometheus, and can integrate with third-party notification tools like PagerDuty:
In order to generate notifications for events like downtime or anomalous behavior, we’ll need to perform the following high-level tasks:
• Stand up an instance of AlertManager
• Configure our Prometheus instance to use AlertManager
• Create alert rules in Prometheus
So, let’s get to it!
Goal 1: Standing up AlertManager
AlertManager runs standalone alongside Prometheus. It takes care of handling alerts, grouping them, deduplicating them, and routing them to the appropriate services like email, chat, and PagerDuty. It can also silence and inhibit certain alerts during planned events or downtime. This is beyond the scope of this article, but we’ll detail it in future posts.
AlertManager rules are conceptualized as routes, giving you the ability to write sophisticated sets of rules to determine where notifications should end up. A default receiver should be configured for every notification, and then additional services can be configured through child routes that match certain conditions, as such:
In this example, we have instructed AlertManager to route any notifications classified as an outage to PagerDuty. Further, if the Alert matches a specific team we send it to a chat solution, and if the Alert matches a particular group we send it to a mailing list. We’ll see how to apply these labels to alerts further down when we configure alerts in Prometheus. Note that these matches are not exclusive, and an alert can match multiple conditions with multiple destinations.
Until then, to set up AlertManager:
1) Download the official distribution and unarchive the bundle. It’s pretty barebones:
As always, don’t do this as root in production.
2) The bulk of your configuration will be in a .yml configuration file, similar to how we configured Prometheus. The default configuration that ships in alertmanager.yml is enough to get AlertManager up and running, but it doesn’t contain any integration points with notification services. Configurations can be simple or complex depending on the number of notification services you integrate with. You can see a full configuration reference here.
For this example, we’ll set up an alert receiver in Slack. To do so, we’ll make use of the Slack module for Prometheus, documented here.
You’ll need to configure an Incoming Webhook in your Slack instance which is outside the scope of this article, but well-documented here.
Once we have an Incoming Webhook URL from Slack, the rest of the configuration is simple. We create our receiver and use the slack_configs module. Our final configuration (which I’ve placed in a file called demo-config.yml) looks like:
We first set some global parameters to manage our default behavior, which will be to send an email alert. We start with a “default-receiver” route, which has a corresponding “receivers” section configuring the inbox of our alert. We then start a child route to a receiver called ‘slack’ that will be invoked if the “service” label on the alert regex matches to “activemq.”
3) At this point, we can fire up AlertManager with:
That’s it! If AlertManager starts successfully, you should see output similar to the following:
Now we’ll need to head over to Prometheus and attach it to our new AlertManager instance.
Goal 2: Integrating Prometheus with AlertManager
This part is pretty easy — we just need to modify our existing Prometheus yaml configuration and let Prometheus know that we have an AlertManager instance ready to go. We will be statically configuring, but there are discovery mechanisms available as well. We’ll add a new section to the config called “alerting” and create a static_configs table for our AlertManager instance. The final configuration file will look like:
In this config, then, we are creating an alertmanagers config, and pointing to the instance of AlertManager we have running locally in this demo. You might have also noticed a new entry for rule_files. This is where we configure our actual thresholds and alerts.
Goal 3: Configuring alerts
Now that we have Prometheus integrated with AlertManager, we will need to configure some rules. Prometheus gives you a great deal of functionality and flexibility to create your alert conditions which you can read more about here.
For this post, we’ll just be looking at two metrics within our ActiveMQ instance. We’ll first check to make sure the broker is actually up, and then we’ll check to see if any messages have entered the Dead Letter Queue, which is a standard place for failed messages to go in a JMS provider like ActiveMQ.
Let’s get cracking!
1) The first metric is a generic metric used by Prometheus to indicate whether it was able to scrape metrics. It is simply called “up” and carries a value of 1 for “up” and 0 for “down”:
So in our new rules-amq.yml file, we can add an entry to catch when this metric equals “0” and react appropriately. Rule files use a top-level “groups” entry, and then are split into any number of groups. We’ll look at the whole configuration file in a moment, but for this rule specifically we will create the following configuration:
In this section, we have a few things going on. First, the “- alert:” tag simply gives the alert a friendly name. We’ve specified “Broker Down.” The “expr” section is where you craft your rules, and you can perform pretty sophisticated comparisons in here. In this case, we’re simply specifying that we’ll trigger this alert if the “up” metric for the “activemq” job is ever 0. We can then create labels and annotations, including a link to our Grafana dashboard under the “dashboard” param, and even an internal runbook for the service!
2) Our next metric will concern the depth of the DLQ, so we can use the org_apache_activemq_Broker_DLQ metric, which is a JMX Metric being scraped that contains the depth of the ActiveMQ DLQ. We’ll do a comparison operator to make sure that the metric isn’t above 0:
Note: We added the “for” parameter here, which means that this condition must be true for, in our case, 1 minute before an alert will fire. This can help cut down on unnecessary alerts for conditions which may be self-healing.
3) That’s it — our complete configuration should look like:
4) Save that configuration into the rules-amq.yml file that we are now referencing, and fire up Prometheus in the normal manner:
That’s it for our configuration, let’s test it all out!
Goal 4: Testing our new alerts
First, let’s make sure that Prometheus has accepted our alert configuration. Open it up in a browser and click “Alerts.” In here you should see the two alerts we just configured:
Looks great! Let’s make an alert fire by bringing down our ActiveMQ broker process. Once it’s down, it could take up to 30 seconds for Prometheus to notice, since that is what our scrape settings are limited to. But once it does notice, you’ll see that the Alert has been fired:
And we should see our Alert in Slack!
Of course, this is pretty bare-bones, but you can reference the slack_configs configuration guide linked above to learn how to further customize the alert, including custom graphics, emojis, and of course text.
We’re here if you need us
This is the last in our three-part series detailing how to use Prometheus and Grafana to fully monitor a Java application. Stay tuned for future posts where we’ll dive deeper into these powerful open source monitoring solutions.
In the meantime, our OpenLogic open source support team is available 24×7 to assist you with this and other open source software!