Prometheus Alerts

DockerBeginner
Practice Now

Introduction

Effective monitoring isn't just about collecting metrics; it's about being notified when things go wrong. Prometheus has a powerful, built-in alerting system that allows you to define alert conditions using the same PromQL query language you use for graphing. When an alert's condition is met, it enters a "firing" state.

In this lab, you will learn the fundamentals of Prometheus alerting. You will start with a pre-configured environment running Prometheus and a Node Exporter. Your task will be to create a separate alerting rule file, define a rule to detect high CPU usage, configure Prometheus to load this file, and finally, simulate a high CPU load to watch your alert trigger in the Prometheus UI.

Understand the Alerting Environment

In this step, you will familiarize yourself with the lab environment. The setup script has already started two Docker containers for you: one for Prometheus and one for Node Exporter.

First, let's verify that both containers are running. Open a terminal and execute the docker ps command:

docker ps

You should see output similar to the following, showing the prometheus and node-exporter containers in an "Up" status.

CONTAINER ID   IMAGE                           COMMAND                  CREATED          STATUS          PORTS                                       NAMES
...            prom/prometheus                 "/bin/prometheus --c…"   15 seconds ago   Up 14 seconds   0.0.0.0:9090->9090/tcp, :::9090->9090/tcp   prometheus
...            prom/node-exporter               "/bin/node_exporter …"   16 seconds ago   Up 15 seconds   0.0.0.0:9100->9100/tcp, :::9100->9100/tcp   node-exporter

The node-exporter container exposes metrics about the host system (our lab VM), and the prometheus container is configured to scrape (collect) those metrics.

Now, let's check the Prometheus UI. To access it:

  1. In the LabEx interface, click the + (New Tab) button in the top navigation bar
  2. Choose Web Service from the dropdown menu
  3. Enter 9090 for the port number
  4. Click Open to launch the Prometheus web interface

When the new tab opens, you should see the Prometheus Expression Browser landing page. Navigate to Status -> Targets from the top navigation menu. You should see that the node_exporter job has a green "UP" state, confirming that Prometheus is successfully collecting data from it. This connection is the foundation for our alerting rule.

Prometheus Targets UI

Create alert-rules.yml for High CPU Alert

In this step, you will create a dedicated file for your alerting rules. It's a best practice to keep rules separate from the main Prometheus configuration for better organization.

We will create a file named alert-rules.yml inside your project directory. Use the nano editor to create and edit the file:

nano ~/project/alert-rules.yml

Now, copy and paste the following YAML content into the nano editor. This defines a rule group containing a single alert that triggers when CPU usage is high.

groups:
  - name: node_alerts
    rules:
      - alert: HighCpuLoad
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"
          description: "CPU load is > 10% (current value: {{ $value }}%)"

Let's break down this rule:

  • groups: Rules are organized into groups. All rules in a group are evaluated sequentially.
  • alert: The name of our alert, HighCpuLoad.
  • expr: The PromQL expression that is evaluated. If it returns a value, the alert is triggered. Here, we calculate the percentage of non-idle CPU time over the last minute. If it's greater than 10%, the condition is met.
  • for: This clause specifies that the condition must be true for a continuous duration (1 minute) before the alert becomes "Firing". This prevents alerts from triggering on brief spikes.
  • annotations: These add human-readable information to the alert. summary and description are standard annotations. You can use template variables like {{ $labels.instance }} and {{ $value }} to include dynamic data in your alert messages.

After pasting the content, save the file and exit nano by pressing Ctrl+X, then Y, and then Enter.

Run Prometheus Container with Mounted Rules File

In this step, you'll tell Prometheus to load your new rules file and restart the container with the updated configuration.

First, you need to edit the main configuration file, prometheus.yml, to include a reference to your rules file. Open it with nano:

nano ~/project/prometheus.yml

Add the rule_files directive under the global section. The file should look like this after your changes:

global:
  scrape_interval: 15s
  rule_files:
    - "alert-rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["prometheus:9090"]
  - job_name: "node_exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

Save the file and exit nano (Ctrl+X, Y, Enter).

Now that the configuration is updated, you must restart the Prometheus container to apply the changes. First, stop and remove the old container:

docker stop prometheus
docker rm prometheus

Finally, run a new Prometheus container. This command is similar to the one from the setup script, but it includes a second -v flag to mount your alert-rules.yml file into the container.

docker run -d --name prometheus -p 9090:9090 \
  --network monitoring \
  -v /home/labex/project/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v /home/labex/project/alert-rules.yml:/etc/prometheus/alert-rules.yml \
  prom/prometheus

This command ensures that both the main configuration and the alerting rules are available inside the Prometheus container.

Verify Alert Rules Loaded in Prometheus UI

In this step, you will confirm that Prometheus has successfully loaded your new alerting rule.

Go back to the Prometheus UI tab in your browser (or open a new Web Service tab on port 9090 if needed). If the page doesn't load, wait a few seconds for the new container to start up and then refresh the page.

From the top navigation bar, click on the Alerts menu item.

You should now see your HighCpuLoad alert listed. The alert will be in the Inactive section, indicated by a green background. This is the expected state because the CPU load on the system is currently low, so the alert's expression (expr) evaluates to false.

Prometheus Inactive Alert

It's important to understand the three states of an alert:

  • Inactive (Green): The alert condition is false.
  • Pending (Yellow): The alert condition has become true, but the for duration has not yet passed. Prometheus is waiting to see if the condition persists.
  • Firing (Red): The alert condition has been true for the entire for duration. In a production setup, this is when Prometheus would send the alert to an Alertmanager.

Your alert is currently inactive, which is correct. In the next step, we will cause it to fire.

Simulate Load to Test Alert Firing

In this final step, you will intentionally increase the CPU load on the system to test if your alert triggers correctly.

We can generate CPU load using a simple, infinite shell loop. In your terminal, run the following command. The & at the end will run the process in the background, so you can continue to use your terminal.

while true; do true; done &

This command starts a process that consumes 100% of a single CPU core. Now, quickly switch back to the Alerts page in the Prometheus UI (accessible via the Web Service tab on port 9090).

You will observe the state of the HighCpuLoad alert change:

  1. Within about 15-30 seconds, the alert's expression will become true. The alert will move to the Pending section and turn yellow. This means Prometheus has detected the high CPU load but is waiting for the 1m duration specified in the for clause.
  2. After being in the Pending state for one minute, the alert will move to the Firing section and turn red. This confirms that your alerting rule works as expected! You can expand the alert to see the annotations you defined, complete with the current value.

Prometheus Firing Alert

Once you have seen the alert fire, you can stop the load generation. Go back to your terminal and run the following command to kill the background loop process:

Important: To save LabEx VM server resources, please ensure you run the following command to stop the load generation.

kill $!

After stopping the load, watch the Prometheus UI again. The alert will soon return to the Inactive (green) state, completing the test cycle.

Summary

Congratulations! You have successfully configured and tested a Prometheus alert.

In this lab, you learned how to:

  • Structure alerting rules in a separate YAML file.
  • Write a PromQL expression to define an alert condition for high CPU usage.
  • Use annotations to create meaningful, human-readable alert messages.
  • Configure Prometheus to load your rule files and restart it to apply changes.
  • Observe the lifecycle of an alert in the Prometheus UI, from Inactive to Pending to Firing.
  • Simulate a condition to trigger and test your alert.

This is the first half of the alerting picture. The next logical step, which is outside the scope of this lab, would be to set up an Alertmanager instance. Prometheus would send its firing alerts to the Alertmanager, which would then be responsible for deduplicating, grouping, and routing them to actual notification channels like email, Slack, or PagerDuty.