Using Prometheus and Grafana to Monitor Your Email Pipeline

Why Email Pipeline Metrics Matter

Most mail administrators operate blind. They know something is wrong when users complain, and they fix it by digging through raw log files. Prometheus and Grafana change that: instead of grep-and-hope, you get time-series graphs, threshold alerts, and at-a-glance health dashboards that show exactly what is happening across every domain you protect.

Spam Killer exposes a Prometheus-compatible metrics endpoint that covers every meaningful event in the pipeline: incoming connections, authentication outcomes, filtering decisions, forwarding results, and per-domain breakdowns. Once you wire this into Grafana, a single screen tells you whether your mail infrastructure is healthy right now.

The Metrics That Actually Matter

Not all metrics are equally useful. These are the ones worth putting on a dashboard:

Message volume — total messages accepted per minute, broken down by domain. A sudden drop usually means a DNS problem; a sudden spike usually means you are under attack or someone published your address in a spam list.
Spam rate — percentage of accepted messages classified as spam. Trending upward over a week? Your domain may have been added to a bulk mailing list or a botnet has found your MX records.
SPF pass/fail ratios — SPF failures per domain tell you whether legitimate senders are misconfigured, or whether someone is actively spoofing your domain. A sustained SPF fail rate above 5% deserves investigation.
DKIM pass/fail ratios — Similar signal to SPF but at the message signature level. DKIM failures from a sending server you trust usually indicate a key rotation that was not propagated to DNS.
Forwarding errors — errors returned by your upstream mail server when Spam Killer attempts delivery. A non-zero and growing count here means your upstream server is unhealthy or rejecting messages incorrectly.
Rate-limited connections — how many inbound connections were throttled because the sending IP exceeded your configured rate limit. A high count from a single IP block is a strong signal of a spam campaign.
Greylisted counts — messages deferred on first attempt. Legitimate senders retry; spambots typically do not. Watching this metric over time shows you how effective greylisting is at your volume.

Enabling the Prometheus Endpoint

The metrics endpoint is disabled by default. To enable it, add the following block to your spamkiller.yaml configuration file:

metrics:
  enabled: true
  listen: "127.0.0.1:9150"
  path: "/metrics"
  # Optional: restrict scraping to your Prometheus server IP
  allowed_ips:
    - "10.0.1.20"

After restarting Spam Killer, you can verify the endpoint is live with a quick curl:

curl http://127.0.0.1:9150/metrics | head -40

You should see the standard Prometheus text exposition format, with lines beginning # HELP and # TYPE followed by metric samples.

Key Metrics Reference

Here is a summary of the most useful metrics exposed by the endpoint:

spamkiller_messages_accepted_total — Counter. Total messages accepted from inbound connections, labeled by domain.
spamkiller_messages_spam_total — Counter. Messages classified as spam, labeled by domain.
spamkiller_messages_forwarded_total — Counter. Messages successfully forwarded to upstream, labeled by domain.
spamkiller_forward_errors_total — Counter. Delivery failures when forwarding to upstream, labeled by domain and error_code.
spamkiller_spf_results_total — Counter. SPF check results, labeled by domain and result (pass, fail, softfail, neutral, none).
spamkiller_dkim_results_total — Counter. DKIM verification results, labeled by domain and result.
spamkiller_greylisted_total — Counter. Messages deferred by greylisting, labeled by domain.
spamkiller_rate_limited_total — Counter. Connections dropped or throttled due to rate limiting, labeled by sender_ip_range.
spamkiller_active_connections — Gauge. Current number of open SMTP connections.
spamkiller_queue_depth — Gauge. Messages currently awaiting forwarding retry.

Setting Up Grafana

If you do not already have Grafana running, the quickest path is via Docker:

docker run -d -p 3000:3000 --name grafana grafana/grafana-oss

Once Grafana is up, add your Prometheus instance as a data source: go to Connections → Data sources → Add data source, select Prometheus, and enter the URL of your Prometheus server (e.g. http://prometheus:9090). Click Save & Test to confirm connectivity.

Create a new dashboard and add the following panels:

Time series: Message volume by domain — shows inbound mail volume over time, one line per domain. Useful for spotting volume anomalies immediately.
Gauge: Current spam rate — a single-number view of what fraction of mail is being classified as spam right now. Good for a top-of-dashboard health indicator.
Time series: SPF and DKIM failures — overlay SPF fail and DKIM fail rates. Divergence between the two often points to a specific misconfiguration.
Table: Top sender domains — a table panel showing which sending domains are generating the most volume, and their spam rates. Helps you spot new problem senders quickly.
Time series: Forwarding errors — a low-but-non-zero error rate is normal; a spike means your upstream server needs attention.
Stat: Queue depth — should normally be near zero. A non-zero queue depth that is growing indicates a delivery backlog.

Useful PromQL Queries

Copy these into Grafana panel queries to get started quickly:

# Messages per minute by domain (last 5 minutes)
rate(spamkiller_messages_accepted_total[5m]) * 60

# Spam rate as a percentage (last 10 minutes)
100 * rate(spamkiller_messages_spam_total[10m])
  / rate(spamkiller_messages_accepted_total[10m])

# SPF failure rate by domain
rate(spamkiller_spf_results_total{result="fail"}[10m])
  / rate(spamkiller_spf_results_total[10m])

# Forwarding error rate
rate(spamkiller_forward_errors_total[5m])

# Greylisting effectiveness (deferred vs accepted ratio)
rate(spamkiller_greylisted_total[10m])
  / rate(spamkiller_messages_accepted_total[10m])

Alert Thresholds Worth Configuring

Start with these alert rules in your Prometheus configuration and tune the thresholds for your traffic volume:

Spam rate above 30% for 15 minutes — indicates a sustained campaign targeting one of your domains. Check the top senders table immediately.
SPF failure rate above 10% for 10 minutes — could indicate active spoofing of your domain or a newly deployed sending server that is not in your SPF record.
Forwarding errors above 5/minute for 5 minutes — your upstream mail server may be down, overloaded, or rejecting messages with a 5xx code.
Queue depth above 50 for more than 5 minutes — messages are backing up and not being forwarded. Investigate the forwarding errors metric alongside this one.
Active connections drop to zero for more than 2 minutes during business hours — Spam Killer may have crashed or lost its network binding. This alert should page someone.

Alerting on the right thresholds takes a week or two of baseline observation to calibrate. Start with conservative thresholds and tighten them once you understand your normal traffic patterns. The Grafana Alerting UI makes it easy to set per-panel alerts directly from the PromQL queries above, so you do not need to manage a separate alertmanager configuration for basic notifications.

Using Prometheus and Grafana to Monitor Your Email Pipeline

Why Email Pipeline Metrics Matter

The Metrics That Actually Matter

Enabling the Prometheus Endpoint

Key Metrics Reference

Setting Up Grafana

Useful PromQL Queries

Alert Thresholds Worth Configuring

Related Articles

X-Spam Headers Explained: How to Build Mail Rules That Actually Use Your Filter's Output

TLS in Email: What It Protects, What It Doesn't, and Why You Still Need Content Filtering

Migrating to a Spam Proxy Without Dropping Email: A Zero-Downtime Cutover Guide