Prometheus - Chaos Edition

Despite its dramatic name, Prometheus Chaos Edition is not an official Prometheus release. It is a concept (and accompanying script/container) popularized by the Prometheus community and tools like kube-prometheus-stack chaos experiments.

In this post, we’ll explore what PCE is, how to deploy it, and why chaos engineering your observability pipeline is the smartest gamble you’ll make this quarter.

Once running, the sidecar exposes an HTTP API on :9091 . You can now inject failures: prometheus chaos edition

@app.route('/metrics') def metrics(): if random.random() < 0.2: # 20% of the time return "malformed_metric{ invalid syntax", 200 return Response(real_metrics(), mimetype='text/plain')

# Pull the chaos edition sidecar docker pull quay.io/prometheuschaos/chaos-sidecar:latest docker run -d --name prometheus-chaos --network container:prometheus quay.io/prometheuschaos/chaos-sidecar Despite its dramatic name, Prometheus Chaos Edition is

# malicious_exporter.py from flask import Flask, Response import random app = Flask()

| Risk | Mitigation | | --- | --- | | PCE accidentally runs on production | Use namespace isolation, explicit --chaos.enabled=false flag in prod. | | Permanent data loss | Run against a replica Prometheus with --storage.tsdb.retention.time=6h . | | Alert fatigue | Notify a separate “chaos channel” during experiments. | | Controller plane overload | Limit chaos duration (e.g., 5 minutes max). | Once running, the sidecar exposes an HTTP API on :9091

Enter – a little-known, experimental tool designed to do the unthinkable: intentionally break your Prometheus deployment so you can fix it before a real disaster.

Before we dive into code, let’s address the obvious question: Why would I voluntarily break my monitoring?

What happens when your Prometheus server runs out of memory? What if a metric scrape takes 30 seconds because a target is thrashing? What if your alerting rules become corrupt?