Blog

Smarter Self-Healing: Teaching Our System to Measure Blast Radius

By OpsKern · March 15, 2026 · 3 min read

Table of Contents

The remediation pipeline restarted Caddy on a Tuesday afternoon. Caddy is the reverse proxy. It sits in front of Grafana, Gitea, the wiki, and the client portal. All four went down for six seconds while Caddy came back up.

The fix was correct — Caddy had a stale config. But the blast radius was wrong. Restarting a standalone monitoring agent is fine. Restarting the service that everything else routes through should require a second opinion.

That incident is why we spent this week teaching the system to measure consequences before it acts.

Blast radius analysis

Every remediation action now goes through a blast radius gate before execution. The system maps the dependency graph — which services depend on the thing being fixed, which hosts are affected, and how many customers would notice if the fix went sideways.

If the blast radius exceeds a threshold, the system escalates to a human instead of acting on its own. A restart of a standalone monitoring agent? Go ahead. A restart of a database that three services depend on? That gets a notification and waits for approval.

This is not just a binary “safe or unsafe” check. The system calculates a tiered risk score based on:

Service dependencies. How many other services talk to this one?
Customer impact. Is this a shared resource or dedicated?
Time of day. A restart during business hours has a different risk profile than one at 3am.
Recent history. If this service was already restarted twice today, the third time warrants a closer look.

Confidence calibration

The system also learned to be honest about how confident it is in its own fixes.

Every remediation outcome — success or failure — feeds back into a confidence score. If a particular playbook has been succeeding 95% of the time, the system knows it can trust that approach. If a newer playbook has only been tried three times, the system treats it with appropriate skepticism.

When confidence is low, the system defaults to a more conservative action: notify the operator, provide the diagnosis, and suggest the fix rather than executing it automatically.

This creates a natural learning curve. New remediation patterns start supervised, earn trust through repeated success, and eventually graduate to fully autonomous execution. Patterns that start failing get automatically demoted back to supervised mode.

Seven new remediation playbooks

We also expanded the library of things the system knows how to fix:

Service restart with rollback. If a service does not come back healthy after restart, revert to the previous configuration.
Disk cleanup with safety checks. Clear log files and temp directories, but only after verifying the logs have been shipped to central storage.
Certificate renewal. Detect expiring TLS certificates and trigger renewal before they cause outages.
Configuration drift correction. When a host’s configuration drifts from the desired state, automatically reapply the correct configuration.
Backup retry with escalation. If a backup fails, retry with exponential backoff. If it keeps failing, escalate.
DNS resolution verification. When DNS-related alerts fire, verify resolution from multiple vantage points before declaring the issue resolved.
Resource threshold response. When CPU, memory, or disk approaches critical thresholds, take proactive action rather than waiting for the alert to fire.

Each playbook goes through the same blast radius and confidence checks before execution. Nothing runs unchecked.

What this means for managed infrastructure

If you are running infrastructure that needs to stay up, you should not need to wake up at 3am for problems that have known fixes. But you also should not trust a system that blindly applies fixes without understanding the consequences.

The goal is a system that is both autonomous and careful — one that fixes what it can, asks for help when it should, and gets better at telling the difference over time.

That is what we shipped this week. Next up: making the whole system observable enough that you can watch it learn in real time.

Smarter Self-Healing: Teaching Our System to Measure Blast Radius

Blast radius analysis

Confidence calibration

Seven new remediation playbooks

What this means for managed infrastructure

Get the code

Get the getting started guide

Blast radius analysis

Confidence calibration

Seven new remediation playbooks

What this means for managed infrastructure

Get homelab ops tips in your inbox

Related posts

What Is Self-Healing Infrastructure? A Practical Guide

5 Ansible Patterns That Saved My Homelab at 3 AM

From 68 Alerts to Zero: How I Built a Self-Healing Homelab in a Weekend

Get the code

Get the getting started guide