My Homelab Fixes Itself — Here's the Ansible Setup

By OpsKern · · 5 min read
My Homelab Fixes Itself — Here's the Ansible Setup
Table of Contents

Editor’s note (March 2026): Since this post was published, the “200-line Python bridge” has grown into a full operations agent — 94 alert rules, 41 automated remediations, vulnerability scanning, config drift detection, three-tier approval gates, and Slack integration. The repo has been renamed to ops-kernel-stack. The architecture described below still forms the core of the system; everything else is layers on top.

At 2am last Tuesday, one of my Docker containers crashed. I know because ntfy pinged my phone. I also know because ntfy pinged my phone again 47 seconds later to tell me it had already fixed itself.

I did not wake up. I did not SSH in. I did not do anything.

This is the thing I’m most proud of in my homelab, and I want to show you exactly how it works.


The problem with self-hosted infrastructure

When you run your own services, things break. Containers crash. Disks fill up. Services fail after updates. This is the maintenance tax of self-hosting: you traded a monthly SaaS fee for an on-call rotation you didn’t sign up for.

The standard homelab answer is Uptime Kuma — a dashboard that tells you something is down. The problem with Uptime Kuma is it still requires you to fix the thing. You’re the pager, and you’re also the on-call engineer.

What I wanted was something that closed the loop: detect the problem, run the fix, tell me what happened. No AI, no complexity, no cloud dependency. Just deterministic automation that runs while I sleep.


What I built

Four-piece self-healing stack

The stack has four pieces:

1. Prometheus + node_exporter — metrics from every host. CPU, memory, disk, systemd service state, Docker container health. One playbook deploys node_exporter to the entire fleet.

2. Alertmanager — when Prometheus sees something wrong, Alertmanager fires. I have rules for disk > 85%, systemd service failed, container not running, TLS cert expiring soon.

3. The remediation bridge — this is the interesting part. A small FastAPI service running on my Ansible control node. When Alertmanager sends a webhook, the bridge looks up the alert name in a YAML map, picks the right Ansible playbook, runs it against the affected host, and sends a notification with the result.

4. Ansible remediation playbooks — one-shot playbooks for each failure mode: restart a container, restart a service, clean disk space, reload Caddy. These are the actual fixes.

Here’s the loop:

Container crashes
      |
      v
Prometheus detects (container health check fails)
      |
      v
Alertmanager fires ContainerDown alert
      |
      v
Remediation bridge receives webhook at :9999/hook
      |
      v
Looks up "ContainerDown" in remediation-map.yml
      -> playbook: remediation/restart-container.yml
      -> cooldown: 10 minutes
      |
      v
ansible-playbook remediation/restart-container.yml --limit docker-host -e container_name=...
      |
      v
ntfy: "Auto-remediated: ContainerDown (container restarted on docker-host)"

Total time from crash to fix: under 60 seconds.


The remediation map

The bridge uses a YAML config file to map alert names to playbooks. Adding a new remediation is just adding a few lines — no code changes, no restart required:

mappings:

  ContainerDown:
    playbook: remediation/restart-container.yml
    cooldown_minutes: 10

  DiskSpaceHigh:
    playbook: remediation/cleanup-disk.yml
    cooldown_minutes: 60

  SystemdServiceFailed:
    playbook: remediation/service-restart.yml
    cooldown_minutes: 15

  TLSCertExpiringSoon:
    playbook: remediation/caddy-reload.yml
    cooldown_minutes: 360

The cooldown prevents remediation loops. If a container keeps crashing every 2 minutes (OOM kill, bad config), the bridge runs the fix once and then backs off for the cooldown window. You get a failure notification for the second crash — that one needs a human.


Why Ansible and not an AI agent?

I tried an AI-based approach first. The idea was appealing: describe the problem in natural language, let the agent figure out the fix. In practice, it was slower, less reliable, and required internet access. For a homelab that’s supposed to run independently of external services, that’s a problem.

Ansible playbooks are deterministic. They do exactly what they say, every time, in the same order. I can --check them before deploying, I can read them in five minutes, I can run them manually when I want. There’s no reasoning step that might produce a different answer on a Tuesday.

The bridge runs entirely on-premises. No tokens, no API calls, no LLM in the loop. It reads a YAML file, runs a command, sends a notification. The entire codebase is about 200 lines of Python.


The backup side

Backup status across 6 hosts

While I was at it, I also automated the part of homelab ownership nobody talks about: backups you can actually trust.

Restic on every host. SFTP target on my NAS. Systemd timers staggered across the fleet so they don’t all hammer the NAS simultaneously. Prune policies so the repo doesn’t grow forever. Repository integrity checks on a schedule.

One playbook deploys this to every host in the backup_servers group. Exit code 3 (some files skipped — normal for non-root) is treated as success. The logs go to journald.

I’ve tested restores. That’s the only part that matters.


The full stack

Everything in this setup is driven by Ansible playbooks:

WhatPlaybook
Restic backupsdeploy-restic-fleet.yml
Prometheus node_exporterdeploy-node-exporter.yml
Loki log aggregationdeploy-loki.yml
Promtail log shipper (fleet)deploy-promtail-fleet.yml
Remediation bridgedeploy-remediation-bridge.yml

No hardcoded IPs. No Wyrdix-specific anything. The whole thing is parameterized — swap in your hostnames, your NAS IP, your ntfy server, and it deploys.

All playbooks are --check safe. All tasks are tagged. You can run just the configure tag to push a config change without touching the install steps.


Get it

The full collection is on GitHub: ops-kernel-stack

If you want the companion guide — the why behind every design decision, plus chapters on Proxmox provisioning, BIND9 DNS, Caddy + TLS, and a full walkthrough of building the remediation bridge from scratch — I’m working on that now.

Get the free getting started guide — a walkthrough that takes you from zero to a working Ansible homelab. Subscribe for a note when the full book is ready.


What I’d do differently

The bridge is in-memory for cooldown tracking. If the service restarts, the cooldown resets. For my use case this is fine — the scenarios where the service itself crashes are rare enough that a false double-remediation isn’t a problem. For a higher-stakes setup, persist the cooldowns to a file or SQLite.

The alerting rules are simple threshold-based. Smarter anomaly detection (rate of change, multi-signal correlation) would reduce false positives. Not worth it for a homelab.


Questions? Open an issue on GitHub or email hello@opskern.io.

Everything in this post is included in OpsKern managed hosting. Starting at $75/server/month.

Stay sharp

Get homelab ops tips in your inbox

One email per month. Ansible patterns, monitoring tricks, and self-healing strategies — no spam.

Unsubscribe anytime. We respect your inbox.

Get the code

The full Ansible collection is open source and ready to fork.

View on GitHub

Get the getting started guide

A free guide to automating your homelab with Ansible — from first playbook to self-healing infrastructure.

Read the Guide