Blog

Hardening the Foundation: Logging, Secrets, and Multi-Site Operations

By OpsKern · March 21, 2026 · 3 min read

Table of Contents

A DNS query log from stonecircle-host would have told us exactly when resolution started failing last month. We didn’t have one. The log existed on the host, but nothing was collecting it. We found the root cause in 40 minutes instead of 5.

That gap — and a dozen others like it — drove a week of hardening across four areas: centralized logging, secrets management, network enrollment, and multi-site high availability.

Unified logging

Every host in our fleet now ships logs to the same central destination. That sounds obvious, but the reality of a growing fleet is that logging tends to happen in layers — some hosts log to syslog, some to journald, some to files that get rotated and forgotten.

We deployed log collection agents to every host, including our hypervisors, and wired in sources that were previously invisible:

Audit logs. System-level audit events that track who did what and when. These were being generated but not collected.
DNS query logs. Every DNS resolution is now visible in the same search interface as application logs.
Automation logs. When our config management runs, the output goes to central logging, not just the local terminal.

The result is a single place to search across every host, every service, and every layer of the stack. When something goes wrong at 2am, the answer is in one place instead of scattered across twelve hosts.

Secrets consolidation

We audited every credential in the fleet and moved them into a proper secrets vault. API tokens, database passwords, SSH keys, webhook secrets — everything that was previously stored in config files, environment variables, or scattered across hosts.

This also meant rotating credentials that should have been rotated sooner. We found a few that had been in place since initial setup and never changed. That is the kind of debt that accumulates quietly until it becomes a security incident.

Every credential now has:

A defined rotation schedule
An owner responsible for renewal
Automated scanning that flags any secret that appears in a git commit or config file where it should not be

We also added secret scanning to our CI pipeline. Any commit that contains something that looks like a credential gets blocked before it can be pushed.

Mesh network enrollment

Several hosts were missing from our encrypted mesh network. They were reachable via the local network, but not through the overlay network that provides encrypted, authenticated connections between sites.

All hosts are now enrolled with proper access control tags. Each host has a defined role, and can only reach the services that role requires. The network boundary matches the trust boundary.

Multi-site high availability

The biggest change this week: our management agent now runs in an active-standby configuration across two sites.

The management agent now runs across two geographically separate sites. If the primary goes down — hardware failure, network outage, power loss — the secondary takes over automatically. When the primary comes back, it reclaims leadership.

This includes:

Alert deduplication across sites, so the same problem does not generate duplicate notifications during failover.
Configuration sync via automation, so both sites run identical configurations at all times.
Health-based failover that promotes the secondary only when the primary is genuinely unreachable, not just slow.

For managed infrastructure customers, this means the monitoring and remediation system itself is no longer a single point of failure. The system that watches your infrastructure is now watched by another copy of itself.

Why hardening weeks matter

It is tempting to skip this kind of work in favor of features. Features are visible. Hardening is invisible — until the day it saves you.

A centralized log pipeline means a 2am incident takes minutes to diagnose instead of hours. A secrets vault means a leaked credential is a rotation, not a breach. Multi-site HA means a hardware failure is a log entry, not a customer outage.

We will be back to shipping features next week. But this week made everything we build on top of the stack meaningfully more reliable.

Hardening the Foundation: Logging, Secrets, and Multi-Site Operations

Unified logging

Secrets consolidation

Mesh network enrollment

Multi-site high availability

Why hardening weeks matter

Get the code

Get the getting started guide

Unified logging

Secrets consolidation

Mesh network enrollment

Multi-site high availability

Why hardening weeks matter

Get homelab ops tips in your inbox

Related posts

What Is Self-Healing Infrastructure? A Practical Guide

Remediation Bridge v2: What Changed and Why

The Case for a 6MB Monitoring Agent (And Why We Built One in Go)

Get the code

Get the getting started guide