Services

You run the servers. We keep them healthy.

Everything is configured with Ansible and version-controlled in Git. If a host dies, we rebuild it from the repo.

What We Manage

>_

Monitoring

Metrics scraped every 15 seconds across your entire fleet. Dashboards show CPU, memory, disk, network, and container health — 90 days of retention. You get a read-only link to the same dashboards we use.

!!

Alerting

Dozens of alert rules tuned to the things that actually break: host unreachable, disk pressure, service failures, TLS cert expiry, container health, backup staleness. Notifications hit your phone within seconds.

~>

Self-Healing

When an alert fires, the system assesses risk and selects the right response. Disk full? Cleanup runs. Container crashed? It restarts. Vulnerability detected? Patch pipeline queues. Every remediation logged, every outcome notified.

[]

Backups

Daily snapshots, verified every morning at 07:00 UTC. If a snapshot is stale or missing, you hear about it before it matters. Managed plans include 90-day retention.

{}

Provisioning

One Ansible run and your infrastructure is configured. Need another server? Add it to the inventory and apply. DNS, firewall rules, networking — all in the same run.

##

Security

SSH key-only, fail2ban, automatic security patches, Tailscale VPN for all management traffic. No management ports on the public internet.

<<

Log Aggregation

Every host ships logs to a central aggregator. Search them from your dashboard. 90-day retention. When something breaks at 3am, the logs are already there in the morning.

::

Client Dashboard

Private dashboard for every client. Magic link login — no passwords. 30-day uptime, infrastructure status, alert history, billing. Refreshes automatically.

The Difference

Autonomous remediation. Not just alerting.

Most monitoring stacks stop at the notification. Ours closes the loop. Every alert is assessed by risk, matched to the right response, and resolved without human intervention. Escalation gates ensure high-risk changes still require approval.

1

Alert fires Anomaly detected automatically

2

Risk assessed Severity and response determined

3

Fix applied Automated remediation runs

4

You get notified Under 60 seconds, no human required