The Case for a 6MB Monitoring Agent (And Why We Built One in Go)

By OpsKern · · 4 min read
The Case for a 6MB Monitoring Agent (And Why We Built One in Go)
Table of Contents

We deployed monitoring to a new LXC container last month — 512MB of RAM, running a single service. After installing node_exporter, Filebeat, and the log forwarder, the monitoring stack was using 240MB. The actual service was using 80MB. The tools watching the server were consuming three times the resources of the thing they were watching.

On our Raspberry Pi node, it was worse. The monitoring overhead ate half the available memory before collecting a single metric. We needed something lighter. So we built it.


What a monitoring agent actually needs to do

Four agent responsibilities

Before writing any code, we listed the jobs a monitoring agent has to do on a managed host:

  1. Heartbeat. Tell the control plane it is alive. Every 30 seconds is fine. Every 5 seconds is paranoid but acceptable.
  2. System metrics. CPU, memory, disk, network. The basics. Collected every minute, not every second — you don’t need second-level granularity for capacity planning.
  3. Log forwarding. Send syslog and container logs to a central location. Not all of them — just the ones that matter for alerting and debugging.
  4. Command execution. Accept instructions from the control plane. “Restart this container.” “Run this health check.” “Apply this update.” The remediation pipeline needs hands on every host.

That is the whole list. Four jobs. None of them require a JVM. None of them require 200MB of RAM. None of them require a configuration file longer than 20 lines.


Why Go

Go compiles to a single static binary. No runtime dependencies. No package managers. No “install Python 3.11 and also pip and also these six libraries.” You copy one file to the host and run it.

The compiled binary for our agent is 5.7MB. It runs with a baseline memory footprint of about 8MB. On a host with 512MB of RAM, that is 1.6% overhead — compared to 40% or more for a traditional monitoring stack.

Deployment is a single command: copy the binary, create a systemd unit, start the service. On a fresh host, the agent is collecting metrics within 10 seconds of the first SSH connection.

Go also handles concurrency well, which matters when you are running a heartbeat loop, a metrics collector, a log forwarder, and a command listener all in the same process. Goroutines handle this naturally without the complexity of threading or async frameworks.


Design decisions that keep it light

Several deliberate choices keep the agent small:

Pull, don’t push configuration. The agent checks in with the control plane and asks “what should I be doing?” rather than storing a complex configuration file locally. If the monitoring requirements change, the control plane sends new instructions on the next check-in. No need to SSH in and edit config files across the fleet.

Batch metrics, don’t stream them. Collecting metrics every 60 seconds and sending them in a batch every 5 minutes uses dramatically less bandwidth and CPU than streaming every metric in real time. For infrastructure monitoring — where you care about trends, not millisecond resolution — batching is the right trade-off.

Minimal dependencies. The agent uses Go’s standard library for HTTP, JSON, and system calls. No external frameworks. No ORM. No logging library with 47 configuration options. Fewer dependencies means fewer things to break and fewer things to update.

Graceful degradation. If the agent cannot reach the control plane, it buffers metrics locally and retries. It does not crash. It does not fill the disk with debug logs. It waits quietly and reconnects when it can.


What you can learn from this approach

Even if you are not building your own agent, the principles apply to any monitoring setup:

1. Measure the monitor. How much RAM and CPU is your monitoring stack using? Run and find out. If your exporters and collectors are using more resources than your actual services, something is wrong.

2. Collect less, not more. Most homelabs collect every metric their tools offer and look at three of them. Configure your exporters to collect only what you actually alert on or graph. Everything else is storage cost with no return.

3. Prefer pull over push. Prometheus already does this — it scrapes targets rather than having targets push to it. The same principle applies to agent configuration. Central configuration that gets pulled is easier to manage than distributed configuration that gets pushed.

4. Single binaries are worth the trade-off. If you are choosing between a tool that requires Docker and one that ships as a single binary, the binary wins for monitoring agents. You want your monitoring to work even when Docker is the thing that is broken.


Where this fits in managed infrastructure

For OpsKern customers, the agent is invisible. It gets deployed automatically when a host is onboarded, it runs in the background, and it provides the telemetry that powers everything else — alerting, remediation, vulnerability scanning, capacity planning.

The customer never installs it, configures it, or thinks about it. That is the point. Monitoring infrastructure should be infrastructure, not a project.

If you are running your own fleet and want something lighter than the standard monitoring stack, consider what jobs your agent actually needs to do. The answer is usually fewer than you think.

Everything in this post is included in OpsKern managed hosting. Starting at $75/server/month.

Stay sharp

Get homelab ops tips in your inbox

One email per month. Ansible patterns, monitoring tricks, and self-healing strategies — no spam.

Unsubscribe anytime. We respect your inbox.

Get the code

The full Ansible collection is open source and ready to fork.

View on GitHub

Get the getting started guide

A free guide to automating your homelab with Ansible — from first playbook to self-healing infrastructure.

Read the Guide