The pager goes off at 3 a.m. A service is down. You SSH into the box and find silence — a log that says started, a log that says crashed, and nothing in between. The system worked right up until it did not, and it left you no story of how. That gap is not bad luck. It is a design decision, made months earlier, to ship features and defer visibility.
Three signals, three jobs
These three are not interchangeable. Logs are the narrative — discrete events with context, what happened and why. Metrics are the vital signs — cheap numbers aggregated over time that tell you how much and how fast. Traces are the map — the path of a single request as it crosses services, showing where the time actually went. Reach for a log to explain one event, a metric to spot a trend, a trace to find the slow hop. Skip one and you get blind spots exactly where incidents live.
A log line meant only for human eyes is a dead end at scale. Structured logging emits events as key-value data — a request id, a tenant, a duration, an outcome — so you can filter, group, and alert instead of grepping prose. The discipline is to log decisions, not decoration: the branch taken, the input that failed validation, the downstream call that timed out. Free text tells you a story you have to read one line at a time. Structured events let you ask the whole fleet a question and get an answer in seconds.
The handful of metrics that matter
You can drown in dashboards. The signals that actually predict user pain are few — latency, traffic, errors, and saturation, the golden signals. How long requests take, how many arrive, how many fail, and how close you are to a resource limit. Track those per service, with percentiles rather than averages, and you can see trouble building before customers feel it. Everything else is context you pull once the golden signals tell you where to look. Instrument the four first; add the rest when a specific question demands it.
Traces stitch the services back together
In a distributed system no single service holds the truth about a slow request. A trace propagates one id across every hop and records the timing of each, so a request that takes two seconds stops being a mystery and becomes a bar chart — this call was fast, this queue was deep, this database did the damage. Without tracing you debug microservices by guessing which team to blame. With it, the request tells you itself.
Operability is a feature you design in
A system is not done when it passes tests; it is done when someone other than its author can run it at 3 a.m. That means instrumentation is part of the design, not a patch after the first outage. We treat infrastructure as a product that is operated, not just delivered — every service ships with structured logs, the golden signals, traces, and with dashboards and alerts wired before it takes production traffic. The question we ask in review is not only does it work, but can you see it working.
A system you cannot observe is a system you can only hope about, never operate.— Protocore · Backend engineering
On a payments platform we ran, instrumentation was the first commit, not the last. When latency crept up one afternoon, a trace pointed at a single slow dependency in minutes, and the golden signals confirmed no money was at risk while we fixed it — no war room, no guessing. That is what instrumented by default buys: a system you can actually run, one where an incident is a question with an answer instead of a night spent in the dark.
Have a system to build?
Tell us the problem. We'll come back with an architecture and a plan.
Start a project