When a request slows down in a distributed system, "the API is slow" is not a diagnosis — it is the start of an investigation. Observability is what turns that investigation from guesswork into a few targeted queries. Over the years I have settled on a small set of patterns for making Go services observable without drowning them in instrumentation.
The three signals
Observability is usually framed as traces, metrics, and logs. They answer different questions:
- Metrics answer "is something wrong, and how bad?" — they are cheap, aggregate, and great for alerting.
- Traces answer "where is the time going?" — they follow a single request across service boundaries.
- Logs answer "what exactly happened?" — they carry the detail you need once a trace points you at the right span.
The mistake I see most often is treating these as three separate systems. They are far more powerful when correlated by a shared trace ID.
Start with OpenTelemetry
OpenTelemetry gives you a vendor-neutral API for all three signals. Wiring up a tracer provider once, at startup, keeps the rest of the code clean:
func initTracer(ctx context.Context) (func(context.Context) error, error) {
exp, err := otlptracegrpc.New(ctx)
if err != nil {
return nil, fmt.Errorf("create otlp exporter: %w", err)
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exp),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("loan-screening"),
)),
)
otel.SetTracerProvider(tp)
return tp.Shutdown, nil
}From there, instrument the boundaries that matter — inbound handlers, outbound calls, and any expensive work in between:
func (s *Service) Screen(ctx context.Context, app Application) (Decision, error) {
ctx, span := otel.Tracer("screening").Start(ctx, "Screen")
defer span.End()
span.SetAttributes(attribute.String("applicant.segment", app.Segment))
// ... real work ...
return decision, nil
}Make logs carry the trace ID
A trace is only useful if you can jump from a log line to it. With structured
logging via slog, attach the trace ID to every log emitted inside a request:
func logger(ctx context.Context) *slog.Logger {
sc := trace.SpanContextFromContext(ctx)
if !sc.HasTraceID() {
return slog.Default()
}
return slog.Default().With("trace_id", sc.TraceID().String())
}Now a single trace_id ties together your Prometheus exemplars, your Jaeger
trace, and your Loki logs. That correlation is the whole point.
Keep metrics low-cardinality
Metrics are cheap until you label them with something unbounded — a user ID, a request path with IDs in it, a raw error string. Cardinality explosions are the most common way teams accidentally take down their own monitoring stack.
Rule of thumb: a label is safe only if you could enumerate its possible values on a whiteboard.
Use a small, fixed set of labels (method, route template, status class) and push the high-cardinality detail into traces and logs instead.
What good looks like
A healthy setup lets you go from an alert to a root cause in three hops:
- A Prometheus alert fires on elevated p99 latency.
- An exemplar on that metric links to a slow trace in Jaeger.
- The slow span's
trace_idpulls the exact logs in Loki.
No SSH-ing into boxes, no grep across hosts. That is the difference between
observability as a buzzword and observability as a tool you actually reach for at
2 a.m.