📊 Observability for MCP Servers: Structured Logging, Distributed Tracing, and Metrics
See inside your production MCP server — every tool call, every session, every latency spike — with pino, OpenTelemetry, and Prometheus
Hi 👋, I'm Tushar Patil. Currently I am working as Frontend Developer (Angular) and also have expertise with .Net Core and Framework.
This is Part 7 of the AI Engineering with TypeScript series.
Prerequisites: Part 1 · Part 2 · Part 3 · Part 4 · Part 5 · Part 6
Stack: Node.js 20+ · pino · @opentelemetry/sdk-node · prom-client · Jaeger · Prometheus · Grafana
🗺️ What we'll cover
After Parts 5 and 6 you have a hardened, multi-tenant, horizontally scaled MCP server. But right now, when something goes wrong in production, you are flying blind. Which tool call took 4 seconds? Which tenant is hammering the server? Why did three sessions die at 2 AM?
Observability answers all of these questions. The three pillars are:
- 🪵 Logs — what happened, with full context
- 🔭 Traces — how long each step took and where time was lost
- 📊 Metrics — aggregated numbers you can alert on
By the end you will have:
- 🪵 Structured JSON logging with
pino— correlation IDs, session context, request metadata - 🔭 Distributed tracing with OpenTelemetry — spans for every tool call, Redis lookup, and HTTP request
- 📊 Prometheus metrics — request counts, tool call durations, active session gauges
- 🖥️ Grafana dashboards — logs, traces, and metrics in one unified view
- 🐳 A
docker-composestack that spins up the whole observability stack locally
🪵 Part 1: Structured Logging with pino
console.log is fine during development. In production it produces unstructured strings that are impossible to query, filter, or alert on. pino emits newline-delimited JSON — every log line is machine-readable and carries structured fields.
npm install pino pino-http
npm install -D pino-pretty
// src/logger.ts
import pino from "pino";
export const logger = pino({
level: process.env.LOG_LEVEL ?? "info",
// In production, remove prettyPrint and pipe output to your log collector
transport:
process.env.NODE_ENV !== "production"
? { target: "pino-pretty", options: { colorize: true } }
: undefined,
base: {
service: "weather-mcp-server",
version: process.env.npm_package_version ?? "unknown",
},
});
Every log line now carries service and version automatically. In production, pipe the JSON output to Loki, Datadog, or Cloudwatch — they index the fields for free. 🎯
🔗 Part 2: Correlation IDs — Tying a Request Together
When 50 concurrent requests are interleaved in your logs, you need a way to filter down to everything that happened for one specific request. A correlation ID (also called a trace ID or request ID) is a UUID generated at the start of each request and attached to every log line that request produces.
// src/middleware/correlation.ts
import type { Request, Response, NextFunction } from "express";
import { randomUUID } from "crypto";
import { logger } from "../logger.js";
declare global {
namespace Express {
interface Request {
correlationId: string;
log: pino.Logger;
}
}
}
import type pino from "pino";
export function correlationMiddleware(
req: Request,
res: Response,
next: NextFunction
) {
// Accept an upstream correlation ID (from a gateway) or generate one
const correlationId =
(req.headers["x-correlation-id"] as string) ?? randomUUID();
req.correlationId = correlationId;
// Child logger inherits all pino config + adds request-scoped fields
req.log = logger.child({
correlationId,
sessionId: req.headers["mcp-session-id"],
method: req.method,
path: req.path,
});
res.setHeader("X-Correlation-Id", correlationId);
next();
}
Now inside any route handler or tool, use req.log.info(...) instead of logger.info(...). Every line automatically carries the correlation ID, session ID, and request metadata — no manual field attachment needed. ✅
A real log line looks like this:
{
"level": "info",
"time": 1748000000000,
"service": "weather-mcp-server",
"correlationId": "a3f1c2d4-...",
"sessionId": "abc-123",
"method": "POST",
"path": "/mcp",
"tool": "get_current_weather",
"city": "Pune",
"durationMs": 312,
"msg": "tool call completed"
}
Paste that correlationId into your log explorer and you instantly see every log line for that request across all instances. 🔍
🔭 Part 3: Distributed Tracing with OpenTelemetry
Logs tell you what happened. Traces tell you how long each step took and where time was lost. A trace is a tree of spans — each span represents one unit of work (an HTTP request, a Redis lookup, a tool call).
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/resources \
@opentelemetry/semantic-conventions
Create the instrumentation file and import it before anything else in your entry point:
// src/instrumentation.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION }
from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: "weather-mcp-server",
[SEMRESATTRS_SERVICE_VERSION]: process.env.npm_package_version ?? "0.0.0",
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
}),
instrumentations: [
getNodeAutoInstrumentations({
"@opentelemetry/instrumentation-fs": { enabled: false }, // too noisy
}),
],
});
sdk.start();
process.on("SIGTERM", () => sdk.shutdown());
process.on("SIGINT", () => sdk.shutdown());
// src/server.ts — import instrumentation FIRST
import "./instrumentation.js";
import express from "express";
// ... rest of your imports
getNodeAutoInstrumentations automatically creates spans for every HTTP request, Redis call, and outbound fetch. You get distributed tracing with zero manual instrumentation for all the common operations. ✅
🔭 Part 4: Manual Spans for Tool Calls
Auto-instrumentation covers infrastructure. For your own business logic — tool calls, session lookups, cache hits — add manual spans:
// src/tracing.ts
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("weather-mcp-server", "1.0.0");
export async function withSpan<T>(
name: string,
attributes: Record<string, string | number | boolean>,
fn: () => Promise<T>
): Promise<T> {
return tracer.startActiveSpan(name, async (span) => {
span.setAttributes(attributes);
try {
const result = await fn();
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: err instanceof Error ? err.message : String(err),
});
span.recordException(err as Error);
throw err;
} finally {
span.end();
}
});
}
Use it in your tool handler:
server.tool(
"get_current_weather",
"Get current weather for a city",
GetWeatherInput.shape,
async (args) => {
const input = GetWeatherInput.parse(args);
return withSpan(
"tool.get_current_weather",
{
"tool.name": "get_current_weather",
"tool.input.city": input.city,
"tool.input.units": input.units ?? "metric",
"session.id": currentSessionId, // from closure
},
async () => {
const data = await fetchCurrentWeather(input.city, input.country, input.units);
return { content: [{ type: "text", text: JSON.stringify(data, null, 2) }] };
}
);
}
);
In Jaeger or Grafana Tempo you now see a waterfall: the outer HTTP span contains the tool.get_current_weather span, which contains the outbound fetch span to the weather API. If the tool call is slow, you can see exactly which child span is the bottleneck. 🔍
📊 Part 5: Prometheus Metrics with prom-client
Logs and traces answer historical questions. Metrics answer right now questions: "How many requests per second are we handling? What is the p99 latency of tool calls? How many sessions are active?"
npm install prom-client
// src/metrics.ts
import client from "prom-client";
// Enable default Node.js metrics (heap, GC, event loop lag, etc.)
client.collectDefaultMetrics({ prefix: "mcp_" });
export const httpRequestDuration = new client.Histogram({
name: "mcp_http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "path", "status_code"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});
export const toolCallDuration = new client.Histogram({
name: "mcp_tool_call_duration_seconds",
help: "Duration of MCP tool calls in seconds",
labelNames: ["tool_name", "session_id", "status"],
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
export const toolCallTotal = new client.Counter({
name: "mcp_tool_calls_total",
help: "Total number of MCP tool calls",
labelNames: ["tool_name", "status"],
});
export const activeSessions = new client.Gauge({
name: "mcp_active_sessions",
help: "Number of currently active MCP sessions",
});
export const authFailures = new client.Counter({
name: "mcp_auth_failures_total",
help: "Total number of authentication failures",
labelNames: ["reason"],
});
export const registry = client.register;
Expose a /metrics endpoint that Prometheus scrapes:
// In server.ts
import { registry } from "./metrics.js";
// No auth on /metrics — Prometheus scrapes this internally
app.get("/metrics", async (_req, res) => {
res.set("Content-Type", registry.contentType);
res.send(await registry.metrics());
});
📊 Part 6: Instrumenting Tool Calls with Metrics
Wire the metrics into your tool execution path:
// src/utils/instrument-tool.ts
import { toolCallDuration, toolCallTotal, activeSessions } from "../metrics.js";
import { withSpan } from "../tracing.js";
import type { Request } from "express";
export async function instrumentedToolCall<T>(
toolName: string,
sessionId: string,
fn: () => Promise<T>
): Promise<T> {
const end = toolCallDuration.startTimer({ tool_name: toolName, session_id: sessionId });
return withSpan(`tool.${toolName}`, { "tool.name": toolName, "session.id": sessionId }, async () => {
try {
const result = await fn();
end({ status: "success" });
toolCallTotal.inc({ tool_name: toolName, status: "success" });
return result;
} catch (err) {
end({ status: "error" });
toolCallTotal.inc({ tool_name: toolName, status: "error" });
throw err;
}
});
}
Now wrap every tool handler body:
server.tool(
"get_current_weather",
"Get current weather for a city",
GetWeatherInput.shape,
async (args) => {
const input = GetWeatherInput.parse(args);
return instrumentedToolCall("get_current_weather", currentSessionId, async () => {
const data = await fetchCurrentWeather(input.city, input.country, input.units);
return { content: [{ type: "text", text: JSON.stringify(data, null, 2) }] };
});
}
);
One wrapper gives you histogram timings, counters, and a trace span — all three pillars covered in a single call. 🎯
📈 Part 7: Session Gauge — Track Active Sessions
Update the active session gauge whenever sessions are created or destroyed:
// In handleMcpRequest — when a new session is created:
activeSessions.inc();
// In the DELETE /mcp handler — when a session is cleaned up:
activeSessions.dec();
// In InProcessSessionStore.evictExpired():
for (const [id, state] of this.sessions) {
if (now - state.lastActiveAt.getTime() > this.TTL_MS) {
this.sessions.delete(id);
activeSessions.dec(); // track evictions too
}
}
A Prometheus alert on this gauge tells you when sessions are leaking (gauge grows without bound) or when a deployment drained all sessions (gauge drops to zero mid-traffic). 🚨
🐳 Part 8: Local Observability Stack with docker-compose
Spin up the full observability stack locally — Jaeger for traces, Prometheus for metrics, Loki for logs, and Grafana as the unified UI:
version: "3.9"
services:
mcp-server:
build: .
ports:
- "3000:3000"
environment:
- NODE_ENV=production
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
- VALID_TOKENS=dev-token
- WEATHER_API_KEY=${WEATHER_API_KEY}
- REDIS_URL=redis://redis:6379
depends_on:
- jaeger
- prometheus
- redis
redis:
image: redis:7-alpine
ports:
- "6379:6379"
jaeger:
image: jaegertracing/all-in-one:1.57
ports:
- "16686:16686" # Jaeger UI
- "4318:4318" # OTLP HTTP receiver
prometheus:
image: prom/prometheus:v2.51.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.4.0
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
prometheus.yml scrape config:
global:
scrape_interval: 15s
scrape_configs:
- job_name: weather-mcp-server
static_configs:
- targets: ["mcp-server:3000"]
metrics_path: /metrics
Start everything:
docker compose up
Then open:
- Jaeger UI at
http://localhost:16686— search forweather-mcp-servertraces - Prometheus at
http://localhost:9090— querymcp_tool_call_duration_seconds - Grafana at
http://localhost:3001— build unified dashboards
🖥️ Part 9: Useful Prometheus Queries
Once Prometheus is scraping your server, these PromQL queries give you the most useful signals:
Request throughput (requests per second, 5-minute window):
rate(mcp_http_request_duration_seconds_count[5m])
p99 tool call latency:
histogram_quantile(0.99, rate(mcp_tool_call_duration_seconds_bucket[5m]))
Tool call error rate:
rate(mcp_tool_calls_total{status="error"}[5m])
/
rate(mcp_tool_calls_total[5m])
Active sessions right now:
mcp_active_sessions
Auth failure rate (alert if this spikes — it could mean a token is compromised):
rate(mcp_auth_failures_total[5m])
Node.js event loop lag (alert if consistently above 100ms — your server is CPU-starved):
mcp_nodejs_eventloop_lag_seconds
Paste these into Grafana panels and you have a production-grade dashboard in minutes. 📊
🚨 Part 10: Alerting Rules
Add these alert rules to prometheus.yml (or your Alertmanager config) to get paged before users notice:
groups:
- name: mcp-server
rules:
- alert: HighToolCallErrorRate
expr: |
rate(mcp_tool_calls_total{status="error"}[5m])
/ rate(mcp_tool_calls_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Tool call error rate above 5% for 2 minutes"
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
rate(mcp_tool_call_duration_seconds_bucket[5m])
) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "p99 tool call latency above 3s"
- alert: SessionLeak
expr: mcp_active_sessions > 500
for: 10m
labels:
severity: critical
annotations:
summary: "Active sessions above 500 for 10 minutes — possible leak"
- alert: AuthSpiking
expr: rate(mcp_auth_failures_total[5m]) > 10
for: 1m
labels:
severity: critical
annotations:
summary: "Auth failures spiking — possible credential attack"
💡 Key Takeaways
Log structured JSON, not strings. Strings are for humans. JSON is for both humans and machines — your log aggregator can index, filter, and alert on individual fields.
Correlation IDs are non-negotiable in distributed systems. Without them, debugging a multi-instance, multi-session server is guesswork.
Auto-instrumentation covers 80% of the work. getNodeAutoInstrumentations automatically traces all HTTP, Redis, and DNS operations. Only add manual spans for your own business logic.
Histograms beat averages. An average latency of 200ms hides the fact that 1% of requests take 5 seconds. Always look at p95 and p99.
Alerts should be actionable. Every alert should have a clear remediation — "restart the service", "scale up", "rotate tokens". Alerts with no action are noise that trains your team to ignore pages.
🎯 Summary
In Part 7 you wired full observability into the MCP server:
- 🪵 pino — structured JSON logs with correlation IDs and session context on every line
- 🔗 Correlation middleware — one UUID ties all logs for a request together across instances
- 🔭 OpenTelemetry — auto-instrumented spans for HTTP, Redis, and DNS plus manual spans for tool calls
- 📊 prom-client — request duration histograms, tool call counters, active session gauges, auth failure counters
- 🐳 docker-compose — one command spins up Jaeger, Prometheus, and Grafana locally
- 🚨 Alert rules — error rate, p99 latency, session leak, and auth spike detectors
In Part 8 we will build a TypeScript MCP client SDK wrapper — a clean, reusable abstraction you can publish as an npm package so other teams can connect to your MCP server without touching raw JSON-RPC. 📦