📊 Observability for MCP Servers: Structured Logging, Distributed Tracing, and Metrics

This is Part 7 of the AI Engineering with TypeScript series.

Prerequisites: Part 1 · Part 2 · Part 3 · Part 4 · Part 5 · Part 6

Stack: Node.js 20+ · pino · @opentelemetry/sdk-node · prom-client · Jaeger · Prometheus · Grafana

🗺️ What we'll cover

After Parts 5 and 6 you have a hardened, multi-tenant, horizontally scaled MCP server. But right now, when something goes wrong in production, you are flying blind. Which tool call took 4 seconds? Which tenant is hammering the server? Why did three sessions die at 2 AM?

Observability answers all of these questions. The three pillars are:

🪵 Logs — what happened, with full context
🔭 Traces — how long each step took and where time was lost
📊 Metrics — aggregated numbers you can alert on

By the end you will have:

🪵 Structured JSON logging with pino — correlation IDs, session context, request metadata
🔭 Distributed tracing with OpenTelemetry — spans for every tool call, Redis lookup, and HTTP request
📊 Prometheus metrics — request counts, tool call durations, active session gauges
🖥️ Grafana dashboards — logs, traces, and metrics in one unified view
🐳 A docker-compose stack that spins up the whole observability stack locally

🪵 Part 1: Structured Logging with pino

console.log is fine during development. In production it produces unstructured strings that are impossible to query, filter, or alert on. pino emits newline-delimited JSON — every log line is machine-readable and carries structured fields.

npm install pino pino-http
npm install -D pino-pretty

// src/logger.ts
import pino from "pino";

export const logger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  // In production, remove prettyPrint and pipe output to your log collector
  transport:
    process.env.NODE_ENV !== "production"
      ? { target: "pino-pretty", options: { colorize: true } }
      : undefined,
  base: {
    service: "weather-mcp-server",
    version: process.env.npm_package_version ?? "unknown",
  },
});

Every log line now carries service and version automatically. In production, pipe the JSON output to Loki, Datadog, or Cloudwatch — they index the fields for free. 🎯

🔗 Part 2: Correlation IDs — Tying a Request Together

When 50 concurrent requests are interleaved in your logs, you need a way to filter down to everything that happened for one specific request. A correlation ID (also called a trace ID or request ID) is a UUID generated at the start of each request and attached to every log line that request produces.

// src/middleware/correlation.ts
import type { Request, Response, NextFunction } from "express";
import { randomUUID } from "crypto";
import { logger } from "../logger.js";

declare global {
  namespace Express {
    interface Request {
      correlationId: string;
      log: pino.Logger;
    }
  }
}

import type pino from "pino";

export function correlationMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
) {
  // Accept an upstream correlation ID (from a gateway) or generate one
  const correlationId =
    (req.headers["x-correlation-id"] as string) ?? randomUUID();

  req.correlationId = correlationId;

  // Child logger inherits all pino config + adds request-scoped fields
  req.log = logger.child({
    correlationId,
    sessionId: req.headers["mcp-session-id"],
    method: req.method,
    path: req.path,
  });

  res.setHeader("X-Correlation-Id", correlationId);
  next();
}

Now inside any route handler or tool, use req.log.info(...) instead of logger.info(...). Every line automatically carries the correlation ID, session ID, and request metadata — no manual field attachment needed. ✅

A real log line looks like this:

{
  "level": "info",
  "time": 1748000000000,
  "service": "weather-mcp-server",
  "correlationId": "a3f1c2d4-...",
  "sessionId": "abc-123",
  "method": "POST",
  "path": "/mcp",
  "tool": "get_current_weather",
  "city": "Pune",
  "durationMs": 312,
  "msg": "tool call completed"
}

Paste that correlationId into your log explorer and you instantly see every log line for that request across all instances. 🔍

🔭 Part 3: Distributed Tracing with OpenTelemetry

Logs tell you what happened. Traces tell you how long each step took and where time was lost. A trace is a tree of spans — each span represents one unit of work (an HTTP request, a Redis lookup, a tool call).

npm install @opentelemetry/sdk-node \
             @opentelemetry/auto-instrumentations-node \
             @opentelemetry/exporter-trace-otlp-http \
             @opentelemetry/resources \
             @opentelemetry/semantic-conventions

Create the instrumentation file and import it before anything else in your entry point:

// src/instrumentation.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION }
  from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: "weather-mcp-server",
    [SEMRESATTRS_SERVICE_VERSION]: process.env.npm_package_version ?? "0.0.0",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-fs": { enabled: false }, // too noisy
    }),
  ],
});

sdk.start();

process.on("SIGTERM", () => sdk.shutdown());
process.on("SIGINT", () => sdk.shutdown());

// src/server.ts — import instrumentation FIRST
import "./instrumentation.js";
import express from "express";
// ... rest of your imports

getNodeAutoInstrumentations automatically creates spans for every HTTP request, Redis call, and outbound fetch. You get distributed tracing with zero manual instrumentation for all the common operations. ✅

🔭 Part 4: Manual Spans for Tool Calls

Auto-instrumentation covers infrastructure. For your own business logic — tool calls, session lookups, cache hits — add manual spans:

// src/tracing.ts
import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("weather-mcp-server", "1.0.0");

export async function withSpan<T>(
  name: string,
  attributes: Record<string, string | number | boolean>,
  fn: () => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(name, async (span) => {
    span.setAttributes(attributes);
    try {
      const result = await fn();
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: err instanceof Error ? err.message : String(err),
      });
      span.recordException(err as Error);
      throw err;
    } finally {
      span.end();
    }
  });
}

Use it in your tool handler:

server.tool(
  "get_current_weather",
  "Get current weather for a city",
  GetWeatherInput.shape,
  async (args) => {
    const input = GetWeatherInput.parse(args);

    return withSpan(
      "tool.get_current_weather",
      {
        "tool.name": "get_current_weather",
        "tool.input.city": input.city,
        "tool.input.units": input.units ?? "metric",
        "session.id": currentSessionId, // from closure
      },
      async () => {
        const data = await fetchCurrentWeather(input.city, input.country, input.units);
        return { content: [{ type: "text", text: JSON.stringify(data, null, 2) }] };
      }
    );
  }
);

In Jaeger or Grafana Tempo you now see a waterfall: the outer HTTP span contains the tool.get_current_weather span, which contains the outbound fetch span to the weather API. If the tool call is slow, you can see exactly which child span is the bottleneck. 🔍

📊 Part 5: Prometheus Metrics with prom-client

Logs and traces answer historical questions. Metrics answer right now questions: "How many requests per second are we handling? What is the p99 latency of tool calls? How many sessions are active?"

npm install prom-client

// src/metrics.ts
import client from "prom-client";

// Enable default Node.js metrics (heap, GC, event loop lag, etc.)
client.collectDefaultMetrics({ prefix: "mcp_" });

export const httpRequestDuration = new client.Histogram({
  name: "mcp_http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "path", "status_code"],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});

export const toolCallDuration = new client.Histogram({
  name: "mcp_tool_call_duration_seconds",
  help: "Duration of MCP tool calls in seconds",
  labelNames: ["tool_name", "session_id", "status"],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

export const toolCallTotal = new client.Counter({
  name: "mcp_tool_calls_total",
  help: "Total number of MCP tool calls",
  labelNames: ["tool_name", "status"],
});

export const activeSessions = new client.Gauge({
  name: "mcp_active_sessions",
  help: "Number of currently active MCP sessions",
});

export const authFailures = new client.Counter({
  name: "mcp_auth_failures_total",
  help: "Total number of authentication failures",
  labelNames: ["reason"],
});

export const registry = client.register;

Expose a /metrics endpoint that Prometheus scrapes:

// In server.ts
import { registry } from "./metrics.js";

// No auth on /metrics — Prometheus scrapes this internally
app.get("/metrics", async (_req, res) => {
  res.set("Content-Type", registry.contentType);
  res.send(await registry.metrics());
});

📊 Part 6: Instrumenting Tool Calls with Metrics

Wire the metrics into your tool execution path:

// src/utils/instrument-tool.ts
import { toolCallDuration, toolCallTotal, activeSessions } from "../metrics.js";
import { withSpan } from "../tracing.js";
import type { Request } from "express";

export async function instrumentedToolCall<T>(
  toolName: string,
  sessionId: string,
  fn: () => Promise<T>
): Promise<T> {
  const end = toolCallDuration.startTimer({ tool_name: toolName, session_id: sessionId });

  return withSpan(`tool.${toolName}`, { "tool.name": toolName, "session.id": sessionId }, async () => {
    try {
      const result = await fn();
      end({ status: "success" });
      toolCallTotal.inc({ tool_name: toolName, status: "success" });
      return result;
    } catch (err) {
      end({ status: "error" });
      toolCallTotal.inc({ tool_name: toolName, status: "error" });
      throw err;
    }
  });
}

Now wrap every tool handler body:

server.tool(
  "get_current_weather",
  "Get current weather for a city",
  GetWeatherInput.shape,
  async (args) => {
    const input = GetWeatherInput.parse(args);

    return instrumentedToolCall("get_current_weather", currentSessionId, async () => {
      const data = await fetchCurrentWeather(input.city, input.country, input.units);
      return { content: [{ type: "text", text: JSON.stringify(data, null, 2) }] };
    });
  }
);

One wrapper gives you histogram timings, counters, and a trace span — all three pillars covered in a single call. 🎯

📈 Part 7: Session Gauge — Track Active Sessions

Update the active session gauge whenever sessions are created or destroyed:

// In handleMcpRequest — when a new session is created:
activeSessions.inc();

// In the DELETE /mcp handler — when a session is cleaned up:
activeSessions.dec();

// In InProcessSessionStore.evictExpired():
for (const [id, state] of this.sessions) {
  if (now - state.lastActiveAt.getTime() > this.TTL_MS) {
    this.sessions.delete(id);
    activeSessions.dec();  // track evictions too
  }
}

A Prometheus alert on this gauge tells you when sessions are leaking (gauge grows without bound) or when a deployment drained all sessions (gauge drops to zero mid-traffic). 🚨

🐳 Part 8: Local Observability Stack with docker-compose

Spin up the full observability stack locally — Jaeger for traces, Prometheus for metrics, Loki for logs, and Grafana as the unified UI:

version: "3.9"

services:
  mcp-server:
    build: .
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
      - VALID_TOKENS=dev-token
      - WEATHER_API_KEY=${WEATHER_API_KEY}
      - REDIS_URL=redis://redis:6379
    depends_on:
      - jaeger
      - prometheus
      - redis

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  jaeger:
    image: jaegertracing/all-in-one:1.57
    ports:
      - "16686:16686"   # Jaeger UI
      - "4318:4318"     # OTLP HTTP receiver

  prometheus:
    image: prom/prometheus:v2.51.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning

prometheus.yml scrape config:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: weather-mcp-server
    static_configs:
      - targets: ["mcp-server:3000"]
    metrics_path: /metrics

Start everything:

docker compose up

Then open:

Jaeger UI at http://localhost:16686 — search for weather-mcp-server traces
Prometheus at http://localhost:9090 — query mcp_tool_call_duration_seconds
Grafana at http://localhost:3001 — build unified dashboards

🖥️ Part 9: Useful Prometheus Queries

Once Prometheus is scraping your server, these PromQL queries give you the most useful signals:

Request throughput (requests per second, 5-minute window):

rate(mcp_http_request_duration_seconds_count[5m])

p99 tool call latency:

histogram_quantile(0.99, rate(mcp_tool_call_duration_seconds_bucket[5m]))

Tool call error rate:

rate(mcp_tool_calls_total{status="error"}[5m])
/
rate(mcp_tool_calls_total[5m])

Active sessions right now:

mcp_active_sessions

Auth failure rate (alert if this spikes — it could mean a token is compromised):

rate(mcp_auth_failures_total[5m])

Node.js event loop lag (alert if consistently above 100ms — your server is CPU-starved):

mcp_nodejs_eventloop_lag_seconds

Paste these into Grafana panels and you have a production-grade dashboard in minutes. 📊

🚨 Part 10: Alerting Rules

Add these alert rules to prometheus.yml (or your Alertmanager config) to get paged before users notice:

groups:
  - name: mcp-server
    rules:
      - alert: HighToolCallErrorRate
        expr: |
          rate(mcp_tool_calls_total{status="error"}[5m])
          / rate(mcp_tool_calls_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Tool call error rate above 5% for 2 minutes"

      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            rate(mcp_tool_call_duration_seconds_bucket[5m])
          ) > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 tool call latency above 3s"

      - alert: SessionLeak
        expr: mcp_active_sessions > 500
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Active sessions above 500 for 10 minutes — possible leak"

      - alert: AuthSpiking
        expr: rate(mcp_auth_failures_total[5m]) > 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Auth failures spiking — possible credential attack"

💡 Key Takeaways

Log structured JSON, not strings. Strings are for humans. JSON is for both humans and machines — your log aggregator can index, filter, and alert on individual fields.

Correlation IDs are non-negotiable in distributed systems. Without them, debugging a multi-instance, multi-session server is guesswork.

Auto-instrumentation covers 80% of the work. getNodeAutoInstrumentations automatically traces all HTTP, Redis, and DNS operations. Only add manual spans for your own business logic.

Histograms beat averages. An average latency of 200ms hides the fact that 1% of requests take 5 seconds. Always look at p95 and p99.

Alerts should be actionable. Every alert should have a clear remediation — "restart the service", "scale up", "rotate tokens". Alerts with no action are noise that trains your team to ignore pages.

🎯 Summary

In Part 7 you wired full observability into the MCP server:

🪵 pino — structured JSON logs with correlation IDs and session context on every line
🔗 Correlation middleware — one UUID ties all logs for a request together across instances
🔭 OpenTelemetry — auto-instrumented spans for HTTP, Redis, and DNS plus manual spans for tool calls
📊 prom-client — request duration histograms, tool call counters, active session gauges, auth failure counters
🐳 docker-compose — one command spins up Jaeger, Prometheus, and Grafana locally
🚨 Alert rules — error rate, p99 latency, session leak, and auth spike detectors

In Part 8 we will build a TypeScript MCP client SDK wrapper — a clean, reusable abstraction you can publish as an npm package so other teams can connect to your MCP server without touching raw JSON-RPC. 📦

📚 Further Reading

🪵 pino documentation
🔭 OpenTelemetry Node.js SDK
📊 prom-client on GitHub
🖥️ Grafana dashboards
🔭 Jaeger distributed tracing
🏗️ Part 6: Multi-Tenant MCP — Session Management and Horizontal Scaling

📊 Observability for MCP Servers: Structured Logging, Distributed Tracing, and Metrics

🗺️ What we'll cover

🪵 Part 1: Structured Logging with pino

🔗 Part 2: Correlation IDs — Tying a Request Together

🔭 Part 3: Distributed Tracing with OpenTelemetry

🔭 Part 4: Manual Spans for Tool Calls

📊 Part 5: Prometheus Metrics with prom-client

📊 Part 6: Instrumenting Tool Calls with Metrics

📈 Part 7: Session Gauge — Track Active Sessions

🐳 Part 8: Local Observability Stack with docker-compose

🖥️ Part 9: Useful Prometheus Queries

🚨 Part 10: Alerting Rules

💡 Key Takeaways

🎯 Summary

📚 Further Reading

Comments

More from this blog

📦 Build and Publish a Reusable TypeScript MCP Client SDK

🏗️ Multi-Tenant MCP: Session Management, State Isolation, and Horizontal Scaling

🐳 Production MCP Servers: Streamable HTTP, OAuth 2.0, Zod Validation, and Docker

💬 Streaming AI Agents and an Interactive CLI: Real-Time MCP in TypeScript

Command Palette

🗺️ What we'll cover

🪵 Part 1: Structured Logging with pino

🔗 Part 2: Correlation IDs — Tying a Request Together

🔭 Part 3: Distributed Tracing with OpenTelemetry

🔭 Part 4: Manual Spans for Tool Calls

📊 Part 5: Prometheus Metrics with prom-client

📊 Part 6: Instrumenting Tool Calls with Metrics

📈 Part 7: Session Gauge — Track Active Sessions

🐳 Part 8: Local Observability Stack with docker-compose

🖥️ Part 9: Useful Prometheus Queries

🚨 Part 10: Alerting Rules

💡 Key Takeaways

🎯 Summary

📚 Further Reading

Comments

More from this blog