# 📊 Observability for MCP Servers: Structured Logging, Distributed Tracing, and Metrics

---

*This is Part 7 of the **AI Engineering with TypeScript** series.*

*Prerequisites: [Part 1](https://blog.techtush.in/what-is-mcp-model-context-protocol-a-typescript-developer-s-guide) · [Part 2](https://blog.techtush.in/mcp-fundamentals-tools-resources-prompts-and-capability-negotiation-deep-dive) · [Part 3](https://blog.techtush.in/building-an-ai-agent-with-mcp-multi-step-tool-orchestration-in-typescript) · [Part 4](https://blog.techtush.in/streaming-ai-agents-and-an-interactive-cli-real-time-mcp-in-typescript) · [Part 5](https://blog.techtush.in/production-mcp-servers-streamable-http-oauth-zod-and-docker) · [Part 6](https://blog.techtush.in/multi-tenant-mcp-session-management-state-isolation-and-horizontal-scaling)*

*Stack: Node.js 20+ · pino · @opentelemetry/sdk-node · prom-client · Jaeger · Prometheus · Grafana*

---

## 🗺️ What we'll cover

After Parts 5 and 6 you have a hardened, multi-tenant, horizontally scaled MCP server. But right now, when something goes wrong in production, you are flying blind. Which tool call took 4 seconds? Which tenant is hammering the server? Why did three sessions die at 2 AM?

Observability answers all of these questions. The three pillars are:

- 🪵 **Logs** — what happened, with full context
- 🔭 **Traces** — how long each step took and where time was lost
- 📊 **Metrics** — aggregated numbers you can alert on

By the end you will have:

- 🪵 **Structured JSON logging** with `pino` — correlation IDs, session context, request metadata
- 🔭 **Distributed tracing** with OpenTelemetry — spans for every tool call, Redis lookup, and HTTP request
- 📊 **Prometheus metrics** — request counts, tool call durations, active session gauges
- 🖥️ **Grafana dashboards** — logs, traces, and metrics in one unified view
- 🐳 A `docker-compose` stack that spins up the whole observability stack locally

---

## 🪵 Part 1: Structured Logging with pino

`console.log` is fine during development. In production it produces unstructured strings that are impossible to query, filter, or alert on. `pino` emits newline-delimited JSON — every log line is machine-readable and carries structured fields.

    npm install pino pino-http
    npm install -D pino-pretty

    // src/logger.ts
    import pino from "pino";

    export const logger = pino({
      level: process.env.LOG_LEVEL ?? "info",
      // In production, remove prettyPrint and pipe output to your log collector
      transport:
        process.env.NODE_ENV !== "production"
          ? { target: "pino-pretty", options: { colorize: true } }
          : undefined,
      base: {
        service: "weather-mcp-server",
        version: process.env.npm_package_version ?? "unknown",
      },
    });

Every log line now carries `service` and `version` automatically. In production, pipe the JSON output to Loki, Datadog, or Cloudwatch — they index the fields for free. 🎯

---

## 🔗 Part 2: Correlation IDs — Tying a Request Together

When 50 concurrent requests are interleaved in your logs, you need a way to filter down to everything that happened for one specific request. A **correlation ID** (also called a trace ID or request ID) is a UUID generated at the start of each request and attached to every log line that request produces.

    // src/middleware/correlation.ts
    import type { Request, Response, NextFunction } from "express";
    import { randomUUID } from "crypto";
    import { logger } from "../logger.js";

    declare global {
      namespace Express {
        interface Request {
          correlationId: string;
          log: pino.Logger;
        }
      }
    }

    import type pino from "pino";

    export function correlationMiddleware(
      req: Request,
      res: Response,
      next: NextFunction
    ) {
      // Accept an upstream correlation ID (from a gateway) or generate one
      const correlationId =
        (req.headers["x-correlation-id"] as string) ?? randomUUID();

      req.correlationId = correlationId;

      // Child logger inherits all pino config + adds request-scoped fields
      req.log = logger.child({
        correlationId,
        sessionId: req.headers["mcp-session-id"],
        method: req.method,
        path: req.path,
      });

      res.setHeader("X-Correlation-Id", correlationId);
      next();
    }

Now inside any route handler or tool, use `req.log.info(...)` instead of `logger.info(...)`. Every line automatically carries the correlation ID, session ID, and request metadata — no manual field attachment needed. ✅

A real log line looks like this:

    {
      "level": "info",
      "time": 1748000000000,
      "service": "weather-mcp-server",
      "correlationId": "a3f1c2d4-...",
      "sessionId": "abc-123",
      "method": "POST",
      "path": "/mcp",
      "tool": "get_current_weather",
      "city": "Pune",
      "durationMs": 312,
      "msg": "tool call completed"
    }

Paste that `correlationId` into your log explorer and you instantly see every log line for that request across all instances. 🔍

---

## 🔭 Part 3: Distributed Tracing with OpenTelemetry

Logs tell you *what* happened. Traces tell you *how long each step took* and *where time was lost*. A trace is a tree of **spans** — each span represents one unit of work (an HTTP request, a Redis lookup, a tool call).

    npm install @opentelemetry/sdk-node \
                 @opentelemetry/auto-instrumentations-node \
                 @opentelemetry/exporter-trace-otlp-http \
                 @opentelemetry/resources \
                 @opentelemetry/semantic-conventions

Create the instrumentation file and import it **before anything else** in your entry point:

    // src/instrumentation.ts
    import { NodeSDK } from "@opentelemetry/sdk-node";
    import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
    import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
    import { Resource } from "@opentelemetry/resources";
    import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION }
      from "@opentelemetry/semantic-conventions";

    const sdk = new NodeSDK({
      resource: new Resource({
        [SEMRESATTRS_SERVICE_NAME]: "weather-mcp-server",
        [SEMRESATTRS_SERVICE_VERSION]: process.env.npm_package_version ?? "0.0.0",
      }),
      traceExporter: new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
      }),
      instrumentations: [
        getNodeAutoInstrumentations({
          "@opentelemetry/instrumentation-fs": { enabled: false }, // too noisy
        }),
      ],
    });

    sdk.start();

    process.on("SIGTERM", () => sdk.shutdown());
    process.on("SIGINT", () => sdk.shutdown());

    // src/server.ts — import instrumentation FIRST
    import "./instrumentation.js";
    import express from "express";
    // ... rest of your imports

`getNodeAutoInstrumentations` automatically creates spans for every HTTP request, Redis call, and outbound `fetch`. You get distributed tracing with zero manual instrumentation for all the common operations. ✅

---

## 🔭 Part 4: Manual Spans for Tool Calls

Auto-instrumentation covers infrastructure. For your own business logic — tool calls, session lookups, cache hits — add manual spans:

    // src/tracing.ts
    import { trace, SpanStatusCode } from "@opentelemetry/api";

    const tracer = trace.getTracer("weather-mcp-server", "1.0.0");

    export async function withSpan<T>(
      name: string,
      attributes: Record<string, string | number | boolean>,
      fn: () => Promise<T>
    ): Promise<T> {
      return tracer.startActiveSpan(name, async (span) => {
        span.setAttributes(attributes);
        try {
          const result = await fn();
          span.setStatus({ code: SpanStatusCode.OK });
          return result;
        } catch (err) {
          span.setStatus({
            code: SpanStatusCode.ERROR,
            message: err instanceof Error ? err.message : String(err),
          });
          span.recordException(err as Error);
          throw err;
        } finally {
          span.end();
        }
      });
    }

Use it in your tool handler:

    server.tool(
      "get_current_weather",
      "Get current weather for a city",
      GetWeatherInput.shape,
      async (args) => {
        const input = GetWeatherInput.parse(args);

        return withSpan(
          "tool.get_current_weather",
          {
            "tool.name": "get_current_weather",
            "tool.input.city": input.city,
            "tool.input.units": input.units ?? "metric",
            "session.id": currentSessionId, // from closure
          },
          async () => {
            const data = await fetchCurrentWeather(input.city, input.country, input.units);
            return { content: [{ type: "text", text: JSON.stringify(data, null, 2) }] };
          }
        );
      }
    );

In Jaeger or Grafana Tempo you now see a waterfall: the outer HTTP span contains the `tool.get_current_weather` span, which contains the outbound `fetch` span to the weather API. If the tool call is slow, you can see exactly which child span is the bottleneck. 🔍

---

## 📊 Part 5: Prometheus Metrics with prom-client

Logs and traces answer historical questions. Metrics answer *right now* questions: "How many requests per second are we handling? What is the p99 latency of tool calls? How many sessions are active?"

    npm install prom-client

    // src/metrics.ts
    import client from "prom-client";

    // Enable default Node.js metrics (heap, GC, event loop lag, etc.)
    client.collectDefaultMetrics({ prefix: "mcp_" });

    export const httpRequestDuration = new client.Histogram({
      name: "mcp_http_request_duration_seconds",
      help: "Duration of HTTP requests in seconds",
      labelNames: ["method", "path", "status_code"],
      buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
    });

    export const toolCallDuration = new client.Histogram({
      name: "mcp_tool_call_duration_seconds",
      help: "Duration of MCP tool calls in seconds",
      labelNames: ["tool_name", "session_id", "status"],
      buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    });

    export const toolCallTotal = new client.Counter({
      name: "mcp_tool_calls_total",
      help: "Total number of MCP tool calls",
      labelNames: ["tool_name", "status"],
    });

    export const activeSessions = new client.Gauge({
      name: "mcp_active_sessions",
      help: "Number of currently active MCP sessions",
    });

    export const authFailures = new client.Counter({
      name: "mcp_auth_failures_total",
      help: "Total number of authentication failures",
      labelNames: ["reason"],
    });

    export const registry = client.register;

Expose a `/metrics` endpoint that Prometheus scrapes:

    // In server.ts
    import { registry } from "./metrics.js";

    // No auth on /metrics — Prometheus scrapes this internally
    app.get("/metrics", async (_req, res) => {
      res.set("Content-Type", registry.contentType);
      res.send(await registry.metrics());
    });

---

## 📊 Part 6: Instrumenting Tool Calls with Metrics

Wire the metrics into your tool execution path:

    // src/utils/instrument-tool.ts
    import { toolCallDuration, toolCallTotal, activeSessions } from "../metrics.js";
    import { withSpan } from "../tracing.js";
    import type { Request } from "express";

    export async function instrumentedToolCall<T>(
      toolName: string,
      sessionId: string,
      fn: () => Promise<T>
    ): Promise<T> {
      const end = toolCallDuration.startTimer({ tool_name: toolName, session_id: sessionId });

      return withSpan(`tool.${toolName}`, { "tool.name": toolName, "session.id": sessionId }, async () => {
        try {
          const result = await fn();
          end({ status: "success" });
          toolCallTotal.inc({ tool_name: toolName, status: "success" });
          return result;
        } catch (err) {
          end({ status: "error" });
          toolCallTotal.inc({ tool_name: toolName, status: "error" });
          throw err;
        }
      });
    }

Now wrap every tool handler body:

    server.tool(
      "get_current_weather",
      "Get current weather for a city",
      GetWeatherInput.shape,
      async (args) => {
        const input = GetWeatherInput.parse(args);

        return instrumentedToolCall("get_current_weather", currentSessionId, async () => {
          const data = await fetchCurrentWeather(input.city, input.country, input.units);
          return { content: [{ type: "text", text: JSON.stringify(data, null, 2) }] };
        });
      }
    );

One wrapper gives you histogram timings, counters, and a trace span — all three pillars covered in a single call. 🎯

---

## 📈 Part 7: Session Gauge — Track Active Sessions

Update the active session gauge whenever sessions are created or destroyed:

    // In handleMcpRequest — when a new session is created:
    activeSessions.inc();

    // In the DELETE /mcp handler — when a session is cleaned up:
    activeSessions.dec();

    // In InProcessSessionStore.evictExpired():
    for (const [id, state] of this.sessions) {
      if (now - state.lastActiveAt.getTime() > this.TTL_MS) {
        this.sessions.delete(id);
        activeSessions.dec();  // track evictions too
      }
    }

A Prometheus alert on this gauge tells you when sessions are leaking (gauge grows without bound) or when a deployment drained all sessions (gauge drops to zero mid-traffic). 🚨

---

## 🐳 Part 8: Local Observability Stack with docker-compose

Spin up the full observability stack locally — Jaeger for traces, Prometheus for metrics, Loki for logs, and Grafana as the unified UI:

    version: "3.9"

    services:
      mcp-server:
        build: .
        ports:
          - "3000:3000"
        environment:
          - NODE_ENV=production
          - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318/v1/traces
          - VALID_TOKENS=dev-token
          - WEATHER_API_KEY=${WEATHER_API_KEY}
          - REDIS_URL=redis://redis:6379
        depends_on:
          - jaeger
          - prometheus
          - redis

      redis:
        image: redis:7-alpine
        ports:
          - "6379:6379"

      jaeger:
        image: jaegertracing/all-in-one:1.57
        ports:
          - "16686:16686"   # Jaeger UI
          - "4318:4318"     # OTLP HTTP receiver

      prometheus:
        image: prom/prometheus:v2.51.0
        volumes:
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
        ports:
          - "9090:9090"

      grafana:
        image: grafana/grafana:10.4.0
        ports:
          - "3001:3000"
        environment:
          - GF_SECURITY_ADMIN_PASSWORD=admin
        volumes:
          - ./grafana/provisioning:/etc/grafana/provisioning

`prometheus.yml` scrape config:

    global:
      scrape_interval: 15s

    scrape_configs:
      - job_name: weather-mcp-server
        static_configs:
          - targets: ["mcp-server:3000"]
        metrics_path: /metrics

Start everything:

    docker compose up

Then open:

- Jaeger UI at `http://localhost:16686` — search for `weather-mcp-server` traces
- Prometheus at `http://localhost:9090` — query `mcp_tool_call_duration_seconds`
- Grafana at `http://localhost:3001` — build unified dashboards

---

## 🖥️ Part 9: Useful Prometheus Queries

Once Prometheus is scraping your server, these PromQL queries give you the most useful signals:

Request throughput (requests per second, 5-minute window):

    rate(mcp_http_request_duration_seconds_count[5m])

p99 tool call latency:

    histogram_quantile(0.99, rate(mcp_tool_call_duration_seconds_bucket[5m]))

Tool call error rate:

    rate(mcp_tool_calls_total{status="error"}[5m])
    /
    rate(mcp_tool_calls_total[5m])

Active sessions right now:

    mcp_active_sessions

Auth failure rate (alert if this spikes — it could mean a token is compromised):

    rate(mcp_auth_failures_total[5m])

Node.js event loop lag (alert if consistently above 100ms — your server is CPU-starved):

    mcp_nodejs_eventloop_lag_seconds

Paste these into Grafana panels and you have a production-grade dashboard in minutes. 📊

---

## 🚨 Part 10: Alerting Rules

Add these alert rules to `prometheus.yml` (or your Alertmanager config) to get paged before users notice:

    groups:
      - name: mcp-server
        rules:
          - alert: HighToolCallErrorRate
            expr: |
              rate(mcp_tool_calls_total{status="error"}[5m])
              / rate(mcp_tool_calls_total[5m]) > 0.05
            for: 2m
            labels:
              severity: warning
            annotations:
              summary: "Tool call error rate above 5% for 2 minutes"

          - alert: HighP99Latency
            expr: |
              histogram_quantile(0.99,
                rate(mcp_tool_call_duration_seconds_bucket[5m])
              ) > 3
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "p99 tool call latency above 3s"

          - alert: SessionLeak
            expr: mcp_active_sessions > 500
            for: 10m
            labels:
              severity: critical
            annotations:
              summary: "Active sessions above 500 for 10 minutes — possible leak"

          - alert: AuthSpiking
            expr: rate(mcp_auth_failures_total[5m]) > 10
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Auth failures spiking — possible credential attack"

---

## 💡 Key Takeaways

**Log structured JSON, not strings.** Strings are for humans. JSON is for both humans and machines — your log aggregator can index, filter, and alert on individual fields.

**Correlation IDs are non-negotiable in distributed systems.** Without them, debugging a multi-instance, multi-session server is guesswork.

**Auto-instrumentation covers 80% of the work.** `getNodeAutoInstrumentations` automatically traces all HTTP, Redis, and DNS operations. Only add manual spans for your own business logic.

**Histograms beat averages.** An average latency of 200ms hides the fact that 1% of requests take 5 seconds. Always look at p95 and p99.

**Alerts should be actionable.** Every alert should have a clear remediation — "restart the service", "scale up", "rotate tokens". Alerts with no action are noise that trains your team to ignore pages.

---

## 🎯 Summary

In Part 7 you wired full observability into the MCP server:

- 🪵 **pino** — structured JSON logs with correlation IDs and session context on every line
- 🔗 **Correlation middleware** — one UUID ties all logs for a request together across instances
- 🔭 **OpenTelemetry** — auto-instrumented spans for HTTP, Redis, and DNS plus manual spans for tool calls
- 📊 **prom-client** — request duration histograms, tool call counters, active session gauges, auth failure counters
- 🐳 **docker-compose** — one command spins up Jaeger, Prometheus, and Grafana locally
- 🚨 **Alert rules** — error rate, p99 latency, session leak, and auth spike detectors

In **Part 8** we will build a **TypeScript MCP client SDK wrapper** — a clean, reusable abstraction you can publish as an npm package so other teams can connect to your MCP server without touching raw JSON-RPC. 📦

---

## 📚 Further Reading

- 🪵 [pino documentation](https://getpino.io)
- 🔭 [OpenTelemetry Node.js SDK](https://opentelemetry.io/docs/languages/js/getting-started/nodejs/)
- 📊 [prom-client on GitHub](https://github.com/siimon/prom-client)
- 🖥️ [Grafana dashboards](https://grafana.com/docs/grafana/latest/dashboards/)
- 🔭 [Jaeger distributed tracing](https://www.jaegertracing.io/)
- 🏗️ [Part 6: Multi-Tenant MCP — Session Management and Horizontal Scaling](https://blog.techtush.in/multi-tenant-mcp-session-management-state-isolation-and-horizontal-scaling)
