Skip to main content

Command Palette

Search for a command to run...

πŸ“ RAG Evaluation Framework for MCP Agents: Measuring What Actually Matters

Stop guessing whether your RAG agent is getting better or worse β€” build a systematic eval framework that scores faithfulness, relevance, and precision, then gate your CI pipeline on it

Updated
β€’16 min read
πŸ“ RAG Evaluation Framework for MCP Agents: Measuring What Actually Matters
T

Hi πŸ‘‹, I'm Tushar Patil. Currently I am working as Frontend Developer (Angular) and also have expertise with .Net Core and Framework.


This is Part 12 of the AI Engineering with TypeScript series.

Prerequisites: Part 3 β€” AI Agent Β· Part 10 β€” MCP + RAG Β· Part 11 β€” Multi-Tenant RAG

Stack: Node.js 20+ Β· TypeScript 5.x Β· Vitest Β· @anthropic-ai/sdk Β· pgvector Β· Zod


πŸ—ΊοΈ What we'll cover

You have a working RAG agent. It retrieves documents and generates answers. But how do you know if it is actually good? And how do you know if it gets worse after you:

  • Swap text-embedding-3-small for a different model
  • Change your chunk size from 512 to 256 words
  • Add 500 new documents to the knowledge base
  • Upgrade to a new version of the MCP SDK

Without a measurement framework you are flying blind. You push a change, the RAG pipeline behaves differently, and you find out three days later when a user reports wrong answers.

Evals fix this. An eval is a structured test that runs your RAG pipeline on a fixed set of questions with known correct answers, scores the outputs on multiple dimensions, stores the results over time, and fails your CI pipeline when quality regresses.

By the end you will have:

  • πŸ“‹ A golden eval dataset β€” question/answer pairs with ground-truth citations
  • πŸ“ Three scoring dimensions β€” Faithfulness, Answer Relevance, and Context Precision β€” each computed programmatically using Claude as a judge
  • πŸ—„οΈ A results store that tracks scores over time so you can see trends
  • 🚨 A CI gate that blocks merges when scores drop below a threshold
  • πŸ” An eval harness that runs the full RAG pipeline end-to-end and compares outputs
  • πŸ“Š A score reporter that prints a clear summary table after every eval run

🧠 Part 1: The Three Dimensions of RAG Quality

RAG quality breaks down into three orthogonal dimensions. Getting all three right is what separates a trustworthy system from a hallucination machine. 🎯

Faithfulness β€” does the generated answer contain only information that is supported by the retrieved context? A high faithfulness score means the model is not hallucinating beyond what was retrieved. This is the most important dimension for production systems.

Answer Relevance β€” is the generated answer actually responsive to the question asked? A model can be faithful (not hallucinating) but still give an irrelevant answer by retrieving the right documents but responding to the wrong part of the question.

Context Precision β€” of the chunks retrieved by the search tool, what fraction were actually useful for answering the question? Low context precision means your retrieval is noisy β€” it is finding topically adjacent documents but missing the specific passage that contains the answer. This points to chunking or embedding problems.

Each score is a float between 0.0 and 1.0. Production thresholds that work well in practice:

Faithfulness       >= 0.85  (non-negotiable β€” below this, users get wrong information)
Answer Relevance   >= 0.80  (below this, answers feel off-topic)
Context Precision  >= 0.70  (below this, retrieval is too noisy β€” fix chunking)

πŸ“‹ Part 2: Building the Golden Dataset

A golden dataset is a JSON file of question/answer pairs where the answer is manually verified against your actual documents. This is the ground truth that every eval run is measured against.

// src/evals/dataset.ts

export interface EvalCase {
  id: string;
  question: string;
  groundTruthAnswer: string;
  expectedSources: string[];       // document names that should be retrieved
  tags: string[];                  // categories for grouping results
}

export interface EvalDataset {
  version: string;
  tenantId: string;
  cases: EvalCase[];
}

A sample eval-dataset.json for our weather MCP knowledge base:

{
  "version": "1.0.0",
  "tenantId": "acme",
  "cases": [
    {
      "id": "session-ttl-001",
      "question": "What is the recommended TTL for MCP sessions and how should it be implemented?",
      "groundTruthAnswer": "The recommended TTL is 30 minutes, implemented as a sliding expiry that resets on every active request using Redis set with EX 1800.",
      "expectedSources": ["mcp-guide.md"],
      "tags": ["sessions", "redis"]
    },
    {
      "id": "deploy-fly-001",
      "question": "How do I deploy the MCP server to Fly.io?",
      "groundTruthAnswer": "Install flyctl, run flyctl launch to create fly.toml, set secrets with flyctl secrets set, then deploy with flyctl deploy.",
      "expectedSources": ["runbook.md"],
      "tags": ["deployment", "flyio"]
    },
    {
      "id": "rls-policy-001",
      "question": "How does row-level security prevent cross-tenant data leakage in pgvector?",
      "groundTruthAnswer": "PostgreSQL RLS policies use current_setting to match tenant_id on every row. FORCE ROW LEVEL SECURITY ensures even the table owner cannot bypass the policy.",
      "expectedSources": ["security-guide.md"],
      "tags": ["security", "multitenancy"]
    },
    {
      "id": "chunk-size-001",
      "question": "What chunk size is recommended for RAG ingestion?",
      "groundTruthAnswer": "512 words with a 64-word overlap is a good default. Too small loses context; too large dilutes embedding precision.",
      "expectedSources": ["rag-guide.md"],
      "tags": ["rag", "ingestion"]
    }
  ]
}

Invest time in building a high-quality golden dataset. Twenty well-chosen questions with verified answers are worth more than two hundred auto-generated ones. Cover your most important use cases, known edge cases, and questions that tripped up the agent during manual testing. πŸ’‘


πŸ€– Part 3: Claude as a Judge

We use Claude to score each dimension β€” this is called LLM-as-judge evaluation and is now standard practice in AI engineering. The key is giving Claude a precise scoring rubric as a structured prompt and requiring it to respond with JSON.

// src/evals/judge.ts
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";

const anthropic = new Anthropic();

const ScoreSchema = z.object({
  score: z.number().min(0).max(1),
  reasoning: z.string(),
});

export type ScoreResult = z.infer<typeof ScoreSchema>;

async function judgeWithPrompt(prompt: string): Promise<ScoreResult> {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 512,
    messages: [{ role: "user", content: prompt }],
    system:
      "You are a precise evaluator for RAG systems. Always respond with valid JSON only. No markdown, no explanation outside the JSON.",
  });

  const text = response.content
    .filter((b) => b.type === "text")
    .map((b) => (b as Anthropic.TextBlock).text)
    .join("");

  return ScoreSchema.parse(JSON.parse(text));
}

export async function scoreFaithfulness(
  question: string,
  retrievedContext: string,
  generatedAnswer: string
): Promise<ScoreResult> {
  return judgeWithPrompt(`
You are evaluating whether a generated answer is faithful to its retrieved context.
Faithful means: every factual claim in the answer is directly supported by the context.
An answer that adds information not present in the context is unfaithful, even if that information is true.

Question: ${question}

Retrieved context:
${retrievedContext}

Generated answer:
${generatedAnswer}

Score from 0.0 to 1.0 where:
1.0 = every claim is directly supported by the context
0.5 = some claims are supported, some are not
0.0 = the answer is entirely unsupported by the context

Respond with JSON only:
{"score": <number>, "reasoning": "<one sentence explanation>"}
`);
}

export async function scoreAnswerRelevance(
  question: string,
  generatedAnswer: string
): Promise<ScoreResult> {
  return judgeWithPrompt(`
You are evaluating whether a generated answer is relevant to the question asked.

Question: ${question}

Generated answer:
${generatedAnswer}

Score from 0.0 to 1.0 where:
1.0 = the answer directly and completely addresses the question
0.5 = the answer is partially relevant but misses key aspects
0.0 = the answer does not address the question at all

Respond with JSON only:
{"score": <number>, "reasoning": "<one sentence explanation>"}
`);
}

export async function scoreContextPrecision(
  question: string,
  retrievedChunks: RetrievedChunk[],
  groundTruthAnswer: string
): Promise<ScoreResult> {
  const chunksText = retrievedChunks
    .map((c, i) => `[Chunk \({i + 1}] (source: \){c.source})\n${c.content}`)
    .join("\n\n---\n\n");

  return judgeWithPrompt(`
You are evaluating the precision of retrieved context chunks.
Precision = what fraction of the retrieved chunks were actually useful for answering the question.

Question: ${question}
Ground truth answer: ${groundTruthAnswer}

Retrieved chunks:
${chunksText}

Count how many chunks contain information that is directly relevant to answering the question.
Score = relevant_chunks / total_chunks

Score from 0.0 to 1.0.

Respond with JSON only:
{"score": <number>, "reasoning": "<one sentence explanation>"}
`);
}

export interface RetrievedChunk {
  source: string;
  content: string;
  score: number;
}

πŸ” Part 4: The Eval Harness

The harness runs the RAG pipeline end-to-end for each eval case and collects the raw outputs needed for scoring:

// src/evals/harness.ts
import Anthropic from "@anthropic-ai/sdk";
import { createMcpClient } from "../client.js";
import type { EvalCase } from "./dataset.js";
import type { RetrievedChunk } from "./judge.js";

const anthropic = new Anthropic();

export interface HarnessResult {
  evalId: string;
  question: string;
  generatedAnswer: string;
  retrievedChunks: RetrievedChunk[];
  toolCallCount: number;
  latencyMs: number;
}

export async function runEvalCase(
  evalCase: EvalCase,
  mcpServerArgs: string[]
): Promise<HarnessResult> {
  const client = await createMcpClient("node", mcpServerArgs);
  const { tools } = await client.listTools();

  const anthropicTools: Anthropic.Tool[] = tools.map((t) => ({
    name: t.name,
    description: t.description ?? "",
    input_schema: t.inputSchema as Anthropic.Tool["input_schema"],
  }));

  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: evalCase.question },
  ];

  const retrievedChunks: RetrievedChunk[] = [];
  let toolCallCount = 0;
  const start = Date.now();

  // Run the agent loop (same pattern as Part 3)
  while (true) {
    const response = await anthropic.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 2048,
      tools: anthropicTools,
      messages,
      system:
        "You are a knowledge assistant. Always use search_knowledge_base before answering factual questions.",
    });

    messages.push({ role: "assistant", content: response.content });

    if (response.stop_reason === "end_turn") {
      const answer = response.content
        .filter((b): b is Anthropic.TextBlock => b.type === "text")
        .map((b) => b.text)
        .join("\n");

      await client.close();

      return {
        evalId: evalCase.id,
        question: evalCase.question,
        generatedAnswer: answer,
        retrievedChunks,
        toolCallCount,
        latencyMs: Date.now() - start,
      };
    }

    if (response.stop_reason === "tool_use") {
      const toolResults: Anthropic.ToolResultBlockParam[] = [];

      for (const block of response.content) {
        if (block.type !== "tool_use") continue;

        toolCallCount++;
        const result = await client.callTool({
          name: block.name,
          arguments: block.input as Record<string, unknown>,
        });

        const resultText = result.content
          .filter((c) => c.type === "text")
          .map((c) => (c as { type: "text"; text: string }).text)
          .join("\n");

        // Parse retrieved chunks from search_knowledge_base results
        if (block.name === "search_knowledge_base") {
          const chunkMatches = [...resultText.matchAll(
            /\[(\d+)\] Source: ([^\s(]+) \(score: ([0-9.]+)\)\n([\s\S]*?)(?=\n---|\n\[|\Z)/g
          )];

          for (const match of chunkMatches) {
            retrievedChunks.push({
              source: match[2],
              score: parseFloat(match[3]),
              content: match[4].trim(),
            });
          }
        }

        toolResults.push({
          type: "tool_result",
          tool_use_id: block.id,
          content: resultText,
        });
      }

      messages.push({ role: "user", content: toolResults });
    }
  }
}

πŸ“ Part 5: The Full Eval Runner

Now wire the harness and judge together into a runner that processes the entire dataset and produces structured results:

// src/evals/runner.ts
import fs from "fs";
import path from "path";
import type { EvalDataset, EvalCase } from "./dataset.js";
import { runEvalCase, type HarnessResult } from "./harness.js";
import {
  scoreFaithfulness,
  scoreAnswerRelevance,
  scoreContextPrecision,
} from "./judge.js";

export interface EvalResult {
  evalId: string;
  question: string;
  faithfulness: number;
  answerRelevance: number;
  contextPrecision: number;
  faithfulnessReasoning: string;
  answerRelevanceReasoning: string;
  contextPrecisionReasoning: string;
  latencyMs: number;
  toolCallCount: number;
  passed: boolean;
}

export interface EvalRunSummary {
  runId: string;
  timestamp: string;
  totalCases: number;
  passedCases: number;
  meanFaithfulness: number;
  meanAnswerRelevance: number;
  meanContextPrecision: number;
  results: EvalResult[];
}

const THRESHOLDS = {
  faithfulness: 0.85,
  answerRelevance: 0.80,
  contextPrecision: 0.70,
};

export async function runEvals(
  dataset: EvalDataset,
  mcpServerArgs: string[],
  concurrency = 2
): Promise<EvalRunSummary> {
  const results: EvalResult[] = [];
  const cases = dataset.cases;

  // Process in batches to avoid hammering the API
  for (let i = 0; i < cases.length; i += concurrency) {
    const batch = cases.slice(i, i + concurrency);

    const batchResults = await Promise.all(
      batch.map(async (evalCase) => {
        console.log(`  Running: ${evalCase.id}...`);
        const harness = await runEvalCase(evalCase, mcpServerArgs);
        return scoreHarnessResult(harness, evalCase);
      })
    );

    results.push(...batchResults);
    console.log(`  Completed \({Math.min(i + concurrency, cases.length)}/\){cases.length}`);
  }

  const mean = (arr: number[]) => arr.reduce((a, b) => a + b, 0) / arr.length;

  return {
    runId: `run-${Date.now()}`,
    timestamp: new Date().toISOString(),
    totalCases: results.length,
    passedCases: results.filter((r) => r.passed).length,
    meanFaithfulness: mean(results.map((r) => r.faithfulness)),
    meanAnswerRelevance: mean(results.map((r) => r.answerRelevance)),
    meanContextPrecision: mean(results.map((r) => r.contextPrecision)),
    results,
  };
}

async function scoreHarnessResult(
  harness: HarnessResult,
  evalCase: EvalCase
): Promise<EvalResult> {
  const contextText = harness.retrievedChunks
    .map((c) => `[\({c.source}]\n\){c.content}`)
    .join("\n\n");

  const [faith, relevance, precision] = await Promise.all([
    scoreFaithfulness(harness.question, contextText, harness.generatedAnswer),
    scoreAnswerRelevance(harness.question, harness.generatedAnswer),
    scoreContextPrecision(harness.question, harness.retrievedChunks, evalCase.groundTruthAnswer),
  ]);

  const passed =
    faith.score >= THRESHOLDS.faithfulness &&
    relevance.score >= THRESHOLDS.answerRelevance &&
    precision.score >= THRESHOLDS.contextPrecision;

  return {
    evalId: harness.evalId,
    question: harness.question,
    faithfulness: faith.score,
    answerRelevance: relevance.score,
    contextPrecision: precision.score,
    faithfulnessReasoning: faith.reasoning,
    answerRelevanceReasoning: relevance.reasoning,
    contextPrecisionReasoning: precision.reasoning,
    latencyMs: harness.latencyMs,
    toolCallCount: harness.toolCallCount,
    passed,
  };
}

πŸ“Š Part 6: The Score Reporter

A clean terminal report makes eval results easy to interpret at a glance:

// src/evals/reporter.ts
import type { EvalRunSummary } from "./runner.js";

const GREEN = "\x1b[32m";
const RED = "\x1b[31m";
const YELLOW = "\x1b[33m";
const RESET = "\x1b[0m";
const BOLD = "\x1b[1m";

function colour(score: number, threshold: number): string {
  if (score >= threshold) return GREEN;
  if (score >= threshold * 0.9) return YELLOW;
  return RED;
}

function fmt(score: number, threshold: number): string {
  return `\({colour(score, threshold)}\){score.toFixed(3)}${RESET}`;
}

export function printReport(summary: EvalRunSummary): void {
  const passRate = ((summary.passedCases / summary.totalCases) * 100).toFixed(1);

  console.log("\n" + "=".repeat(72));
  console.log(`\({BOLD}RAG Eval Run: \){summary.runId}${RESET}`);
  console.log(`Timestamp  : ${summary.timestamp}`);
  console.log(`Cases      : \({summary.passedCases}/\){summary.totalCases} passed (${passRate}%)`);
  console.log("─".repeat(72));
  console.log(
    `Mean Faithfulness    : ${fmt(summary.meanFaithfulness, 0.85)}  (threshold: 0.85)`
  );
  console.log(
    `Mean Answer Relevance: ${fmt(summary.meanAnswerRelevance, 0.80)}  (threshold: 0.80)`
  );
  console.log(
    `Mean Context Precision: ${fmt(summary.meanContextPrecision, 0.70)}  (threshold: 0.70)`
  );
  console.log("─".repeat(72));

  for (const r of summary.results) {
    const status = r.passed ? `\({GREEN}PASS\){RESET}` : `\({RED}FAIL\){RESET}`;
    console.log(`\n\({status} [\){r.evalId}] ${r.question.slice(0, 60)}...`);
    console.log(
      `       Faithfulness: \({fmt(r.faithfulness, 0.85)}  Relevance: \){fmt(r.answerRelevance, 0.80)}  Precision: \({fmt(r.contextPrecision, 0.70)}  (\){r.latencyMs}ms)`
    );
    if (!r.passed) {
      if (r.faithfulness < 0.85)
        console.log(`       \({RED}↳ Faithfulness: \){r.faithfulnessReasoning}${RESET}`);
      if (r.answerRelevance < 0.80)
        console.log(`       \({RED}↳ Relevance: \){r.answerRelevanceReasoning}${RESET}`);
      if (r.contextPrecision < 0.70)
        console.log(`       \({RED}↳ Precision: \){r.contextPrecisionReasoning}${RESET}`);
    }
  }

  console.log("\n" + "=".repeat(72));
}

export function exitCode(summary: EvalRunSummary): number {
  return summary.passedCases === summary.totalCases ? 0 : 1;
}

πŸ—„οΈ Part 7: Storing Results for Trend Analysis

Save every run to a JSON file (or a database) so you can track scores over time and detect gradual regressions:

// src/evals/store.ts
import fs from "fs";
import path from "path";
import type { EvalRunSummary } from "./runner.js";

const STORE_DIR = ".eval-results";

export function saveRun(summary: EvalRunSummary): string {
  fs.mkdirSync(STORE_DIR, { recursive: true });
  const filename = `${summary.runId}.json`;
  const filepath = path.join(STORE_DIR, filename);
  fs.writeFileSync(filepath, JSON.stringify(summary, null, 2));
  return filepath;
}

export function loadHistory(): EvalRunSummary[] {
  if (!fs.existsSync(STORE_DIR)) return [];

  return fs
    .readdirSync(STORE_DIR)
    .filter((f) => f.endsWith(".json"))
    .map((f) => JSON.parse(fs.readFileSync(path.join(STORE_DIR, f), "utf-8")))
    .sort((a, b) => a.timestamp.localeCompare(b.timestamp));
}

export function detectRegression(
  current: EvalRunSummary,
  threshold = 0.05
): string[] {
  const history = loadHistory();
  if (history.length < 2) return [];

  // Compare against the most recent previous run
  const previous = history[history.length - 2];
  const regressions: string[] = [];

  const dims = [
    { name: "faithfulness", cur: current.meanFaithfulness, prev: previous.meanFaithfulness },
    { name: "answerRelevance", cur: current.meanAnswerRelevance, prev: previous.meanAnswerRelevance },
    { name: "contextPrecision", cur: current.meanContextPrecision, prev: previous.meanContextPrecision },
  ];

  for (const d of dims) {
    const delta = d.prev - d.cur;
    if (delta > threshold) {
      regressions.push(
        `\({d.name} dropped \){delta.toFixed(3)} (\({d.prev.toFixed(3)} β†’ \){d.cur.toFixed(3)})`
      );
    }
  }

  return regressions;
}

πŸš€ Part 8: The Eval Entry Point

// src/evals/run.ts
import "dotenv/config";
import { readFileSync } from "fs";
import type { EvalDataset } from "./dataset.js";
import { runEvals } from "./runner.js";
import { printReport, exitCode } from "./reporter.js";
import { saveRun, detectRegression } from "./store.js";

const dataset: EvalDataset = JSON.parse(
  readFileSync("src/evals/eval-dataset.json", "utf-8")
);

const mcpServerArgs = ["dist/server.js"];

console.log(`\nπŸ” Running ${dataset.cases.length} eval cases...\n`);

const summary = await runEvals(dataset, mcpServerArgs, 2);

printReport(summary);

const savedPath = saveRun(summary);
console.log(`\nπŸ’Ύ Results saved to: ${savedPath}`);

const regressions = detectRegression(summary);
if (regressions.length > 0) {
  console.error("\n🚨 REGRESSIONS DETECTED:");
  regressions.forEach((r) => console.error(`   ↳ ${r}`));
  process.exit(1);
}

process.exit(exitCode(summary));

Run it:

npx tsx src/evals/run.ts

πŸ” Running 4 eval cases...

  Running: session-ttl-001...
  Running: deploy-fly-001...
  Completed 2/4
  Running: rls-policy-001...
  Running: chunk-size-001...
  Completed 4/4

========================================================================
RAG Eval Run: run-1748012345678
Timestamp  : 2026-05-16T10:30:00.000Z
Cases      : 4/4 passed (100.0%)
────────────────────────────────────────────────────────────────────────
Mean Faithfulness    : 0.921  (threshold: 0.85)
Mean Answer Relevance: 0.887  (threshold: 0.80)
Mean Context Precision: 0.812  (threshold: 0.70)
────────────────────────────────────────────────────────────────────────

PASS [session-ttl-001] What is the recommended TTL for MCP sessions a...
     Faithfulness: 0.950  Relevance: 0.920  Precision: 0.875  (1842ms)

PASS [deploy-fly-001] How do I deploy the MCP server to Fly.io?...
     Faithfulness: 0.900  Relevance: 0.860  Precision: 0.800  (2103ms)
========================================================================

🚨 Part 9: CI Gate with GitHub Actions

Add evals to your CI pipeline so quality regressions block merges automatically:

# .github/workflows/eval.yml
name: RAG Evals

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: pgvector/pgvector:pg16
        env:
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: rag_test
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20

      - run: npm ci

      - name: Run database migrations
        run: npx tsx src/db/migrate.ts
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost:5432/rag_test

      - name: Ingest eval documents
        run: npx tsx src/ingestion/ingest.ts acme ./eval-docs/*.md
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost:5432/rag_test
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Run RAG evals
        run: npx tsx src/evals/run.ts
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost:5432/rag_test
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Upload eval results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: .eval-results/

The if: always() on the upload step means you get the results artifact even when the eval job fails β€” essential for debugging regressions. πŸ”


πŸ’‘ Part 10: Practical Tips for Production Evals

Start small and grow. Twenty high-quality eval cases beat two hundred auto-generated ones. Focus your first dataset on your most common user questions and your highest-risk failure modes.

Separate retrieval evals from generation evals. Retrieval quality (context precision, recall) is a function of your embedding model and chunk size. Generation quality (faithfulness, relevance) is a function of your prompt and model. Running them separately makes it easier to pinpoint which layer regressed.

Use a fixed model version for the judge. Pin your judge to claude-sonnet-4-20250514 (or equivalent). If the judge model changes, your historical scores become incomparable. The judge's own variation adds noise β€” mitigate it by running each eval case 3 times and averaging.

Track source recall separately. Add a fourth metric: what fraction of expectedSources actually appeared in retrievedChunks? A source recall below 0.80 means your vector search is missing the right documents β€” usually a chunking or embedding problem, not a generation problem.

Keep eval docs in version control. Your eval-dataset.json and eval-docs/ folder belong in the repo. When the underlying documents change, update the ground truth answers and commit them together. Drift between the knowledge base and the golden dataset is the most common cause of mysterious eval failures. πŸ“‚


🎯 Summary

In Part 12 you built a complete RAG evaluation framework:

  • πŸ“‹ Golden eval dataset β€” manually verified Q/A pairs with expected sources
  • πŸ€– LLM-as-judge scoring β€” Claude evaluates Faithfulness, Answer Relevance, and Context Precision with structured JSON output
  • πŸ” End-to-end eval harness β€” runs the full RAG agent loop and captures retrieved chunks
  • πŸ“Š Score reporter β€” colour-coded terminal output showing pass/fail per case
  • πŸ—„οΈ Trend store β€” JSON results persisted per run with automated regression detection
  • 🚨 CI gate β€” GitHub Actions workflow that blocks merges when scores drop below thresholds

In Part 13 we will take the complete MCP stack and add real-time collaborative features β€” multiple agents sharing a session, tool call events broadcast to connected clients over SSE, and a simple dashboard that shows live agent activity. πŸŽ›οΈ


πŸ“š Further Reading

AI Engineering with TypeScript

Part 3 of 14

A comprehensive, code-first series on building production-grade AI systems with the Model Context Protocol (MCP) and TypeScript. From your first MCP server to multi-agent orchestration, RAG pipelines, observability, and global deployment β€” every post is packed with real, runnable code.

Up next

πŸ”’ Multi-Tenant RAG: Row-Level Security in pgvector with MCP

When Tenant A searches your knowledge base, they must never see Tenant B's documents β€” enforce this at the database level, not in application code