🔍 MCP + RAG: Building a Knowledge-Base Tool with Vector Search in TypeScript

This is Part 10 of the AI Engineering with TypeScript series.

Prerequisites: Part 2 — MCP Fundamentals · Part 3 — AI Agent · Part 5 — Production MCP Server

Stack: Node.js 20+ · TypeScript 5.x · OpenAI Embeddings API · pgvector · Qdrant · @modelcontextprotocol/sdk · Zod

🗺️ What we'll cover

Every MCP server we built in this series answers questions from live APIs — weather data, current conditions, real-time forecasts. But many of the most valuable questions an AI agent needs to answer come from your own documents — product docs, internal wikis, support tickets, research papers, code comments, runbooks.

That is what RAG (Retrieval-Augmented Generation) solves. Instead of stuffing entire documents into the context window (expensive, slow, hits the limit quickly), RAG:

Splits documents into chunks and stores them as embedding vectors in a vector database
At query time, embeds the user's question and finds the most semantically similar chunks
Passes only those relevant chunks to the model as context

When you combine RAG with MCP, your AI agent can call a search_knowledge_base tool mid-reasoning — exactly like it calls a weather tool — and ground its answers in your actual knowledge. 🧠

By the end you will have:

📥 An ingestion pipeline that chunks, embeds, and stores documents in pgvector or Qdrant
🔍 A search_knowledge_base MCP tool that does semantic similarity search
📄 A list_sources MCP resource listing all indexed documents
🤖 An agent that uses retrieval before answering knowledge questions
🗄️ Guidance on choosing between pgvector (SQL-native) and Qdrant (purpose-built vector DB)
🧹 Metadata filtering so searches can be scoped to a specific source, date range, or tag

🧠 Part 1: How RAG + MCP Fits Together

Without RAG, your agent's knowledge is frozen at the model's training cutoff and whatever you put in the system prompt. With RAG + MCP the flow becomes:

User: "What is the recommended Redis TTL for MCP sessions?"

Agent thinks: I should search the knowledge base first.

→ calls search_knowledge_base({ query: "Redis TTL MCP sessions" })

MCP server:
  1. embeds the query → [0.021, -0.834, 0.441, ...]
  2. cosine-similarity search in pgvector
  3. returns top 3 matching chunks from your internal docs

Agent: Based on the documentation [chunk 1], the recommended TTL is 30 minutes,
       reset on every session access (sliding expiry)...

The agent did not hallucinate. It retrieved the answer from your actual content and cited it. That is the core value of RAG. 🎯

📦 Part 2: Project Setup

mkdir mcp-rag-server && cd mcp-rag-server
npm init -y
npm install @modelcontextprotocol/sdk openai zod pg pgvector @qdrant/js-client-rest dotenv pdfjs-dist marked
npm install -D typescript @types/pg @types/node tsx tsup

Your .env:

# Embeddings (OpenAI)
OPENAI_API_KEY=sk-...

# Choose one vector store
DATABASE_URL=postgresql://localhost:5432/rag_db   # for pgvector
QDRANT_URL=http://localhost:6333                  # for Qdrant

# MCP server
PORT=3001
VALID_TOKENS=your-secret-token

The embedding model we use is text-embedding-3-small from OpenAI — 1536 dimensions, fast, cheap ($0.02 per million tokens). It works with both pgvector and Qdrant without changing the ingestion logic. ✅

🗄️ Part 3: Choosing Your Vector Store

Both options are excellent. Here is when to pick each one:

pgvector — the PostgreSQL extension that adds a vector column type and cosine/L2/inner-product index operators.

Pick pgvector when:

You already run PostgreSQL (Supabase, Neon, Railway, RDS)
You want to join vector search results with relational data (e.g. filter by tenant_id, created_at)
You want one database for everything — sessions, metadata, and vectors
You prefer SQL and want to query embeddings with familiar tooling

Qdrant — a purpose-built vector database written in Rust, with a REST + gRPC API.

Pick Qdrant when:

You are handling millions of vectors and need maximum search throughput
You want built-in payload filtering, named vectors, and sparse vector support
You prefer a dedicated service that is tuned exclusively for vector search
You want the Qdrant cloud managed tier for zero-ops hosting

For this post we implement both and you can swap between them by changing one import. 🔄

🗄️ Part 4: pgvector Setup

Install the extension and create the table:

-- Run this once in your PostgreSQL database
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS documents (
  id          BIGSERIAL PRIMARY KEY,
  source      TEXT NOT NULL,           -- filename or URL
  chunk_index INTEGER NOT NULL,        -- position within source
  content     TEXT NOT NULL,           -- raw text of this chunk
  embedding   vector(1536),            -- text-embedding-3-small output
  metadata    JSONB DEFAULT '{}',      -- tags, page numbers, section headers
  created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- IVFFlat index for approximate nearest-neighbour search
-- lists = roughly sqrt(total_rows) is a good starting point
CREATE INDEX IF NOT EXISTS documents_embedding_idx
  ON documents
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

The ivfflat index trades a tiny amount of recall for dramatically faster search at scale. For under 100k rows you can omit the index and use exact search — simply remove the CREATE INDEX statement. For over 1 million rows, switch to hnsw which gives better recall at the cost of higher memory:

-- Alternative for large collections (1M+ rows)
CREATE INDEX documents_embedding_hnsw_idx
  ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

Now the pgvector client:

// src/stores/pgvector-store.ts
import pg from "pg";
import { toSql } from "pgvector/pg";

const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });

export interface DocumentChunk {
  id?: number;
  source: string;
  chunkIndex: number;
  content: string;
  embedding: number[];
  metadata?: Record<string, unknown>;
}

export interface SearchResult {
  id: number;
  source: string;
  content: string;
  score: number;
  metadata: Record<string, unknown>;
}

export async function insertChunk(chunk: DocumentChunk): Promise<void> {
  await pool.query(
    `INSERT INTO documents (source, chunk_index, content, embedding, metadata)
     VALUES (\(1, \)2, \(3, \)4, $5)`,
    [
      chunk.source,
      chunk.chunkIndex,
      chunk.content,
      toSql(chunk.embedding),
      JSON.stringify(chunk.metadata ?? {}),
    ]
  );
}

export async function similaritySearch(
  queryEmbedding: number[],
  topK = 5,
  filter?: { source?: string; tags?: string[] }
): Promise<SearchResult[]> {
  let whereClause = "";
  const params: unknown[] = [toSql(queryEmbedding), topK];

  if (filter?.source) {
    params.push(filter.source);
    whereClause += ` AND source = $${params.length}`;
  }

  const result = await pool.query(
    `SELECT id, source, content, metadata,
            1 - (embedding <=> $1) AS score
     FROM documents
     WHERE 1=1 ${whereClause}
     ORDER BY embedding <=> $1
     LIMIT $2`,
    params
  );

  return result.rows.map((row) => ({
    id: row.id,
    source: row.source,
    content: row.content,
    score: parseFloat(row.score),
    metadata: row.metadata,
  }));
}

export async function listSources(): Promise<{ source: string; chunkCount: number }[]> {
  const result = await pool.query(
    `SELECT source, COUNT(*) as chunk_count
     FROM documents
     GROUP BY source
     ORDER BY source`
  );
  return result.rows.map((r) => ({
    source: r.source,
    chunkCount: parseInt(r.chunk_count),
  }));
}

The <=> operator is pgvector's cosine distance operator. 1 - (embedding <=> $1) converts distance to similarity — 1.0 means identical, 0.0 means completely unrelated. We order by distance (ascending) but return similarity (descending) so scores are intuitive for the caller. ✅

🦀 Part 5: Qdrant Setup

Start Qdrant locally with Docker:

docker run -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

Now the Qdrant client:

// src/stores/qdrant-store.ts
import { QdrantClient } from "@qdrant/js-client-rest";

const client = new QdrantClient({ url: process.env.QDRANT_URL ?? "http://localhost:6333" });

const COLLECTION = "documents";
const VECTOR_SIZE = 1536;

export async function ensureCollection(): Promise<void> {
  const collections = await client.getCollections();
  const exists = collections.collections.some((c) => c.name === COLLECTION);

  if (!exists) {
    await client.createCollection(COLLECTION, {
      vectors: { size: VECTOR_SIZE, distance: "Cosine" },
    });

    // Payload index for fast metadata filtering
    await client.createPayloadIndex(COLLECTION, {
      field_name: "source",
      field_schema: "keyword",
    });
  }
}

export async function insertChunk(chunk: DocumentChunk): Promise<void> {
  await client.upsert(COLLECTION, {
    points: [
      {
        id: `\({chunk.source}-\){chunk.chunkIndex}`,
        vector: chunk.embedding,
        payload: {
          source: chunk.source,
          chunkIndex: chunk.chunkIndex,
          content: chunk.content,
          metadata: chunk.metadata ?? {},
        },
      },
    ],
  });
}

export async function similaritySearch(
  queryEmbedding: number[],
  topK = 5,
  filter?: { source?: string }
): Promise<SearchResult[]> {
  const qdrantFilter = filter?.source
    ? { must: [{ key: "source", match: { value: filter.source } }] }
    : undefined;

  const results = await client.search(COLLECTION, {
    vector: queryEmbedding,
    limit: topK,
    filter: qdrantFilter,
    with_payload: true,
  });

  return results.map((r) => ({
    id: String(r.id),
    source: r.payload?.source as string,
    content: r.payload?.content as string,
    score: r.score,
    metadata: (r.payload?.metadata as Record<string, unknown>) ?? {},
  }));
}

export async function listSources(): Promise<{ source: string; chunkCount: number }[]> {
  // Scroll through all points and aggregate by source
  const counts = new Map<string, number>();
  let offset: string | number | null = null;

  do {
    const page = await client.scroll(COLLECTION, {
      limit: 100,
      offset: offset ?? undefined,
      with_payload: ["source"],
    });

    for (const point of page.points) {
      const src = point.payload?.source as string;
      counts.set(src, (counts.get(src) ?? 0) + 1);
    }

    offset = page.next_page_offset;
  } while (offset !== null);

  return Array.from(counts.entries()).map(([source, chunkCount]) => ({
    source,
    chunkCount,
  }));
}

Both stores expose the same interface — insertChunk, similaritySearch, listSources. The MCP server imports from a single store.ts file that re-exports whichever backend you configure via env var. 🔄

✂️ Part 6: The Ingestion Pipeline

Ingestion has three steps: load the document, chunk it into overlapping passages, embed each chunk and store it.

// src/ingestion/chunker.ts

export interface Chunk {
  content: string;
  index: number;
}

export function chunkText(
  text: string,
  chunkSize = 512,
  overlap = 64
): Chunk[] {
  const words = text.split(/\s+/);
  const chunks: Chunk[] = [];
  let i = 0;
  let index = 0;

  while (i < words.length) {
    const slice = words.slice(i, i + chunkSize);
    chunks.push({ content: slice.join(" "), index: index++ });
    i += chunkSize - overlap;
  }

  return chunks;
}

The overlap parameter (64 words by default) means adjacent chunks share content at their boundaries. This prevents important sentences from being split across two chunks with neither containing enough context to be useful. 💡

Now the embedder:

// src/ingestion/embedder.ts
import OpenAI from "openai";

const openai = new OpenAI();

export async function embedTexts(texts: string[]): Promise<number[][]> {
  // Batch up to 100 texts per API call for efficiency
  const batches: string[][] = [];
  for (let i = 0; i < texts.length; i += 100) {
    batches.push(texts.slice(i, i + 100));
  }

  const allEmbeddings: number[][] = [];

  for (const batch of batches) {
    const response = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: batch,
      encoding_format: "float",
    });

    allEmbeddings.push(...response.data.map((d) => d.embedding));
  }

  return allEmbeddings;
}

And the full ingestion script that ties it together:

// src/ingestion/ingest.ts
import fs from "fs";
import path from "path";
import { chunkText } from "./chunker.js";
import { embedTexts } from "./embedder.js";
import { insertChunk } from "../store.js";

export async function ingestFile(filePath: string, tags: string[] = []): Promise<void> {
  const source = path.basename(filePath);
  const raw = fs.readFileSync(filePath, "utf-8");

  console.log(`📄 Ingesting: \({source} (\){raw.length} chars)`);

  const chunks = chunkText(raw);
  console.log(`  ✂️  ${chunks.length} chunks created`);

  const embeddings = await embedTexts(chunks.map((c) => c.content));
  console.log(`  🔢 ${embeddings.length} embeddings computed`);

  for (let i = 0; i < chunks.length; i++) {
    await insertChunk({
      source,
      chunkIndex: chunks[i].index,
      content: chunks[i].content,
      embedding: embeddings[i],
      metadata: { tags, filePath },
    });
  }

  console.log(`  ✅ ${source} indexed successfully\n`);
}

// CLI: node dist/ingestion/ingest.js ./docs/runbook.md ./docs/api-reference.md
const files = process.argv.slice(2);
for (const f of files) {
  await ingestFile(f);
}

Run it:

npx tsx src/ingestion/ingest.ts \
  ./docs/runbook.md \
  ./docs/mcp-guide.md \
  ./docs/api-reference.md

📄 Ingesting: runbook.md (18432 chars)
  ✂️  42 chunks created
  🔢 42 embeddings computed
  ✅ runbook.md indexed successfully

📄 Ingesting: mcp-guide.md (24100 chars)
  ✂️  56 chunks created
  🔢 56 embeddings computed
  ✅ mcp-guide.md indexed successfully

🔧 Part 7: The MCP Server with RAG Tools

Now the MCP server itself — two tools and one resource:

// src/server.ts
import "dotenv/config";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
import { embedTexts } from "./ingestion/embedder.js";
import { similaritySearch, listSources } from "./store.js";

const server = new McpServer({
  name: "knowledge-base-server",
  version: "1.0.0",
});

// Tool 1: semantic search
server.tool(
  "search_knowledge_base",
  "Search the internal knowledge base for information relevant to a query. Use this before answering any question about internal processes, documentation, or technical guides.",
  {
    query: z
      .string()
      .min(3)
      .describe("The natural language question or search phrase"),
    top_k: z
      .number()
      .int()
      .min(1)
      .max(10)
      .default(4)
      .describe("Number of results to return"),
    source_filter: z
      .string()
      .optional()
      .describe("Optional: restrict search to a specific document source"),
  },
  async (args) => {
    const [queryEmbedding] = await embedTexts([args.query]);

    const results = await similaritySearch(
      queryEmbedding,
      args.top_k,
      args.source_filter ? { source: args.source_filter } : undefined
    );

    if (results.length === 0) {
      return {
        content: [
          {
            type: "text",
            text: "No relevant documents found for this query. The knowledge base may not contain information on this topic.",
          },
        ],
      };
    }

    const formatted = results
      .map(
        (r, i) =>
          `[\({i + 1}] Source: \){r.source} (score: \({r.score.toFixed(3)})\n\){r.content}`
      )
      .join("\n\n---\n\n");

    return {
      content: [
        {
          type: "text",
          text: `Found \({results.length} relevant passages:\n\n\){formatted}`,
        },
      ],
    };
  }
);

// Tool 2: add a new document at runtime
server.tool(
  "index_document",
  "Add a new text document to the knowledge base so it can be searched immediately.",
  {
    source: z.string().describe("A name or identifier for this document"),
    content: z.string().min(10).describe("The full text content to index"),
    tags: z.array(z.string()).default([]).describe("Optional tags for filtering"),
  },
  async (args) => {
    const chunks = (await import("./ingestion/chunker.js")).chunkText(args.content);
    const embeddings = await embedTexts(chunks.map((c) => c.content));

    for (let i = 0; i < chunks.length; i++) {
      await (await import("./store.js")).insertChunk({
        source: args.source,
        chunkIndex: i,
        content: chunks[i].content,
        embedding: embeddings[i],
        metadata: { tags: args.tags },
      });
    }

    return {
      content: [
        {
          type: "text",
          text: `Indexed \({chunks.length} chunks from "\){args.source}" successfully.`,
        },
      ],
    };
  }
);

// Resource: list all indexed sources
server.resource(
  "indexed-sources",
  "knowledge://sources",
  { description: "List of all documents currently indexed in the knowledge base" },
  async () => {
    const sources = await listSources();
    const text = sources
      .map((s) => `\({s.source} — \){s.chunkCount} chunks`)
      .join("\n");

    return {
      contents: [
        {
          uri: "knowledge://sources",
          text: sources.length
            ? `Indexed documents:\n\n${text}`
            : "No documents indexed yet.",
        },
      ],
    };
  }
);

const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Knowledge base MCP server running");

🤖 Part 8: The RAG Agent in Action

Connect the agent from Part 3 to the knowledge base server and watch it retrieve before it answers:

// Quick test — run the agent against the knowledge base
import { createMcpClient } from "./client.js";
import { runStreamingAgent } from "./streaming-agent.js";

const client = await createMcpClient("node", ["dist/server.js"]);
const { tools } = await client.listTools();

const anthropicTools = tools.map((t) => ({
  name: t.name,
  description: t.description ?? "",
  input_schema: t.inputSchema,
}));

const messages = [
  {
    role: "user" as const,
    content:
      "What is the recommended session TTL for the MCP server and how should it be implemented?",
  },
];

await runStreamingAgent(client, anthropicTools, messages);

Terminal output:

🤖 Agent: Let me search the knowledge base for information on MCP session TTL.
  🔧 [tool_use] search_knowledge_base({"query":"MCP session TTL Redis implementation","top_k":4})
  ✅ [result] Found 4 relevant passages:
              [1] Source: mcp-guide.md (score: 0.891)
              The recommended TTL for MCP sessions is 30 minutes, implemented
              as a sliding expiry...

🤖 Agent: Based on the documentation, the recommended session TTL is **30 minutes**
with a sliding expiry — meaning the TTL resets on every active request.
This is implemented in Redis using:

    await redis.set(KEY(sessionId), JSON.stringify(state), "EX", 1800);

The sliding TTL ensures active sessions never expire while idle sessions
clean themselves up automatically. [Source: mcp-guide.md]

The agent cited the source file and provided the exact implementation from your docs — no hallucination. 🎯

🧹 Part 9: Metadata Filtering — Scoped Search

Sometimes you want to search only within a specific document or tag. The source_filter parameter on search_knowledge_base enables this:

// Agent can now narrow search to a specific document
search_knowledge_base({
  query: "Docker deployment steps",
  top_k: 3,
  source_filter: "runbook.md"   // only search the runbook
})

For tag-based filtering with pgvector, add a WHERE clause on the JSONB metadata:

-- Filter by tag in pgvector
SELECT id, source, content, metadata,
       1 - (embedding <=> $1) AS score
FROM documents
WHERE metadata->'tags' ? 'docker'   -- contains tag
ORDER BY embedding <=> $1
LIMIT $2;

For Qdrant, use payload filters:

await client.search(COLLECTION, {
  vector: queryEmbedding,
  limit: topK,
  filter: {
    must: [
      { key: "metadata.tags", match: { any: ["docker"] } }
    ]
  },
  with_payload: true,
});

💡 Part 10: Production Tips

Chunk size matters more than you think. 512 words is a good default. Too small (under 100 words) and chunks lose context. Too large (over 1000 words) and the embedding averages over too much content, making similarity search less precise. Experiment with your specific documents.

Embed the question and the answer separately. HyDE (Hypothetical Document Embeddings) is a technique where you ask the model to generate a hypothetical answer to the query, embed that hypothetical answer, and search for chunks similar to the answer rather than the question. This dramatically improves recall for questions phrased very differently from the documentation.

Re-rank after retrieval. Cosine similarity is fast but imprecise. After retrieving the top 20 candidates, run a cross-encoder reranker (like cross-encoder/ms-marco-MiniLM-L-6-v2 via a local model) to reorder them and keep only the top 4. This two-stage approach gives you both speed and precision.

Track which chunks get cited. Add a citations table that logs which chunk IDs were included in a response. After a week, you can identify which documents are being consulted most — and which were indexed but never retrieved, meaning they may be poorly chunked or the embedding model does not represent them well.

Set a minimum similarity threshold. If the best match scores below 0.6, the knowledge base probably does not contain relevant information. Return a clear "not found" instead of low-quality results that confuse the model:

const results = await similaritySearch(queryEmbedding, topK);
const relevant = results.filter((r) => r.score >= 0.6);

if (relevant.length === 0) {
  return { content: [{ type: "text", text: "No relevant information found." }] };
}

🎯 Summary

In Part 10 you built a complete RAG pipeline exposed as MCP tools:

📥 Ingestion pipeline — chunk, embed with text-embedding-3-small, store in pgvector or Qdrant
🗄️ Two vector stores — pgvector for SQL-native deployments, Qdrant for dedicated high-throughput search
🔍 search_knowledge_base MCP tool — semantic similarity search with metadata filtering
📝 index_document MCP tool — add new content at runtime without restarting the server
📄 knowledge://sources MCP resource — list all indexed documents
🤖 RAG agent — Claude retrieves before it answers, citing sources and avoiding hallucination
💡 Production tips — chunk size, HyDE, reranking, citation tracking, similarity thresholds

In Part 11 we will harden the knowledge base with access control per tenant — so that when Client A asks a question, the search only returns chunks that Client A is authorised to see. Multi-tenant RAG with row-level security in pgvector. 🔐

📚 Further Reading

🔢 OpenAI Embeddings API
🗄️ pgvector GitHub
🦀 Qdrant TypeScript client
🧠 HyDE paper — Precise Zero-Shot Dense Retrieval
✂️ Chunking strategies for RAG
🤖 Part 3: Building an AI Agent with MCP
🌐 Part 5: Production MCP Servers

🔍 MCP + RAG: Building a Knowledge-Base Tool with Vector Search in TypeScript

🗺️ What we'll cover

🧠 Part 1: How RAG + MCP Fits Together

📦 Part 2: Project Setup

🗄️ Part 3: Choosing Your Vector Store

🗄️ Part 4: pgvector Setup

🦀 Part 5: Qdrant Setup

✂️ Part 6: The Ingestion Pipeline

🔧 Part 7: The MCP Server with RAG Tools

🤖 Part 8: The RAG Agent in Action

🧹 Part 9: Metadata Filtering — Scoped Search

💡 Part 10: Production Tips

🎯 Summary

📚 Further Reading

Comments

More from this blog

🚀 Deploying Your MCP Server to Production: Free and Paid Platforms with GitHub Actions

📦 Build and Publish a Reusable TypeScript MCP Client SDK

📊 Observability for MCP Servers: Structured Logging, Distributed Tracing, and Metrics

🏗️ Multi-Tenant MCP: Session Management, State Isolation, and Horizontal Scaling

Command Palette

🗺️ What we'll cover

🧠 Part 1: How RAG + MCP Fits Together

📦 Part 2: Project Setup

🗄️ Part 3: Choosing Your Vector Store

🗄️ Part 4: pgvector Setup

🦀 Part 5: Qdrant Setup

✂️ Part 6: The Ingestion Pipeline

🔧 Part 7: The MCP Server with RAG Tools

🤖 Part 8: The RAG Agent in Action

🧹 Part 9: Metadata Filtering — Scoped Search

💡 Part 10: Production Tips

🎯 Summary

📚 Further Reading

Comments

More from this blog