π MCP + RAG: Building a Knowledge-Base Tool with Vector Search in TypeScript
Give your AI agent long-term memory β ingest documents, embed them into a vector database, and expose semantic search as an MCP tool so Claude can answer questions grounded in your own knowledge
Hi π, I'm Tushar Patil. Currently I am working as Frontend Developer (Angular) and also have expertise with .Net Core and Framework.
This is Part 10 of the AI Engineering with TypeScript series.
Prerequisites: Part 2 β MCP Fundamentals Β· Part 3 β AI Agent Β· Part 5 β Production MCP Server
Stack: Node.js 20+ Β· TypeScript 5.x Β· OpenAI Embeddings API Β· pgvector Β· Qdrant Β· @modelcontextprotocol/sdk Β· Zod
πΊοΈ What we'll cover
Every MCP server we built in this series answers questions from live APIs β weather data, current conditions, real-time forecasts. But many of the most valuable questions an AI agent needs to answer come from your own documents β product docs, internal wikis, support tickets, research papers, code comments, runbooks.
That is what RAG (Retrieval-Augmented Generation) solves. Instead of stuffing entire documents into the context window (expensive, slow, hits the limit quickly), RAG:
- Splits documents into chunks and stores them as embedding vectors in a vector database
- At query time, embeds the user's question and finds the most semantically similar chunks
- Passes only those relevant chunks to the model as context
When you combine RAG with MCP, your AI agent can call a search_knowledge_base tool mid-reasoning β exactly like it calls a weather tool β and ground its answers in your actual knowledge. π§
By the end you will have:
- π₯ An ingestion pipeline that chunks, embeds, and stores documents in pgvector or Qdrant
- π A
search_knowledge_baseMCP tool that does semantic similarity search - π A
list_sourcesMCP resource listing all indexed documents - π€ An agent that uses retrieval before answering knowledge questions
- ποΈ Guidance on choosing between pgvector (SQL-native) and Qdrant (purpose-built vector DB)
- π§Ή Metadata filtering so searches can be scoped to a specific source, date range, or tag
π§ Part 1: How RAG + MCP Fits Together
Without RAG, your agent's knowledge is frozen at the model's training cutoff and whatever you put in the system prompt. With RAG + MCP the flow becomes:
User: "What is the recommended Redis TTL for MCP sessions?"
Agent thinks: I should search the knowledge base first.
β calls search_knowledge_base({ query: "Redis TTL MCP sessions" })
MCP server:
1. embeds the query β [0.021, -0.834, 0.441, ...]
2. cosine-similarity search in pgvector
3. returns top 3 matching chunks from your internal docs
Agent: Based on the documentation [chunk 1], the recommended TTL is 30 minutes,
reset on every session access (sliding expiry)...
The agent did not hallucinate. It retrieved the answer from your actual content and cited it. That is the core value of RAG. π―
π¦ Part 2: Project Setup
mkdir mcp-rag-server && cd mcp-rag-server
npm init -y
npm install @modelcontextprotocol/sdk openai zod pg pgvector @qdrant/js-client-rest dotenv pdfjs-dist marked
npm install -D typescript @types/pg @types/node tsx tsup
Your .env:
# Embeddings (OpenAI)
OPENAI_API_KEY=sk-...
# Choose one vector store
DATABASE_URL=postgresql://localhost:5432/rag_db # for pgvector
QDRANT_URL=http://localhost:6333 # for Qdrant
# MCP server
PORT=3001
VALID_TOKENS=your-secret-token
The embedding model we use is text-embedding-3-small from OpenAI β 1536 dimensions, fast, cheap ($0.02 per million tokens). It works with both pgvector and Qdrant without changing the ingestion logic. β
ποΈ Part 3: Choosing Your Vector Store
Both options are excellent. Here is when to pick each one:
pgvector β the PostgreSQL extension that adds a vector column type and cosine/L2/inner-product index operators.
Pick pgvector when:
- You already run PostgreSQL (Supabase, Neon, Railway, RDS)
- You want to join vector search results with relational data (e.g. filter by
tenant_id,created_at) - You want one database for everything β sessions, metadata, and vectors
- You prefer SQL and want to query embeddings with familiar tooling
Qdrant β a purpose-built vector database written in Rust, with a REST + gRPC API.
Pick Qdrant when:
- You are handling millions of vectors and need maximum search throughput
- You want built-in payload filtering, named vectors, and sparse vector support
- You prefer a dedicated service that is tuned exclusively for vector search
- You want the Qdrant cloud managed tier for zero-ops hosting
For this post we implement both and you can swap between them by changing one import. π
ποΈ Part 4: pgvector Setup
Install the extension and create the table:
-- Run this once in your PostgreSQL database
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS documents (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL, -- filename or URL
chunk_index INTEGER NOT NULL, -- position within source
content TEXT NOT NULL, -- raw text of this chunk
embedding vector(1536), -- text-embedding-3-small output
metadata JSONB DEFAULT '{}', -- tags, page numbers, section headers
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- IVFFlat index for approximate nearest-neighbour search
-- lists = roughly sqrt(total_rows) is a good starting point
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
The ivfflat index trades a tiny amount of recall for dramatically faster search at scale. For under 100k rows you can omit the index and use exact search β simply remove the CREATE INDEX statement. For over 1 million rows, switch to hnsw which gives better recall at the cost of higher memory:
-- Alternative for large collections (1M+ rows)
CREATE INDEX documents_embedding_hnsw_idx
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Now the pgvector client:
// src/stores/pgvector-store.ts
import pg from "pg";
import { toSql } from "pgvector/pg";
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
export interface DocumentChunk {
id?: number;
source: string;
chunkIndex: number;
content: string;
embedding: number[];
metadata?: Record<string, unknown>;
}
export interface SearchResult {
id: number;
source: string;
content: string;
score: number;
metadata: Record<string, unknown>;
}
export async function insertChunk(chunk: DocumentChunk): Promise<void> {
await pool.query(
`INSERT INTO documents (source, chunk_index, content, embedding, metadata)
VALUES (\(1, \)2, \(3, \)4, $5)`,
[
chunk.source,
chunk.chunkIndex,
chunk.content,
toSql(chunk.embedding),
JSON.stringify(chunk.metadata ?? {}),
]
);
}
export async function similaritySearch(
queryEmbedding: number[],
topK = 5,
filter?: { source?: string; tags?: string[] }
): Promise<SearchResult[]> {
let whereClause = "";
const params: unknown[] = [toSql(queryEmbedding), topK];
if (filter?.source) {
params.push(filter.source);
whereClause += ` AND source = $${params.length}`;
}
const result = await pool.query(
`SELECT id, source, content, metadata,
1 - (embedding <=> $1) AS score
FROM documents
WHERE 1=1 ${whereClause}
ORDER BY embedding <=> $1
LIMIT $2`,
params
);
return result.rows.map((row) => ({
id: row.id,
source: row.source,
content: row.content,
score: parseFloat(row.score),
metadata: row.metadata,
}));
}
export async function listSources(): Promise<{ source: string; chunkCount: number }[]> {
const result = await pool.query(
`SELECT source, COUNT(*) as chunk_count
FROM documents
GROUP BY source
ORDER BY source`
);
return result.rows.map((r) => ({
source: r.source,
chunkCount: parseInt(r.chunk_count),
}));
}
The <=> operator is pgvector's cosine distance operator. 1 - (embedding <=> $1) converts distance to similarity β 1.0 means identical, 0.0 means completely unrelated. We order by distance (ascending) but return similarity (descending) so scores are intuitive for the caller. β
π¦ Part 5: Qdrant Setup
Start Qdrant locally with Docker:
docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
Now the Qdrant client:
// src/stores/qdrant-store.ts
import { QdrantClient } from "@qdrant/js-client-rest";
const client = new QdrantClient({ url: process.env.QDRANT_URL ?? "http://localhost:6333" });
const COLLECTION = "documents";
const VECTOR_SIZE = 1536;
export async function ensureCollection(): Promise<void> {
const collections = await client.getCollections();
const exists = collections.collections.some((c) => c.name === COLLECTION);
if (!exists) {
await client.createCollection(COLLECTION, {
vectors: { size: VECTOR_SIZE, distance: "Cosine" },
});
// Payload index for fast metadata filtering
await client.createPayloadIndex(COLLECTION, {
field_name: "source",
field_schema: "keyword",
});
}
}
export async function insertChunk(chunk: DocumentChunk): Promise<void> {
await client.upsert(COLLECTION, {
points: [
{
id: `\({chunk.source}-\){chunk.chunkIndex}`,
vector: chunk.embedding,
payload: {
source: chunk.source,
chunkIndex: chunk.chunkIndex,
content: chunk.content,
metadata: chunk.metadata ?? {},
},
},
],
});
}
export async function similaritySearch(
queryEmbedding: number[],
topK = 5,
filter?: { source?: string }
): Promise<SearchResult[]> {
const qdrantFilter = filter?.source
? { must: [{ key: "source", match: { value: filter.source } }] }
: undefined;
const results = await client.search(COLLECTION, {
vector: queryEmbedding,
limit: topK,
filter: qdrantFilter,
with_payload: true,
});
return results.map((r) => ({
id: String(r.id),
source: r.payload?.source as string,
content: r.payload?.content as string,
score: r.score,
metadata: (r.payload?.metadata as Record<string, unknown>) ?? {},
}));
}
export async function listSources(): Promise<{ source: string; chunkCount: number }[]> {
// Scroll through all points and aggregate by source
const counts = new Map<string, number>();
let offset: string | number | null = null;
do {
const page = await client.scroll(COLLECTION, {
limit: 100,
offset: offset ?? undefined,
with_payload: ["source"],
});
for (const point of page.points) {
const src = point.payload?.source as string;
counts.set(src, (counts.get(src) ?? 0) + 1);
}
offset = page.next_page_offset;
} while (offset !== null);
return Array.from(counts.entries()).map(([source, chunkCount]) => ({
source,
chunkCount,
}));
}
Both stores expose the same interface β insertChunk, similaritySearch, listSources. The MCP server imports from a single store.ts file that re-exports whichever backend you configure via env var. π
βοΈ Part 6: The Ingestion Pipeline
Ingestion has three steps: load the document, chunk it into overlapping passages, embed each chunk and store it.
// src/ingestion/chunker.ts
export interface Chunk {
content: string;
index: number;
}
export function chunkText(
text: string,
chunkSize = 512,
overlap = 64
): Chunk[] {
const words = text.split(/\s+/);
const chunks: Chunk[] = [];
let i = 0;
let index = 0;
while (i < words.length) {
const slice = words.slice(i, i + chunkSize);
chunks.push({ content: slice.join(" "), index: index++ });
i += chunkSize - overlap;
}
return chunks;
}
The overlap parameter (64 words by default) means adjacent chunks share content at their boundaries. This prevents important sentences from being split across two chunks with neither containing enough context to be useful. π‘
Now the embedder:
// src/ingestion/embedder.ts
import OpenAI from "openai";
const openai = new OpenAI();
export async function embedTexts(texts: string[]): Promise<number[][]> {
// Batch up to 100 texts per API call for efficiency
const batches: string[][] = [];
for (let i = 0; i < texts.length; i += 100) {
batches.push(texts.slice(i, i + 100));
}
const allEmbeddings: number[][] = [];
for (const batch of batches) {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: batch,
encoding_format: "float",
});
allEmbeddings.push(...response.data.map((d) => d.embedding));
}
return allEmbeddings;
}
And the full ingestion script that ties it together:
// src/ingestion/ingest.ts
import fs from "fs";
import path from "path";
import { chunkText } from "./chunker.js";
import { embedTexts } from "./embedder.js";
import { insertChunk } from "../store.js";
export async function ingestFile(filePath: string, tags: string[] = []): Promise<void> {
const source = path.basename(filePath);
const raw = fs.readFileSync(filePath, "utf-8");
console.log(`π Ingesting: \({source} (\){raw.length} chars)`);
const chunks = chunkText(raw);
console.log(` βοΈ ${chunks.length} chunks created`);
const embeddings = await embedTexts(chunks.map((c) => c.content));
console.log(` π’ ${embeddings.length} embeddings computed`);
for (let i = 0; i < chunks.length; i++) {
await insertChunk({
source,
chunkIndex: chunks[i].index,
content: chunks[i].content,
embedding: embeddings[i],
metadata: { tags, filePath },
});
}
console.log(` β
${source} indexed successfully\n`);
}
// CLI: node dist/ingestion/ingest.js ./docs/runbook.md ./docs/api-reference.md
const files = process.argv.slice(2);
for (const f of files) {
await ingestFile(f);
}
Run it:
npx tsx src/ingestion/ingest.ts \
./docs/runbook.md \
./docs/mcp-guide.md \
./docs/api-reference.md
π Ingesting: runbook.md (18432 chars)
βοΈ 42 chunks created
π’ 42 embeddings computed
β
runbook.md indexed successfully
π Ingesting: mcp-guide.md (24100 chars)
βοΈ 56 chunks created
π’ 56 embeddings computed
β
mcp-guide.md indexed successfully
π§ Part 7: The MCP Server with RAG Tools
Now the MCP server itself β two tools and one resource:
// src/server.ts
import "dotenv/config";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
import { embedTexts } from "./ingestion/embedder.js";
import { similaritySearch, listSources } from "./store.js";
const server = new McpServer({
name: "knowledge-base-server",
version: "1.0.0",
});
// Tool 1: semantic search
server.tool(
"search_knowledge_base",
"Search the internal knowledge base for information relevant to a query. Use this before answering any question about internal processes, documentation, or technical guides.",
{
query: z
.string()
.min(3)
.describe("The natural language question or search phrase"),
top_k: z
.number()
.int()
.min(1)
.max(10)
.default(4)
.describe("Number of results to return"),
source_filter: z
.string()
.optional()
.describe("Optional: restrict search to a specific document source"),
},
async (args) => {
const [queryEmbedding] = await embedTexts([args.query]);
const results = await similaritySearch(
queryEmbedding,
args.top_k,
args.source_filter ? { source: args.source_filter } : undefined
);
if (results.length === 0) {
return {
content: [
{
type: "text",
text: "No relevant documents found for this query. The knowledge base may not contain information on this topic.",
},
],
};
}
const formatted = results
.map(
(r, i) =>
`[\({i + 1}] Source: \){r.source} (score: \({r.score.toFixed(3)})\n\){r.content}`
)
.join("\n\n---\n\n");
return {
content: [
{
type: "text",
text: `Found \({results.length} relevant passages:\n\n\){formatted}`,
},
],
};
}
);
// Tool 2: add a new document at runtime
server.tool(
"index_document",
"Add a new text document to the knowledge base so it can be searched immediately.",
{
source: z.string().describe("A name or identifier for this document"),
content: z.string().min(10).describe("The full text content to index"),
tags: z.array(z.string()).default([]).describe("Optional tags for filtering"),
},
async (args) => {
const chunks = (await import("./ingestion/chunker.js")).chunkText(args.content);
const embeddings = await embedTexts(chunks.map((c) => c.content));
for (let i = 0; i < chunks.length; i++) {
await (await import("./store.js")).insertChunk({
source: args.source,
chunkIndex: i,
content: chunks[i].content,
embedding: embeddings[i],
metadata: { tags: args.tags },
});
}
return {
content: [
{
type: "text",
text: `Indexed \({chunks.length} chunks from "\){args.source}" successfully.`,
},
],
};
}
);
// Resource: list all indexed sources
server.resource(
"indexed-sources",
"knowledge://sources",
{ description: "List of all documents currently indexed in the knowledge base" },
async () => {
const sources = await listSources();
const text = sources
.map((s) => `\({s.source} β \){s.chunkCount} chunks`)
.join("\n");
return {
contents: [
{
uri: "knowledge://sources",
text: sources.length
? `Indexed documents:\n\n${text}`
: "No documents indexed yet.",
},
],
};
}
);
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Knowledge base MCP server running");
π€ Part 8: The RAG Agent in Action
Connect the agent from Part 3 to the knowledge base server and watch it retrieve before it answers:
// Quick test β run the agent against the knowledge base
import { createMcpClient } from "./client.js";
import { runStreamingAgent } from "./streaming-agent.js";
const client = await createMcpClient("node", ["dist/server.js"]);
const { tools } = await client.listTools();
const anthropicTools = tools.map((t) => ({
name: t.name,
description: t.description ?? "",
input_schema: t.inputSchema,
}));
const messages = [
{
role: "user" as const,
content:
"What is the recommended session TTL for the MCP server and how should it be implemented?",
},
];
await runStreamingAgent(client, anthropicTools, messages);
Terminal output:
π€ Agent: Let me search the knowledge base for information on MCP session TTL.
π§ [tool_use] search_knowledge_base({"query":"MCP session TTL Redis implementation","top_k":4})
β
[result] Found 4 relevant passages:
[1] Source: mcp-guide.md (score: 0.891)
The recommended TTL for MCP sessions is 30 minutes, implemented
as a sliding expiry...
π€ Agent: Based on the documentation, the recommended session TTL is **30 minutes**
with a sliding expiry β meaning the TTL resets on every active request.
This is implemented in Redis using:
await redis.set(KEY(sessionId), JSON.stringify(state), "EX", 1800);
The sliding TTL ensures active sessions never expire while idle sessions
clean themselves up automatically. [Source: mcp-guide.md]
The agent cited the source file and provided the exact implementation from your docs β no hallucination. π―
π§Ή Part 9: Metadata Filtering β Scoped Search
Sometimes you want to search only within a specific document or tag. The source_filter parameter on search_knowledge_base enables this:
// Agent can now narrow search to a specific document
search_knowledge_base({
query: "Docker deployment steps",
top_k: 3,
source_filter: "runbook.md" // only search the runbook
})
For tag-based filtering with pgvector, add a WHERE clause on the JSONB metadata:
-- Filter by tag in pgvector
SELECT id, source, content, metadata,
1 - (embedding <=> $1) AS score
FROM documents
WHERE metadata->'tags' ? 'docker' -- contains tag
ORDER BY embedding <=> $1
LIMIT $2;
For Qdrant, use payload filters:
await client.search(COLLECTION, {
vector: queryEmbedding,
limit: topK,
filter: {
must: [
{ key: "metadata.tags", match: { any: ["docker"] } }
]
},
with_payload: true,
});
π‘ Part 10: Production Tips
Chunk size matters more than you think. 512 words is a good default. Too small (under 100 words) and chunks lose context. Too large (over 1000 words) and the embedding averages over too much content, making similarity search less precise. Experiment with your specific documents.
Embed the question and the answer separately. HyDE (Hypothetical Document Embeddings) is a technique where you ask the model to generate a hypothetical answer to the query, embed that hypothetical answer, and search for chunks similar to the answer rather than the question. This dramatically improves recall for questions phrased very differently from the documentation.
Re-rank after retrieval. Cosine similarity is fast but imprecise. After retrieving the top 20 candidates, run a cross-encoder reranker (like cross-encoder/ms-marco-MiniLM-L-6-v2 via a local model) to reorder them and keep only the top 4. This two-stage approach gives you both speed and precision.
Track which chunks get cited. Add a citations table that logs which chunk IDs were included in a response. After a week, you can identify which documents are being consulted most β and which were indexed but never retrieved, meaning they may be poorly chunked or the embedding model does not represent them well.
Set a minimum similarity threshold. If the best match scores below 0.6, the knowledge base probably does not contain relevant information. Return a clear "not found" instead of low-quality results that confuse the model:
const results = await similaritySearch(queryEmbedding, topK);
const relevant = results.filter((r) => r.score >= 0.6);
if (relevant.length === 0) {
return { content: [{ type: "text", text: "No relevant information found." }] };
}
π― Summary
In Part 10 you built a complete RAG pipeline exposed as MCP tools:
- π₯ Ingestion pipeline β chunk, embed with
text-embedding-3-small, store in pgvector or Qdrant - ποΈ Two vector stores β pgvector for SQL-native deployments, Qdrant for dedicated high-throughput search
- π
search_knowledge_baseMCP tool β semantic similarity search with metadata filtering - π
index_documentMCP tool β add new content at runtime without restarting the server - π
knowledge://sourcesMCP resource β list all indexed documents - π€ RAG agent β Claude retrieves before it answers, citing sources and avoiding hallucination
- π‘ Production tips β chunk size, HyDE, reranking, citation tracking, similarity thresholds
In Part 11 we will harden the knowledge base with access control per tenant β so that when Client A asks a question, the search only returns chunks that Client A is authorised to see. Multi-tenant RAG with row-level security in pgvector. π