Contextual Retrieval-Augmented Generation (RAG) on Cloudflare Workers
Today I am sharing a powerful pattern I’ve implemented on the Cloudflare Developers Platform whilst building my weekends project: Contextual Retrieval-Augmented Generation (RAG).
Traditional RAG implementations often fall short when presented with nuanced queries or complex documents. Anthropic introduced Contextual Retrieval as an answer to this problem.
In this blog post, I’ll walk through implementing a contextual RAG system entirely on the Cloudflare Developer Plarform. You’ll see how combining vector search with full-text search, query rewriting, and smart ranking techniques creates a system that truly understands your content.
What is Contextual RAG
If you’re doing any kind of RAG today, you’re probably:
- Chunking documents
- Storing vectors
- Searching embeddings
- Stuffing the results into your LLM context window
The problem is simple: chunking is often not good enough. When you retrieve a chunk based on cosine similarity, it might be totally out of context and irrelevant to the user query.
In a nutshell, key issues with traditional RAG systems are:
- Context loss: When documents are chunked into smaller pieces, the broader context that gives meaning to each chunk is often lost
- Relevance issues: Vector search alone might retrieve semantically similar content that isn’t actually relevant to the query
- Query limitations: User queries are often ambiguous or lack specificity needed for effective retrieval
Contextual RAG provides a solution to these issues. It consists of pre-pending every chunk with a short explanation situating it within the context of the entire document. Every chunk is passed through an LLM to provide this context.
The prompt looks like:
1<document>2{{WHOLE_DOCUMENT}}3</document>4Here is the chunk we want to situate within the whole document5<chunk>6{{CHUNK_CONTENT}}7</chunk>8Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
The result is a short sentence situating the chunk within the context of the entire documents. It is prepended to the actual chunk before inserting it into the vector databse. This provides a solution to the context loss challenge.
Additionally, the query side of the RAG system is enhanced with full-text search, reciprocal rank fusion and a LLM reranker.


With this technique the challenges of traditional RAG are addressed by:
- Enhancing chunks with context: Each chunk is augmented with contextual information from the full document
- Using hybrid search: Combining vector similarity with keyword-based search
- Rewriting queries: Expanding and transforming user queries into multiple search variations
- Intelligent ranking: Using sophisticated algorithms to merge and rank results from different search methods
Let’s dive deeper and see how to implemented this on Cloudflare Workers.
Project Overview
We’ll be using:
- Cloudflare Workers - Our serverless platform
- Cloudflare D1 - SQL database for document storage
- Cloudflare Vectorize - Vector search engine
- Workers AI - Cloudflare AI platform
- Drizzle ORM - Type-safe database access
- Hono - Lightweight framework for our API routes
Here’s the basic structure of the project:
1.2├── bindings.ts # Bindings TypeScript Definition3├── bootstrap.sh # Automate creating the Cloudflare resources4├── drizzle.config.ts # Database configuration5├── fts5.sql # Full-text search SQL triggers6├── package.json # Dependencies and scripts7├── src8│ ├── db9│ │ ├── index.ts # Database CRUD operations10│ │ ├── schemas.ts # Databae schema definition11│ │ └── types.ts # Database TypeScript types12│ ├── index.ts # Main Worker code and Durable Object13│ ├── search.ts # Search functionality14│ ├── utils.ts # Util functions15│ └── vectorize.ts # Vectorize operations16├── tsconfig.json # TypeScript configuration17└── wrangler.json # Cloudflare Workers configuration
This final solution will be implementing the following architecture. The goal of this blog post is to walk through each item individually and finally explain how they all work together.


The complete source code for this blog post is available on GitHub. I suggest following the post with the source code available for reference.
Bootstrapping the RAG System
We first need to create our cloud resources. Let’s create a bootstrap.sh
script that creates a Vectorize index and a D1 database:
1#! /bin/bash2set -e3
4npx wrangler vectorize create contextual-rag-index --dimensions=1024 --metric=cosine5npx wrangler vectorize create-metadata-index contextual-rag-index --property-name=timestamp --type=number6
7npx wrangler d1 create contextual-rag
It use wrangler to create:
- A Vectorize index for storing embeddings
- A metadata index for timestamp-based filtering
- A D1 database for document storage
Running this script will output a database ID we will use in our wrangler.json
and drizzle.config.ts
files.
Next, after creating a Node.js project with npm
and configuring TypeScript, let’s create our wrangler.json
file to configure our Cloudflare Worker and connect our resources to it.
1{2 "name": "contextual-rag",3 "compatibility_date": "2024-11-12",4 "workers_dev": true,5 "upload_source_maps": true,6 "observability": {7 "enabled": true8 },9 "main": "./src/index.ts",10 "vectorize": [11 {12 "binding": "VECTORIZE",13 "index_name": "contextual-rag-index"14 }15 ],16 "d1_databases": [17 {18 "binding": "D1",19 "database_name": "contextual-rag",20 "database_id": "<REPLACE WITH YOUR DATABASE ID>"21 }22 ],23 "ai": {24 "binding": "AI"25 },26 "vars": {}27}
Document Ingestion Pipeline
After initializing the project, the first step is to ingest documents we want to retrieve. We are building everything on a Hono application deployed on Cloudflare Workers.
Our API endpoint for document uploads looks like this:
1app.post('/', async (c) => {2 const { contents } = await c.req.json();3 if (!contents || typeof contents !== "string") return c.json({ message: "Bad Request" }, 400)4
5 const doc = await createDoc(c.env, { contents });6
7 const splitter = new RecursiveCharacterTextSplitter({8 chunkSize: 1024,9 chunkOverlap: 200,10 });11 const raw = await splitter.splitText(contents);12 const chunks = await contextualizeChunks(c.env, contents, raw)13 await insertChunkVectors(c.env, { docId: doc.id, created: doc.created }, chunks);14
15 return c.json(doc)16});
It’s a POST
endpoint that takes the contents of the document to process. The processing steps are:
- Store the complete document in the database
- Split the document into manageable chunks
- Contextualize each chunk
- Generate vector embeddings for each chunk
- Store the chunks and their embeddings
Text Splitting
We split the documents because LLMs have limited context windows, and retrieving entire documents for every query would be inefficient. By splitting documents into smaller chunks, we can retrieve only the most relevant pieces.
We use the RecursiveCharacterTextSplitter from LangChain, which intelligently splits text based on content boundaries:
1const splitter = new RecursiveCharacterTextSplitter({2 chunkSize: 1024,3 chunkOverlap: 200,4});5const raw = await splitter.splitText(contents);
The chunkSize
parameter controls the maximum size of each chunk, while chunkOverlap
creates an overlap between adjacent chunks. This overlap helps maintain context across chunk boundaries and prevents information from being lost at the dividing points.
Context Enhancement
This is where we do the “contextual” part of Contextual RAG. Instead of storing raw chunks, we first enhance each chunk with contextual information situating it within the context of the full text:
1export async function contextualizeChunks(env: { AI: Ai }, content: string, chunks: string[]): Promise<string[]> {2 const promises = chunks.map(async c => {3
4 const prompt = `<document>5${content}6</document>7Here is the chunk we want to situate within the whole document8<chunk>9${c}10</chunk>11Please give a short succinct context to situate this chunk within the overall12document for the purposes of improving search retrieval of the chunk.13Answer only with the succinct context and nothing else. `;14
15 // @ts-ignore16 const res = await env.AI.run("@cf/meta/llama-3.1-8b-instruct-fast", {17 prompt,18 }) as { response: string }19
20 return `${res.response}; ${c}`;21 })22
23 return await Promise.all(promises);24}
Without the context, chunks become isolated islands of information, divorced from their surrounding context. For example, a chunk containing “it increases efficiency by 40%” is meaningless without knowing what “it” refers to. By adding contextual information to each chunk, we make them more self-contained and improve retrieval accuracy.
We use an LLM to analyze the relationship between the chunk and the full document, then generate a short context summary that precedes the chunk text. This enhancement makes each chunk more self-contained and improves retrieval relevance.
For example, I tried this with a short text describing Paris. After chunking the text, one of the chunks was:
The city doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions, grand boulevards designed for kings now flooded with delivery bikes and tourists holding maps upside-down. Crumbling stone facades wear ivy and grime like couture. Every corner feels staged, but somehow still effortless, like the city isn’t even trying to impress you.
There is no indication that this chunk is talking about Paris.
The enhanced chunk after running through the LLM is:
The chunk describes the essence and atmosphere of Paris, highlighting its unique blend of beauty, history, and contradictions.; The city doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions, grand boulevards designed for kings now flooded with delivery bikes and tourists holding maps upside-down. Crumbling stone facades wear ivy and grime like couture. Every corner feels staged, but somehow still effortless, like the city isn’t even trying to impress you.
A retrieval query consisting of just the word Paris is more likely to match the enhanced chunk versus the raw chunk without context.
Vector Generation and Storage
After contextualizing the chunks, we generate vector embeddings and store them in Cloudflare Vectorize and D1:
1export async function insertChunkVectors(2 env: { D1: D1Database, AI: Ai, VECTORIZE: Vectorize },3 data: { docId: string, created: Date },4 chunks: string[],5) {6
7 const { docId, created } = data;8 const batchSize = 10;9 const insertPromises = [];10
11 for (let i = 0; i < chunks.length; i += batchSize) {12 const chunkBatch = chunks.slice(i, i + batchSize);13
14 insertPromises.push(15 (async () => {16 const embeddingResult = await env.AI.run("@cf/baai/bge-large-en-v1.5", {17 text: chunkBatch,18 });19 const embeddingBatch: number[][] = embeddingResult.data;20
21 const chunkInsertResults = await Promise.all(chunkBatch.map(c => createChunk(env, { docId, text: c })))22 const chunkIds = chunkInsertResults.map((result) => result.id);23
24 await env.VECTORIZE.insert(25 embeddingBatch.map((embedding, index) => ({26 id: chunkIds[index],27 values: embedding,28 metadata: {29 docId,30 chunkId: chunkIds[index],31 text: chunkBatch[index],32 timestamp: created.getTime(),33 },34 }))35 );36 })()37 );38 }39
40 await Promise.all(insertPromises);41}
This function:
- Divides chunks into batches of 10
- Generates embeddings for each batch using Cloudflare AI
- Stores the chunks in the D1 database
- Stores the embeddings and associated metadata in Vectorize
The createChunk
function inserts the chunks the D1 database and gives us a unique ID for each chunk, which is then shared in Vectorize. The next section gives more details on the D1 database.
The metadata attached to each vector includes the original text, document ID, and timestamp, which enables filtering and improves retrieval performance.
D1 Database Schema
Our RAG system must combine full-text and vector search.
We have two data models to store outside of the vector database: full documents and the chunks derived from them. For this, I’m using Cloudflare D1 (SQLite) with Drizzle ORM.
Here’s the schema definition:
1import { index, integer, sqliteTable, text } from "drizzle-orm/sqlite-core";2
3export const docs = sqliteTable(4 "docs",5 {6 id: text("id")7 .notNull()8 .primaryKey()9 .$defaultFn(() => randomString()),10 contents: text("contents"),11 created: integer("created", { mode: "timestamp_ms" })12 .$defaultFn(() => new Date())13 .notNull(),14 updated: integer("updated", { mode: "timestamp_ms" })15 .$onUpdate(() => new Date())16 .notNull(),17 },18 (table) => ([19 index("docs.created.idx").on(table.created),20 ]),21);22
23
24export const chunks = sqliteTable(25 "chunks",26 {27 id: text("id")28 .notNull()29 .primaryKey()30 .$defaultFn(() => randomString()),31 docId: text('doc_id').notNull(),32 text: text("text").notNull(),33 created: integer("created", { mode: "timestamp_ms" })34 .$defaultFn(() => new Date())35 .notNull(),36 },37 (table) => ([38 index("chunks.doc_id.idx").on(table.docId),39 ]),40);41
42function randomString(length = 16): string {43 const chars = "abcdefghijklmnopqrstuvwxyz";44 const resultArray = new Array(length);45
46 for (let i = 0; i < length; i++) {47 const randomIndex = Math.floor(Math.random() * chars.length);48 resultArray[i] = chars[randomIndex];49 }50
51 return resultArray.join("");52}
With these schemas, we can add CRUD functions to write reand and write documents and chunks in D1.
1export function getDrizzleClient(env: DB) {2 return drizzle(env.D1, {3 schema,4 });5}6
7export async function createDoc(env: DB, doc: InsertDoc): Promise<Doc> {8 const d1 = getDrizzleClient(env);9
10 const [res] = await d111 .insert(docs)12 .values(doc)13 .onConflictDoUpdate({14 target: [docs.id],15 set: doc,16 })17 .returning();18
19 return res;20}21
22export async function listDocsByIds(23 env: DB,24 params: { ids: string[] },25): Promise<Doc[]> {26 const d1 = getDrizzleClient(env);27
28 const qs = await d129 .select()30 .from(docs)31 .where(inArray(docs.id, params.ids))32 return qs;33}34
35export async function createChunk(env: DB, chunk: InsertChunk): Promise<Chunk> {36 const d1 = getDrizzleClient(env);37
38 const [res] = await d139 .insert(chunks)40 .values(chunk)41 .onConflictDoUpdate({42 target: [chunks.id],43 set: chunk,44 })45 .returning();46
47 return res;48}
SQLite Full-Text Search
While vector search excels at semantic similarity, it often misses exact keyword matches that full-text search handles perfectly. Full-text search excels at finding exact keyword matches and can retrieve relevant content even when the semantic meaning might be ambiguous. By combining both approaches, we get the best of both worlds.
SQLite provides a powerful full-text search extension called FTS5. We’ll create a virtual table that mirrors our chunks
table but with full-text search capabilities:
1CREATE VIRTUAL TABLE chunks_fts USING fts5(2 id UNINDEXED,3 doc_id UNINDEXED,4 text,5 content = 'chunks',6 created7);8
9CREATE TRIGGER chunks_ai10AFTER11INSERT12 ON chunks BEGIN13INSERT INTO14 chunks_fts(id, doc_id, text, created)15VALUES16 (17 new.id,18 new.doc_id,19 new.text,20 new.created21 );22
23END;24
25CREATE TRIGGER chunks_ad26AFTER DELETE ON chunks27FOR EACH ROW28BEGIN29 DELETE FROM chunks_fts WHERE id = old.id;30 INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild');31END;32
33CREATE TRIGGER chunks_au34AFTER UPDATE ON chunks BEGIN35 DELETE FROM chunks_fts WHERE id = old.id;36 INSERT INTO chunks_fts(id, doc_id, text, created)37 VALUES (new.id, new.doc_id, new.text, new.created);38 INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild');39END;
The triggers ensure that our FTS table stays in sync with the main chunks table. Whenever a chunk is inserted, updated, or deleted, the corresponding entry in the FTS table is also modified.
We use DrizzleKit to create migration files for our D1 database with this schema.
1drizzle-kit generate
The above SQL for the FTS tables must be included manually in the migration file before running the migration in D1. Otherwise, we will not have full-text search capabilities in our SQLite database.
1drizzle-kit migrate
Once the migration is completed, we’re ready to start processing documets in the ingestion side of our RAG system.
Query Processing and Search
Now let’s examine how queries are handled. Our query endpoint looks like this:
1app.post('/query', async (c) => {2 const { prompt, timeframe } = await c.req.json();3
4 if (!prompt) return c.json({ message: "Bad Request" }, 400);5
6 const searchOptions: {7 timeframe?: { from?: number; to?: number };8 } = {};9
10 if (timeframe) {11 searchOptions.timeframe = timeframe;12 }13
14 const ai = createWorkersAI({ binding: c.env.AI });15 // @ts-ignore16 const model = ai("@cf/meta/llama-3.1-8b-instruct-fast") as LanguageModel;17
18 const { queries, keywords } = await rewriteToQueries(model, { prompt });19
20 const { chunks } = await searchDocs(c.env, {21 questions: queries,22 query: prompt,23 keywords,24 topK: 8,25 scoreThreshold: 0.501,26 ...searchOptions,27 });28
29 const uniques = getUniqueListBy(chunks, "docId").map((r) => {30 const arr = chunks31 .filter((f) => f.docId === r.docId)32 .map((v) => v.score);33 return {34 id: r.docId,35 score: Math.max(...arr),36 };37 });38
39 const res = await listDocsByIds(c.env, { ids: uniques.map(u => u.id) });40 const answer = await c.env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", {41 prompt: `${prompt}42
43Context: ${chunks}`44 })45
46 return c.json({47 keywords,48 queries,49 chunks,50 answer,51 docs: res.map(doc => ({ ...doc, score: uniques.find(u => u.id === doc.id)?.score || 0 })).sort((a, b) => b.score - a.score)52 })53});54
55function getUniqueListBy<T extends Record<string, unknown>>(arr: T[], key: keyof T): T[] {56 const result: T[] = [];57 for (const elt of arr) {58 const found = result.find((t) => t[key] === elt[key]);59 if (!found) {60 result.push(elt);61 }62 }63 return result;64}
It’s a POST
endpoint that uses the AI SDK and the Workers AI Provider to complete the following steps:
- Query rewriting: Rewrite the user prompt into multiple related questions and keywords to improve RAG performance
- Hybrid search: Combining vector and text search
- Result fusion and reranking
- Answer generation
Let’s examine each in detail.
Query Rewriting
Users rarely express their needs perfectly on the first try. Their queries are often ambiguous or lack specific keywords that would make retrieval effective. Query rewriting expands the original query into multiple variations:
1export async function rewriteToQueries(model: LanguageModel, params: { prompt: string }): Promise<{ keywords: string[], queries: string[] }> {2 const prompt = `Given the following user message,3rewrite it into 5 distinct queries that could be used to search for relevant information,4and provide additional keywords related to the query.5Each query should focus on different aspects or potential interpretations of the original message.6Each keyword should be a derived from an interpratation of the provided user message.7
8User message: ${params.prompt}`;9
10 try {11 const res = await generateObject({12 model,13 prompt,14 schema: z.object({15 queries: z.array(z.string()).describe(16 "Similar queries to the user's query. Be concise but comprehensive."17 ),18 keywords: z.array(z.string()).describe(19 "Keywords from the query to use for full-text search"20 ),21 }),22 })23
24 return res.object;25 } catch (err) {26 return {27 queries: [params.prompt],28 keywords: []29 }30 }31}
By generating multiple interpretations of the original query, we can capture different aspects and increase the likelihood of finding relevant information.
The AI model generates:
- A set of expanded queries that explore different interpretations of the user’s question
- Keywords extracted from the query for full-text search
This approach dramatically improves search recall compared to using just the original query.
For example, I sent this prompt:
what’s special about paris?
and it was rewritten as:
1{2 "keywords": [3 "paris",4 "eiffel tower",5 "monet",6 "loire river",7 "montmartre",8 "notre dame",9 "art",10 "history",11 "culture",12 "tourism"13 ],14 "queries": [15 "paris attractions",16 "paris landmarks",17 "paris history",18 "paris culture",19 "paris tourism"20 ]21}
By searching the databases with all these combinations, we increase the likelihood of finding relevant chunks.
Hybrid Search
With our rewritten queries and extracted keywords, we now perform a hybrid search using both vector similarity and full-text search:
1export async function searchDocs(env: SearchBindings, params: DocSearchParams): Promise<{ chunks: Array<{ text: string, id: string, docId: string; score: number }> }> {2 const { timeframe, questions, keywords } = params;3
4 const [vectors, sql] = (await Promise.all([5 queryChunkVectors(env, { queries: questions, timeframe, },),6 searchChunks(env, { needles: keywords, timeframe }),7 ]));8
9 const searchResults = {10 vectors,11 sql: sql.map(item => {12 return {13 id: item.id,14 text: item.text,15 docId: item.doc_id,16 rank: item.rank,17 }18 })19 };20
21 const mergedResults = performReciprocalRankFusion(searchResults.sql, searchResults.vectors);22 const res = await processSearchResults(env, params, mergedResults);23 return res;24}
Vector search excels at finding semantically similar content but can miss exact keyword matches while full-text search is great at finding keyword matches but lacks semantic understanding.
By running both search types in parallel and then merging the results, we get more comprehensive coverage.
The vector search looks for embeddings similar to our query embeddings, applying timestamp filters if they are provided:
1export async function queryChunkVectors(2 env: { AI: Ai, VECTORIZE: Vectorize },3 params: { queries: string[], timeframe?: { from?: number, to?: number } }) {4 const { queries, timeframe, } = params;5 const queryVectors = await Promise.all(6 queries.map((q) => env.AI.run("@cf/baai/bge-large-en-v1.5", { text: [q] }))7 );8
9 const filter: VectorizeVectorMetadataFilter = { };10 if (timeframe?.from) {11 // @ts-expect-error error in the package12 filter.timestamp = { "$gte": timeframe.from }13 }14 if (timeframe?.to) {15 // @ts-expect-error error in the package16 filter.timestamp = { "$lt": timeframe.to }17 }18
19 const results = await Promise.all(20 queryVectors.map((qv) =>21 env.VECTORIZE.query(qv.data[0], {22 topK: 20,23 returnValues: false,24 returnMetadata: "all",25 filter,26 })27 )28 );29
30 return results;31}
The full-text search uses SQLite’s FTS5 to find keyword matches:
1export async function searchChunks(env: DB, params: { needles: string[], timeframe?: { from?: number, to?: number }; }, limit = 40) {2 const d1 = getDrizzleClient(env);3
4 const { needles, timeframe } = params;5 const queries = needles.filter(Boolean).map(6 (term) => {7 const sanitizedTerm = term.trim().replace(/[^\w\s]/g, '');8
9 return `10 SELECT chunks.*, bm25(chunks_fts) AS rank11 FROM chunks_fts12 JOIN chunks ON chunks_fts.id = chunks.id13 WHERE chunks_fts MATCH '${sanitizedTerm}'14 ${timeframe?.from ? `AND created > ${timeframe.from}` : ''}15 ${timeframe?.to ? `AND created < ${timeframe.to}` : ''}16 ORDER BY rank17 LIMIT ${limit}18 `;19 }20 );21
22 const results = await Promise.all(23 queries.map(async (query) => {24 const res = await d1.run(query);25 return res.results as ChunkSearch[];26 })27 );28
29 return results.flat()30}
We’re using the BM25 ranking algorithm (built into FTS5) to sort results by relevance. It’s an algorithm that considers term frequency, document length, and other factors to determine relevance.
Reciprocal Rank Fusion
After getting results from both search methods, we need to merge them. This is where Reciprocal Rank Fusion (RRF) comes in.
It’s a rank aggregation method that combines ranking from multiple sources into a single unified ranking.
When you have multiple ranked lists from different systems, each with their own scoring method, it’s challenging to merge them fairly. RRF provides a principled way to combine these lists by:
- Considering the rank position rather than raw scores (which may not be comparable)
- Giving higher weights to items that appear at high ranks in multiple lists
- Using a constant
k
to mitigate the impact of outliers
The formula gives each item a score based on its rank in each list: score = 1 / (k + rank)
. Items that appear high in multiple lists get the highest combined scores.
The constant k
(set to 60 in our implementation) is crucial as it prevents items that appear at the very top of only one list from completely dominating the results. A larger k value makes the algorithm more conservative, reducing the advantage of top-ranked items and giving more consideration to items further down the lists.
1export function performReciprocalRankFusion(2 fullTextResults: DocMatch[],3 vectorResults: VectorizeMatches[]4): { docId: string, id: string; score: number; text?: string }[] {5
6 const vectors = uniqueVectorMatches(vectorResults.flatMap(r => r.matches));7 const sql = uniqueDocMatches(fullTextResults);8
9 const k = 60; // Constant for fusion, can be adjusted10 const scores: { [key: string]: { id: string, text?: string; docId: string, score: number } } = {};11
12 // Process full-text search results13 sql.forEach((result, index) => {14 const key = result.id;15 const score = 1 / (k + index);16 scores[key] = {17 id: result.id,18 docId: result.docId,19 text: result.text,20 score: (scores[key]?.score || 0) + score,21 };22 });23
24 // Process vector search results25 vectors.forEach((match, index) => {26 const key = match.id;27 const score = 1 / (k + index);28 scores[key] = {29 id: match.id,30 docId: match.metadata?.docId as string,31 text: match.metadata?.text as string,32 score: (scores[key]?.score || 0) + score,33 };34 });35
36 const res = Object.entries(scores)37 .map(([key, { id, score, docId, text }]) => ({ docId, id, score, text }))38 .sort((a, b) => b?.score - a?.score);39
40 return res.slice(0, 150);41}
AI Reranking
After merging the results, we use another LLM to rerank them based on their relevance to the original query.
The initial search and fusion steps are based on broader relevance signals. The reranker performs a more focused assessment of whether each result directly answers the user’s question. It uses baai/bge-reranker-base, a reranker model. These models are language models that reorder search results based on relevance to the user query, improving the qualify of the RAG. They take as input the user query and a list of documents, and return the order of the documents from most relevant to the query to least.
1async function processSearchResults(env: SearchBindings, params: DocSearchParams, mergedResults: { docId: string, id: string; score: number; text?: string }[]) {2 if (!mergedResults.length) return { chunks: [] };3 const { query, scoreThreshold, topK } = params;4 const chunks: Array<{ text: string; id: string; docId: string } & { score: number }> = [];5
6 const response = await env.AI.run(7 "@cf/baai/bge-reranker-base",8 {9 // @ts-ignore10 query,11 contexts: mergedResults.map(r => ({ id: r.id, text: r.text }))12 },13 ) as { response: Array<{ id: number, score: number }> };14
15
16 const scores = response.response.map(i => i.score);17 let indices = response.response.map((i, index) => ({ id: i.id, score: sigmoid(scores[index]) }));18 if (scoreThreshold && scoreThreshold > 0) {19 indices = indices.filter(i => i.score >= scoreThreshold);20 }21 if (topK && topK > 0) {22 indices = indices.slice(0, topK)23 }24
25 const slice = reorderArray(mergedResults, indices.map(i => i.id)).map((v, index) => ({ ...v, score: indices[index]?.score }));26
27 await Promise.all(slice.map(async result => {28 if (!result) return;29 const a = {30 text: result.text || (await getChunk(env, { docId: result.docId, id: result.id }))?.text || "",31 docId: result.docId,32 id: result.id,33 score: result.score,34 };35
36 chunks.push(a)37 }));38
39 return { chunks };40}
After reranking, we apply a sigmoid function to transform the raw scores:
1function sigmoid(score: number, k: number = 0.4): number {2 return 1 / (1 + Math.exp(-score / k));3}
The reranker’s raw scores can have a wide range and don’t directly translate to a probability of relevance. The sigmoid function has a few compelling characteristics:
- Bounded output range: Sigmoid squashes values into a fixed range (0,1), creating a probability-like score that’s easier to interpret and threshold.
- Non-linear transformation: Sigmoid emphasizes differences in the middle range while compressing extremes, which is ideal for relevance scoring where we need to distinguish between “somewhat relevant” and “very relevant” items.
- Stability with outliers: Extreme scores don’t disproportionately affect the normalized range.
- Ordering: Sigmoid maintains the relative ordering of results
The parameter k
controls the steepness of the sigmoid curve: a smaller value creates a sharper distinction between relevant and irrelevant results. The specific value of k=0.4
was chosen after experimentation to create an effective decision boundary around the 0.5 mark. Results above this threshold are considered sufficiently relevant for inclusion in the final context.
Below is a plot of our sigmoid function:


We pass the reranker score of each chunk through the signoid
function, and we can set a fixed threshold score as the sigmoid
normalises the scores. We can ignore anything below the threshold score as these chunks are likely not relevant to the user query.
Without the sigmoid
function, each ranking would have a different score distribution and it would be difficult to fairly select a threshold score.
Search Parameter Tuning
The searchDocs
function includes several parameters that significantly impact retrieval quality:
1const { chunks } = await searchDocs(c.env, {2 questions: queries,3 query: prompt,4 keywords,5 topK: 8,6 scoreThreshold: 0.501,7 ...searchOptions,8});
The topK
parameter limits the number of chunks we retrieve, which is essential for:
- Managing the LLM context window size limitations
- Reducing noise in the context that could distract the model
- Minimizing cost and latency
The scoreThreshold
is calibrated to work with our sigmoid
normalization. Since sigmoid transforms scores to the (0,1) range with 0.5 representing the inflection point, setting the threshold just above 0.5 ensures we only include chunks that the reranker determined are more likely relevant than not.
This threshold prevents the inclusion of marginally relevant content that might dilute the quality of our context window.
Answer Generation
Finally, we generate an answer using the retrieved chunks as context:
1const answer = await c.env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", {2 prompt: `${prompt}3
4Context: ${chunks}`5})
This step is where the “Generation” part of RAG comes in. The LLM receives both the user’s question and the retrieved context, then generates an answer that draws on that context.
Stitching it all together
The entire system is an integrated pipeline where each component builds upon the previous ones:
-
When a query arrives, it first hits the Query Rewriter, which expands it into multiple variations to improve search coverage.
-
These expanded queries alongside extracted keywords simultaneously feed into two separate search systems:
- Vector Search (semantic similarity using embeddings)
- Full-Text Search (keyword matching using FTS5)
-
The results from both search methods then enter the Reciprocal Rank Fusion function, which combines them based on rank position. This solves the challenge of comparing scores from fundamentally different systems.
-
The fused results are then passed to the AI Reranker, which performs an LLM based relevance assessment focused on answering the original query.
-
The reranked results are then filtered by score threshold and count limits before being passed into the context window for the final LLM.
-
The LLM receives both the original query and the curated context to produce the final answer.
This multi-stage approach creates a system that outperforms traditional RAG systems.
Testing It Out
Now let’s see how our contextual RAG system works in practice. After deploying your Worker using wrangler
, you’ll get a URL for your Worker:
1https://contextual-rag.account-name.workers.dev
where account-name
is the name of your Cloudflare account.
Here are some example API calls:
Uploading a document
1curl -X POST https://contextual-rag.account-name.workers.dev/ \2 -H "Content-Type: application/json" \3 -d '{"contents":"Paris doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions — grand boulevards designed for kings now flooded with delivery bikes and tourists holding maps upside-down. Crumbling stone facades wear ivy and grime like couture. Every corner feels staged, but somehow still effortless, like the city isn’t even trying to impress you. It just is. Each arrondissement spins its own little universe. In the Marais, tiny alleys drip with charm — cafés packed so tightly you can hear every clink of every espresso cup. Montmartre clings to its hill like a stubborn old cat... <REDACTED>"}'4
5# Response:6# {7# "id": "jtmnofvveptdalwl",8# "contents": "Paris doesn’t shout. It smirks. It moves with a kind of...",9# "created": "2025-04-26T19:50:39.941Z",10# "updated": "2025-04-26T19:50:39.941Z",11# }
Querying
1curl -X POST https://contextual-rag.account-name.workers.dev/query \2 -H "Content-Type: application/json" \3 -d '{"prompt":"What is special about Paris?"}'4
5# Response:6# {7# "keywords": [8# "paris",9# "eiffel tower",10# "monet",11# "loire river",12# "montmartre",13# "notre dame",14# "art",15# "history",16# "culture",17# "tourism"18# ],19# "queries": [20# "paris attractions",21# "paris landmarks",22# "paris history",23# "paris culture",24# "paris tourism"25# ],26# "chunks": [27# {28# "text": "This chunk describes Paris, the capital of France, its location on the Seine River, and its famous landmarks and characteristics.; Paris doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions...",29# "id": "abcdefghijklmno",30# "docId": "qwertyuiopasdfg",31# "score": 0.808531055607128732# }33# ],34# "answer": "Paris, the capital of France, is known as the City of Light. It has been a hub for intellectuals and artists for centuries, and its stunning architecture, art museums, and romantic atmosphere make it one of the most popular tourist destinations in the world. Some of the most famous landmarks in Paris include the Eiffel Tower, the Louvre Museum...",35# "docs": [36# {37# "id": "jtmnofvveptdalwl",38# "contents": "Paris doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions...",39# "created": "2025-04-26T19:50:39.941Z",40# "updated": "2025-04-26T19:50:39.941Z",41# "score": 0.808531055607128742# }43# ]44# }
Limitations and Considerations
As much as this contextual RAG system is powerful, there are several limitations and considerations to keep in mind:
-
AI Costs: The contextual enhancement process requires running each chunk through an LLM, which increases both the computation time and cost compared to traditional RAG systems. For very large document collections, this can become a significant consideration.
-
Latency Trade-offs: Adding multiple stages of processing (query rewriting, hybrid search, reranking) improves result quality but increases end-to-end latency. For applications where response time is critical, you might need to optimize certain components or make trade-offs.
-
Storage Growth: The contextualized chunks are longer than raw chunks, requiring more storage in both D1 and Vectorize. It’s important to monitor the size of your databases and remain below limits.
-
Rate Limits: All components of the pipeline have rate-limits. This can affect high-throughput applications. Consider offloading the document ingestion into a queue in production.
-
Context Window Limitations: When contextualizing chunks, very long documents may exceed the context window of the LLM. You might need to implement a hierarchical approach for extremely large documents.
On Basebrain, I’ve addressed some of these challenges by using Cloudflare Workflows for document ingestion, and Durable Objects instead of D1 for storage, where I implement one database per user.