Contextual Retrieval-Augmented Generation (RAG) on Cloudflare Workers

Today I am sharing a powerful pattern I’ve implemented on the Cloudflare Developers Platform whilst building my weekends project: Contextual Retrieval-Augmented Generation (RAG).

Traditional RAG implementations often fall short when presented with nuanced queries or complex documents. Anthropic introduced Contextual Retrieval as an answer to this problem.

In this blog post, I’ll walk through implementing a contextual RAG system entirely on the Cloudflare Developer Plarform. You’ll see how combining vector search with full-text search, query rewriting, and smart ranking techniques creates a system that truly understands your content.

What is Contextual RAG

If you’re doing any kind of RAG today, you’re probably:

Chunking documents
Storing vectors
Searching embeddings
Stuffing the results into your LLM context window

The problem is simple: chunking is often not good enough. When you retrieve a chunk based on cosine similarity, it might be totally out of context and irrelevant to the user query.

In a nutshell, key issues with traditional RAG systems are:

Context loss: When documents are chunked into smaller pieces, the broader context that gives meaning to each chunk is often lost
Relevance issues: Vector search alone might retrieve semantically similar content that isn’t actually relevant to the query
Query limitations: User queries are often ambiguous or lack specificity needed for effective retrieval

Contextual RAG provides a solution to these issues. It consists of pre-pending every chunk with a short explanation situating it within the context of the entire document. Every chunk is passed through an LLM to provide this context.

The prompt looks like:

1
<document>
2
{{WHOLE_DOCUMENT}}
3
</document>
4
Here is the chunk we want to situate within the whole document
5
<chunk>
6
{{CHUNK_CONTENT}}
7
</chunk>
8
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

The result is a short sentence situating the chunk within the context of the entire documents. It is prepended to the actual chunk before inserting it into the vector databse. This provides a solution to the context loss challenge.

Additionally, the query side of the RAG system is enhanced with full-text search, reciprocal rank fusion and a LLM reranker.

Contextual Retrieval + Reranking

With this technique the challenges of traditional RAG are addressed by:

Enhancing chunks with context: Each chunk is augmented with contextual information from the full document
Using hybrid search: Combining vector similarity with keyword-based search
Rewriting queries: Expanding and transforming user queries into multiple search variations
Intelligent ranking: Using sophisticated algorithms to merge and rank results from different search methods

Let’s dive deeper and see how to implemented this on Cloudflare Workers.

Project Overview

We’ll be using:

Cloudflare Workers - Our serverless platform
Cloudflare D1 - SQL database for document storage
Cloudflare Vectorize - Vector search engine
Workers AI - Cloudflare AI platform
Drizzle ORM - Type-safe database access
Hono - Lightweight framework for our API routes

Here’s the basic structure of the project:

1
.
2
├── bindings.ts                # Bindings TypeScript Definition
3
├── bootstrap.sh               # Automate creating the Cloudflare resources
4
├── drizzle.config.ts          # Database configuration
5
├── fts5.sql                   # Full-text search SQL triggers
6
├── package.json               # Dependencies and scripts
7
├── src
8
│   ├── db
9
│   │   ├── index.ts           # Database CRUD operations
10
│   │   ├── schemas.ts         # Databae schema definition
11
│   │   └── types.ts           # Database TypeScript types
12
│   ├── index.ts               # Main Worker code and Durable Object
13
│   ├── search.ts              # Search functionality
14
│   ├── utils.ts               # Util functions
15
│   └── vectorize.ts           # Vectorize operations
16
├── tsconfig.json              # TypeScript configuration
17
└── wrangler.json              # Cloudflare Workers configuration

This final solution will be implementing the following architecture. The goal of this blog post is to walk through each item individually and finally explain how they all work together.

Solution Architecture

The complete source code for this blog post is available on GitHub. I suggest following the post with the source code available for reference.

Bootstrapping the RAG System

We first need to create our cloud resources. Let’s create a bootstrap.sh script that creates a Vectorize index and a D1 database:

1
#! /bin/bash
2
set -e
3

4
npx wrangler vectorize create contextual-rag-index --dimensions=1024 --metric=cosine
5
npx wrangler vectorize create-metadata-index contextual-rag-index --property-name=timestamp --type=number
6

7
npx wrangler d1 create contextual-rag

It use wrangler to create:

A Vectorize index for storing embeddings
A metadata index for timestamp-based filtering
A D1 database for document storage

Running this script will output a database ID we will use in our wrangler.json and drizzle.config.ts files.

Next, after creating a Node.js project with npm and configuring TypeScript, let’s create our wrangler.json file to configure our Cloudflare Worker and connect our resources to it.

1
{
2
  "name": "contextual-rag",
3
  "compatibility_date": "2024-11-12",
4
  "workers_dev": true,
5
  "upload_source_maps": true,
6
  "observability": {
7
    "enabled": true
8
  },
9
  "main": "./src/index.ts",
10
  "vectorize": [
11
    {
12
      "binding": "VECTORIZE",
13
      "index_name": "contextual-rag-index"
14
    }
15
  ],
16
  "d1_databases": [
17
    {
18
      "binding": "D1",
19
      "database_name": "contextual-rag",
20
      "database_id": "<REPLACE WITH YOUR DATABASE ID>"
21
    }
22
  ],
23
  "ai": {
24
    "binding": "AI"
25
  },
26
  "vars": {}
27
}

Document Ingestion Pipeline

After initializing the project, the first step is to ingest documents we want to retrieve. We are building everything on a Hono application deployed on Cloudflare Workers.

Our API endpoint for document uploads looks like this:

1
app.post('/', async (c) => {
2
  const { contents } = await c.req.json();
3
  if (!contents || typeof contents !== "string") return c.json({ message: "Bad Request" }, 400)
4

5
  const doc = await createDoc(c.env, { contents });
6

7
  const splitter = new RecursiveCharacterTextSplitter({
8
    chunkSize: 1024,
9
    chunkOverlap: 200,
10
  });
11
  const raw = await splitter.splitText(contents);
12
  const chunks = await contextualizeChunks(c.env, contents, raw)
13
  await insertChunkVectors(c.env, { docId: doc.id, created: doc.created }, chunks);
14

15
  return c.json(doc)
16
});

It’s a POST endpoint that takes the contents of the document to process. The processing steps are:

Store the complete document in the database
Split the document into manageable chunks
Contextualize each chunk
Generate vector embeddings for each chunk
Store the chunks and their embeddings

Text Splitting

We split the documents because LLMs have limited context windows, and retrieving entire documents for every query would be inefficient. By splitting documents into smaller chunks, we can retrieve only the most relevant pieces.

We use the RecursiveCharacterTextSplitter from LangChain, which intelligently splits text based on content boundaries:

1
const splitter = new RecursiveCharacterTextSplitter({
2
  chunkSize: 1024,
3
  chunkOverlap: 200,
4
});
5
const raw = await splitter.splitText(contents);

The chunkSize parameter controls the maximum size of each chunk, while chunkOverlap creates an overlap between adjacent chunks. This overlap helps maintain context across chunk boundaries and prevents information from being lost at the dividing points.

Context Enhancement

This is where we do the “contextual” part of Contextual RAG. Instead of storing raw chunks, we first enhance each chunk with contextual information situating it within the context of the full text:

1
export async function contextualizeChunks(env: { AI: Ai }, content: string, chunks: string[]): Promise<string[]> {
2
  const promises = chunks.map(async c => {
3

4
    const prompt = `<document>
5
${content}
6
</document>
7
Here is the chunk we want to situate within the whole document
8
<chunk>
9
${c}
10
</chunk>
11
Please give a short succinct context to situate this chunk within the overall
12
document for the purposes of improving search retrieval of the chunk.
13
Answer only with the succinct context and nothing else. `;
14

15
    // @ts-ignore
16
    const res = await env.AI.run("@cf/meta/llama-3.1-8b-instruct-fast", {
17
      prompt,
18
    }) as { response: string }
19

20
    return `${res.response}; ${c}`;
21
  })
22

23
  return await Promise.all(promises);
24
}

Without the context, chunks become isolated islands of information, divorced from their surrounding context. For example, a chunk containing “it increases efficiency by 40%” is meaningless without knowing what “it” refers to. By adding contextual information to each chunk, we make them more self-contained and improve retrieval accuracy.

We use an LLM to analyze the relationship between the chunk and the full document, then generate a short context summary that precedes the chunk text. This enhancement makes each chunk more self-contained and improves retrieval relevance.

For example, I tried this with a short text describing Paris. After chunking the text, one of the chunks was:

The city doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions, grand boulevards designed for kings now flooded with delivery bikes and tourists holding maps upside-down. Crumbling stone facades wear ivy and grime like couture. Every corner feels staged, but somehow still effortless, like the city isn’t even trying to impress you.

There is no indication that this chunk is talking about Paris.

The enhanced chunk after running through the LLM is:

The chunk describes the essence and atmosphere of Paris, highlighting its unique blend of beauty, history, and contradictions.; The city doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions, grand boulevards designed for kings now flooded with delivery bikes and tourists holding maps upside-down. Crumbling stone facades wear ivy and grime like couture. Every corner feels staged, but somehow still effortless, like the city isn’t even trying to impress you.

A retrieval query consisting of just the word Paris is more likely to match the enhanced chunk versus the raw chunk without context.

Vector Generation and Storage

After contextualizing the chunks, we generate vector embeddings and store them in Cloudflare Vectorize and D1:

1
export async function insertChunkVectors(
2
  env: { D1: D1Database, AI: Ai, VECTORIZE: Vectorize },
3
  data: { docId: string, created: Date },
4
  chunks: string[],
5
) {
6

7
  const { docId, created } = data;
8
  const batchSize = 10;
9
  const insertPromises = [];
10

11
  for (let i = 0; i < chunks.length; i += batchSize) {
12
    const chunkBatch = chunks.slice(i, i + batchSize);
13

14
    insertPromises.push(
15
      (async () => {
16
        const embeddingResult = await env.AI.run("@cf/baai/bge-large-en-v1.5", {
17
          text: chunkBatch,
18
        });
19
        const embeddingBatch: number[][] = embeddingResult.data;
20

21
        const chunkInsertResults = await Promise.all(chunkBatch.map(c => createChunk(env, { docId, text: c })))
22
        const chunkIds = chunkInsertResults.map((result) => result.id);
23

24
        await env.VECTORIZE.insert(
25
          embeddingBatch.map((embedding, index) => ({
26
            id: chunkIds[index],
27
            values: embedding,
28
            metadata: {
29
              docId,
30
              chunkId: chunkIds[index],
31
              text: chunkBatch[index],
32
              timestamp: created.getTime(),
33
            },
34
          }))
35
        );
36
      })()
37
    );
38
  }
39

40
  await Promise.all(insertPromises);
41
}

This function:

Divides chunks into batches of 10
Generates embeddings for each batch using Cloudflare AI
Stores the chunks in the D1 database
Stores the embeddings and associated metadata in Vectorize

The createChunk function inserts the chunks the D1 database and gives us a unique ID for each chunk, which is then shared in Vectorize. The next section gives more details on the D1 database.

The metadata attached to each vector includes the original text, document ID, and timestamp, which enables filtering and improves retrieval performance.

D1 Database Schema

Our RAG system must combine full-text and vector search.

We have two data models to store outside of the vector database: full documents and the chunks derived from them. For this, I’m using Cloudflare D1 (SQLite) with Drizzle ORM.

Here’s the schema definition:

1
import { index, integer, sqliteTable, text } from "drizzle-orm/sqlite-core";
2

3
export const docs = sqliteTable(
4
  "docs",
5
  {
6
    id: text("id")
7
      .notNull()
8
      .primaryKey()
9
      .$defaultFn(() => randomString()),
10
    contents: text("contents"),
11
    created: integer("created", { mode: "timestamp_ms" })
12
      .$defaultFn(() => new Date())
13
      .notNull(),
14
    updated: integer("updated", { mode: "timestamp_ms" })
15
      .$onUpdate(() => new Date())
16
      .notNull(),
17
  },
18
  (table) => ([
19
    index("docs.created.idx").on(table.created),
20
  ]),
21
);
22

23

24
export const chunks = sqliteTable(
25
  "chunks",
26
  {
27
    id: text("id")
28
      .notNull()
29
      .primaryKey()
30
      .$defaultFn(() => randomString()),
31
    docId: text('doc_id').notNull(),
32
    text: text("text").notNull(),
33
    created: integer("created", { mode: "timestamp_ms" })
34
      .$defaultFn(() => new Date())
35
      .notNull(),
36
  },
37
  (table) => ([
38
    index("chunks.doc_id.idx").on(table.docId),
39
  ]),
40
);
41

42
function randomString(length = 16): string {
43
  const chars = "abcdefghijklmnopqrstuvwxyz";
44
  const resultArray = new Array(length);
45

46
  for (let i = 0; i < length; i++) {
47
    const randomIndex = Math.floor(Math.random() * chars.length);
48
    resultArray[i] = chars[randomIndex];
49
  }
50

51
  return resultArray.join("");
52
}

With these schemas, we can add CRUD functions to write reand and write documents and chunks in D1.

1
export function getDrizzleClient(env: DB) {
2
  return drizzle(env.D1, {
3
    schema,
4
  });
5
}
6

7
export async function createDoc(env: DB, doc: InsertDoc): Promise<Doc> {
8
  const d1 = getDrizzleClient(env);
9

10
  const [res] = await d1
11
    .insert(docs)
12
    .values(doc)
13
    .onConflictDoUpdate({
14
      target: [docs.id],
15
      set: doc,
16
    })
17
    .returning();
18

19
  return res;
20
}
21

22
export async function listDocsByIds(
23
  env: DB,
24
  params: { ids: string[] },
25
): Promise<Doc[]> {
26
  const d1 = getDrizzleClient(env);
27

28
  const qs = await d1
29
    .select()
30
    .from(docs)
31
    .where(inArray(docs.id, params.ids))
32
  return qs;
33
}
34

35
export async function createChunk(env: DB, chunk: InsertChunk): Promise<Chunk> {
36
  const d1 = getDrizzleClient(env);
37

38
  const [res] = await d1
39
    .insert(chunks)
40
    .values(chunk)
41
    .onConflictDoUpdate({
42
      target: [chunks.id],
43
      set: chunk,
44
    })
45
    .returning();
46

47
  return res;
48
}

SQLite Full-Text Search

While vector search excels at semantic similarity, it often misses exact keyword matches that full-text search handles perfectly. Full-text search excels at finding exact keyword matches and can retrieve relevant content even when the semantic meaning might be ambiguous. By combining both approaches, we get the best of both worlds.

SQLite provides a powerful full-text search extension called FTS5. We’ll create a virtual table that mirrors our chunks table but with full-text search capabilities:

1
CREATE VIRTUAL TABLE chunks_fts USING fts5(
2
  id UNINDEXED,
3
  doc_id UNINDEXED,
4
  text,
5
  content = 'chunks',
6
  created
7
);
8

9
CREATE TRIGGER chunks_ai
10
AFTER
11
INSERT
12
  ON chunks BEGIN
13
INSERT INTO
14
  chunks_fts(id, doc_id, text, created)
15
VALUES
16
  (
17
    new.id,
18
    new.doc_id,
19
    new.text,
20
    new.created
21
  );
22

23
END;
24

25
CREATE TRIGGER chunks_ad
26
AFTER DELETE ON chunks
27
FOR EACH ROW
28
BEGIN
29
    DELETE FROM chunks_fts WHERE id = old.id;
30
    INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild');
31
END;
32

33
CREATE TRIGGER chunks_au
34
AFTER UPDATE ON chunks BEGIN
35
    DELETE FROM chunks_fts WHERE id = old.id;
36
    INSERT INTO chunks_fts(id, doc_id, text, created)
37
    VALUES (new.id, new.doc_id, new.text, new.created);
38
    INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild');
39
END;

The triggers ensure that our FTS table stays in sync with the main chunks table. Whenever a chunk is inserted, updated, or deleted, the corresponding entry in the FTS table is also modified.

We use DrizzleKit to create migration files for our D1 database with this schema.

1
drizzle-kit generate

The above SQL for the FTS tables must be included manually in the migration file before running the migration in D1. Otherwise, we will not have full-text search capabilities in our SQLite database.

1
drizzle-kit migrate

Once the migration is completed, we’re ready to start processing documets in the ingestion side of our RAG system.

Query Processing and Search

Now let’s examine how queries are handled. Our query endpoint looks like this:

1
app.post('/query', async (c) => {
2
  const { prompt, timeframe } = await c.req.json();
3

4
  if (!prompt) return c.json({ message: "Bad Request" }, 400);
5

6
  const searchOptions: {
7
    timeframe?: { from?: number; to?: number };
8
  } = {};
9

10
  if (timeframe) {
11
    searchOptions.timeframe = timeframe;
12
  }
13

14
  const ai = createWorkersAI({ binding: c.env.AI });
15
  // @ts-ignore
16
  const model = ai("@cf/meta/llama-3.1-8b-instruct-fast") as LanguageModel;
17

18
  const { queries, keywords } = await rewriteToQueries(model, { prompt });
19

20
  const { chunks } = await searchDocs(c.env, {
21
    questions: queries,
22
    query: prompt,
23
    keywords,
24
    topK: 8,
25
    scoreThreshold: 0.501,
26
    ...searchOptions,
27
  });
28

29
  const uniques = getUniqueListBy(chunks, "docId").map((r) => {
30
    const arr = chunks
31
      .filter((f) => f.docId === r.docId)
32
      .map((v) => v.score);
33
    return {
34
      id: r.docId,
35
      score: Math.max(...arr),
36
    };
37
  });
38

39
  const res = await listDocsByIds(c.env, { ids: uniques.map(u => u.id) });
40
  const answer = await c.env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", {
41
    prompt: `${prompt}
42

43
Context: ${chunks}`
44
  })
45

46
  return c.json({
47
    keywords,
48
    queries,
49
    chunks,
50
    answer,
51
    docs: res.map(doc => ({ ...doc, score: uniques.find(u => u.id === doc.id)?.score || 0 })).sort((a, b) => b.score - a.score)
52
  })
53
});
54

55
function getUniqueListBy<T extends Record<string, unknown>>(arr: T[], key: keyof T): T[] {
56
  const result: T[] = [];
57
  for (const elt of arr) {
58
    const found = result.find((t) => t[key] === elt[key]);
59
    if (!found) {
60
      result.push(elt);
61
    }
62
  }
63
  return result;
64
}

It’s a POST endpoint that uses the AI SDK and the Workers AI Provider to complete the following steps:

Query rewriting: Rewrite the user prompt into multiple related questions and keywords to improve RAG performance
Hybrid search: Combining vector and text search
Result fusion and reranking
Answer generation

Let’s examine each in detail.

Query Rewriting

Users rarely express their needs perfectly on the first try. Their queries are often ambiguous or lack specific keywords that would make retrieval effective. Query rewriting expands the original query into multiple variations:

1
export async function rewriteToQueries(model: LanguageModel, params: { prompt: string }): Promise<{ keywords: string[], queries: string[] }> {
2
  const prompt = `Given the following user message,
3
rewrite it into 5 distinct queries that could be used to search for relevant information,
4
and provide additional keywords related to the query.
5
Each query should focus on different aspects or potential interpretations of the original message.
6
Each keyword should be a derived from an interpratation of the provided user message.
7

8
User message: ${params.prompt}`;
9

10
  try {
11
    const res = await generateObject({
12
      model,
13
      prompt,
14
      schema: z.object({
15
        queries: z.array(z.string()).describe(
16
          "Similar queries to the user's query. Be concise but comprehensive."
17
        ),
18
        keywords: z.array(z.string()).describe(
19
          "Keywords from the query to use for full-text search"
20
        ),
21
      }),
22
    })
23

24
    return res.object;
25
  } catch (err) {
26
    return {
27
      queries: [params.prompt],
28
      keywords: []
29
    }
30
  }
31
}

By generating multiple interpretations of the original query, we can capture different aspects and increase the likelihood of finding relevant information.

The AI model generates:

A set of expanded queries that explore different interpretations of the user’s question
Keywords extracted from the query for full-text search

This approach dramatically improves search recall compared to using just the original query.

For example, I sent this prompt:

what’s special about paris?

and it was rewritten as:

1
{
2
  "keywords": [
3
    "paris",
4
    "eiffel tower",
5
    "monet",
6
    "loire river",
7
    "montmartre",
8
    "notre dame",
9
    "art",
10
    "history",
11
    "culture",
12
    "tourism"
13
  ],
14
  "queries": [
15
    "paris attractions",
16
    "paris landmarks",
17
    "paris history",
18
    "paris culture",
19
    "paris tourism"
20
  ]
21
}

By searching the databases with all these combinations, we increase the likelihood of finding relevant chunks.

Hybrid Search

With our rewritten queries and extracted keywords, we now perform a hybrid search using both vector similarity and full-text search:

1
export async function searchDocs(env: SearchBindings, params: DocSearchParams): Promise<{ chunks: Array<{ text: string, id: string, docId: string; score: number }> }> {
2
  const { timeframe, questions, keywords } = params;
3

4
  const [vectors, sql] = (await Promise.all([
5
    queryChunkVectors(env, { queries: questions, timeframe, },),
6
    searchChunks(env, { needles: keywords, timeframe }),
7
  ]));
8

9
  const searchResults = {
10
    vectors,
11
    sql: sql.map(item => {
12
      return {
13
        id: item.id,
14
        text: item.text,
15
        docId: item.doc_id,
16
        rank: item.rank,
17
      }
18
    })
19
  };
20

21
  const mergedResults = performReciprocalRankFusion(searchResults.sql, searchResults.vectors);
22
  const res = await processSearchResults(env, params, mergedResults);
23
  return res;
24
}

Vector search excels at finding semantically similar content but can miss exact keyword matches while full-text search is great at finding keyword matches but lacks semantic understanding.

By running both search types in parallel and then merging the results, we get more comprehensive coverage.

The vector search looks for embeddings similar to our query embeddings, applying timestamp filters if they are provided:

1
export async function queryChunkVectors(
2
  env: { AI: Ai, VECTORIZE: Vectorize },
3
  params: { queries: string[], timeframe?: { from?: number, to?: number } }) {
4
  const { queries, timeframe, } = params;
5
  const queryVectors = await Promise.all(
6
    queries.map((q) => env.AI.run("@cf/baai/bge-large-en-v1.5", { text: [q] }))
7
  );
8

9
  const filter: VectorizeVectorMetadataFilter = {  };
10
  if (timeframe?.from) {
11
    // @ts-expect-error error in the package
12
    filter.timestamp = { "$gte": timeframe.from }
13
  }
14
  if (timeframe?.to) {
15
    // @ts-expect-error error in the package
16
    filter.timestamp = { "$lt": timeframe.to }
17
  }
18

19
  const results = await Promise.all(
20
    queryVectors.map((qv) =>
21
      env.VECTORIZE.query(qv.data[0], {
22
        topK: 20,
23
        returnValues: false,
24
        returnMetadata: "all",
25
        filter,
26
      })
27
    )
28
  );
29

30
  return results;
31
}

The full-text search uses SQLite’s FTS5 to find keyword matches:

1
export async function searchChunks(env: DB, params: { needles: string[], timeframe?: { from?: number, to?: number }; }, limit = 40) {
2
  const d1 = getDrizzleClient(env);
3

4
  const { needles, timeframe } = params;
5
  const queries = needles.filter(Boolean).map(
6
    (term) => {
7
      const sanitizedTerm = term.trim().replace(/[^\w\s]/g, '');
8

9
      return `
10
        SELECT chunks.*, bm25(chunks_fts) AS rank
11
        FROM chunks_fts
12
        JOIN chunks ON chunks_fts.id = chunks.id
13
        WHERE chunks_fts MATCH '${sanitizedTerm}'
14
        ${timeframe?.from ? `AND created > ${timeframe.from}` : ''}
15
        ${timeframe?.to ? `AND created < ${timeframe.to}` : ''}
16
        ORDER BY rank
17
        LIMIT ${limit}
18
      `;
19
    }
20
  );
21

22
  const results = await Promise.all(
23
    queries.map(async (query) => {
24
      const res = await d1.run(query);
25
      return res.results as ChunkSearch[];
26
    })
27
  );
28

29
  return results.flat()
30
}

We’re using the BM25 ranking algorithm (built into FTS5) to sort results by relevance. It’s an algorithm that considers term frequency, document length, and other factors to determine relevance.

Reciprocal Rank Fusion

After getting results from both search methods, we need to merge them. This is where Reciprocal Rank Fusion (RRF) comes in.

It’s a rank aggregation method that combines ranking from multiple sources into a single unified ranking.

When you have multiple ranked lists from different systems, each with their own scoring method, it’s challenging to merge them fairly. RRF provides a principled way to combine these lists by:

Considering the rank position rather than raw scores (which may not be comparable)
Giving higher weights to items that appear at high ranks in multiple lists
Using a constant k to mitigate the impact of outliers

The formula gives each item a score based on its rank in each list: score = 1 / (k + rank). Items that appear high in multiple lists get the highest combined scores.

The constant k (set to 60 in our implementation) is crucial as it prevents items that appear at the very top of only one list from completely dominating the results. A larger k value makes the algorithm more conservative, reducing the advantage of top-ranked items and giving more consideration to items further down the lists.

1
export function performReciprocalRankFusion(
2
  fullTextResults: DocMatch[],
3
  vectorResults: VectorizeMatches[]
4
): { docId: string,  id: string; score: number; text?: string }[] {
5

6
  const vectors = uniqueVectorMatches(vectorResults.flatMap(r => r.matches));
7
  const sql = uniqueDocMatches(fullTextResults);
8

9
  const k = 60; // Constant for fusion, can be adjusted
10
  const scores: { [key: string]: { id: string, text?: string;  docId: string, score: number } } = {};
11

12
  // Process full-text search results
13
  sql.forEach((result, index) => {
14
    const key = result.id;
15
    const score = 1 / (k + index);
16
    scores[key] = {
17
      id: result.id,
18
      docId: result.docId,
19
      text: result.text,
20
      score: (scores[key]?.score || 0) + score,
21
    };
22
  });
23

24
  // Process vector search results
25
  vectors.forEach((match, index) => {
26
    const key = match.id;
27
    const score = 1 / (k + index);
28
    scores[key] = {
29
      id: match.id,
30
      docId: match.metadata?.docId as string,
31
      text: match.metadata?.text as string,
32
      score: (scores[key]?.score || 0) + score,
33
    };
34
  });
35

36
  const res = Object.entries(scores)
37
    .map(([key, { id, score, docId, text }]) => ({ docId, id, score, text }))
38
    .sort((a, b) => b?.score - a?.score);
39

40
  return res.slice(0, 150);
41
}

AI Reranking

After merging the results, we use another LLM to rerank them based on their relevance to the original query.

The initial search and fusion steps are based on broader relevance signals. The reranker performs a more focused assessment of whether each result directly answers the user’s question. It uses baai/bge-reranker-base, a reranker model. These models are language models that reorder search results based on relevance to the user query, improving the qualify of the RAG. They take as input the user query and a list of documents, and return the order of the documents from most relevant to the query to least.

1
async function processSearchResults(env: SearchBindings, params: DocSearchParams, mergedResults: { docId: string, id: string; score: number; text?: string }[]) {
2
  if (!mergedResults.length) return { chunks: [] };
3
  const { query, scoreThreshold, topK } = params;
4
  const chunks: Array<{ text: string; id: string; docId: string } & { score: number }> = [];
5

6
  const response = await env.AI.run(
7
    "@cf/baai/bge-reranker-base",
8
    {
9
      // @ts-ignore
10
      query,
11
      contexts: mergedResults.map(r => ({ id: r.id, text: r.text }))
12
    },
13
  ) as { response: Array<{ id: number, score: number }> };
14

15

16
  const scores = response.response.map(i => i.score);
17
  let indices = response.response.map((i, index) => ({ id: i.id, score: sigmoid(scores[index]) }));
18
  if (scoreThreshold && scoreThreshold > 0) {
19
    indices = indices.filter(i => i.score >= scoreThreshold);
20
  }
21
  if (topK && topK > 0) {
22
    indices = indices.slice(0, topK)
23
  }
24

25
  const slice = reorderArray(mergedResults, indices.map(i => i.id)).map((v, index) => ({ ...v, score: indices[index]?.score }));
26

27
  await Promise.all(slice.map(async result => {
28
    if (!result) return;
29
    const a = {
30
      text: result.text || (await getChunk(env, { docId: result.docId, id: result.id }))?.text || "",
31
      docId: result.docId,
32
      id: result.id,
33
      score: result.score,
34
    };
35

36
    chunks.push(a)
37
  }));
38

39
  return { chunks };
40
}

After reranking, we apply a sigmoid function to transform the raw scores:

1
function sigmoid(score: number, k: number = 0.4): number {
2
  return 1 / (1 + Math.exp(-score / k));
3
}

The reranker’s raw scores can have a wide range and don’t directly translate to a probability of relevance. The sigmoid function has a few compelling characteristics:

Bounded output range: Sigmoid squashes values into a fixed range (0,1), creating a probability-like score that’s easier to interpret and threshold.
Non-linear transformation: Sigmoid emphasizes differences in the middle range while compressing extremes, which is ideal for relevance scoring where we need to distinguish between “somewhat relevant” and “very relevant” items.
Stability with outliers: Extreme scores don’t disproportionately affect the normalized range.
Ordering: Sigmoid maintains the relative ordering of results

The parameter k controls the steepness of the sigmoid curve: a smaller value creates a sharper distinction between relevant and irrelevant results. The specific value of k=0.4 was chosen after experimentation to create an effective decision boundary around the 0.5 mark. Results above this threshold are considered sufficiently relevant for inclusion in the final context.

Below is a plot of our sigmoid function:

Sigmoid - 1 / (1 + e^(-x/0.4))

We pass the reranker score of each chunk through the signoid function, and we can set a fixed threshold score as the sigmoid normalises the scores. We can ignore anything below the threshold score as these chunks are likely not relevant to the user query.

Without the sigmoid function, each ranking would have a different score distribution and it would be difficult to fairly select a threshold score.

Search Parameter Tuning

The searchDocs function includes several parameters that significantly impact retrieval quality:

1
const { chunks } = await searchDocs(c.env, {
2
  questions: queries,
3
  query: prompt,
4
  keywords,
5
  topK: 8,
6
  scoreThreshold: 0.501,
7
  ...searchOptions,
8
});

The topK parameter limits the number of chunks we retrieve, which is essential for:

Managing the LLM context window size limitations
Reducing noise in the context that could distract the model
Minimizing cost and latency

The scoreThreshold is calibrated to work with our sigmoid normalization. Since sigmoid transforms scores to the (0,1) range with 0.5 representing the inflection point, setting the threshold just above 0.5 ensures we only include chunks that the reranker determined are more likely relevant than not.

This threshold prevents the inclusion of marginally relevant content that might dilute the quality of our context window.

Answer Generation

Finally, we generate an answer using the retrieved chunks as context:

1
const answer = await c.env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", {
2
  prompt: `${prompt}
3

4
Context: ${chunks}`
5
})

This step is where the “Generation” part of RAG comes in. The LLM receives both the user’s question and the retrieved context, then generates an answer that draws on that context.

Stitching it all together

The entire system is an integrated pipeline where each component builds upon the previous ones:

When a query arrives, it first hits the Query Rewriter, which expands it into multiple variations to improve search coverage.
These expanded queries alongside extracted keywords simultaneously feed into two separate search systems:
- Vector Search (semantic similarity using embeddings)
- Full-Text Search (keyword matching using FTS5)
The results from both search methods then enter the Reciprocal Rank Fusion function, which combines them based on rank position. This solves the challenge of comparing scores from fundamentally different systems.
The fused results are then passed to the AI Reranker, which performs an LLM based relevance assessment focused on answering the original query.
The reranked results are then filtered by score threshold and count limits before being passed into the context window for the final LLM.
The LLM receives both the original query and the curated context to produce the final answer.

This multi-stage approach creates a system that outperforms traditional RAG systems.

Testing It Out

Now let’s see how our contextual RAG system works in practice. After deploying your Worker using wrangler, you’ll get a URL for your Worker:

1
https://contextual-rag.account-name.workers.dev

where account-name is the name of your Cloudflare account.

Here are some example API calls:

Uploading a document

1
curl -X POST https://contextual-rag.account-name.workers.dev/ \
2
  -H "Content-Type: application/json" \
3
  -d '{"contents":"Paris doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions — grand boulevards designed for kings now flooded with delivery bikes and tourists holding maps upside-down. Crumbling stone facades wear ivy and grime like couture. Every corner feels staged, but somehow still effortless, like the city isn’t even trying to impress you. It just is. Each arrondissement spins its own little universe. In the Marais, tiny alleys drip with charm — cafés packed so tightly you can hear every clink of every espresso cup. Montmartre clings to its hill like a stubborn old cat... <REDACTED>"}'
4

5
# Response:
6
# {
7
#   "id": "jtmnofvveptdalwl",
8
#   "contents": "Paris doesn’t shout. It smirks. It moves with a kind of...",
9
#   "created": "2025-04-26T19:50:39.941Z",
10
#    "updated": "2025-04-26T19:50:39.941Z",
11
# }

Querying

1
curl -X POST https://contextual-rag.account-name.workers.dev/query \
2
  -H "Content-Type: application/json" \
3
  -d '{"prompt":"What is special about Paris?"}'
4

5
# Response:
6
# {
7
#   "keywords": [
8
#     "paris",
9
#     "eiffel tower",
10
#     "monet",
11
#     "loire river",
12
#     "montmartre",
13
#     "notre dame",
14
#     "art",
15
#     "history",
16
#     "culture",
17
#     "tourism"
18
#   ],
19
#   "queries": [
20
#     "paris attractions",
21
#     "paris landmarks",
22
#     "paris history",
23
#     "paris culture",
24
#     "paris tourism"
25
#   ],
26
#   "chunks": [
27
#     {
28
#       "text": "This chunk describes Paris, the capital of France, its location on the Seine River, and its famous landmarks and characteristics.; Paris doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions...",
29
#       "id": "abcdefghijklmno",
30
#       "docId": "qwertyuiopasdfg",
31
#       "score": 0.8085310556071287
32
#     }
33
#   ],
34
#   "answer": "Paris, the capital of France, is known as the City of Light. It has been a hub for intellectuals and artists for centuries, and its stunning architecture, art museums, and romantic atmosphere make it one of the most popular tourist destinations in the world. Some of the most famous landmarks in Paris include the Eiffel Tower, the Louvre Museum...",
35
#   "docs": [
36
#     {
37
#       "id": "jtmnofvveptdalwl",
38
#       "contents": "Paris doesn’t shout. It smirks. It moves with a kind of practiced nonchalance, a shrug that says, of course it’s beautiful here. It’s a city built for daydreams and contradictions...",
39
#       "created": "2025-04-26T19:50:39.941Z",
40
#        "updated": "2025-04-26T19:50:39.941Z",
41
#       "score": 0.8085310556071287
42
#     }
43
#   ]
44
# }

Limitations and Considerations

As much as this contextual RAG system is powerful, there are several limitations and considerations to keep in mind:

AI Costs: The contextual enhancement process requires running each chunk through an LLM, which increases both the computation time and cost compared to traditional RAG systems. For very large document collections, this can become a significant consideration.
Latency Trade-offs: Adding multiple stages of processing (query rewriting, hybrid search, reranking) improves result quality but increases end-to-end latency. For applications where response time is critical, you might need to optimize certain components or make trade-offs.
Storage Growth: The contextualized chunks are longer than raw chunks, requiring more storage in both D1 and Vectorize. It’s important to monitor the size of your databases and remain below limits.
Rate Limits: All components of the pipeline have rate-limits. This can affect high-throughput applications. Consider offloading the document ingestion into a queue in production.
Context Window Limitations: When contextualizing chunks, very long documents may exceed the context window of the LLM. You might need to implement a hierarchical approach for extremely large documents.

On Basebrain, I’ve addressed some of these challenges by using Cloudflare Workflows for document ingestion, and Durable Objects instead of D1 for storage, where I implement one database per user.