B4. RAG Essentials: Chunking, Embeddings, and Hybrid Retrieval | Claude API Practitioner's Guide | DataMy

About this article This article is part of Building with Claude — A Practitioner's Guide to the Anthropic API, a study-notes-plus-commentary series based on Anthropic's official "Building with the Claude API" course (hosted on Coursera) and the public Anthropic API documentation at docs.anthropic.com.

Original course and documentation material is © Anthropic. Direct quotes are cited inline. Commentary, code adaptations, and examples are © DataMy. This series is independent and not affiliated with or endorsed by Anthropic.

Companion notebook: B4_rag_essentials.ipynb Setup: see README.md · Datasets: data/runbook_warehouse_cost.md, data/runbook_data_quality.md, data/qbr_q3_2025.md Requires: VOYAGE_API_KEY in .env (free tier at voyageai.com)

Why retrieval augmentation exists

There is a hard limit on how much text you can stuff into a single Claude call. Even with a 200,000-token context window, the economics get uncomfortable quickly: sending a 500-page knowledge base on every call is slow and expensive. More importantly, most questions only need a small slice of that knowledge base — the rest is noise.

Retrieval-augmented generation (RAG) is the architectural answer. Instead of sending everything to the model, you:

Index your knowledge base offline (once, or on update).
Retrieve only the relevant fragments at query time.
Inject those fragments into the model's context alongside the question.
Generate an answer grounded in what was retrieved.

The Anthropic documentation frames this as a core pattern for handling large corpora:

"RAG allows you to provide Claude with relevant information from large databases or knowledge bases that wouldn't fit in a single context window." — Anthropic API Docs, "Retrieval augmented generation (RAG)".

The pipeline sounds simple. The practitioner skill is in the details: how you split documents into retrievable units, how you represent them numerically, and how you combine different retrieval signals to find the right chunk even when the question is phrased nothing like the answer. That is what this article is about.

B5 covers the advanced layer — reranking and contextual retrieval — that you add once the basic pipeline is working. Build the foundation here first.

The RAG pipeline at a glance

Every RAG implementation is a variation on the same six-stage loop:

Documents
    |
    v
[1] CHUNK        Split each document into retrievable units
    |
    v
[2] EMBED        Encode each chunk as a dense vector
    |
    v
[3] INDEX        Store chunks + vectors in a searchable structure
    |
    v  (at query time)
[4] RETRIEVE     Find the chunks most relevant to the user's question
    |
    v
[5] INJECT       Assemble retrieved chunks into the model's context
    |
    v
[6] GENERATE     Claude answers, grounded in the retrieved context

Stages 1–3 are offline (or run on document update). Stages 4–6 happen on every user query.

The companion notebook builds all six stages from scratch using three synthetic documents: a Snowflake cost runbook, a data quality runbook, and a Q3 QBR report. No vector database required — the index is just a Python list, which is the right starting point for understanding what is actually happening before you add infrastructure.

1. Chunking: the first and most important decision

Chunking is the process of splitting a document into smaller, independently retrievable units. It sounds mechanical but has a large effect on retrieval quality: chunks that are too small lose context; chunks that are too large dilute relevance and approach the "send everything" problem you were trying to avoid.

Fixed-size chunking with overlap

The simplest strategy: split on word or character count, with a sliding window of overlap between adjacent chunks.

def chunk_fixed(text: str, chunk_size: int = 400, overlap: int = 50) -> list[str]:
    """Split text into fixed-size word-count chunks with overlap."""
    if not (0 <= overlap < chunk_size):
        raise ValueError(
            f"overlap must satisfy 0 <= overlap < chunk_size, "
            f"got overlap={overlap}, chunk_size={chunk_size}"
        )
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunks.append(" ".join(words[start:end]))
        start += chunk_size - overlap
    return chunks

The overlap prevents a relevant sentence from falling exactly on a chunk boundary and being split across two chunks neither of which retrieves well. A typical overlap is 10–15% of the chunk size.

Use when: the document has no meaningful structure (e.g., a long narrative, a legal contract with no headers). Also useful as a baseline to benchmark against structured strategies.

Avoid when: the document has meaningful structural divisions (Markdown headers, numbered sections) — fixed splitting will break semantic units arbitrarily.

Section-boundary chunking

For structured documents (runbooks, wikis, API documentation, meeting notes), splitting on natural boundaries — Markdown headers, numbered sections, paragraph breaks — preserves semantic coherence.

import re

def chunk_by_section(text: str, max_words: int = 600) -> list[str]:
    """Split on ## headers; further split sections that exceed max_words."""
    raw_sections = re.split(r'\n(?=## )', text)
    chunks = []
    for section in raw_sections:
        words = section.split()
        if len(words) <= max_words:
            chunks.append(section)
        else:
            # Split long sections at paragraph boundaries
            paras = section.split('\n\n')
            current, current_words = [], 0
            for para in paras:
                pw = len(para.split())
                if current_words + pw > max_words and current:
                    chunks.append('\n\n'.join(current))
                    current, current_words = [], 0
                current.append(para)
                current_words += pw
            if current:
                chunks.append('\n\n'.join(current))
    return [c for c in chunks if c.strip()]

A section header travels with the content beneath it, so every chunk is self-describing. When the model reads "## 4. Diagnosis playbook — Step 2: check if the failure is new or recurring", it knows where it is without needing the surrounding document context.

Use when: documents are structured with headers. Runbooks, wikis, technical documentation, meeting transcripts with agenda sections.

Avoid when: documents are flat prose without clear divisions.

Semantic chunking

The most sophisticated strategy: embed the text sentence by sentence, then identify split points where the embedding similarity between adjacent sentences drops below a threshold. The result is chunks that follow the actual topic shifts in the text rather than an arbitrary structural rule.

# Conceptual sketch -- not in the notebook (it requires per-sentence embeddings)
def chunk_semantic(sentences: list[str], embeddings: list[list[float]],
                   similarity_threshold: float = 0.7) -> list[str]:
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        sim = cosine_sim(embeddings[i-1], embeddings[i])
        if sim < similarity_threshold:
            chunks.append(' '.join(current))
            current = []
        current.append(sentences[i])
    if current:
        chunks.append(' '.join(current))
    return chunks

The trade-off: semantic chunking produces the most coherent chunks but requires an embedding call per sentence during indexing — expensive for large corpora. It is the right choice for long-form documents with frequent topic shifts (research papers, multi-section reports) and less justified for structured runbooks that already have explicit boundaries.

Choosing chunk size

There is no universal right answer, but some practical anchors:

Chunk size (tokens)	Typical use case
100–200	Precise fact retrieval (Q&A over a dense FAQ)
300–600	General-purpose (most runbooks, wiki pages, meeting notes)
600–1,200	Context-heavy retrieval (code files, legal documents where paragraph context matters)

Larger chunks retrieve more context per hit but reduce retrieval precision (a chunk about five topics returns five topics when you asked about one). Start at 400 tokens with 50-token overlap and measure retrieval quality before tuning.

2. Embeddings: turning text into vectors

An embedding model converts a text chunk into a dense vector of floating-point numbers — a point in a high-dimensional space where semantically similar texts land near each other. Retrieval then becomes a nearest-neighbour search: find the chunks whose vectors are closest to the vector of the user's question.

For this series, we use VoyageAI's voyage-3 model. VoyageAI is Anthropic's recommended embedding partner for Claude-based RAG applications and offers a free tier that covers experimentation.

"Anthropic recommends Voyage AI for embeddings. Voyage AI builds embedding models optimized for retrieval performance." — Anthropic API Docs, "Embeddings".

The API is straightforward:

import voyageai

vc = voyageai.Client()  # reads VOYAGE_API_KEY from environment

# Embed a batch of document chunks
result = vc.embed(
    [chunk["text"] for chunk in all_chunks],
    model="voyage-3",
    input_type="document",    # use "query" for the user's question at retrieval time
)
embeddings = result.embeddings  # list of float lists, one per chunk

Note input_type: VoyageAI uses different internal representations for documents (being indexed) and queries (being searched). Using the right type for each improves retrieval quality. Always embed chunks with input_type="document" and embed the user query with input_type="query".

Cosine similarity

The standard distance metric for embedding retrieval is cosine similarity: the cosine of the angle between two vectors. It ranges from -1 (opposite) to 1 (identical), with most semantically similar pairs landing in the 0.7–0.95 range.

import math

def cosine_sim(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b + 1e-10)

def vector_search(query: str, all_chunks: list, k: int = 5) -> list:
    q_emb = vc.embed([query], model="voyage-3", input_type="query").embeddings[0]
    scored = [(cosine_sim(q_emb, c["embedding"]), i) for i, c in enumerate(all_chunks)]
    scored.sort(reverse=True)
    return [(all_chunks[i], score) for score, i in scored[:k]]

For a few hundred chunks this brute-force scan is fast enough. For larger corpora (tens of thousands of chunks), you would move the embeddings into a vector database (Pinecone, Weaviate, pgvector, Chroma) and use approximate nearest-neighbour search instead. The logic above stays the same; only the storage and search mechanism changes.

3. BM25: keyword search as the essential baseline

Embedding retrieval is powerful but has a well-known failure mode: it misses exact-match keywords. Ask "what happened in incident 2025-05-18?" and the embedding model may return semantically adjacent text about cost spikes rather than the specific incident, because "2025-05-18" is just a date string with no semantic weight.

BM25 (Best Match 25) is a classical information-retrieval algorithm that scores documents based on term frequency and inverse document frequency — the same family of signals that powered search engines before neural embeddings existed. It is excellent at exact keyword matches and terrible at semantic paraphrase. That makes it complementary to embedding retrieval, not a substitute for it.

from rank_bm25 import BM25Okapi

# Build the index once
tokenized_corpus = [c["text"].lower().split() for c in all_chunks]
bm25 = BM25Okapi(tokenized_corpus)

def bm25_search(query: str, k: int = 5) -> list:
    tokens = query.lower().split()
    scores = bm25.get_scores(tokens)
    top_idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [(all_chunks[i], scores[i]) for i in top_idx]

BM25 needs no API key, no external service, and no GPU. It is the right baseline to build first, and the right component to keep in production even after you add embeddings.

4. Hybrid retrieval with Reciprocal Rank Fusion

Running both retrieval methods and combining their results outperforms either method alone. The combination strategy matters: you cannot simply add raw scores because BM25 scores and cosine similarity scores are on different scales with no common unit.

Reciprocal Rank Fusion (RRF) solves this by converting scores to ranks before combining them. Each result gets a score of 1 / (k + rank) from each retriever (where k is a stability constant, typically 60), and the scores are summed. This is scale-invariant and simple to implement:

def rrf_score(rank: int, k: int = 60) -> float:
    return 1.0 / (k + rank)

def hybrid_search(query: str, k: int = 5, fetch: int = 20) -> list:
    """Combine vector and BM25 results via Reciprocal Rank Fusion."""
    vec_results  = vector_search(query, all_chunks, k=fetch)
    bm25_results = bm25_search(query, k=fetch)

    # Precomputed position map: avoids O(n) list scan and handles duplicate chunk text
    _idx = {id(c): i for i, c in enumerate(all_chunks)}
    vec_ranks  = {_idx[id(c)]: r + 1 for r, (c, _) in enumerate(vec_results)}
    bm25_ranks = {_idx[id(c)]: r + 1 for r, (c, _) in enumerate(bm25_results)}

    # Score all candidates that appeared in either list
    candidates = set(vec_ranks) | set(bm25_ranks)
    rrf = {}
    for idx in candidates:
        rrf[idx] = (rrf_score(vec_ranks[idx]) if idx in vec_ranks else 0.0) + \
                   (rrf_score(bm25_ranks[idx]) if idx in bm25_ranks else 0.0)

    top = sorted(rrf, key=lambda i: rrf[i], reverse=True)[:k]
    return [(all_chunks[i], rrf[i]) for i in top]

Why this combination works: embedding retrieval finds semantically related chunks that use different words than the query; BM25 finds chunks that contain the exact terms in the query. Questions with named entities, dates, product names, or identifiers tend to be underserved by embeddings and overserved by BM25. Questions that paraphrase the document are the reverse. A hybrid retriever covers both failure modes simultaneously.

The fetch parameter controls how many candidates each retriever brings to the fusion step. Fetching 20 from each and fusing to top 5 is a typical starting configuration.

5. The full RAG loop: retrieve, inject, generate

With chunking, embedding, BM25, and hybrid retrieval in place, the full loop is a straightforward composition:

def rag_answer(question: str, k: int = 4) -> tuple[str, list]:
    retrieved = hybrid_search(question, k=k)

    # Assemble the retrieved context with source attribution
    context = "\n\n---\n\n".join(
        f"[Source: {c['source']}]\n{c['text']}"
        for c, _ in retrieved
    )

    resp = cc.client.messages.create(
        model=cc.default_model,
        max_tokens=800,
        temperature=0,
        system=(
            "You are a data platform assistant for Acme SaaS Co. "
            "Answer questions using ONLY the context provided. "
            "Cite the source document name when referencing specific facts. "
            "If the answer is not in the context, say so explicitly."
        ),
        messages=[{
            "role": "user",
            "content": f"Context:\n\n{context}\n\nQuestion: {question}",
        }],
    )
    return resp.content[0].text, retrieved

Three things to note in this implementation:

Source attribution in the context block. Each chunk is prefixed with its source document name. This lets Claude cite the source in its answer and lets you verify retrieval quality by checking whether the cited source makes sense for the question.

Explicit grounding instruction. The system prompt tells the model to use ONLY the provided context and to say so explicitly if the answer is not there. Without this, Claude will draw on its training knowledge to fill gaps — usually helpful in an open-ended assistant, but undesirable in a RAG application where you want answers traceable to your specific corpus.

Temperature zero. Factual retrieval answers should be deterministic. Creativity adds noise, not value, when the model is reading from a provided document.

6. Practical corpus design

The notebook uses three documents as the retrieval corpus. A few observations from building it that apply to real corpora:

Thematic coherence aids retrieval quality. The three documents (cost runbook, data quality runbook, QBR) all describe the same fictional company's data platform. Questions about a specific incident ("what happened on 2025-05-18?") have an unambiguous answer in the corpus. Diverse corpora with topic overlap require more careful chunking to avoid retrieval confusion.

Document-level metadata belongs on every chunk. Each chunk in the notebook carries a "source" field (the document name). In production, extend this to include the document's last-modified date, the section path, and any categorical tags. Metadata enables filtered retrieval ("only search the Q3 report") and attribution in the generated answer.

The corpus is a codebase, not a pile of files. Good RAG corpora are maintained the same way good codebases are: with ownership, versioning, and a removal process. A chunk that refers to an incident from 2022 as "recent" erodes trust. Treat stale content as technical debt.

Practitioner Notes

Section-boundary chunking beats fixed-size chunking for structured documents. The notebook demonstrates both; the section-boundary chunks retrieve more coherently for the runbook corpus because each chunk is a self-contained section with a title.
Always embed queries with input_type="query", not "document". VoyageAI and most embedding providers use asymmetric representations for documents vs. queries. Getting this wrong produces subtly worse retrieval that is hard to diagnose.
BM25 is not optional. Embedding-only retrieval reliably fails on named entities, dates, product names, and error codes — exactly the terms practitioners care most about in operational contexts. Add BM25 before you add anything else.
Log retrieval quality, not just answer quality. A RAG system that returns the wrong chunks but generates a fluent answer is silently broken. Track which chunks were retrieved for a sample of production queries and inspect them manually on a regular cadence.
Start without a vector database. For corpora under ~50,000 chunks, a brute-force cosine scan over an in-memory list is fast enough (under 100ms). Add a vector database when latency or memory becomes the constraint, not as a default architectural choice.

Beyond the Docs

The Anthropic course introduces RAG as a single concept. Two distinctions the course leaves implicit:

Retrieval strategy and generation are independently tunable. A common mistake is to iterate on the prompt when retrieval is the real problem, and vice versa. Evaluate them separately: first check whether the right chunks are being retrieved (retrieval recall); then check whether the model is using them correctly (generation faithfulness). Confounding the two means you can never tell which change helped.
Hybrid retrieval is not an optimisation — it is the baseline. Embedding-only RAG is what you build to understand the pipeline. Hybrid retrieval with RRF is what you deploy. The two-query overhead is trivial; the coverage improvement over either retriever alone is substantial and consistent across corpus types. Build hybrid from the start rather than treating it as an upgrade.

Previous: B3 — Augmenting Model Reasoning: Extended Thinking + Prompt Caching Next: B5 — RAG Advanced: Reranking & Contextual Retrieval Series index: Building with the Claude API — A Practitioner's Guide