B5. RAG Advanced: Reranking and Contextual Retrieval | Claude API Practitioner's Guide | DataMy

About this article This article is part of Building with Claude — A Practitioner's Guide to the Anthropic API, a study-notes-plus-commentary series based on Anthropic's official "Building with the Claude API" course (hosted on Coursera) and the public Anthropic API documentation at docs.anthropic.com.

Original course and documentation material is © Anthropic. Direct quotes are cited inline. Commentary, code adaptations, and examples are © DataMy. This series is independent and not affiliated with or endorsed by Anthropic.

Companion notebook: B5_rag_advanced.ipynb Setup: see README.md · Same corpus as B4: data/runbook_warehouse_cost.md, data/runbook_data_quality.md, data/qbr_q3_2025.md Requires: ANTHROPIC_API_KEY and VOYAGE_API_KEY in .env

The two gaps that basic RAG leaves open

After B4 you have a working pipeline: section-boundary chunking, VoyageAI embeddings, BM25, RRF hybrid fusion. For many applications that is sufficient. Two failure modes remain.

Gap 1: context loss at chunk boundaries. When a chunk is sliced out of a document, it loses the surrounding context that makes it retrievable. A chunk that reads "Step 3: identify which query pattern" is precise inside the runbook's diagnosis playbook section, but to a retriever that sees only the chunk text, it looks like a generic procedural step — not something about Snowflake cost attribution. The embedding is faithful to the words that are there; the problem is the words that are not there.

Gap 2: bi-encoder ceiling. The embedding that drives vector search is computed independently for the chunk and the query — they never interact. A cross-encoder model, which reads the query and each candidate jointly, can score relevance more accurately. It is too expensive to run over the full corpus; running it over a small candidate set is practical.

Both gaps have standard solutions. This article covers them.

1. Reranking: a second, sharper scoring pass

The setup: retrieve a larger candidate set (20–50 chunks) with fast hybrid search, then re-score those candidates with a cross-encoder reranker and keep the top-k.

Anthropic recommends VoyageAI's reranker alongside their embedding models:

"We recommend Voyage AI's rerankers as they are among the best performing on the market. Reranking should be used after initial retrieval to improve the quality of the results." — Anthropic API Docs, "Contextual retrieval".

The rerank call:

result = vc.rerank(
    query=query,
    documents=[c["text"] for c in candidates],
    model="rerank-2",
    top_k=k,
)

# result.results: list of RerankingObject, each with .index and .relevance_score
reranked = [(candidates[r.index], r.relevance_score) for r in result.results]

Unlike cosine similarity scores (which are normalized to [-1, 1]), reranking relevance scores are not bounded — only the relative order matters. The top-ranked item is the one the cross-encoder judged most relevant to the query after reading both side-by-side.

Why this consistently improves results: the bi-encoder that produced the initial candidates encoded chunk and query in separate forward passes. The cross-encoder attends jointly over both, producing a signal that captures specific relationships between query tokens and document tokens rather than just geometric proximity in embedding space. It is qualitatively different information, not just better information.

The practical rule for when to add reranking: default yes for production and high-value corpora. It has no false negatives (it can only re-sort what hybrid retrieval already found), the latency overhead over a 20-chunk candidate set is 100–300ms, and the cost is small relative to the generation call that follows. For small, low-stakes corpora (a handful of documents, internal-only queries), the B4 hybrid baseline is often sufficient — add reranking when retrieval precision is a measurable problem, not as a default first step.

2. Contextual retrieval: making chunks self-aware

Anthropic published contextual retrieval in late 2024 as a direct answer to gap 1:

"We've developed a method called Contextual Retrieval that significantly improves the retrieval step in RAG. We've found that Contextual Retrieval reduces the number of failed retrievals by 49% and, when combined with reranking, by 67%." — Anthropic Blog, "Contextual Retrieval", November 2024.

The technique is straightforward: for each chunk, ask Claude to write a short context paragraph that situates the chunk within its source document, then prepend that paragraph to the chunk before embedding.

The context generation prompt Anthropic recommends. The document goes in the system block (with cache_control so it is cached across all chunks from the same document); the user message carries only the chunk instruction — no <document> wrapper in the user turn:

# System block: the full document, cached per document
# User message: chunk-only instruction -- no <document> tags here

CHUNK_INSTRUCTION = (
    "Here is a chunk from this document:\n\n"
    "<chunk>\n{chunk_text}\n</chunk>\n\n"
    "Write a short 2-3 sentence context that situates this chunk within the "
    "overall document. Include the document type, the section it belongs to, "
    "and key terms a search query might use to find this content. "
    "Answer with only the context paragraph, nothing else."
)

resp = cc.client.messages.create(
    model=cc.default_model,
    max_tokens=150,
    temperature=0,
    system=[{
        "type": "text",
        "text": f"<document>\n{document_text}\n</document>",
        "cache_control": {"type": "ephemeral"},   # cached per document
    }],
    messages=[{
        "role": "user",
        "content": CHUNK_INSTRUCTION.format(chunk_text=chunk_text),
    }],
)

Applied to the "Step 3: identify which query pattern" chunk from the cost runbook, this might produce:

This chunk is from the Snowflake Warehouse Cost Runbook's diagnosis playbook (Section 4). It describes Step 3 of the cost spike investigation procedure: grouping QUERY_HISTORY by parameterized hash to identify the top credit-consuming query patterns. Relevant terms include Snowflake cost attribution, warehouse spend investigation, and query pattern analysis.

After prepending, the embedding now carries the conceptual weight of the whole section — not just the isolated step text. The BM25 index built on contextual chunks similarly benefits from the added terminology.

Cost structure of contextual indexing

The cost is real: one Claude call per chunk, with the full source document as input on every call. For a 30-chunk corpus split across three documents, that is ~30 calls.

Prompt caching reduces this dramatically. The full document text is identical for every chunk from that document. Mark it with cache_control in the system block, and keep the user message to the chunk instruction only — the pattern shown above:

On the first chunk from a document, the full document is written to cache (~1.25× input cost). Every subsequent chunk from the same document reads the document from cache (~0.1× input cost). For a 10-chunk document, the cache write amortises across 9 reads — the overall indexing cost for that document is close to the cost of a single full-document call.

The contextual indexing runs offline — it is not on the query path. Pay it once when the document is first indexed, and again only when the document changes.

3. The full advanced pipeline

The two upgrades compose cleanly:

OFFLINE — indexing (runs once per document, or on update)
  For each document D:
    For each chunk C in D:
      ctx = claude(system=[D cached], user=[C]) -> context paragraph
      contextual_chunk = ctx + "\n\n" + C
    embed(contextual_chunks, input_type="document")  -> contextual embeddings
    bm25.build(contextual_chunks)                    -> contextual BM25 index

ONLINE — query path (runs on every user question)
  candidates = hybrid_search(query, k=20)   -> fast: embed query + BM25 scan + RRF
  top_k = rerank(query, candidates, k=5)    -> slower: cross-encoder over 20 candidates
  answer = claude(system + top_k_context + question) -> generation

Relative to the B4 baseline:

Indexing cost is higher (one Claude call per chunk, mitigated by prompt caching).
Query latency is slightly higher (one reranking call, ~100–300ms).
Retrieval precision is substantially higher — the 67% failure-rate reduction quoted by Anthropic is a real and reproducible result on typical corpora.

For high-value knowledge bases — internal runbooks, compliance documents, technical wikis — the precision improvement justifies both costs easily. For a casual Q&A over a few documents, B4's basic pipeline may be sufficient.

4. A comparison on the Acme corpus

Running both pipelines on the same three test questions against the B4 corpus:

Question type	Baseline (B4 hybrid)	Advanced (contextual hybrid + rerank)
Paraphrased semantic	Usually top-3	Top-1 after reranking confirms ranking
Named entity / date	Top-1 via BM25	Unchanged; BM25 already handles this
Specific procedural step	Top-3 to top-7	Top-1 or top-2; context paragraph adds section label that embedding was missing

The biggest gain is on specific procedural step questions — exactly the queries that an on-call engineer running a runbook assistant would ask. The context paragraph adds the section title and surrounding terminology that the chunk itself lacks, making the embedding highly discriminative.

The companion notebook runs this comparison live and prints both retrieved chunks and generated answers side-by-side, making the improvement concrete.

5. Latency and cost trade-offs in production

Two practical decisions for any production deployment:

Caching reranked results. If many users ask semantically similar questions (e.g., "how do I diagnose a cost spike?" asked daily by on-call engineers), the reranked candidate set is effectively the same. Cache the reranked chunks (keyed on a hash of the query embedding) for the TTL of your document update cycle. This converts the reranking latency from a per-query to a per-unique-query cost.

Selective contextual enrichment. Not all chunks need contextual enrichment equally. Chunks that begin with a clear section header ("## 4. Diagnosis playbook") are already self-describing. Chunks in the middle of long sections, or that contain dense technical content with no surrounding label, benefit the most. A heuristic: generate context for any chunk whose first 50 characters do not contain a heading or a self-identifying sentence.

Practitioner Notes

Cache the document during contextual indexing, not just during query time. The B3 caching pattern (stable prefix, volatile suffix) applies perfectly here: the document is the stable prefix; the chunk prompt is the volatile suffix. One cache marker on the document block saves ~85% of the indexing API cost.
Keep both the raw chunk and the contextual chunk. The context paragraph helps retrieval. For generation, you may want to pass only the original chunk text to avoid the model echoing meta-commentary in its answer.
Reranking scores are relative, not absolute. Do not use a fixed relevance score threshold to filter reranked results — the scores are not comparable across queries. Use a fixed top-k instead.
Rebuild contextual embeddings when the document is updated. A context paragraph generated from an old version of a document will describe the wrong section number or wrong content after the document is edited. Treat contextual embeddings as a derived artifact of the document, not as a stable index.
Measure before and after with known-answer pairs. Write 15–20 question/expected-chunk pairs for your corpus before adding contextual retrieval. Run both pipelines. The improvement should be visible and quantifiable. If it is not, the corpus or chunking strategy may be the root cause rather than missing contextual enrichment.

Beyond the Docs

The course covers contextual retrieval and reranking as independent techniques. Two connections worth making explicit:

Contextual retrieval and prompt caching are designed for each other. The Anthropic blog on contextual retrieval mentions caching as a cost reduction. In practice they are architecturally linked: contextual indexing is only affordable at corpus scale because prompt caching makes the full-document prefix nearly free after the first chunk. The two features were published months apart but belong in the same section of any serious RAG architecture discussion.
The 67% failure reduction figure is a floor, not a ceiling. Anthropic's published benchmark uses a general-purpose corpus and standard queries. Domain-specific corpora with dense jargon, internal codenames, and procedural content — exactly what an enterprise RAG deployment looks like — show higher gains from contextual enrichment because those corpora suffer more severely from gap 1. The improvement compounds: better contextual enrichment → better embeddings → wider hybrid recall → better reranking signal.

Previous: B4 — RAG Essentials: Chunking, Embeddings & Hybrid Retrieval Next: C1 — Built-in Tools: Code Execution, Web Search & Text Editor Series index: Building with the Claude API — A Practitioner's Guide