B3. Augmenting Model Reasoning: Extended Thinking and Prompt Caching | Claude API Practitioner's Guide | DataMy

About this article This article is part of Building with Claude — A Practitioner's Guide to the Anthropic API, a study-notes-plus-commentary series based on Anthropic's official "Building with the Claude API" course (hosted on Coursera) and the public Anthropic API documentation at docs.anthropic.com.

Original course and documentation material is © Anthropic. Direct quotes are cited inline. Commentary, code adaptations, and examples are © DataMy. This series is independent and not affiliated with or endorsed by Anthropic.

Companion notebook: B3_caching_and_thinking.ipynb · llm_client.py Setup: see README.md in the series repo · Dataset: data/runbook_warehouse_cost.md

Two features, one theme

Extended thinking and prompt caching look like unrelated features at first glance — one expands the model's reasoning budget, the other reduces the cost of long prompts. The reason this article treats them together is that both attack the same underlying problem from opposite sides:

How do you make Claude reason harder over more context, without the latency and cost going through the roof?

Extended thinking gives the model more time to reason on each call. You spend more output tokens (and a bit more latency) to get a deeper, more deliberate answer.
Prompt caching lets you stop paying full input price every time you re-send the same context. You spend more storage (and slightly more on the first call) to make every subsequent call dramatically cheaper.

Used separately, each is useful for a narrow class of problems. Used together, they unlock a pattern that beginner Claude code rarely reaches: hard analytical reasoning over a long, stable corpus, at production cost. That is what a real operational copilot looks like — a runbook in context, a hard question asked, a careful answer returned.

This article walks both features end to end, then composes them.

1. Extended thinking: more reasoning budget per call

By default, when you call the API, the model generates a response immediately — token by token, no scratchpad, no internal deliberation visible to the caller. For straightforward questions this is exactly what you want.

For hard problems — multi-step reasoning, mathematical work, complex analysis, ambiguous trade-off decisions — there is a different mode. Extended thinking gives the model a private space to "think before it speaks":

"Extended thinking gives Claude enhanced reasoning capabilities for complex tasks, while also providing transparency into its step-by-step thought process before it delivers its final answer." — Anthropic API Docs, "Extended thinking".

The API shape is simple:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=4096,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content": hard_question}],
)

budget_tokens is the maximum reasoning the model is allowed to spend before producing its answer. It is a soft ceiling — Claude will use what it needs up to that limit. The response object carries both the thinking content and the final answer as separate content blocks.

When extended thinking helps

In practical use, three patterns reliably benefit:

Multi-step analytical questions. "Given this runbook, our recent incident logs, and these cost trends, what is the single highest-leverage change we could make this quarter?" Multiple facts need to be combined, weighed, and ranked. Thinking gives the model room to work.
Math-heavy or financial reasoning. Anything that requires multi-step calculation, sensitivity analysis, or unit reconciliation. Thinking measurably improves correctness here.
Ambiguous trade-off decisions. "Should we keep this query running on the larger warehouse or refactor it?" When the answer depends on weighing factors against each other, thinking surfaces the weighting itself, which is often what the user actually wants to see.

When extended thinking is overkill

The list of things thinking does not help with is just as important:

Classification, extraction, formatting. A persona that returns structured JSON gains nothing from thinking. It just slows down and costs more.
Conversational responses. Chat-style interactions feel sluggish with thinking enabled; the model's natural response is already good enough.
Tool routing. Decisions about which tool to call are typically simple lookups; thinking adds latency without improving accuracy.
Anything where consistency matters more than depth. Thinking introduces additional variation; for repeatable tasks you usually want it off.

A reasonable rule: turn thinking on by default for analytical or advisory features, off by default for transactional features. Then measure.

Cost and latency trade-off

Extended thinking is billed as output tokens. A call with budget_tokens=8000 that uses 6000 thinking tokens will charge for those 6000 tokens at the output rate, on top of the final answer's tokens. Latency scales similarly — expect 2–10x longer round-trip times depending on how deeply the model thinks.

The right way to think about the cost: thinking is a quality knob. You are paying for better answers on the calls where better answers matter. The way to keep the bill sane is to be deliberate about which calls get thinking — not to disable it across the board.

2. Prompt caching: stop paying twice for the same context

The other side of the same problem. As your application matures, your prompts get longer: a detailed system prompt with persona instructions, a runbook the model references, a set of tool definitions, perhaps a few-shot example block. Every one of those tokens is sent on every call, processed by the model on every call, and billed on every call. For high-traffic features, the math gets ugly quickly.

Prompt caching is Anthropic's answer. You mark a section of your prompt as cacheable, and on subsequent calls within a time window, that section is served from a server-side cache rather than re-processed from scratch:

"Prompt caching enables you to store and reuse frequently used context in your API calls, reducing both costs and latency for repetitive tasks." — Anthropic API Docs, "Prompt caching".

The mechanics are explicit but simple. You attach a cache_control marker to the last block of whatever you want cached:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,                       # persona, instructions
            "cache_control": {"type": "ephemeral"},
        },
        {
            "type": "text",
            "text": runbook_markdown,                          # long stable document
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": user_question}],
)

Everything before and including a cache_control marker becomes the cached prefix. Everything after it is uncached and re-processed on every call. That ordering rule is critical — see §5.

Cache TTL

By default, cache entries expire 5 minutes after their last use. For sustained workloads where the same prefix is hit dozens of times per hour, the default is fine. For workloads with longer gaps between calls (a runbook accessed by an on-call rotation, say), Anthropic also offers a 1-hour cache option via "cache_control": {"type": "ephemeral", "ttl": "1h"}. The 1-hour cache costs slightly more to write but eliminates the wasted re-write when calls are spaced more than five minutes apart.

Cost economics

The numbers move over time, so always confirm at anthropic.com/pricing. At the time of writing, the approximate shape is:

Cache write costs slightly more than a regular input token (the first call that creates the cache).
Cache read costs roughly 10% of a regular input token (every subsequent call that hits the cache).

The implication: caching pays off the moment you call the same prefix more than once in a TTL window. For high-traffic features, the savings compound dramatically — a 50,000-token system prompt hit 100 times an hour goes from "this is going to bankrupt us" to "barely shows up in the bill."

The usage object on the response tells you which side of the cache you landed on:

response.usage.cache_creation_input_tokens   # tokens written to cache on this call
response.usage.cache_read_input_tokens       # tokens read from cache on this call
response.usage.input_tokens                  # uncached input tokens

A healthy production trace shows cache_creation_input_tokens only on the first call of each TTL window, with cache_read_input_tokens dominating thereafter.

3. The two together

The pattern that makes both features earn their keep:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=4096,
    thinking={"type": "enabled", "budget_tokens": 8000},
    system=[
        {
            "type": "text",
            "text": ANALYST_PERSONA,                  # stable
            "cache_control": {"type": "ephemeral"},
        },
        {
            "type": "text",
            "text": warehouse_runbook,                 # stable, long
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": hard_user_question}],   # volatile
)

Read top to bottom: a thoughtful, deliberate answer (thinking) grounded in a long, stable corpus (the cached system prompt + runbook) about a specific user question that changes every time.

This is the canonical shape for a production operational copilot — the kind of thing a data platform team would deploy for their own on-call rotation, or that DataMy might build for a client whose engineers need a runbook-grounded assistant. The cached context is essentially free on the second call onward; the thinking budget makes the answer worth reading; the user message is the only volatile part. Every architectural choice in this article was leading toward this shape.

4. A back-of-envelope cost model for caching

Caching's break-even is not particularly subtle, but practitioners often skip the calculation and either over-cache or under-cache.

Let:

P = size of the cacheable prefix in tokens
N = number of calls per TTL window that hit the same prefix
r_input = price per input token
r_write = price per cache-write token (a bit higher than r_input)
r_read = price per cache-read token (much lower than r_input)

Without caching, the prefix costs N × P × r_input per TTL window.

With caching, the prefix costs P × r_write + (N − 1) × P × r_read per TTL window.

The break-even — where caching starts to pay — is at very small N. For typical Anthropic pricing where r_read ≈ 0.1 × r_input and r_write ≈ 1.25 × r_input, caching is a net win whenever you call the same prefix twice or more within a TTL window. The savings grow linearly from there.

Worked example: a 5,000-token runbook hit 20 times in a 5-minute window:

Without caching: 20 × 5,000 × r_input = 100,000 × r_input
With caching: 5,000 × r_write + 19 × 5,000 × r_read ≈ 6,250 + 9,500 = 15,750 × r_input equivalent

Savings: roughly 84%. That ratio holds across most realistic prefix sizes — caching delivers an 80–90% cost reduction on the cached portion once you're calling more than a few times per TTL window. The companion notebook implements this calculator with current pricing as inputs so you can replace the assumed rates and plug in your real numbers.

There are two failure modes for the calculation worth flagging:

Calls per TTL window matters more than calls per day. A feature with 100,000 calls per day distributed evenly across 24 hours hits the same prefix only ~6 times per 5-minute window. Caching still helps, but the savings shape is "always cache" rather than "extreme savings." A feature with 100,000 calls per day clustered during a 4-hour business window hits the prefix ~70 times per 5-minute window — extreme savings territory.
The cache is per-organization, not per-process. If your service runs across many workers, they all share the same cache. Good news for cost; potentially confusing news for debugging.

5. Cache invalidation realities

The most common source of surprise with prompt caching: developers mark something as cached and then wonder why the cache miss rate stays high. The reason is usually that something before the cache marker is changing call to call.

The rule is simple but easy to get wrong:

The cache key is the byte-exact content of everything from the start of the prompt up to and including the cache_control marker.

If a single token before the marker is different between two calls, you get a cache miss. This includes:

A timestamp injected into the system prompt ("Today is 2026-06-08...") — moves daily, cache misses daily.
A user-specific personalization line embedded above the persona ("Greeting: 'Hi Raymond'") — different per user, cache misses per user.
Conversation history that grows over turns — every appended turn pushes a new cache key.

The practical pattern: stable content goes before the cache marker, volatile content goes after. A useful mental model is two layers:

[ persona ][ runbook ][ tool definitions ]      ← cached prefix, hits on every call
[ cache_control marker ]
[ user message, today's data, conversation tail ]   ← volatile suffix, uncached

For multi-turn conversations specifically, the trick is to put a cache marker at the end of the conversation history on each call. The first turn's marker is hit on subsequent turns; the new turn that was uncached on call N becomes the new cached frontier on call N+1. The Anthropic docs describe this as "incremental caching" and it is the right pattern for chat applications with long histories.

Practitioner Notes

Cache the system prompt before you cache anything else. Even a moderately detailed persona-style system prompt repays caching after the second call. Adding cache_control to the system prompt is the single highest-leverage caching change in most codebases.
Inspect usage.cache_read_input_tokens on every call in production. This is your single best signal that caching is actually working. If it stays at zero, something before the marker is changing call to call — find it and move it.
Don't enable extended thinking by default on every endpoint. It is a feature with a cost shape: 2–10x latency, additional output tokens. Enable it on advisory and analytical paths, leave it off on transactional ones, and measure both quality and cost on the boundary cases.
Combine the two before you optimise prompts. It is tempting to shorten a long system prompt to save tokens. Often the right move is the opposite — keep the rich system prompt and cache it. Token reduction has diminishing returns; caching is a step-change.
Log thinking tokens separately from output tokens. They are billed the same way but they tell you different things. A spike in thinking tokens with stable answer length usually means the model encountered a genuinely harder question — useful operational signal.

Beyond the Docs

The official course covers extended thinking and prompt caching as separate topics. Two connections the docs leave implicit:

Caching changes how you should design system prompts. Without caching, every additional token of system prompt is a tax on every call. With caching, a richer system prompt is essentially free after the first call — so the optimal prompt design shifts toward "more detail, more structure, more examples" rather than "as terse as possible." This is the single biggest design-level consequence of caching and the docs don't make it explicit.
Extended thinking + caching is the canonical shape of a serious assistant. Almost every production analytics or operational assistant ends up looking like the §3 example — a cached corpus, a thinking-enabled call, a volatile user message. The features are individually useful and jointly transformative. Reach the combination earlier than you think you need it.

Previous: B2 — Multi-Modal Inputs: Feeding Claude More Than Text Next: B4 — RAG Essentials: Chunking, Embeddings & Hybrid Retrieval Series index: Building with the Claude API — A Practitioner's Guide