About this article This article is part of Building with Claude — A Practitioner's Guide to the Anthropic API, a study-notes-plus-commentary series based on Anthropic's official "Building with the Claude API" course (hosted on Coursera) and the public Anthropic API documentation at docs.anthropic.com.
Original course and documentation material is © Anthropic. Direct quotes are cited inline. Commentary, code adaptations, and examples are © DataMy. This series is independent and not affiliated with or endorsed by Anthropic.
Companion notebook:
A2_setup_and_first_call.ipynb·llm_client.py·requirements.txtSetup: seeREADME.mdin the series repo · Dataset: none
What "robust" means in this article
Most readers of this series have already made a Claude API call. The smallest possible version looks like this:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello, Claude."}],
)
print(response.content[0].text)
That code works. It will also fall over the first time the network blips, the first time a user submits a prompt that exceeds the token budget, the first time you want to know what the call cost, and the first time you want to stream output to a UI.
The rest of this article is about closing that gap — turning the four-line snippet into a small, reusable client wrapper that retries on transient errors, streams when appropriate, logs cost per call, and keeps secrets out of your codebase. The companion notebook contains the full implementation; this article explains the why behind each piece.
By the end you'll have:
- A working Python environment with the
anthropicSDK installed. - An
ANTHROPIC_API_KEYloaded from.env, not hard-coded. - A
call_claude(...)function that handles retries, streaming, and usage logging. - A clear mental model of which parameters in
messages.create()actually matter.
1. Setting up your environment
The Anthropic Python SDK supports Python 3.8+, but for this series we target Python 3.11+ — newer pattern-matching syntax, better error messages, and the ecosystem (especially pandas 2.x and the MCP SDK) is happier there.
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
If you prefer uv (a much faster installer that's become standard in 2025–26), the equivalent is:
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
Either works. The requirements.txt in the companion repo pins everything you'll need for all eleven articles.
Putting your API key in .env, not in your code
The single most common mistake I see when reviewing client code is a hard-coded API key. It usually starts as api_key="sk-ant-..." in a notebook, gets copied into a script, the script gets pushed to a shared repo, and three weeks later someone is rotating keys at 11pm.
The pattern this series uses everywhere:
from dotenv import load_dotenv
from anthropic import Anthropic
load_dotenv() # reads .env into os.environ
client = Anthropic() # SDK reads ANTHROPIC_API_KEY from environment
Your .env file (gitignored) looks like:
ANTHROPIC_API_KEY=sk-ant-...
Anthropic's documentation is explicit about this:
"Authenticate to the API by including your API key in the
x-api-keyHTTP header." — Anthropic API Docs, "Authentication".
The SDK does that header injection for you as long as the key is in the environment. Don't fight the convention.
2. The anatomy of a real call
client.messages.create() accepts many parameters. Four of them deserve real understanding before you do anything else.
model
As of June 2026, Anthropic's current flagship is claude-opus-4-8, with parallel Sonnet and Haiku models in the 4-8 generation. The Anthropic "Building with the Claude API" course on Coursera was published on the 4-5 generation, and this series follows that course's lineup so the code stays consistent with what readers see in the official material — everything below works identically on newer models, and you should substitute the current ID whenever you actually deploy.
The practical model-selection rule is generation-independent:
- Opus (current:
claude-opus-4-8; this series usesclaude-opus-4-5) — the heaviest, most capable model. Use for hard reasoning, complex multi-step problems, and final-quality drafts. - Sonnet (this series uses
claude-sonnet-4-5) — the workhorse. ~80% of production calls in real applications should be running here. Strong general intelligence at moderate cost. - Haiku (this series uses
claude-haiku-4-5) — the speed-and-cost option. Right choice for classification, summarisation, tool routing, and high-volume backend jobs.
A useful rule from real engagements: start every new feature on Sonnet, profile it, then move down to Haiku if quality holds or up to Opus if it doesn't. Picking Opus first because "we want it to be good" is how API bills end up 5× higher than they need to be.
Always check the current model list at docs.anthropic.com/en/docs/about-claude/models — model IDs change, and the page lists the current and deprecated lineups side by side.
max_tokens
This is the maximum number of tokens the model is allowed to generate. It is required. It does not affect the prompt size, only the response.
Two failure modes to be aware of:
- Set it too low and your response gets truncated mid-sentence with
stop_reason: "max_tokens". Always checkresponse.stop_reasonin production code. - Set it absurdly high "to be safe" and you pay for nothing — you're billed only on actual generated tokens, but you've also given up your early-warning that the model is rambling.
A sensible default for chat-style interactions is 1024–2048. For structured-output tasks where you can predict the response shape, set it tighter.
messages
The conversation, as an ordered list. Each item is {"role": "user" | "assistant", "content": ...}. The official course phrases it cleanly:
"Messages must alternate between user and assistant roles. The first message must always be from the user." — Anthropic "Building with the Claude API" course, Module: Messages API.
For multi-turn conversations you maintain this list yourself, append the model's response as an assistant message, then send the whole thing back. The API is stateless — Claude does not remember your previous calls unless you replay them.
temperature
temperature is not a required parameter — if you omit it, the API uses its default. But once you understand what it does, you'll almost always set it explicitly. It controls randomness. The default is 1.0. Two settings matter in practice:
0.0— deterministic-ish. Use for structured output, classification, anything where you want the same input to produce the same output.- default (
1.0) — natural variation. Use for creative writing, chat, anything where small variation is fine.
Values between are rarely useful in production. Pick one of the two and move on.
The system parameter (for system prompts) gets its own article — see B1: System Prompts, Roles & Output Control.
3. Streaming as the default UX
For any user-facing interaction longer than ~200 tokens, streaming is not optional. A two-second wait before any text appears feels broken; the same two seconds of incremental text appearing feels fast. The cost is the same.
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain hybrid retrieval."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final_message = stream.get_final_message()
The context-manager form is the right pattern: it guarantees the connection is closed even if your downstream code throws.
A few practitioner notes:
- Streaming works in backend code too. Even without a UI, streaming lets you start logging, parsing, or piping output before generation finishes. For agent loops where the next step depends on the model's response, this can shave 200–500ms off each turn.
get_final_message()gives you the full response object — includingusagefor cost logging — once the stream completes. Don't drop it.- Streaming and tool use compose. Tool-call events arrive as their own event types in the stream, which the SDK normalises. We pick this up in C2: Custom Tools & Function Calling.
4. Errors, retries, and rate limits
The Anthropic SDK raises typed exceptions you can catch. The ones you'll actually see in production:
| Exception | When | Should you retry? |
|---|---|---|
anthropic.APIConnectionError | Network / DNS / TLS issue | Yes, with backoff |
anthropic.RateLimitError (HTTP 429) | You exceeded your tokens-per-minute or requests-per-minute | Yes, with backoff — respect retry-after header |
anthropic.APIStatusError 5xx (incl. 529 "overloaded") | Transient server issue | Yes, with backoff |
anthropic.BadRequestError (400) | Malformed request, oversized prompt | No — fix the request |
anthropic.AuthenticationError (401) | Bad / missing API key | No — fix the key |
anthropic.PermissionDeniedError (403) | Org/project doesn't have access to the model or feature | No — fix the entitlement |
The Anthropic documentation summarises the retry posture as:
"Some errors are transient and can be safely retried. The SDK automatically retries certain errors with exponential backoff by default." — Anthropic API Docs, "Errors".
The SDK does retry by default for the transient classes — but the default retry count is conservative. In a wrapper you control, the right pattern is:
import time
import random
import anthropic
def call_with_retry(client, *, max_retries=4, **kwargs):
delay = 1.0
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except (anthropic.APIConnectionError,
anthropic.RateLimitError,
anthropic.InternalServerError) as e:
if attempt == max_retries - 1:
raise
sleep = delay * (2 ** attempt) + random.uniform(0, 0.5)
time.sleep(sleep)
Three things to notice:
- Jitter. The
random.uniform(0, 0.5)term prevents a thundering herd if many workers retry at once. - Re-raise on final attempt. If retries are exhausted, the caller still gets the original exception — not a None or a silent swallow.
- Non-retriable errors are not caught. A 400 should reach the caller immediately. Don't sleep through a bug.
5. Cost and token awareness
Every successful response includes a usage object on the response object — this is part of the public SDK surface and is documented by Anthropic:
response.usage.input_tokens # tokens you sent
response.usage.output_tokens # tokens the model generated
response.usage.cache_read_input_tokens # cached input tokens (see B3)
response.usage.cache_creation_input_tokens # tokens written to cache
This is the single most under-used field in beginner code. Logging it on every call is the cheapest production observability you can buy.
The official pricing table — per-million-token rates, by model and by token type (input, output, cache write, cache read) — lives at anthropic.com/pricing. Prices change over time. Don't hard-code rates without a process to refresh them. A defensible cost helper takes the rate table as data rather than baking specific numbers into the source:
# Populate from the current numbers at https://www.anthropic.com/pricing.
# Per-million-token USD; refresh whenever Anthropic updates prices.
PRICES_PER_M_TOKENS: dict = {
# "claude-sonnet-4-5": {"input": ..., "output": ...},
# add models you actually use, with values copied from the pricing page
}
def estimate_cost_usd(model: str, usage) -> float:
p = PRICES_PER_M_TOKENS.get(model)
if not p:
return 0.0
return (
usage.input_tokens * p["input"] / 1_000_000
+ usage.output_tokens * p["output"] / 1_000_000
)
Two authoritative references to keep bookmarked:
- Current pricing — anthropic.com/pricing
- Token / usage object reference — the Messages API docs at docs.anthropic.com/en/api/messages describe the
usagefields returned with every response.
Wire this helper into your client wrapper once and you'll never have to wonder where your monthly bill went. The first time someone asks "why did our spend double last month?", a per-call JSONL log of {model, input_tokens, output_tokens, latency_ms} answers the question in two minutes of pandas.
6. Putting it together
The companion notebook builds the components above into a single ClaudeClient wrapper that:
- Loads credentials from
.env. - Retries transient errors with jittered exponential backoff.
- Streams when
stream=True, batches otherwise. - Logs
{model, input_tokens, output_tokens, estimated_cost_usd, latency_ms}for every call. - Surfaces non-retriable errors immediately.
Once you have a wrapper like this, every later article in the series imports it and focuses on what's new — prompt design, tool use, MCP — rather than rebuilding boilerplate.
Open the notebook now and walk through Sections 1–5 to build it. Section 6 (the Practitioner Lab) gives you an open-ended extension: add a per-day token-budget cap that refuses calls once the daily spend exceeds a configured threshold.
Practitioner Notes
- Don't put the wrapper in your application code. Put it in a separate module (
llm_client.py) and import it everywhere. The number of times I've seen retry logic duplicated across six files inside one repo is depressing. - Log to a file by default, not just stdout. Token usage rolled up by day is the first thing finance and engineering will ask you for. JSONL is the right format — one line per call, easy to
pandas.read_json(..., lines=True). - Pick one model per feature and stick to it for some time. Constantly flipping between Sonnet and Haiku makes it impossible to attribute quality regressions. Profile, then change deliberately.
- Streaming is also a debugging tool. If you're seeing weird latency, streaming will tell you whether the model is slow to start generating (prompt processing) or slow to finish (long output). The two have completely different fixes.
Beyond the Docs
The official course and docs cover most of what's above, but two things they don't emphasise enough for a working developer:
- The
usageobject is silent gold. The docs document it, the course mentions it, but neither makes it clear that you should be logging it on every call from day one. Retrofit this later and you'll regret it. - Backoff jitter. The SDK's built-in retry is fine for a single client, but if you have multiple workers calling the API in parallel, the deterministic backoff can synchronise them. Custom wrappers with jitter — like the one above — avoid this. Worth doing before you scale, not after.
Previous: A1 — Why This Series & Who It's For Next: B1 — System Prompts, Roles & Output Control Series index: Building with the Claude API — A Practitioner's Guide
Based on Anthropic's "Building with the Claude API" course (Coursera) and public API documentation. Commentary © 2026 DataMy. Not affiliated with Anthropic.