B1. System Prompts, Roles, and Output Control | Claude API Practitioner's Guide | DataMy

About this article This article is part of Building with Claude — A Practitioner's Guide to the Anthropic API, a study-notes-plus-commentary series based on Anthropic's official "Building with the Claude API" course (hosted on Coursera) and the public Anthropic API documentation at docs.anthropic.com.

Original course and documentation material is © Anthropic. Direct quotes are cited inline. Commentary, code adaptations, and examples are © DataMy. This series is independent and not affiliated with or endorsed by Anthropic.

Companion notebook: B1_system_prompts_output.ipynb · llm_client.py Setup: see README.md in the series repo · Dataset: data/saas_metrics.csv

Three levers, one call

Most beginner code touches one parameter: the user message. That's the prompt. It works, but it gives up three other levers the API hands you for free:

The system prompt — who Claude is in this conversation, what it must always do, what it must never do.
Temperature — how much variation you accept between identical calls.
The shape of the response — how the output starts, where it ends, and whether it conforms to a schema you can parse.

This article walks through all three. The companion notebook applies them to an analytics-copilot persona that reads a SaaS metrics table and produces clean, schema-validated JSON answers ready for downstream code.

1. The three roles

Every call to messages.create() involves three conceptual roles, even when you only set two of them explicitly.

Role	Set by	Purpose
`system`	The `system` parameter	Claude's persistent identity and constraints for this entire conversation
`user`	An item in `messages` with `"role": "user"`	What the human (or your app) is asking right now
`assistant`	An item in `messages` with `"role": "assistant"`	Claude's reply — but you can also pre-write the beginning of it

The first two are straightforward. The third one is the lever most beginners miss: you are allowed to write the assistant's reply yourself, partially, before the API is even called. Claude will continue from where you left off. We come back to this in section 4.

The official course phrases the structural rule cleanly:

"Messages must alternate between user and assistant roles. The first message must always be from the user." — Anthropic "Building with the Claude API" course, Module: Messages API, accessed 2026-06-08.

That means the system parameter is not part of the messages list. It is passed separately and applies to the whole conversation regardless of how many user/assistant turns follow.

2. System prompts: who Claude is

A system prompt is the place to encode everything that should be true for every turn of a conversation. The Anthropic documentation describes its role as:

"System prompts are a way to provide context, instructions, and guidelines to Claude before presenting it with a question or task." — Anthropic API Docs, "System prompts", accessed 2026-06-08.

The mechanics are trivial:

client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system="You are a SaaS metrics analyst. Always cite the exact numbers from the data provided.",
    messages=[{"role": "user", "content": "What's our net revenue retention last quarter?"}],
)

What's not trivial is writing one that actually helps. After enough trial and error, a structure that holds up looks like this:

<role>
You are X — describe the persona in one sentence.
</role>

<constraints>
- What Claude must always do.
- What Claude must never do.
- Domain knowledge or jargon it should assume / not assume.
</constraints>

<style>
- Tone (formal / casual / direct).
- Length preference.
- Whether to ask clarifying questions or commit to answers.
</style>

<output_format>
- Default response shape (paragraph / bullets / JSON).
- Citation expectations (e.g. "always quote the exact figure from the data").
</output_format>

The XML-style tags are not magic, but they help Claude segment the system prompt into discrete instruction blocks — and they give you a clean place to update one section without rewriting the whole thing.

A concrete example, for the analytics-copilot persona used in the companion notebook:

SYSTEM_PROMPT = """
<role>
You are a SaaS metrics analyst supporting a revenue operations team.
</role>

<constraints>
- Only answer using the data provided in the user message. Never invent figures.
- If a metric the user asks for is not in the data, say so explicitly.
- Use standard SaaS terminology: MRR, ARR, NRR, gross retention, expansion MRR, churn MRR.
</constraints>

<style>
- Direct. No hedging. No "I hope this helps".
- Numbers first, explanation second.
</style>

<output_format>
- Default to a short paragraph followed by a bullet list of the supporting figures.
- When the user asks for "JSON", return ONLY a JSON object — no preamble.
</output_format>
"""

The notebook compares this against the same user question asked with no system prompt at all. The difference in response discipline is hard to overstate.

What system prompts are not good for

Two anti-patterns I see in real codebases:

Business logic in the system prompt. "If the user asks about Customer A's revenue, never mention contract terms." This works until it doesn't, and then nobody knows why. Business rules belong in code, not in a paragraph the model interprets.
Long compliance disclaimers. Stuffing five hundred tokens of "you must always be polite, you must never mention competitors, you must always disclose…" into the system prompt every call burns input tokens without much behavioural payoff. Prompt caching (article B3) makes long system prompts cheaper, but it doesn't make them better-designed.

3. Temperature

Mentioned briefly in A2; expanded here because it pairs naturally with the system prompt as a behaviour-shaping lever.

Temperature controls randomness in token selection. The default is 1.0. In practical use there are really two settings:

temperature=0 — deterministic-leaning. Use for structured output, classification, data extraction, code generation, anything where you want the same input to produce the same output.
temperature=1.0 — the default. Use for chat, creative output, and anything where variation is fine or welcome.

For a SaaS metrics analyst, temperature=0 is almost always the right setting. You don't want last quarter's NRR to be 112% one call and 109% the next.

The official docs phrase the trade-off as:

"Use temperature closer to 0 for analytical and multiple-choice tasks, and closer to 1 for creative and generative tasks." — Anthropic API Docs, "Messages — temperature".

Note that temperature=0 is not perfectly deterministic — there is still some non-determinism in the underlying inference pipeline. For most applications it's deterministic enough; for applications that need byte-exact reproducibility (regulated reporting, for example), you'll need a layer above the API to cache and replay model responses.

4. Pre-filling the assistant turn

This is the one lever most beginners don't know exists, and it changes everything about getting clean structured output.

You are allowed to add an assistant message to the end of the messages list, before calling the API. Claude treats your pre-fill as its own prior words and continues writing from there.

messages = [
    {"role": "user", "content": "Return the Q3 SMB metrics as a JSON object."},
    {"role": "assistant", "content": "{"},   # ← pre-fill: forces JSON, no preamble
]

Claude's next generated token will continue the JSON object. No "Sure, here's the JSON:" preamble, no markdown code fence, no apologetic prose. The response starts with what comes after {.

This solves the most annoying problem with naive structured-output prompts: the model wanting to be helpful by wrapping the output in explanation. Pre-fill removes the choice.

A common pattern for forcing JSON output:

messages = [
    {"role": "user", "content": user_question_plus_data},
    {"role": "assistant", "content": "{"},
]
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=messages,
    system=SYSTEM_PROMPT,
    temperature=0,
)
raw = "{" + response.content[0].text   # prepend the pre-fill to get full JSON
import json
data = json.loads(raw)

Two gotchas worth knowing:

Pre-fill is for the start, not the end. If you pre-fill with {, you still need a way to make Claude stop after the JSON object closes — covered in section 5.
Trailing whitespace in the pre-fill matters. ```json\n and ```json produce subtly different continuations. Test what you actually ship.

5. Stop sequences: bounding the end

stop_sequences is a list of strings; if Claude generates any of them, generation halts immediately and the matched string is not included in the response. It is the natural counterpart to pre-fill.

Classic JSON-extraction recipe:

messages = [
    {"role": "user", "content": "Generate a JSON summary of the data."},
    {"role": "assistant", "content": "```json\n"},
]
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=SYSTEM_PROMPT,
    messages=messages,
    stop_sequences=["```"],
    temperature=0,
)
# response.content[0].text is now clean JSON, no fence, no prose

The pre-fill forces the JSON to start; the stop sequence forces it to end before the closing fence is emitted.

Always check response.stop_reason. If it equals "stop_sequence", the model stopped at a stop_sequences match. If it equals "max_tokens", the output was truncated — your JSON is likely incomplete and will fail to parse.

6. The right way for production: tool schemas

Pre-fill plus stop sequences gets you ~95% of the way. For real production pipelines where you need stronger guarantees, structured output via the tool-use API is the better answer.

You define your output as a tool schema, force Claude to "call" the tool, and read the validated arguments off the response:

metrics_schema = {
    "name": "report_metrics",
    "description": "Reports SaaS metrics extracted from a data table.",
    "input_schema": {
        "type": "object",
        "properties": {
            "month":         {"type": "string", "description": "ISO month, e.g. 2025-09"},
            "segment":       {"type": "string", "enum": ["SMB", "Mid-Market", "Enterprise"]},
            "mrr_end":       {"type": "number"},
            "net_new_mrr":   {"type": "number"},
            "churn_mrr":     {"type": "number"},
            "commentary":    {"type": "string", "description": "1-sentence interpretation"},
        },
        "required": ["month", "segment", "mrr_end", "net_new_mrr", "churn_mrr"],
    },
}

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": user_question_plus_data}],
    tools=[metrics_schema],
    tool_choice={"type": "tool", "name": "report_metrics"},
    temperature=0,
)

# The tool call's input is already a parsed dict — no json.loads() needed.
tool_use_block = next(b for b in response.content if b.type == "tool_use")
structured = tool_use_block.input

Why this is better than pre-fill + stop sequences for production:

The schema is enforced. required fields are required, enum values must match, types are checked.
You get a dict, not a string. No json.loads() failure mode, no surprise whitespace handling.
Self-documenting. The schema is your output contract — when someone reads the code six months later, the shape is obvious.

The pre-fill technique remains useful for one-off scripts and for shaping prose outputs (forcing a specific opening). Tool schemas are the right default for any structured extraction that ships.

We come back to tool schemas in depth in C2: Custom Tools & Function Calling; here we're using them only as an output enforcer, not as a way to let the model take actions.

Practitioner Notes

Treat the system prompt as code. Version-control it. Put it in a file, not inline in client.messages.create() calls scattered across your codebase. The first time you A/B two versions of a system prompt against the same evaluation set, you'll be glad it had a stable path.
Default to temperature=0 for any backend call. Then raise it deliberately when you need variation. The reverse pattern — defaulting to 1.0 and trying to debug why your tests are flaky — wastes weeks.
Don't mix output techniques. A single call should either use a tool schema or use pre-fill + stop sequences. Trying to do both at once produces confusing behaviour and harder debugging.
Log the system prompt hash with each call. When responses start drifting, you want to know whether someone changed the prompt or whether the model itself changed. A simple SHA-256 of the system prompt logged alongside {model, usage, latency_ms} answers this question instantly.
Long system prompt? Use prompt caching. Once a system prompt grows past a few thousand tokens — common for personas with extensive domain context — prompt caching (B3) makes it dramatically cheaper to keep using. Don't shorten a good system prompt just to save tokens.

Beyond the Docs

The official course covers system prompts, temperature, stop sequences, and structured data extraction as separate topics. Two things they don't connect for you:

The natural sequence is roles → system → temperature → pre-fill → stop → tools. Each layer constrains a different axis. Beginner code tries to fix output problems by rewriting the user message; almost every output problem is better solved by adding one of the layers above instead.
Tool schemas are not just for tools. The course introduces them in the context of function calling — making the model take actions. Used as a structured-output enforcer (forcing tool_choice to a specific schema and ignoring the "action" framing), they're the cleanest way to get validated JSON from the API. The official docs hint at this; the course treats it as a side note. In production, this pattern is the workhorse.

Previous: A2 — Environment Setup & Your First Robust API Call Next: B2 — Multimodal Inputs: Images & PDFs Series index: Building with the Claude API — A Practitioner's Guide