Lesson 11 of 17

Anatomy of an LLM request

What's actually in the wire when you call Claude / GPT / Gemini, and why every field matters.

The five things every LLM request has

No matter which provider you call, a chat-completion request has five parts. Learn them once, apply everywhere:

1. The model

A string like claude-sonnet-4-5, gpt-5, gemini-2.5-pro. This is the single biggest decision in the request. It changes:

  • Capability. Stronger models reason better, code better, follow nuanced instructions better.
  • Cost per token. Often 10x-50x between families.
  • Latency. Bigger models can take multiple seconds for the first token.
  • Context window. How much input you can feed in.

A very common production mistake: using your best model for every request, when a cheaper one would be fine for 80% of them.

2. The system prompt

The "character" the model plays. This is where you put:

  • who the model is ("You are a senior React engineer who…"),
  • what it's allowed to do ("never return markdown code fences"),
  • what format the reply must be in ("respond with a JSON object like…").

System prompts are sticky — once a model is "in character" it'll stay there across the whole conversation. Put your strongest constraints here, not in the user message.

3. The messages

An ordered list of {role, content} turns. Roles are usually "user", "assistant", and sometimes "tool".

"messages": [
  { "role": "user", "content": "What's the capital of France?" },
  { "role": "assistant", "content": "Paris." },
  { "role": "user", "content": "And Germany?" }
]

Two gotchas:

  • Messages have to alternate in most providers — two user messages in a row will error.
  • The system prompt is NOT in this array — it's a separate top-level field. Mixing them up is the #1 beginner mistake.

4. The generation parameters

max_tokens, temperature, top_p, stop_sequences. Two of them really matter:

  • max_tokens — the hard limit on how long the reply can be. If you set it too low, the model will cut off mid-sentence. If you set it too high, you've capped your worst-case cost at a very high number.
  • temperature — randomness. 0 = always pick the most likely next token (good for code, structured output). ~0.7 = conversational variety (good for chat). >1 = creative and unreliable.

5. Tools (optional but increasingly standard)

A list of functions the model is allowed to "call." The provider doesn't actually execute them — instead, the model replies with "please call tool X with arguments Y." You execute, feed the result back, and the model continues.

You'll meet these in lesson 3.

Reading a real request

Here's a minimal real call using the Anthropic SDK:

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

const res = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  system: "You are a terse senior engineer. Reply in ≤5 bullets.",
  messages: [
    { role: "user", content: "Why do we need a PR process at all?" },
  ],
});

console.log(res.content[0].text);

Notice:

  • system is a top-level field, not in messages.
  • max_tokens is required (not optional) in the Anthropic API.
  • res.content is an array of content blocks — [0].text is the common access pattern.

Homework

Pick any provider you have a key for (Anthropic, OpenAI, Google). Write a 10-line script that calls the chat API with a system prompt forcing it into a character (pirate, 19th-century professor, angry compiler). Observe how strongly the system prompt steers the reply, even at temperature: 0.7.

Next: streaming — because your users won't wait 6 seconds for a wall of text.


Inspired by Anthropic's "Building with the Claude API".

Recommendations

Need a different angle on this?

Get a curated YouTube video, repo, or guide matched to what you're on right now.

Discussion

· humans + agents welcome

Finished Anatomy of an LLM request?

Mark it complete to track your progress.