The five things every LLM request has
No matter which provider you call, a chat-completion request has five parts. Learn them once, apply everywhere:
1. The model
A string like claude-sonnet-4-5, gpt-5, gemini-2.5-pro. This is
the single biggest decision in the request. It changes:
- Capability. Stronger models reason better, code better, follow nuanced instructions better.
- Cost per token. Often 10x-50x between families.
- Latency. Bigger models can take multiple seconds for the first token.
- Context window. How much input you can feed in.
A very common production mistake: using your best model for every request, when a cheaper one would be fine for 80% of them.
2. The system prompt
The "character" the model plays. This is where you put:
- who the model is ("You are a senior React engineer who…"),
- what it's allowed to do ("never return markdown code fences"),
- what format the reply must be in ("respond with a JSON object like…").
System prompts are sticky — once a model is "in character" it'll stay there across the whole conversation. Put your strongest constraints here, not in the user message.
3. The messages
An ordered list of {role, content} turns. Roles are usually
"user", "assistant", and sometimes "tool".
"messages": [
{ "role": "user", "content": "What's the capital of France?" },
{ "role": "assistant", "content": "Paris." },
{ "role": "user", "content": "And Germany?" }
]
Two gotchas:
- Messages have to alternate in most providers — two user messages in a row will error.
- The system prompt is NOT in this array — it's a separate top-level field. Mixing them up is the #1 beginner mistake.
4. The generation parameters
max_tokens, temperature, top_p, stop_sequences. Two of them
really matter:
max_tokens— the hard limit on how long the reply can be. If you set it too low, the model will cut off mid-sentence. If you set it too high, you've capped your worst-case cost at a very high number.temperature— randomness.0= always pick the most likely next token (good for code, structured output).~0.7= conversational variety (good for chat).>1= creative and unreliable.
5. Tools (optional but increasingly standard)
A list of functions the model is allowed to "call." The provider doesn't actually execute them — instead, the model replies with "please call tool X with arguments Y." You execute, feed the result back, and the model continues.
You'll meet these in lesson 3.
Reading a real request
Here's a minimal real call using the Anthropic SDK:
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const res = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: "You are a terse senior engineer. Reply in ≤5 bullets.",
messages: [
{ role: "user", content: "Why do we need a PR process at all?" },
],
});
console.log(res.content[0].text);
Notice:
systemis a top-level field, not inmessages.max_tokensis required (not optional) in the Anthropic API.res.contentis an array of content blocks —[0].textis the common access pattern.
Homework
Pick any provider you have a key for (Anthropic, OpenAI, Google).
Write a 10-line script that calls the chat API with a system prompt
forcing it into a character (pirate, 19th-century professor, angry
compiler). Observe how strongly the system prompt steers the reply,
even at temperature: 0.7.
Next: streaming — because your users won't wait 6 seconds for a wall of text.
Inspired by Anthropic's "Building with the Claude API".