🤖
AI Course
|
0/1
Sponsor

EthicalAds: Display ethical, developer-targeted ads on your platform without compromising user privacy.

⚡Module 2 of 12

Working with LLM APIs

⏱ 5–6 hours
📘 Beginner
🔧 Python, curl
What you'll learn
  • →Call LLM APIs programmatically with Python
  • →Understand system/user/assistant message structure
  • →Implement multi-turn conversations with history management
  • →Handle streaming responses
  • →Estimate token costs and implement retry logic

Anatomy of an API Call

Before writing a single line of Python, it helps to see a raw API call. The Anthropic Messages API is a straightforward HTTP endpoint — strip away all the SDK magic and this is what happens:

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-haiku-4-5",
    "max_tokens": 256,
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

The response comes back as JSON with this structure:

{
  "id": "msg_01XFDUDYJgAACzvnptvVoYEL",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "The capital of France is Paris."
    }
  ],
  "model": "claude-haiku-4-5",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 14,
    "output_tokens": 10
  }
}

Five fields you'll interact with constantly:

| Field | What it tells you | |-------|------------------| | content[0].text | The model's actual response text | | stop_reason | Why generation stopped: end_turn, max_tokens, tool_use | | usage.input_tokens | Tokens in your request (what you pay for) | | usage.output_tokens | Tokens in the response (typically costs more) | | id | Unique message ID, useful for logging and debugging |

Now the Python SDK equivalent — far less boilerplate for the same call:

import anthropic
 
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env
 
response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=256,
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)
 
print(response.content[0].text)
# → "The capital of France is Paris."
 
print(f"Used {response.usage.input_tokens} input + {response.usage.output_tokens} output tokens")

The SDK handles authentication, retries on transient errors, and deserialization. Use it in production code. Use raw curl when you want to debug or understand exactly what's being sent.


The Message Format — System, User, Assistant

Every LLM API call is a conversation structured as a list of messages. There are three roles:

  • system: Instructions processed before the conversation. Sets persona, format, constraints.
  • user: The human turn. Your application sends these.
  • assistant: The model's turn. When doing multi-turn conversations, you include past assistant responses.
â„šī¸
The System Prompt Is Your Control Plane

The system prompt is processed first, before any user message. Everything you want the model to consistently do — maintain a persona, respond in a specific format, refuse certain topics — belongs here. Don't put critical instructions only in the user message; the model may follow them inconsistently.

Here's an example using all three roles together:

import anthropic
 
client = anthropic.Anthropic()
 
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=512,
    system="""You are a senior Python engineer conducting a code review.
    
Your review format:
1. Overall assessment (1 sentence)
2. Issues found (bulleted list, severity: CRITICAL / WARNING / SUGGESTION)
3. Recommendation (APPROVE / NEEDS CHANGES)
 
Be direct. Do not add pleasantries.""",
    messages=[
        {
            "role": "user",
            "content": """Please review this function:
 
def divide(a, b):
    return a / b
"""
        },
        {
            "role": "assistant",
            "content": "Overall: This function lacks basic safety guardrails.\n\nIssues:\n- CRITICAL: No zero-division check"
        },
        {
            "role": "user",
            "content": "What would a safe version look like?"
        }
    ]
)
 
print(response.content[0].text)

Notice the pattern: the messages list contains the full conversation so far, including the assistant's previous response. This is how multi-turn conversations work — which brings us to the most important thing to understand about LLM APIs.


Multi-Turn Conversations — LLMs Are Stateless

Here is a fact that surprises many developers: the LLM has zero memory between API calls. There is no session. There is no server-side history. Each call to the API is completely independent.

The "memory" of a conversation is nothing more than the messages array you construct and send. If you want the model to remember what was said three messages ago, you must include those three messages in your next call.

âš ī¸
History Grows With Every Turn

After 20 back-and-forth exchanges, you're sending thousands of tokens of history with every request. This has two consequences: higher cost (you pay for every input token, every call) and eventually hitting the context window limit. Production chatbots must implement a history management strategy — sliding window, summarization, or selective pruning.

Here is a complete CLI chatbot that handles history correctly:

import anthropic
 
def run_chatbot(system_prompt: str) -> None:
    client = anthropic.Anthropic()
    history = []
    total_input_tokens = 0
    total_output_tokens = 0
 
    print(f"System: {system_prompt}")
    print("Type 'quit' to exit, 'reset' to clear history.\n")
 
    while True:
        user_input = input("You: ").strip()
        
        if user_input.lower() == "quit":
            break
        if user_input.lower() == "reset":
            history = []
            total_input_tokens = 0
            total_output_tokens = 0
            print("[History cleared]\n")
            continue
        if not user_input:
            continue
 
        # Append the new user message to history
        history.append({"role": "user", "content": user_input})
 
        response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=1024,
            system=system_prompt,
            messages=history  # Send the FULL history every time
        )
 
        assistant_reply = response.content[0].text
        
        # Append the model's reply so next turn includes it
        history.append({"role": "assistant", "content": assistant_reply})
 
        total_input_tokens += response.usage.input_tokens
        total_output_tokens += response.usage.output_tokens
 
        print(f"\nAssistant: {assistant_reply}")
        print(f"[This turn: {response.usage.input_tokens}in / {response.usage.output_tokens}out tokens]")
        print(f"[Session total: {total_input_tokens}in / {total_output_tokens}out tokens]\n")
 
    print(f"\nSession complete. Total tokens: {total_input_tokens + total_output_tokens}")
 
if __name__ == "__main__":
    system = input("System prompt (or press Enter for default): ").strip()
    if not system:
        system = "You are a helpful assistant. Be concise."
    run_chatbot(system)

The critical insight is in messages=history — you're sending the entire accumulated conversation every single time. The model doesn't "remember" anything; it re-reads the full transcript on every call.


Streaming Responses

By default, the API waits until the model finishes generating the entire response before returning it. For short responses, this is fine. For long responses — a 2,000-word article, a 100-line code file — users stare at a blank screen for several seconds.

Streaming sends tokens to your client as they are generated, so users see the response appearing progressively. It's a significant UX improvement for interactive applications.

import anthropic
 
client = anthropic.Anthropic()
 
print("Assistant: ", end="", flush=True)
 
with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain recursion with a short code example."}
    ]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
 
# Get final message with usage stats after the stream completes
final_message = stream.get_final_message()
print(f"\n\n[Tokens: {final_message.usage.input_tokens}in / {final_message.usage.output_tokens}out]")

Streaming uses a Server-Sent Events (SSE) connection under the hood. The SDK abstracts this into a simple iterator. The flush=True on print ensures each token appears immediately rather than being buffered.

For web applications, you'd typically forward this stream to the browser using a framework's streaming response mechanism (FastAPI's StreamingResponse, Next.js's edge runtime, etc.). We cover that in Module 7.


Token Costs — What You're Actually Paying For

LLM APIs charge per token, with input and output tokens typically priced differently. Output tokens cost more because generating each one requires a full forward pass through the model.

def estimate_cost(
    input_tokens: int,
    output_tokens: int,
    model: str = "claude-haiku-4-5"
) -> float:
    """Rough cost estimate in USD. Check Anthropic pricing page for current rates."""
    
    # Approximate pricing per 1M tokens (verify at anthropic.com/pricing)
    pricing = {
        "claude-haiku-4-5":   {"input": 0.80,  "output": 4.00},
        "claude-sonnet-4-5":  {"input": 3.00,  "output": 15.00},
        "claude-opus-4-5":    {"input": 15.00, "output": 75.00},
    }
    
    rates = pricing.get(model, pricing["claude-haiku-4-5"])
    input_cost  = (input_tokens  / 1_000_000) * rates["input"]
    output_cost = (output_tokens / 1_000_000) * rates["output"]
    
    return input_cost + output_cost
 
# Example: A 10,000-message customer support bot
avg_input_per_turn  = 800   # history + system prompt + user message
avg_output_per_turn = 200   # assistant reply
 
cost_per_turn = estimate_cost(avg_input_per_turn, avg_output_per_turn)
daily_cost = cost_per_turn * 10_000
 
print(f"Cost per turn: ${cost_per_turn:.5f}")
print(f"10K turns/day: ${daily_cost:.2f}/day  (${daily_cost * 30:.2f}/month)")
✅
Output Tokens Are Expensive — Set max_tokens Deliberately

Output tokens typically cost 3–5x more than input tokens. Always set max_tokens to a reasonable limit for your use case. A customer support bot rarely needs 4,000 output tokens; 512 is usually plenty. Leaving max_tokens at its maximum is wasting money on every call where the model would have stopped earlier anyway.

Practical cost levers, in order of impact:

  1. Choose the right model tier — Haiku is 10–20x cheaper than Opus for the same task.
  2. Set max_tokens tightly — don't leave slack you don't need.
  3. Keep system prompts concise — they're sent with every request.
  4. Implement prompt caching — reuse repeated prefix tokens (covered in Module 7).
  5. Batch non-urgent requests — Anthropic's Batch API offers 50% discounts.

Rate Limits & Retry Logic

LLM APIs enforce rate limits on requests per minute (RPM) and tokens per minute (TPM). When you exceed a limit, the API returns an HTTP 429 response. This is normal in production systems — the right response is to wait and retry, not to fail immediately.

The standard pattern is exponential backoff with jitter:

import anthropic
import time
import random
from typing import Any
 
def call_with_retry(
    client: anthropic.Anthropic,
    max_retries: int = 5,
    **kwargs: Any
) -> anthropic.types.Message:
    """
    Call the Anthropic API with exponential backoff on retriable errors.
    
    Retriable:  429 (rate limit), 500, 529 (overloaded)
    Not retriable: 400 (bad request), 401 (auth), 404 (not found)
    """
    retriable_status_codes = {429, 500, 529}
    
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s + jitter
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Waiting {wait_time:.1f}s before retry {attempt + 1}/{max_retries}")
            time.sleep(wait_time)
        
        except anthropic.APIStatusError as e:
            if e.status_code not in retriable_status_codes:
                raise  # Don't retry 400/401/404 — these won't fix themselves
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"API error {e.status_code}. Waiting {wait_time:.1f}s...")
            time.sleep(wait_time)
    
    raise RuntimeError("Max retries exceeded")  # Should not reach here
 
# Usage
client = anthropic.Anthropic()
 
response = call_with_retry(
    client,
    model="claude-haiku-4-5",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

Error taxonomy — what each HTTP status means and how to handle it:

| Status | Meaning | Action | |--------|---------|--------| | 400 | Bad request (malformed JSON, invalid params) | Fix your request — don't retry | | 401 | Invalid API key | Check your key — don't retry | | 404 | Model not found | Check model name — don't retry | | 429 | Rate limited | Exponential backoff, then retry | | 500 | Server error | Backoff, then retry | | 529 | API overloaded | Backoff, then retry |


Provider Comparison

The Anthropic API is not the only option. Here's a practical comparison for the models you'll encounter:

| Provider | Best Models | Context Window | Strengths | Notes | |----------|------------|---------------|-----------|-------| | Anthropic (Claude) | Haiku, Sonnet, Opus | 200K tokens | Safety-focused, strong reasoning, large context | This course uses Claude | | OpenAI | GPT-4o, o3 | 128K tokens | Broad ecosystem, rich tool support | GPT-4o for general; o3 for hard reasoning | | Google | Gemini 1.5 Pro, Flash | 1M tokens | Largest context, Google integrations | Flash is the cost-efficient option | | Meta | Llama 3.1, 3.3 | 128K tokens | Open weights, self-hostable, free | Requires your own infrastructure |

For local development and privacy-sensitive use cases, you can run Llama models via Ollama with an OpenAI-compatible API — useful when you can't send data to cloud providers.


đŸ’ģHands-on: CLI Chatbot with Cost Tracking

Goal: Build a working CLI chatbot that maintains conversation history and tracks cumulative token costs.

Requirements:

  1. Accept a system prompt at startup (with a sensible default)
  2. Maintain full conversation history across turns
  3. Print token usage after each turn (this turn + session total)
  4. Print an estimated cost after each turn using the pricing table above
  5. Support a reset command to clear history
  6. Support a stats command to print session summary
  7. Gracefully handle rate limit errors with a retry

Starter structure:

import anthropic
import time
import random
 
PRICING = {
    "claude-haiku-4-5": {"input": 0.80, "output": 4.00},
}
 
def estimate_cost(input_tokens, output_tokens, model="claude-haiku-4-5"):
    rates = PRICING[model]
    return ((input_tokens / 1e6) * rates["input"]) + ((output_tokens / 1e6) * rates["output"])
 
def main():
    client = anthropic.Anthropic()
    system = input("System prompt: ").strip() or "You are a helpful, concise assistant."
    history = []
    stats = {"input": 0, "output": 0, "turns": 0, "cost": 0.0}
    
    while True:
        user_input = input("\nYou: ").strip()
        # TODO: handle quit, reset, stats commands
        # TODO: append to history
        # TODO: call API with retry
        # TODO: print response and token stats
        pass
 
if __name__ == "__main__":
    main()

Stretch goals:

  • Add a sliding window that keeps only the last N turns when history gets long
  • Switch between models mid-conversation with a !model haiku command
  • Save conversation to a JSON file on exit
đŸ§Ē
Knowledge Check
Answer all 3 questions to unlock completion

Q1What is the purpose of the 'system' message in an LLM API call?

Q2Why must you send the full conversation history with every API call?

Q3You receive a 429 status code from the API. What should you do?

← What Is an LLM?
Prompt Engineering →