What Is an LLM?
- โExplain what an LLM is without using jargon
- โUnderstand what a token is and why it matters
- โDescribe embeddings, attention, and next-token prediction conceptually
- โKnow when to use different temperature values
- โNavigate model families (Claude, GPT-4o, Gemini, Llama)
The 60-Second History
Large Language Models did not appear overnight. The path to GPT-4 and Claude stretches back decades, and understanding that path makes LLMs far less mysterious.
In the early days of natural language processing, researchers wrote rule-based systems โ giant dictionaries of hand-crafted grammar rules. These worked for narrow tasks but collapsed the moment real-world language surprised them. Next came n-gram models, which learned statistical associations between words from large text corpora. Better, but still brittle: an n-gram model only looks backward a handful of words.
The deep learning era brought Recurrent Neural Networks (RNNs), which could in principle remember across long sequences. In practice they suffered from vanishing gradients โ information from 50 words ago simply evaporated before it could influence the output. LSTMs improved things, but training was slow and scaling was painful.
Then in 2017, Google researchers published a paper called "Attention Is All You Need." It introduced the Transformer architecture, and nothing has been the same since. The key insight: instead of processing tokens sequentially, transformers process all tokens in parallel and let every token attend to every other token simultaneously. This made training on massive datasets practical for the first time.
Modern LLMs โ Claude, GPT-4, Gemini, Llama โ are all massive transformer models. They were pretrained on hundreds of billions of tokens from the internet, books, academic papers, and code. They didn't learn language by being programmed; they learned it by being shown almost everything humans have written and asked to predict the next word, over and over, at a scale difficult to comprehend.
The practical upshot: these models have absorbed an enormous amount of implicit knowledge about language, facts, reasoning patterns, and code. Your job as an engineer is to learn how to direct that knowledge toward useful tasks.
Tokens โ The Atom of LLMs
Here is the most important thing to understand about LLMs: they do not see words, characters, or sentences. They see tokens.
A token is roughly 4 characters of text, or about 0.75 of a word in English. The exact boundaries depend on the tokenizer โ the model-specific algorithm that splits raw text into tokens before any computation happens.
Input: "Hello, world! How are you today?" Tokens: ["Hello", ",", " world", "!", " How", " are", " you", " today", "?"] Count: 9 tokens (not 7 words) Input: "strawberry" Tokens: ["straw", "berry"] โ 2 tokens, not 1 word Input: "https://www.example.com/path?q=1" Tokens: ["https", "://", "www", ".", "example", ".", "com", "/", "path", "?", "q", "=", "1"] Count: 13 tokens for one URL
"strawberry" tokenizes as roughly 2โ3 tokens depending on the model. This is precisely why early versions of GPT famously struggled with counting the letter "r" in "strawberry" โ the letter boundaries and token boundaries don't align. Models have improved, but the mismatch between human intuitions about "words" and the token reality causes subtle bugs if you're not aware of it.
Tokens matter for three practical reasons:
- Cost: API pricing is per token (input and output separately). A 1,000-word document is ~1,300 tokens.
- Context limits: the context window is measured in tokens, not words.
- Performance: tasks that require character-level reasoning (anagrams, counting letters) are harder for models because the unit of computation is tokens, not characters.
You can count tokens before sending a request using the Anthropic API:
import anthropic
client = anthropic.Anthropic()
# Count tokens without actually sending the message
response = client.messages.count_tokens(
model="claude-opus-4-5",
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Hello, world! How are you today?"}
]
)
print(f"Input tokens: {response.input_tokens}")
# Useful for estimating cost before a large batch jobA rough rule of thumb: 1 token โ 4 characters โ 0.75 words. For English text, divide word count by 0.75 to estimate tokens. For code, URLs, or non-English text, token counts are often higher.
Embeddings โ Meaning as Numbers
Before a transformer can process tokens, each token must be converted into a vector โ a list of numbers. This conversion is called an embedding.
The magic of embeddings is that they capture semantic meaning geometrically. Words with similar meanings end up close together in the high-dimensional vector space. The most famous demonstration:
king - man + woman โ queen
Subtract the "man-ness" from "king," add "woman-ness," and you arrive near "queen" in vector space. The model never explicitly learned this rule โ it emerged from predicting billions of tokens.
High-dimensional space (simplified to 2D): dog โข โข cat โข automobile wolf โข โข vehicle โข car โข truck Animals cluster together. Vehicles cluster together. The clusters are far apart.
Why does this matter for you as an engineer? Embeddings are the backbone of Retrieval-Augmented Generation (RAG), covered in depth in Module 4. When you want to search a document store for text semantically similar to a query, you embed both the query and the documents and find the closest vectors. This is why a search for "automobile accident" can find documents containing "car crash" โ they're close in embedding space even though they share no words.
Attention โ Which Words Matter?
The transformer's core innovation is the attention mechanism. Here's the intuition without any math:
For every token in the input, the model asks: "which other tokens in the sequence are most relevant for understanding this one?" It assigns attention weights โ higher weight to more relevant tokens โ and uses a weighted combination to compute a richer representation of each token.
Consider the pronoun "it" in these two sentences:
Sentence 1: "The animal didn't cross the street because it was tired." โ "it" attends strongly to "animal" Sentence 2: "The animal didn't cross the street because it was too wide." โ "it" attends strongly to "street"
A human reader instantly resolves "it" based on context. Attention is the mechanism that gives transformers the same ability โ and unlike RNNs, it works equally well for words 1 token apart or 1,000 tokens apart.
Transformers have multiple attention heads running in parallel, each learning to attend to different kinds of relationships: grammatical roles, coreference, topic relevance, and more. The outputs of all heads are combined, giving the model a rich, context-aware representation of every token.
You never interact with attention directly as an API user โ but understanding it explains why models handle long-range dependencies so much better than their predecessors, and why context window size matters for quality, not just quantity.
Next-Token Prediction
Strip away all the architecture details and the core of what an LLM does is elegantly simple: given all the tokens seen so far, predict the most likely next token.
That's it. Pre-training is just this task, applied to trillions of examples.
The model sees "The capital of France is" and learns to predict "Paris" with high probability. It sees "def fibonacci(n):" and learns what typically follows Python function definitions. Over enough data, the model internalizes an enormous amount of world knowledge, coding patterns, logical reasoning, and linguistic structure โ because all of these were necessary to predict text well.
Fine-tuning with RLHF (Reinforcement Learning from Human Feedback) is the second stage. Raw pretraining produces a model that continues text; RLHF shapes it into a model that is helpful, harmless, and honest. Human raters compare model outputs, and the model is trained to produce outputs humans prefer. This is what transforms a text predictor into a useful assistant.
Pretraining a frontier model from scratch costs tens of millions of dollars and requires massive GPU clusters running for months. Fine-tuning an existing model takes hours to days and costs hundreds to thousands of dollars. You almost never need to pretrain from scratch. Use existing foundation models and, if necessary, fine-tune them for your domain.
Context Window
The context window is the maximum number of tokens the model can process in a single call โ including your system prompt, the entire conversation history, any retrieved documents, and the model's response.
| Model | Context Window | Approx. equivalent | |-------|---------------|-------------------| | Claude (Opus/Sonnet) | 200,000 tokens | ~500-page book | | GPT-4o | 128,000 tokens | ~320-page book | | Gemini 1.5 Pro | 1,000,000 tokens | ~2,500-page book | | Llama 3.1 (70B) | 128,000 tokens | ~320-page book |
When input exceeds the context window, one of two things happens depending on the implementation: either the API returns an error, or โ in a managed conversation system โ the oldest tokens are silently dropped. Either way, the model cannot see what falls outside the window.
This is why long conversations eventually lose coherence โ the early turns are no longer in context. In Module 4, you'll learn how RAG lets you inject only the most relevant content rather than stuffing everything into context.
Temperature & Sampling
When the model predicts the next token, it produces a probability distribution over its entire vocabulary (50,000+ tokens). Temperature controls how that distribution is sampled:
- Temperature = 0: always pick the highest-probability token. Fully deterministic.
- Temperature = 1: sample from the distribution as computed. Balanced.
- Temperature > 1: flatten the distribution, making unlikely tokens more probable. More "random."
| Use Case | Recommended Temperature | |----------|------------------------| | Code generation | 0.1 โ 0.3 | | Factual Q&A | 0.0 โ 0.3 | | Summarization | 0.3 โ 0.5 | | Creative writing | 0.7 โ 1.0 | | Brainstorming / ideation | 0.8 โ 1.0 |
A common misconception: high temperature makes the model "more creative" or "smarter." It does neither. High temperature makes the model more willing to produce low-probability tokens โ which can mean creative output, or nonsense. The underlying capability of the model doesn't change. For most production tasks, stay between 0.0 and 0.7.
Two other sampling parameters worth knowing:
- top_p (nucleus sampling): instead of sampling from all tokens, only consider the top tokens that together account for p% of the probability mass.
top_p=0.9is a common default. - top_k: limit sampling to the top k tokens by probability. Less commonly used than top_p.
In practice, you'll mostly tune temperature. Leave top_p and top_k at defaults unless you have a specific reason to change them.
Model Families
You'll encounter several families of models in this course:
| Provider | Models | Tier System | Notes | |----------|--------|------------|-------| | Anthropic | Claude | Haiku โ Sonnet โ Opus | Haiku: fast/cheap, Sonnet: balanced, Opus: most capable | | OpenAI | GPT-4o, o1, o3 | Varies | Broad ecosystem, strong tool support | | Google | Gemini 1.5 Flash/Pro | Flash: fast, Pro: capable | 1M token context window on Pro | | Meta | Llama 3.1, 3.2, 3.3 | 8B, 70B, 405B | Open weights, run locally or via API |
How to choose: For most production applications in this course, you'll use Claude Sonnet or Haiku. Sonnet offers the best capability-to-cost ratio for complex tasks; Haiku is the right choice for high-volume tasks where latency and cost matter more than maximum quality.
When evaluating models for your use case, look at:
- Context window โ does it fit your documents?
- Pricing โ input tokens, output tokens, and any batch discounts
- Benchmark scores โ MMLU, HumanEval, MATH for reasoning and coding tasks
- Speed (tokens/second) โ matters for interactive applications
Goal: Directly observe the effect of temperature on model output.
Setup: Install the Anthropic Python SDK and get an API key.
pip install anthropic
export ANTHROPIC_API_KEY="your-key-here"Script:
import anthropic
client = anthropic.Anthropic()
prompt = "Complete this sentence in one sentence: 'The most interesting thing about space is'"
print("=== temperature=0 (5 runs) ===")
for i in range(5):
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=60,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
print(f"Run {i+1}: {response.content[0].text.strip()}")
print("\n=== temperature=1 (5 runs) ===")
for i in range(5):
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=60,
temperature=1,
messages=[{"role": "user", "content": prompt}]
)
print(f"Run {i+1}: {response.content[0].text.strip()}")
# Bonus: count tokens in a message
token_count = client.messages.count_tokens(
model="claude-haiku-4-5",
messages=[{"role": "user", "content": prompt}]
)
print(f"\nPrompt token count: {token_count.input_tokens}")What to observe:
- Temperature 0: Are all 5 outputs identical? (They should be.)
- Temperature 1: How much variation is there? Is it always coherent?
- Try temperature=1.5 โ does output quality degrade?
Stretch goal: Replace the prompt with a Python function signature and observe how temperature affects code completion quality.
Q1What is a 'token' in the context of LLMs?
Q2You set temperature=0 on a model. What does this do?
Q3What happens when a conversation exceeds the context window?