Capstone Project
- βChoose and plan a full AI application
- βWrite a design document before coding
- βApply all course concepts in one integrated project
- βEvaluate your system with a real eval harness
The Capstone Purpose
Every module in this course taught a concept in isolation. You built a RAG pipeline. You wrote an MCP server. You set up an eval harness. Each piece made sense on its own.
The capstone is where isolation ends. Real AI systems require all these pieces working together: retrieval informing generation, tools extending capability, evals validating quality, logging surfacing problems, caching controlling costs. Integrating concepts is harder than applying them individually β and that difficulty is where the real learning happens.
The capstone also forces architectural decisions that tutorial exercises don't. You'll choose a project, write a design document, encounter a failure mode you didn't anticipate, and figure out how to handle it. You can't learn these things from reading. You can only learn them by building.
Choose one of the three projects below. They're different in domain but similar in scope β each takes 10β15 focused hours and uses the full stack of techniques from this course.
Choose Your Project
Option A β AI Code Review Agent
An agent that accepts a GitHub Pull Request URL, fetches the diff, runs a structured analysis through a reflection loop, and posts a review comment directly to GitHub.
This project emphasizes: tool use, multi-step agent loops, reflection/self-critique patterns, eval harness for code review quality.
User Input β βΌ GitHub PR URL β βΌ βββββββββββββββββββββββ β Fetch PR Diff β Tool: github_get_pr_diff(url) β (GitHub API tool) β ββββββββββββ¬βββββββββββ β Raw diff + PR description βΌ ββββββββββββββββββββββββββββββββββββββββ β REFLECTION LOOP β β β β βββββββββββββββββββββββββββββββ β β β Step 1: Draft Review β β β β (identify issues, bugs, β β β β style problems, risks) β β β ββββββββββββ¬βββββββββββββββββββ β β β β β ββββββββββββΌβββββββββββββββββββ β β β Step 2: Critique Draft β β β β (what's wrong? what's β β β β missing? too harsh?) β β β ββββββββββββ¬βββββββββββββββββββ β β β β β ββββββββββββΌβββββββββββββββββββ β β β Step 3: Improved Review β β β β (apply critique, finalize) β β β ββββββββββββ¬βββββββββββββββββββ β β β β β Repeat if score < threshold β βββββββββββββββ¬βββββββββββββββββββββββββ β Final review text βΌ βββββββββββββββββββββββ β Post GitHub Comment β Tool: github_post_comment(url, text) βββββββββββββββββββββββ
Key techniques used: Tool use (Module 6), Agent loops (Module 6), MCP server for GitHub (Module 7), Eval harness with LLM-as-judge (Module 9), Prompt caching for the system prompt (Module 10).
Eval strategy: Collect 10 real PRs with known issues. Have the agent review them. Use LLM-as-judge to score whether the agent correctly identified the known issues. Track: recall (did it find the bugs?), precision (did it flag false positives?), tone (was it constructive?).
Option B β Private Knowledge Assistant + MCP Server
A RAG-powered assistant over your own documents (notes, PDFs, codebase, wiki) exposed as both a chat interface and an MCP server that Claude Desktop can query.
This project emphasizes: RAG pipeline design, MCP server development, eval harness for groundedness, prompt caching for the system prompt.
Documents (PDFs, markdown, code) β βΌ βββββββββββββββββββββββ β Ingestion Pipeline β β β’ Parse documents β β β’ Chunk text β β β’ Generate embeddingsβ β β’ Store in vector DBβ ββββββββββββ¬βββββββββββ β ββββββββ΄βββββββ β β βΌ βΌ Chat Interface MCP Server β β β β Tools: β β β’ search_knowledge_base(query) β β β’ get_document(id) β β β’ list_topics() β βΌ User Query β βΌ ββββββββββββββββββββββββββββββββββββ β RAG Pipeline β β 1. Embed query β β 2. Retrieve top-k chunks β β 3. Rerank by relevance β β 4. Generate grounded answer β β 5. Cite sources β ββββββββββββ¬ββββββββββββββββββββββββ β βΌ Answer + Citations
Key techniques used: RAG pipeline (Module 4), Chunking strategies (Module 4), MCP server (Module 7), Groundedness eval (Module 9), Prompt caching (Module 10).
Eval strategy: Write 20 questions you know the answers to from your documents. Measure: answer correctness (LLM-as-judge), groundedness score (does every claim appear in retrieved context?), citation accuracy (do cited documents actually support the claim?). Track these metrics across changes to chunking strategy and retrieval parameters.
Option C β Multi-Agent Research Pipeline
A pipeline of three specialized agents that collaborate to research a topic, analyze findings, and produce a structured report β with web search, file writing, and reflection throughout.
This project emphasizes: multi-agent orchestration, CrewAI or custom orchestration, agent-to-agent communication, reflection loops, structured output.
Research Topic (from user) β βΌ βββββββββββββββββββββ β RESEARCHER AGENT β β β β Tools: β β β’ web_search() β β β’ fetch_url() β β β’ save_notes() β β β β Output: Raw β β research notes β ββββββββββ¬βββββββββββ β Research notes βΌ βββββββββββββββββββββ β ANALYST AGENT β β β β Tools: β β β’ read_notes() β β β’ fact_check() β β β’ cross_ref() β β β β Output: Structuredβ β analysis + gaps β ββββββββββ¬βββββββββββ β Analysis βΌ βββββββββββββββββββββ β WRITER AGENT β β β β Tools: β β β’ read_analysis()β β β’ write_report() β β β’ format_md() β β β β Output: Final β β markdown report β ββββββββββ¬βββββββββββ β βΌ Final Report (saved to disk)
Key techniques used: Multi-agent systems (Module 5), Tool use (Module 6), Reflection loops (Module 6), Structured logging (Modules 9, 10), Eval harness for report quality (Module 9).
Eval strategy: Define 5 research topics with known good answers. Run the pipeline. Score the final report on: factual accuracy, completeness (are key points covered?), structure (is it well-organized?), citation quality (are claims supported?). Use LLM-as-judge with a research report rubric.
Design Document Template
Write this document before writing any code. Engineers who write design documents first ship faster and with fewer rewrites β not because documents are magic, but because the act of writing surfaces assumptions, contradictions, and missing pieces before they become bugs.
# Project Design Document
## [Project Name]
**Author**: [Your name]
**Date**: [Today]
**Option**: [A / B / C]
---
## Problem Statement
What problem does this solve? Who benefits and how?
(2β3 sentences max β if you can't say it in 3 sentences,
the problem isn't clear enough yet.)
---
## Architecture Overview
[ASCII diagram of your system β copy and modify from the option above]
Key components:
- **Component 1**: What it does, what it inputs, what it outputs
- **Component 2**: Same
- **Component 3**: Same
---
## Agent/Pipeline Design
For each agent or pipeline stage:
- **Purpose**: What decision or transformation happens here?
- **Inputs**: What does it receive?
- **Outputs**: What does it produce?
- **Tools**: What tools can it call?
- **Model**: Which Claude model and why?
---
## Failure Modes & Mitigations
| Failure Mode | How to Detect | Mitigation |
|---|---|---|
| API rate limit hit | 429 error code | Exponential backoff + retry |
| LLM hallucination | Groundedness score < 0.8 | Retry with explicit grounding instruction |
| Tool call fails | Error in tool result | Return structured error, let agent decide |
| Agent loop infinite | Iteration counter > 10 | Force stop, return partial result |
| Empty retrieval | 0 chunks returned | Fallback: broaden query, inform user |
(Add your project-specific failure modes)
---
## Eval Strategy
- **Number of test cases**: [minimum 10]
- **Eval type**: Unit / LLM-as-judge / Human (explain your choice)
- **Pass threshold**: [e.g., 70% pass rate]
- **Key metrics**: [e.g., groundedness score, precision, recall]
- **What regression would you catch**: [describe a specific scenario]
---
## Technical Decisions
| Decision | Options Considered | Choice | Reason |
|---|---|---|---|
| Vector DB | Chroma, Pinecone, FAISS | [your choice] | [reason] |
| Chunking strategy | Fixed-size, semantic, paragraph | [your choice] | [reason] |
| Agent framework | Custom, LangChain, CrewAI | [your choice] | [reason] |
| Model routing | Single model, tiered | [your choice] | [reason] |
---
## Definition of Done
Check every box before considering the capstone complete:
- [ ] Design document written and reviewed
- [ ] Architecture diagram drawn
- [ ] Failure modes documented with mitigations
- [ ] Core functionality working end-to-end
- [ ] Eval harness with β₯ 10 test cases
- [ ] Pass rate β₯ 70% on eval suite
- [ ] Structured logging on every LLM call
- [ ] README with setup instructions and usage examples
- [ ] Cost estimate: how much would this cost at 1,000 calls/day?
- [ ] 5-minute demo recorded or live demo readyDeliverables
The capstone is complete when all ten of these exist:
- Design document β filled out template above, written before coding
- Architecture diagram β ASCII or drawn, matching what you actually built
- Failure modes table β at least 5 failure modes with detection and mitigation
- Working core β the main use case works end-to-end without manual intervention
- Eval harness β a Python file you can run with
python eval.pythat tests your system - Pass rate β₯ 70% β at least 7 of your 10 test cases pass
- Structured logging β every LLM call produces a JSON log line with cost and latency
- README β setup instructions from scratch (pretend you're handing it to a colleague)
- Cost estimate β calculated cost at 100, 1,000, and 10,000 calls/day
- Demo β 5-minute recorded walkthrough or live demonstration
Write the design document BEFORE writing a single line of implementation code. It takes 1β2 hours. It will save you 4β6 hours of rework when you discover the failure mode you didn't think about on day one. Architects who write first ship faster and with fewer rewrites.
Getting Unstuck
Every capstone project hits at least one wall. Here's how to get through it.
The feature isn't working at all: Simplify until something works. Strip out everything non-essential. Get a single call returning successfully. Then add complexity back one layer at a time. The failure is almost always in one specific component β binary search for it.
The eval pass rate is below 70%: Don't tune the system and the evals at the same time β you'll game the evals rather than improve the system. Fix the system first: read the failing cases carefully, look for patterns, hypothesize a cause, change one thing, re-run. Then check if that actually fixed the pattern.
The agent is getting stuck in a loop: Add a hard iteration limit and structured logging to every step. Print (or log) what decision the agent made at each step. You'll quickly see where it's confused. Usually the fix is a clearer system prompt or a better tool description.
The costs are too high: Profile first β which calls are the most expensive? Usually one component dominates. Apply prompt caching to static content in that component. Route simpler sub-tasks to Haiku. Batch calls where possible.
Capstone Execution Plan
Use this structure to stay on track across the 10β15 hours:
Hours 1β2: Design
- Choose your option
- Write the design document (full template above)
- Draw the architecture diagram
- Identify your 5 most likely failure modes
Hours 3β5: Skeleton
- Set up the project structure and dependencies
- Implement the main data flow with stubs (functions that return hardcoded data)
- Verify the skeleton runs end-to-end without errors
Hours 6β9: Core Implementation
- Replace stubs with real implementations one at a time
- Add structured logging from the start β it will save debugging time
- Test each component independently before integrating
Hours 10β12: Evals
- Write your 10+ eval cases
- Run the harness against your working system
- Identify and fix the top failures
Hours 13β14: Production Hardening
- Add prompt caching to static content
- Add PII redaction to logging if applicable
- Add model routing if using multiple task types
- Verify eval pass rate β₯ 70%
Hours 15: Documentation and Demo
- Write README with setup and usage
- Calculate cost estimates
- Record or prepare your demo
The capstone is done when all 10 deliverables are checked off. Don't skip the design document. Don't skip the evals. These are the parts that make the project real.