Sponsor

EthicalAds: Display ethical, developer-targeted ads on your platform without compromising user privacy.

🏆Module 11 of 12

Capstone Project

⏱ 10–15 hours

📘 Advanced

🔧 Python, your choice

What you'll learn

→Choose and plan a full AI application
→Write a design document before coding
→Apply all course concepts in one integrated project
→Evaluate your system with a real eval harness

The Capstone Purpose

Every module in this course taught a concept in isolation. You built a RAG pipeline. You wrote an MCP server. You set up an eval harness. Each piece made sense on its own.

The capstone is where isolation ends. Real AI systems require all these pieces working together: retrieval informing generation, tools extending capability, evals validating quality, logging surfacing problems, caching controlling costs. Integrating concepts is harder than applying them individually — and that difficulty is where the real learning happens.

The capstone also forces architectural decisions that tutorial exercises don't. You'll choose a project, write a design document, encounter a failure mode you didn't anticipate, and figure out how to handle it. You can't learn these things from reading. You can only learn them by building.

Choose one of the three projects below. They're different in domain but similar in scope — each takes 10–15 focused hours and uses the full stack of techniques from this course.

Choose Your Project

Option A — AI Code Review Agent

An agent that accepts a GitHub Pull Request URL, fetches the diff, runs a structured analysis through a reflection loop, and posts a review comment directly to GitHub.

This project emphasizes: tool use, multi-step agent loops, reflection/self-critique patterns, eval harness for code review quality.

Option A Architecture

User Input
    │
    ▼
GitHub PR URL
    │
    ▼
┌─────────────────────┐
│  Fetch PR Diff      │  Tool: github_get_pr_diff(url)
│  (GitHub API tool)  │
└──────────┬──────────┘
           │ Raw diff + PR description
           ▼
┌──────────────────────────────────────┐
│         REFLECTION LOOP              │
│                                      │
│  ┌─────────────────────────────┐     │
│  │  Step 1: Draft Review       │     │
│  │  (identify issues, bugs,    │     │
│  │   style problems, risks)    │     │
│  └──────────┬──────────────────┘     │
│             │                        │
│  ┌──────────▼──────────────────┐     │
│  │  Step 2: Critique Draft     │     │
│  │  (what's wrong? what's      │     │
│  │   missing? too harsh?)      │     │
│  └──────────┬──────────────────┘     │
│             │                        │
│  ┌──────────▼──────────────────┐     │
│  │  Step 3: Improved Review    │     │
│  │  (apply critique, finalize) │     │
│  └──────────┬──────────────────┘     │
│             │                        │
│   Repeat if score < threshold        │
└─────────────┬────────────────────────┘
              │ Final review text
              ▼
┌─────────────────────┐
│  Post GitHub Comment │  Tool: github_post_comment(url, text)
└─────────────────────┘

Key techniques used: Tool use (Module 6), Agent loops (Module 6), MCP server for GitHub (Module 7), Eval harness with LLM-as-judge (Module 9), Prompt caching for the system prompt (Module 10).

Eval strategy: Collect 10 real PRs with known issues. Have the agent review them. Use LLM-as-judge to score whether the agent correctly identified the known issues. Track: recall (did it find the bugs?), precision (did it flag false positives?), tone (was it constructive?).

Option B — Private Knowledge Assistant + MCP Server

A RAG-powered assistant over your own documents (notes, PDFs, codebase, wiki) exposed as both a chat interface and an MCP server that Claude Desktop can query.

This project emphasizes: RAG pipeline design, MCP server development, eval harness for groundedness, prompt caching for the system prompt.

Option B Architecture

Documents (PDFs, markdown, code)
    │
    ▼
┌─────────────────────┐
│  Ingestion Pipeline │
│  • Parse documents  │
│  • Chunk text       │
│  • Generate embeddings│
│  • Store in vector DB│
└──────────┬──────────┘
           │
    ┌──────┴──────┐
    │             │
    ▼             ▼
Chat Interface   MCP Server
    │            │
    │            │ Tools:
    │            │  • search_knowledge_base(query)
    │            │  • get_document(id)
    │            │  • list_topics()
    │
    ▼
User Query
    │
    ▼
┌──────────────────────────────────┐
│  RAG Pipeline                    │
│  1. Embed query                  │
│  2. Retrieve top-k chunks        │
│  3. Rerank by relevance          │
│  4. Generate grounded answer     │
│  5. Cite sources                 │
└──────────┬───────────────────────┘
           │
           ▼
Answer + Citations

Key techniques used: RAG pipeline (Module 4), Chunking strategies (Module 4), MCP server (Module 7), Groundedness eval (Module 9), Prompt caching (Module 10).

Eval strategy: Write 20 questions you know the answers to from your documents. Measure: answer correctness (LLM-as-judge), groundedness score (does every claim appear in retrieved context?), citation accuracy (do cited documents actually support the claim?). Track these metrics across changes to chunking strategy and retrieval parameters.

Option C — Multi-Agent Research Pipeline

A pipeline of three specialized agents that collaborate to research a topic, analyze findings, and produce a structured report — with web search, file writing, and reflection throughout.

This project emphasizes: multi-agent orchestration, CrewAI or custom orchestration, agent-to-agent communication, reflection loops, structured output.

Option C Architecture

Research Topic (from user)
        │
        ▼
┌───────────────────┐
│  RESEARCHER AGENT │
│                   │
│  Tools:           │
│  • web_search()   │
│  • fetch_url()    │
│  • save_notes()   │
│                   │
│  Output: Raw      │
│  research notes   │
└────────┬──────────┘
         │ Research notes
         ▼
┌───────────────────┐
│  ANALYST AGENT    │
│                   │
│  Tools:           │
│  • read_notes()   │
│  • fact_check()   │
│  • cross_ref()    │
│                   │
│  Output: Structured│
│  analysis + gaps  │
└────────┬──────────┘
         │ Analysis
         ▼
┌───────────────────┐
│  WRITER AGENT     │
│                   │
│  Tools:           │
│  • read_analysis()│
│  • write_report() │
│  • format_md()    │
│                   │
│  Output: Final    │
│  markdown report  │
└────────┬──────────┘
         │
         ▼
    Final Report
    (saved to disk)

Key techniques used: Multi-agent systems (Module 5), Tool use (Module 6), Reflection loops (Module 6), Structured logging (Modules 9, 10), Eval harness for report quality (Module 9).

Eval strategy: Define 5 research topics with known good answers. Run the pipeline. Score the final report on: factual accuracy, completeness (are key points covered?), structure (is it well-organized?), citation quality (are claims supported?). Use LLM-as-judge with a research report rubric.

Design Document Template

Write this document before writing any code. Engineers who write design documents first ship faster and with fewer rewrites — not because documents are magic, but because the act of writing surfaces assumptions, contradictions, and missing pieces before they become bugs.

# Project Design Document
## [Project Name]
**Author**: [Your name]  
**Date**: [Today]  
**Option**: [A / B / C]
 
---
 
## Problem Statement
What problem does this solve? Who benefits and how?
(2–3 sentences max — if you can't say it in 3 sentences, 
the problem isn't clear enough yet.)
 
---
 
## Architecture Overview
[ASCII diagram of your system — copy and modify from the option above]
 
Key components:
- **Component 1**: What it does, what it inputs, what it outputs
- **Component 2**: Same
- **Component 3**: Same
 
---
 
## Agent/Pipeline Design
For each agent or pipeline stage:
- **Purpose**: What decision or transformation happens here?
- **Inputs**: What does it receive?
- **Outputs**: What does it produce?
- **Tools**: What tools can it call?
- **Model**: Which Claude model and why?
 
---
 
## Failure Modes & Mitigations
 
| Failure Mode | How to Detect | Mitigation |
|---|---|---|
| API rate limit hit | 429 error code | Exponential backoff + retry |
| LLM hallucination | Groundedness score < 0.8 | Retry with explicit grounding instruction |
| Tool call fails | Error in tool result | Return structured error, let agent decide |
| Agent loop infinite | Iteration counter > 10 | Force stop, return partial result |
| Empty retrieval | 0 chunks returned | Fallback: broaden query, inform user |
 
(Add your project-specific failure modes)
 
---
 
## Eval Strategy
- **Number of test cases**: [minimum 10]
- **Eval type**: Unit / LLM-as-judge / Human (explain your choice)
- **Pass threshold**: [e.g., 70% pass rate]
- **Key metrics**: [e.g., groundedness score, precision, recall]
- **What regression would you catch**: [describe a specific scenario]
 
---
 
## Technical Decisions
 
| Decision | Options Considered | Choice | Reason |
|---|---|---|---|
| Vector DB | Chroma, Pinecone, FAISS | [your choice] | [reason] |
| Chunking strategy | Fixed-size, semantic, paragraph | [your choice] | [reason] |
| Agent framework | Custom, LangChain, CrewAI | [your choice] | [reason] |
| Model routing | Single model, tiered | [your choice] | [reason] |
 
---
 
## Definition of Done
Check every box before considering the capstone complete:
 
- [ ] Design document written and reviewed
- [ ] Architecture diagram drawn
- [ ] Failure modes documented with mitigations
- [ ] Core functionality working end-to-end
- [ ] Eval harness with ≥ 10 test cases
- [ ] Pass rate ≥ 70% on eval suite
- [ ] Structured logging on every LLM call
- [ ] README with setup instructions and usage examples
- [ ] Cost estimate: how much would this cost at 1,000 calls/day?
- [ ] 5-minute demo recorded or live demo ready

Deliverables

The capstone is complete when all ten of these exist:

Design document — filled out template above, written before coding
Architecture diagram — ASCII or drawn, matching what you actually built
Failure modes table — at least 5 failure modes with detection and mitigation
Working core — the main use case works end-to-end without manual intervention
Eval harness — a Python file you can run with python eval.py that tests your system
Pass rate ≥ 70% — at least 7 of your 10 test cases pass
Structured logging — every LLM call produces a JSON log line with cost and latency
README — setup instructions from scratch (pretend you're handing it to a colleague)
Cost estimate — calculated cost at 100, 1,000, and 10,000 calls/day
Demo — 5-minute recorded walkthrough or live demonstration

✅

Design First, Then Code

Write the design document BEFORE writing a single line of implementation code. It takes 1–2 hours. It will save you 4–6 hours of rework when you discover the failure mode you didn't think about on day one. Architects who write first ship faster and with fewer rewrites.

Getting Unstuck

Every capstone project hits at least one wall. Here's how to get through it.

The feature isn't working at all: Simplify until something works. Strip out everything non-essential. Get a single call returning successfully. Then add complexity back one layer at a time. The failure is almost always in one specific component — binary search for it.

The eval pass rate is below 70%: Don't tune the system and the evals at the same time — you'll game the evals rather than improve the system. Fix the system first: read the failing cases carefully, look for patterns, hypothesize a cause, change one thing, re-run. Then check if that actually fixed the pattern.

The agent is getting stuck in a loop: Add a hard iteration limit and structured logging to every step. Print (or log) what decision the agent made at each step. You'll quickly see where it's confused. Usually the fix is a clearer system prompt or a better tool description.

The costs are too high: Profile first — which calls are the most expensive? Usually one component dominates. Apply prompt caching to static content in that component. Route simpler sub-tasks to Haiku. Batch calls where possible.

💻Hands-on Exercise

Capstone Execution Plan

Use this structure to stay on track across the 10–15 hours:

Hours 1–2: Design

Choose your option
Write the design document (full template above)
Draw the architecture diagram
Identify your 5 most likely failure modes

Hours 3–5: Skeleton

Set up the project structure and dependencies
Implement the main data flow with stubs (functions that return hardcoded data)
Verify the skeleton runs end-to-end without errors

Hours 6–9: Core Implementation

Replace stubs with real implementations one at a time
Add structured logging from the start — it will save debugging time
Test each component independently before integrating

Hours 10–12: Evals

Write your 10+ eval cases
Run the harness against your working system
Identify and fix the top failures

Hours 13–14: Production Hardening

Add prompt caching to static content
Add PII redaction to logging if applicable
Add model routing if using multiple task types
Verify eval pass rate ≥ 70%

Hours 15: Documentation and Demo

Write README with setup and usage
Calculate cost estimates
Record or prepare your demo

The capstone is done when all 10 deliverables are checked off. Don't skip the design document. Don't skip the evals. These are the parts that make the project real.

← Production

What's Next →