Production & AI Safety
- âImplement prompt caching to cut API costs by up to 90%
- âApply security best practices (API keys, PII redaction)
- âSet up production monitoring for AI systems
- âKnow when to fine-tune vs when to prompt engineer
From Prototype to Production
Getting a Claude integration to work in a Jupyter notebook is different from running it at scale. In production you're dealing with: real costs that compound over millions of calls, real users who will try to break your system, real latency requirements, and real regulatory constraints around data handling.
This module covers the four gaps between a working prototype and a production-ready system: cost optimization, security, monitoring, and the fine-tuning decision.
Prompt Caching â 90% Cost Reduction
The biggest cost lever in production AI is prompt caching. Here's the situation: most AI applications send the same large system prompt with every request. A customer support bot might have a 5,000-token system prompt containing product docs, policies, and examples. At scale, you're paying for those 5,000 tokens on every single API call â even though they never change.
Prompt caching lets you pay full price once, then approximately 10% for subsequent calls where the cached block appears.
Prompt caching requires blocks of at least 1,024 tokens to be eligible. The cache TTL is 5 minutes â every time the cached block is used, the TTL resets. For a constantly-used chatbot, the cache stays warm indefinitely. Cache write costs 25% more than normal; reads cost 10% of normal. The math works out in your favor after just a few calls.
Here's how to enable it:
import anthropic
client = anthropic.Anthropic()
# This large system prompt will be cached
LARGE_SYSTEM_PROMPT = """
You are an expert customer support agent for Acme Corp.
== PRODUCT CATALOG ==
[...5,000 tokens of product documentation, pricing, policies...]
== TROUBLESHOOTING GUIDES ==
[...detailed troubleshooting steps for 50 common issues...]
== ESCALATION PROCEDURES ==
[...when and how to escalate, internal contact list...]
"""
def chat_with_caching(user_message: str, conversation_history: list) -> str:
"""
Sends a message using prompt caching for the static system content.
First call: pays full price for system prompt write + read.
Subsequent calls: pays ~10% for system prompt (cache hit).
"""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # Mark for caching
}
],
messages=conversation_history + [
{"role": "user", "content": user_message}
],
)
# Check cache performance in the response headers
usage = response.usage
cache_read = getattr(usage, "cache_read_input_tokens", 0)
cache_write = getattr(usage, "cache_creation_input_tokens", 0)
if cache_read > 0:
print(f" Cache HIT: {cache_read} tokens read from cache (saved ~90%)")
elif cache_write > 0:
print(f" Cache WRITE: {cache_write} tokens written to cache")
return response.content[0].text
# Cost comparison (approximate, at claude-sonnet-4-5 pricing):
# Without caching: 5,000 tokens à $3/1M = $0.015 per call
# With caching (after first call): 5,000 tokens à $0.30/1M = $0.0015 per call
# Savings at 10,000 calls/day: $0.015 Ã 10,000 = $150/day â $15/day
# Monthly savings: ~$4,050
# Multiple cache points: you can cache more than just the system prompt
def process_document_with_caching(document: str, questions: list[str]) -> list[str]:
"""
Cache a large document and ask multiple questions against it.
Each question only pays for its own tokens â not the document again.
"""
answers = []
for question in questions:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Document:\n{document}",
"cache_control": {"type": "ephemeral"}, # Cache the document
},
{
"type": "text",
"text": f"\nQuestion: {question}",
# No cache_control â this changes every time
},
],
}
],
)
answers.append(response.content[0].text)
return answersThe pattern: identify what's static in your prompts (system instructions, document context, few-shot examples) and mark it with cache_control. What changes per request (the user's message, conversation history) gets no cache control.
Model Tiering â Right Model for the Right Task
Not every task needs your most capable model. Routing requests to the appropriate model tier saves significant money with minimal quality loss.
from enum import Enum
class ModelTier(Enum):
FAST = "claude-haiku-4-5" # Fast, cheap â simple tasks
BALANCED = "claude-sonnet-4-5" # Best capability/cost ratio â most tasks
POWERFUL = "claude-opus-4-5" # Most capable â complex reasoning
def route_to_model(task_type: str, complexity: str = "medium") -> str:
"""
Route a task to the appropriate model tier.
Returns the model identifier string.
"""
routing = {
# Simple, high-volume tasks â Haiku
("classification", "low"): ModelTier.FAST,
("classification", "medium"): ModelTier.FAST,
("extraction", "low"): ModelTier.FAST,
("sentiment", "low"): ModelTier.FAST,
("simple_qa", "low"): ModelTier.FAST,
# Standard development tasks â Sonnet
("code_review", "medium"): ModelTier.BALANCED,
("rag_answer", "medium"): ModelTier.BALANCED,
("summarization", "medium"): ModelTier.BALANCED,
("agent_task", "medium"): ModelTier.BALANCED,
# High-stakes complex reasoning â Opus
("architecture_review", "high"): ModelTier.POWERFUL,
("complex_reasoning", "high"): ModelTier.POWERFUL,
("legal_analysis", "high"): ModelTier.POWERFUL,
}
tier = routing.get((task_type, complexity), ModelTier.BALANCED)
return tier.value
# Cost comparison per 1,000 calls (rough estimates):
# | Task | Model | Cost |
# |-------------------|---------|---------|
# | Sentiment (100 tok) | Haiku | $0.08 |
# | Code review (1K tok)| Sonnet | $3.00 |
# | Architecture (2K tok)| Opus | $30.00 || Model | Best for | Approx. input cost | |---|---|---| | Claude Haiku | Classification, simple Q&A, extraction, high-volume pipelines | $0.80/1M tokens | | Claude Sonnet | Code review, RAG pipelines, agents, complex instructions | $3.00/1M tokens | | Claude Opus | Multi-step reasoning, architecture decisions, nuanced judgment | $15.00/1M tokens |
Build model routing into your agent framework from day one. Starting with Haiku and routing up when needed is much easier than starting with Opus and trying to cut costs later.
Security â API Keys and PII
Never Hardcode API Keys
# WRONG â never do this
client = anthropic.Anthropic(api_key="sk-ant-api03-...")
# RIGHT â use environment variables
import os
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
# Even better â use a secrets manager in production
import boto3 # AWS Secrets Manager example
def get_api_key() -> str:
sm = boto3.client("secretsmanager", region_name="us-east-1")
secret = sm.get_secret_value(SecretId="anthropic/api-key")
return secret["SecretString"]
client = anthropic.Anthropic(api_key=get_api_key())PII Redaction Before Logging
Never log raw user input without PII redaction. Many AI applications have inadvertently stored SSNs, medical record numbers, and credit card numbers in plain text logs. This can breach GDPR, CCPA, HIPAA, and PCI-DSS, resulting in fines in the millions. Redact before your logging pipeline, not after.
import re
from typing import NamedTuple
class RedactionResult(NamedTuple):
redacted_text: str
redaction_count: int
types_found: list[str]
# Patterns for common PII types
PII_PATTERNS = {
"SSN": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
"CREDIT_CARD": re.compile(r"\b(?:\d{4}[- ]?){3}\d{4}\b"),
"EMAIL": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
"PHONE_US": re.compile(r"\b(?:\+?1[-.]?)?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}\b"),
"IP_ADDRESS": re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b"),
"DATE_OF_BIRTH": re.compile(
r"\b(?:DOB|date of birth|born)[:\s]+\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",
re.IGNORECASE,
),
}
def redact_pii(text: str) -> RedactionResult:
"""
Redact known PII patterns from text before logging.
Returns the redacted text and a summary of what was found.
"""
redacted = text
types_found = []
total_count = 0
for pii_type, pattern in PII_PATTERNS.items():
matches = pattern.findall(redacted)
if matches:
redacted = pattern.sub(f"[{pii_type}_REDACTED]", redacted)
types_found.append(pii_type)
total_count += len(matches)
return RedactionResult(
redacted_text=redacted,
redaction_count=total_count,
types_found=types_found,
)
def safe_log_user_input(user_message: str, logger) -> str:
"""Redact PII and log. Returns the redacted version for use downstream."""
result = redact_pii(user_message)
if result.redaction_count > 0:
logger.warning(
f"PII detected and redacted: {result.types_found} "
f"({result.redaction_count} instances)"
)
logger.info(f"User input (redacted): {result.redacted_text[:500]}")
return result.redacted_text
# Example:
# Input: "My SSN is 123-45-6789 and email is alice@example.com"
# Output: "My SSN is [SSN_REDACTED] and email is [EMAIL_REDACTED]"Note that regex-based redaction catches known patterns but won't catch everything â "my social is 123456789" (no dashes) would pass through. For production healthcare or financial applications, use a dedicated PII detection service.
AI Safety for Developers
"AI Safety" often sounds abstract, but there are four concrete considerations every developer shipping AI products needs to address:
Hallucination risk in high-stakes domains: A chatbot that hallucinates product specs is annoying. A chatbot that hallucinates medication dosages is dangerous. If your AI system operates in a high-stakes domain (medical, legal, financial), implement mandatory human review for outputs above a certain confidence threshold, and cite sources for every claim.
Demographic bias testing: Models can perform differently across demographic groups. If your system classifies resumes, screens support tickets, or makes recommendations, test explicitly on inputs with names and language patterns from different demographic groups. Documented disparate impact is a legal liability.
Responsible disclosure: If you discover your AI system can be manipulated to produce harmful outputs, report it to the model provider. Keeping it quiet while users could be harmed is an ethical failure.
Regulatory awareness: The EU AI Act classifies certain AI applications as "high-risk" â employment decisions, credit scoring, critical infrastructure. High-risk systems require conformity assessments, human oversight, and documentation. Know which category your system falls into before launch.
Production Monitoring
Logging individual calls is necessary but not sufficient. Production monitoring tracks aggregate behavior over time: cost trends, error rates, latency percentiles.
import time
import statistics
from collections import defaultdict
from threading import Lock
from dataclasses import dataclass, field
@dataclass
class AIMonitor:
"""Thread-safe aggregate monitor for production AI systems."""
_lock: Lock = field(default_factory=Lock)
_call_count: int = 0
_error_count: int = 0
_total_cost_usd: float = 0.0
_latencies_ms: list[float] = field(default_factory=list)
_cost_by_model: dict = field(default_factory=lambda: defaultdict(float))
_errors_by_type: dict = field(default_factory=lambda: defaultdict(int))
def record_call(
self,
model: str,
input_tokens: int,
output_tokens: int,
latency_ms: float,
error: str | None = None,
) -> None:
cost = self._calculate_cost(model, input_tokens, output_tokens)
with self._lock:
self._call_count += 1
self._total_cost_usd += cost
self._latencies_ms.append(latency_ms)
self._cost_by_model[model] += cost
if error:
self._error_count += 1
error_type = error.split(":")[0] if ":" in error else error[:50]
self._errors_by_type[error_type] += 1
def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
input_rates = {
"claude-haiku-4-5": 0.80,
"claude-sonnet-4-5": 3.00,
"claude-opus-4-5": 15.00,
}
output_rates = {
"claude-haiku-4-5": 4.00,
"claude-sonnet-4-5": 15.00,
"claude-opus-4-5": 75.00,
}
input_cost = (input_tokens / 1_000_000) * input_rates.get(model, 3.0)
output_cost = (output_tokens / 1_000_000) * output_rates.get(model, 15.0)
return input_cost + output_cost
def get_stats(self) -> dict:
with self._lock:
latencies = self._latencies_ms or [0]
sorted_lat = sorted(latencies)
n = len(sorted_lat)
return {
"total_calls": self._call_count,
"error_rate": self._error_count / max(self._call_count, 1),
"total_cost_usd": round(self._total_cost_usd, 4),
"cost_by_model": dict(self._cost_by_model),
"latency_p50_ms": round(sorted_lat[int(n * 0.50)], 1),
"latency_p95_ms": round(sorted_lat[int(n * 0.95)], 1),
"latency_p99_ms": round(sorted_lat[int(n * 0.99)], 1),
"errors_by_type": dict(self._errors_by_type),
}
def report(self) -> None:
stats = self.get_stats()
print(f"AI Monitor Report")
print(f" Calls: {stats['total_calls']} | Error rate: {stats['error_rate']:.1%}")
print(f" Cost: ${stats['total_cost_usd']:.4f} total")
print(f" Latency: p50={stats['latency_p50_ms']}ms | p99={stats['latency_p99_ms']}ms")
for model, cost in stats["cost_by_model"].items():
print(f" {model}: ${cost:.4f}")
# Global monitor â shared across your application
monitor = AIMonitor()
# Integrate with the logging decorator from Module 9:
# monitor.record_call(model, input_tokens, output_tokens, latency_ms, error)Set up alerts on: error_rate > 5%, cost exceeding daily budget, p99_latency > 10 seconds, and any spike in a specific error type.
When to Fine-Tune
Fine-tuning is often the first tool engineers reach for when a model doesn't behave exactly as desired. It's almost never the right first move.
Before considering fine-tuning, exhaust these options in order:
- Better system prompt: Most behavior changes come from clearer instructions.
- Few-shot examples in the prompt: 3â10 examples of ideal input/output pairs fix many issues.
- RAG for knowledge gaps: If the model lacks current or proprietary knowledge, retrieve it.
- Output validation + retry: Catch malformed outputs and retry with a clarifying message.
Fine-tune only when all of the above have genuinely hit their ceiling. Specific situations where fine-tuning makes sense:
- Consistent output format: You need JSON in a very specific schema, and prompt engineering produces variation even with examples.
- Domain-specific style: Technical writing in a highly specialized style that's hard to capture in a prompt.
- Latency/cost at scale: A fine-tuned smaller model can outperform a larger model on narrow tasks, saving significant cost at high volume.
- You have 1,000+ high-quality examples: Fine-tuning with fewer examples than this rarely improves over good prompting.
In 95% of production AI applications, better prompts plus RAG outperform fine-tuning. A great system prompt takes hours to write. Fine-tuning takes weeks of data collection, training runs, and evaluation â plus the ongoing cost of maintaining a custom model. Exhaust prompting first. Always.
| Situation | Right solution | |---|---| | Model doesn't know recent facts | RAG â retrieve current information | | Output format is inconsistent | Few-shot examples in the prompt + output validation | | Behavior is close but needs polish | System prompt iteration | | Style needs to match a specific voice | Few-shot examples, possibly fine-tuning if very distinct | | Narrow high-volume task at cost limit | Fine-tune a smaller model | | Less than 100 examples available | Prompt engineering â do not fine-tune |
Exercise â Production-Hardening Your Agent
Take the agent you built in Module 6 or 8 and make it production-ready.
Step 1 â Add prompt caching: Identify the static portion of your system prompt. Mark it with cache_control: ephemeral. Run 10 consecutive calls and verify cache hits appear in usage.
Step 2 â Add structured logging: Apply the with_logging decorator from Module 9 to your main LLM call function. Verify that each call produces a JSON log line with cost and latency.
Step 3 â Add PII redaction: Wrap your user input pipeline with safe_log_user_input. Test with synthetic PII: "My email is test@example.com and phone is 555-123-4567." Verify it's redacted before the log.
Step 4 â Add model routing: Define at least two task types in your agent. Route simple tasks to Haiku, complex ones to Sonnet. Verify the right model is called by checking the model field in log output.
Step 5 â Run your evals: Run the eval harness you built in Module 9 against this production-hardened version. Did your changes affect eval scores? Document the before/after pass rates.
Step 6 â Write a production readiness checklist: Before you'd ship any AI feature, what would you check? Write a 10-item checklist based on this module. At minimum include: caching enabled, API keys in env vars, PII redacted before logging, error handling on all tool calls, eval pass rate âĨ 70%, cost estimate per 1,000 calls.
Q1How does prompt caching work in the Anthropic API?
Q2Before logging user inputs from your AI app, what must you do?
Q3When SHOULD you fine-tune a model?