Sponsor

EthicalAds: Display ethical, developer-targeted ads on your platform without compromising user privacy.

🛠️Module 8 of 12

Tools & Skills in AI Systems

⏱ 5–6 hours

📘 Intermediate

🔧 Python, JSON Schema

What you'll learn

→Distinguish tools (executable actions) from skills (packaged expertise)
→Design well-typed tool schemas using JSON Schema
→Write tool implementations robust to LLM-generated inputs
→Understand how models select which tool to call

Tools vs. Skills

These two words are often used interchangeably, but they describe distinct concepts in AI system design. Getting clear on the difference makes your architecture cleaner and your code more reusable.

| | Tool | Skill | |---|---|---| | What it is | An executable function the model can call | A packaged bundle of expertise shaping agent behavior | | What it does | Performs an action — reads, writes, calls an API | Configures how an agent approaches a class of problems | | Contains | Python function + JSON Schema | System prompt + tools + few-shot examples | | Example | search_web(query), write_file(path, content) | PythonDebuggingSkill, CustomerSupportSkill | | Reusability | Individual function | Entire behavioral profile for an agent |

A Tool is the mechanism. It does one concrete thing: query a database, send a message, call an external API. The model decides when to use it based on the tool's name and description.

A Skill is the strategy. It's an opinionated configuration — system prompt + curated tools + examples — that tells an agent how to think about a domain. You apply a skill to an agent to give it expertise. The same underlying model becomes a different specialist depending on which skill is applied.

Think of it this way: a carpenter's hammer is a tool. "Finish carpentry" is a skill. The skill determines which tools to use and how to use them well.

Tool Design Principles

There is one rule that matters more than all others: the tool description is the API contract for the model.

The model never reads your source code. It never sees docstrings (unless you put them in the description). It decides whether and when to call your tool based entirely on the name and description fields you write. A bad description causes wrong tool calls. A good description makes the model a reliable collaborator.

Compare these two descriptions for the same tool:

Bad description:

{
  "name": "get_data",
  "description": "Gets data from the system.",
  "inputSchema": { ... }
}

This is useless. "Gets data" describes nothing. The model has no idea when to call this, what kind of data, or what it returns.

Good description:

{
  "name": "search_customer_orders",
  "description": "Search orders for a specific customer by email address. Returns a list of orders with order ID, date, total, status (pending/shipped/delivered/cancelled), and items. Use this when the user asks about their orders, order history, or the status of a specific order. Returns at most 50 orders sorted by date descending.",
  "inputSchema": { ... }
}

This tells the model: what data it returns, what format, when to call it, and limits to be aware of.

✅

Write Descriptions for a Junior Developer

Write your description as if explaining to a junior developer who needs to decide when to use this function. Be specific about: when to call it, what it returns, any constraints or limits, what NOT to use it for if there are common confusions. The model is an expert reader — give it the information it needs.

Four principles for excellent tool design:

One tool, one responsibility: search_orders and cancel_order are two tools, not one. Mixed-responsibility tools create ambiguous descriptions and fragile implementations.
Name the domain in the tool name: search_customer_orders is better than search. When a model sees ten tools, specific names reduce confusion.
Describe the return value: the model needs to know what it gets back to use the result correctly.
Explain selection criteria: if two tools could plausibly handle the same request, describe when to choose each one.

JSON Schema Anatomy

Every tool's inputSchema is a standard JSON Schema object. Here's a fully annotated example for a ticket creation tool:

{
  "name": "create_ticket",
  "description": "Create a new support ticket in the ticketing system. Use when a user reports a bug, requests a feature, or has an issue that needs tracking. Returns the ticket ID and URL.",
  "inputSchema": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "description": "A concise title for the ticket, max 100 characters.",
        "maxLength": 100
      },
      "description": {
        "type": "string",
        "description": "Full description of the issue or request. Include steps to reproduce for bugs."
      },
      "priority": {
        "type": "string",
        "enum": ["low", "medium", "high", "critical"],
        "description": "Ticket priority. Use 'critical' only for production outages or data loss.",
        "default": "medium"
      },
      "category": {
        "type": "string",
        "enum": ["bug", "feature", "question", "documentation"],
        "description": "Ticket category — determines routing to the right team."
      },
      "affected_users": {
        "type": "integer",
        "description": "Estimated number of users affected. Used for priority triage.",
        "minimum": 0
      },
      "labels": {
        "type": "array",
        "items": { "type": "string" },
        "description": "Optional tags for filtering, e.g. ['auth', 'mobile', 'payments'].",
        "maxItems": 10
      }
    },
    "required": ["title", "description", "priority", "category"]
  }
}

Key schema features to use:

enum constrains values to a fixed set — prevents the model from inventing invalid categories.
default tells the model what to use when the user doesn't specify.
minimum/maximum on integers enforce valid ranges.
maxLength prevents accidentally huge string values.
required tells the model which fields it must supply.
description on each property is just as important as the tool description — it's how the model knows what to put in each field.

Write descriptions on every property, not just the required ones. The model filling in optional fields correctly makes your tool much more useful.

Writing Robust Tool Implementations

The model generates tool arguments. The model makes mistakes. Your tool implementation must handle bad input gracefully.

from dataclasses import dataclass
from typing import Optional
import anthropic
 
client = anthropic.Anthropic()
 
 
@dataclass
class TicketResult:
    success: bool
    ticket_id: Optional[str] = None
    url: Optional[str] = None
    error: Optional[str] = None
 
    def to_dict(self) -> dict:
        return {k: v for k, v in self.__dict__.items() if v is not None}
 
 
def create_ticket(
    title: str,
    description: str,
    priority: str,
    category: str,
    affected_users: int = 0,
    labels: list[str] | None = None,
) -> dict:
    """Tool implementation with robust error handling."""
 
    # 1. Input validation — don't trust model-generated arguments
    if not title or not title.strip():
        return TicketResult(success=False, error="title cannot be empty").to_dict()
 
    if len(title) > 100:
        return TicketResult(
            success=False,
            error=f"title too long ({len(title)} chars). Max 100."
        ).to_dict()
 
    valid_priorities = {"low", "medium", "high", "critical"}
    if priority not in valid_priorities:
        return TicketResult(
            success=False,
            error=f"Invalid priority '{priority}'. Must be one of: {valid_priorities}"
        ).to_dict()
 
    valid_categories = {"bug", "feature", "question", "documentation"}
    if category not in valid_categories:
        return TicketResult(
            success=False,
            error=f"Invalid category '{category}'. Must be one of: {valid_categories}"
        ).to_dict()
 
    labels = labels or []
 
    # 2. Type coercion — model sometimes sends strings for integers
    try:
        affected_users = int(affected_users)
    except (ValueError, TypeError):
        affected_users = 0
 
    # 3. Business logic
    try:
        # Call your actual ticketing API here
        ticket_id = f"TICK-{hash(title) % 10000:04d}"  # mock
        url = f"https://tickets.example.com/{ticket_id}"
 
        return TicketResult(
            success=True,
            ticket_id=ticket_id,
            url=url,
        ).to_dict()
 
    except Exception as e:
        # 4. Return structured errors, not exceptions
        return TicketResult(
            success=False,
            error=f"Database error: {str(e)}"
        ).to_dict()

⚠️

Return Errors as Data, Not Exceptions

Return errors as structured data so the model can read the error message and decide what to do next — retry with different input, ask the user for clarification, or report the issue. If your tool raises an unhandled exception, the API surfaces it as an opaque error and the model loses context about what went wrong.

Three patterns that make tools robust:

Structured error returns: Include a success boolean and an error string. The model reads the error message and can respond intelligently.

Type coercion: Models sometimes send "5" instead of 5. Defensively coerce types before using them.

Idempotency: Where possible, design tools that can be called twice safely. If the model gets confused and calls create_ticket twice with the same title, it should return the existing ticket rather than creating a duplicate.

Parallel Tool Calls

When you give a model multiple tools and a complex request, it will often call several tools simultaneously rather than sequentially. This is faster and more efficient — but your tool implementations must handle concurrent execution safely.

import asyncio
import anthropic
import json
 
client = anthropic.Anthropic()
 
 
async def execute_tool_call(tool_name: str, tool_input: dict) -> str:
    """Execute a single tool call asynchronously."""
    if tool_name == "search_orders":
        await asyncio.sleep(0.1)  # Simulate API latency
        return json.dumps({"orders": [{"id": "ORD-001", "status": "shipped"}]})
    elif tool_name == "get_customer_profile":
        await asyncio.sleep(0.1)
        return json.dumps({"name": "Alice", "tier": "premium", "since": "2022-01"})
    return json.dumps({"error": f"Unknown tool: {tool_name}"})
 
 
async def run_agent_with_parallel_tools(user_message: str) -> str:
    """Agent loop that handles parallel tool calls."""
    messages = [{"role": "user", "content": user_message}]
    tools = [
        {
            "name": "search_orders",
            "description": "Search recent orders for the current customer. Returns order list with status.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "limit": {"type": "integer", "description": "Max orders to return", "default": 10}
                },
            },
        },
        {
            "name": "get_customer_profile",
            "description": "Get profile information for the current customer including tier and account age.",
            "input_schema": {"type": "object", "properties": {}},
        },
    ]
 
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            tools=tools,
            messages=messages,
        )
 
        if response.stop_reason == "end_turn":
            # Extract final text response
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return ""
 
        if response.stop_reason == "tool_use":
            # Collect all tool calls from this response
            tool_calls = [b for b in response.content if b.type == "tool_use"]
 
            # Execute all tool calls IN PARALLEL
            tool_results = await asyncio.gather(
                *[execute_tool_call(tc.name, tc.input) for tc in tool_calls]
            )
 
            # Add assistant's response to messages
            messages.append({"role": "assistant", "content": response.content})
 
            # Add all tool results
            messages.append({
                "role": "user",
                "content": [
                    {
                        "type": "tool_result",
                        "tool_use_id": tc.id,
                        "content": result,
                    }
                    for tc, result in zip(tool_calls, tool_results)
                ],
            })

The key pattern: when stop_reason is "tool_use", collect all tool call blocks from the response, execute them concurrently with asyncio.gather(), then return all results in a single tool_result message. Never execute one tool result at a time when you have multiple — it's slower and sends unnecessary API calls.

Building a Reusable Skill

A Skill packages a system prompt, curated tools, and few-shot examples into a reusable unit you can apply to any agent. Here's a complete Python debugging skill:

from dataclasses import dataclass, field
from typing import Any
 
 
@dataclass
class PythonDebuggingSkill:
    """
    A reusable skill that makes an agent expert at debugging Python code.
    Apply this to any agent that needs to analyze and fix Python errors.
    """
 
    SYSTEM_PROMPT: str = """You are an expert Python debugger with 15 years of experience.
 
When debugging:
1. Read the full error traceback before drawing conclusions
2. Identify the root cause, not just the symptom  
3. Check for common Python gotchas: mutable defaults, late binding, off-by-one errors
4. Suggest a minimal reproduction case when the bug isn't obvious
5. Explain WHY the fix works, not just what to change
 
Always run the code mentally before claiming a fix works."""
 
    TOOLS: list[dict] = field(default_factory=lambda: [
        {
            "name": "run_python",
            "description": (
                "Execute a Python code snippet in a sandboxed environment. "
                "Returns stdout, stderr, and exit code. "
                "Use to verify fixes or test hypotheses about the bug."
            ),
            "input_schema": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "The Python code to execute",
                    },
                    "timeout_seconds": {
                        "type": "integer",
                        "description": "Max execution time. Default 10.",
                        "default": 10,
                        "maximum": 60,
                    },
                },
                "required": ["code"],
            },
        },
        {
            "name": "search_python_docs",
            "description": (
                "Search Python official documentation for a function, class, or concept. "
                "Returns the relevant documentation section. "
                "Use when you need to verify exact behavior of a built-in or standard library."
            ),
            "input_schema": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "What to look up, e.g. 'list.sort key parameter'",
                    }
                },
                "required": ["query"],
            },
        },
    ])
 
    EXAMPLES: list[dict] = field(default_factory=lambda: [
        {
            "user": "Why does my list modification inside a function not persist outside it?",
            "assistant": (
                "This is Python's scoping: assigning `my_list = [...]` creates a new local variable. "
                "But `my_list.append(...)` mutates the original object. "
                "To modify the list in place, use `.append()`, `.extend()`, or slice assignment `my_list[:] = new_values`. "
                "To replace it entirely, return the new list and reassign at the call site."
            ),
        },
    ])
 
    def apply_to_agent(self, agent_config: dict) -> dict:
        """Apply this skill to an agent configuration dict."""
        agent_config["system"] = self.SYSTEM_PROMPT
        agent_config["tools"] = self.TOOLS
 
        # Inject examples as the first few-shot messages
        if self.EXAMPLES and "messages" not in agent_config:
            agent_config["messages"] = []
        
        few_shot = []
        for example in self.EXAMPLES:
            few_shot.append({"role": "user", "content": example["user"]})
            few_shot.append({"role": "assistant", "content": example["assistant"]})
        
        agent_config["messages"] = few_shot + agent_config.get("messages", [])
        return agent_config
 
 
# Usage
import anthropic
 
client = anthropic.Anthropic()
skill = PythonDebuggingSkill()
 
agent_config = {
    "model": "claude-sonnet-4-5",
    "max_tokens": 2048,
    "messages": [
        {"role": "user", "content": "My code raises: 'TypeError: 'int' object is not iterable'. Here's the code: for x in len(my_list): ..."}
    ],
}
 
configured = skill.apply_to_agent(agent_config)
response = client.messages.create(**configured)
print(response.content[0].text)

Skills compose naturally. A customer support agent might apply three skills: one for ticket handling, one for product knowledge, and one for empathetic communication. Each skill contributes its system prompt section and tool set.

Tool Design Anti-Patterns

These mistakes show up constantly in production AI systems. Avoid them from the start.

1. Vague descriptions

{ "description": "Does stuff with users." }

The model cannot make reliable decisions with this. Describe specifically what it does, when to use it, and what it returns.

2. Too many responsibilities A manage_database tool that can create tables, insert rows, run queries, and delete databases is impossible to describe accurately. Split it.

3. Silent failures

def get_user(user_id: str) -> dict:
    user = db.find(user_id)
    return user or {}  # Returns empty dict on failure

The model receives {} and doesn't know if the user doesn't exist or the database is down. Return {"found": false, "error": "User not found"} instead.

4. No input validation Trust nothing from the model. Validate types, ranges, enum values, and string lengths before using any argument.

5. Non-idempotent operations without guards If calling create_order twice with the same cart creates two orders, you'll have bugs when the model retries on a timeout. Design idempotent tools or require explicit confirmation on retry.

6. No dry_run for destructive tools

def delete_all_records(table: str, dry_run: bool = True) -> dict:
    if dry_run:
        count = db.count(table)
        return {"would_delete": count, "dry_run": True, "action": "none"}
    
    # Only actually delete when dry_run=False
    db.delete_all(table)
    return {"deleted": True}

Any tool that deletes, sends, charges, or otherwise does something irreversible needs a dry_run mode. Default it to True. The agent discovers what it would do, you review, then you call with dry_run=False.

💻Hands-on Exercise

Exercise — Developer Assistant Agent

Design and implement all four tools for a Developer Assistant agent. For each tool, write the full JSON Schema definition and a Python stub with proper error handling.

Tool 1: run_tests Runs a test suite and returns results. Think about: how do you represent pass/fail per test? What if the test command itself fails to run?

Tool 2: search_docs Searches internal documentation for a query. Returns top 3 matching documents with title, excerpt, and URL. What happens when there are no results?

Tool 3: create_github_issue Creates a GitHub issue with title, body, labels, and assignees. Add a dry_run parameter. What does the dry run return?

Tool 4: explain_error Takes an error message and stack trace, returns a structured explanation with: probable cause, common fixes, relevant documentation links. What's the schema?

Once you have the schemas, wire them into a real Anthropic API call using the tool use pattern from Module 6. Test by asking the agent to "run the tests and if they fail, create a GitHub issue with the error details."

🧪

Knowledge Check

Answer all 3 questions to unlock completion

Q1How does a model decide which tool to call?

Q2What's the difference between a Tool and a Skill?

Q3You build a send_email tool. How do you prevent the agent from bulk-emailing 5,000 users?

← MCP Protocol

Evaluation →