Week 3: Agents

Defining Agents #

The accompanying project code is helpful to see many of these concepts in action. I also paired with Claude to create a sample MCP client and server to help me understand how the plumbing works.

LLMs are static #

LLMs are static based on a body of knowledge. They don’t have autonomy or agency to plan, perform actions. For a next token predictor answering something like ‘Write an email to my boss to take one day off’ might return a good result. But ‘Write a full report on the housing marketing and share existing opportunities’ is going to miss expectations.

We can use RAG to augment content, and fine-tuning to make them more domain-specific. The there are limits to the complexity of task that even a fine-tuned, RAG-augmented model can handle on their owfn.

Our goal is to make LLMs more capable.

“How’s the weather in San Francisco today?” > LLM > + Weather API
“What is 1234532 + 56528” > LLM > + Calculator
“Where is my order?” > LLM > + Database access
“Who scored in the Barcelona game today?” > LLM > + Web search
“What is your refund policy?” > LLM > + RAG

Definition #

An agent is¹:

A software system that uses LLM(s) to pursue goals and complete tasks on behalf of users. They plan, reason, call tools, and rely on memory to complete complex tasks.

D2 diagram from agents.d2 — An agent is a system of software components

Agents have autonomy; LLMs don’t. Agentic systems have different levels of agency.

Simple processor
Workflows
Tool caller
Multi-step agents
Multi-agent systems

Types of Agent #

Let’s talk a little more about each of those 5 agent levels.

Simple Processor Agents #

This is seen less as an agent than a simple piece of sofware. Even thought it may be ‘simple’ in the agentic sense, there are typically many back-and-forth calls between software and LLM. And there’s more happening in the loops like tokenizations, response templating, and prompt engineering in general.

Workflow Agents #

This is where the predefined code paths are designed offline using some kind of orchestrator. At runtime, the LLM is not asked to plan. The benefit is more determinism.

Common workflow patterns:

Prompt chaining #

A single prompt can overwhelm the token limits and complexity that the LLM can handle. It can also lead to hallucination. Something like Analyze housing market, summarize the results, etc etc…" needs to be decomposed into smaller tasks. Something like this for a user query:

Prompt 1: analyze housing market specific in {user query}
Prompt 2: summarize findings in {output}
The ONLY thing it does is summary. Much simpler!
Prompt 3: Identify trends in {output2}
Again, a focused task.
Prompt 4: Share opportunities given {output1} and {output2}

Prompt chaining is where a task can be easily decomposed. It’s a tradeoff between accuracy and latency. Good examples might be:

Content generation (writing a docuemnt outline -> checking the outline -> write the document).
Data extraction like converting unstructured text into a structured format.
Information processing with transform1, transform2, transform3, etc.

Routing #

There’s some example code in my agent-learn/2-routing-pattern repo.

This goes beyond deterministic steps and we introduce conditional logic. Really two key steps:

Determine the intent of the user prompt.
Route the prompt to the appropriate LLM.

Routers could be one of the following types:

Rule based (if/else)
ML-based: Use a traditional ML model that can determine the path: classifier model (A or B)
Embedding similarity: query -> embedding -> nearest specialized embedding. This is useful for semantic routing.
LLM Routers: Just prompt the LLM to classify the intent in a deliberately structured way.

Routers are commonly used for efficiency. For example, you could send common or simple questions to smaller and cheaper models, whereas you could send complex questions to larger and more expensive models.

Here’s the key: the decision logic for routing does not have to be an LLM! It could even be human in the loop.

Reflection #

There’s some example code and a detailed README in my agent-learn/4-reflection repo.

This is also called Evaluator-Optimizer. The Evaluation is also called a Critic. There’s no way that a Router could figure this out: we’re entering a loop until the Critic is satisfied.

In practice, it’s ofter more effective for a specialized LLM so that you avoid feedback bias in the Critic.

Reflection is good when there are clear evalution criteria and where iterative refinement helps. Code generation is a great example. The Critic could write tests to assess and then ask the Generator to iterate on the code.

Parallelization #

Tool Caller #

I’ve written a bunch of examples of the tool-caller pattern.

Tool calling workflow:

Define a Tool
Let LLM know about the tool
When the LLM wants to use it, call the tool and return the response.

One way to let the LLM know about a tool is to define it in the system prompt. The key is to have the LLM spit out a structured format that the Agent can recognize:

System Prompt:

1
2
3
4
5
6
7
You can use a tool called add to add two numbers. It takes two inputs: number 1 and number 2. It returns their sum. Use it like this:

<tool>
  add(number1, number2)
</tool>

Just replace number1 and number2 with the numbers you want to add.

User prompt:
1
What is 184322 + 54821?

Response:

1
2
3
<tool>
add(184322, 54821)
</tool>

Add this point the Agent Software can recognize the format of the response and can call an appropriate existing tool. In this case it might be a Python tool with an add(n1, n2) signature, or even have the LLM write a function. LangChain makes tool integration pretty easy.

But this isn’t scalable: jamming stuff into a system prompt is non-standard, impossible to maintain, and leads to brittle code.

Model Context Protocol (MCP) is the new standard for tool integration. It’s a protocol. The service provider writes the tool functions once and exposes them as an MCP-compliant server to broad consumption.

You introduce the MCP Servers to your LLM through a standard mechanism and protocol.

This is more maintainable: adding a new tool is as simple as adding a new server to the MCP server.json. See GitHub MCP Registry for a searchable list of available MCP servers.

If you revisit Workflows and plug in the tools concept, everything becomes more maintainable and scalable and flexible. Imagine that the Tools in this workflow are augmenting the system and user prompt as data flows through the system.

This fixes the issues of access of current information.

Multi-step agents #

Solves the problem of fixed workflows by giving decision-making authority to the agent.

Making good decisions involves access to both tools and memory of what happened before. We’ve covered the tools. Memory is for storing and observing past interactions as additional context.

Logically this is a continuous Planning cycle consisting of think, act, observe:

graph LR T[Think] --> A[Act] --> M[Observe] M --> T

Pseudo-prompt:

1
2
3
4
5
6
7
8
9
Thought: I need to check the current weather for New York.
Action:
{
    "action": "get_weather",
    "action_input": {
        "location": "New York"
    }
}
Observe: I got the answer "<result>34F</result>"

Anthropic’s Building Effective Agents Cookbook is a key resource to understand these patterns (example). For example, their prompts for research agents demonstrate how to instruct the LLM to act in specific ways, including being critical of tool results ("…do not take tool results at face value…") and watching out for information pointing to future (looking for ‘could’ ‘maybe’ ‘future’ in search results).

There’s a lot of prompt engineering going in here (look for <tags> in prompts).

There are different approaches to plan > act > adapt. ReACT (2023) is probably the most popular approach as of October 2025. It’s just a prompting technique.

Other papers describing how to implement this think > plan > act loop:

Reflexion - Reflexion: Language Agents with Verbal Reinforcement Learning
ReWOO - ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
Tree Search for Language Model Agents - Research on using tree search algorithms for improved agent planning and decision-making

In summary, workflows vs multi-step agents:

Workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale.
Agents can adapt to new situations, like answering a customer’s unique question. Workflows are rigid, better for repeating the same process, like scheduling maintenance.

When evaluating whether to use multi-step agents:

The autonomous nature of agents means higher costs, and the potential for compounding errors.
Challenging due to their dynamic nature. They can be unreliable, illogical, or prone to infinite loops,
When a problem’s solution is already well-understood and repeatable, constraining the agent to a predetermined, fixed workflow is more effective
Agents can be used for open-ended problems where it’s difficult to predict the required number of steps, and where you can’t hardcode a fixed path.

Multi-agent system #

Single agent may not be capable or the problem is too complex for a single agent. On agent may fail or move in an incorrect direction, so multi-agent systems can self correct.

This is not easy! Common challenges include:

Coordination. Sharing planning, results.
Memory management. When to share, when to isolate.
Compounding errors. The inherent loop means errors can be amplified.

The Anthropic multi-agent research system is a good read that shows the level of effort. OpenAI has shared a practical guide that covers design patterns.

The coordination between agents led Google to defined and release the Agent To Agent (A2A) protocol. Agents can communicate regardless of who built them. A2A is to inter-agent communication what MCP is to tool sharing.

How do you evaluate agents?

Things to thing about:

Accuracy: How well does the agent achieve its goal?
Efficiency: How quickly does the agent achieve its goal?
Robustness: How well does the agent handle unexpected inputs or changes in the environment?
Safety: How well does the agent avoid harmful actions or outcomes?
Fairness: How well does the agent treat different groups or individuals?

Some common metrics include:

METRIC: Token consumption, e.g., avg token usage per request.
METRIC: Tool execution success rate.
METRIC: Observability. How easily can you even find errors.
METRIC: Task success rate.

The AI Agent Stack #

This came from a Jan 2025 post and nicely summarizes a lot of the tooling options so that you don’t think you have to build everything from scratch!

diagram showing elements of the AI agent stack grouped by logical area of the architecture

source

Resources #

Introducing Operator - OpenAI’s AI agent that can autonomously perform tasks through a web browser
Building effective agents - Anthropic’s guide on practical patterns and best practices for developing AI agents
Building Effective Agents Cookbook - Anthropic’s cookbook repository with code examples for agent patterns
Function calling - OpenAI Cookbook guide on fine-tuning models for improved function calling accuracy
Langchain tools - Comprehensive directory of tool integrations available in the LangChain framework
MCP - Introducing the Model Context Protocol, an open standard for connecting AI assistants to data sources
Langchain function calling - LangChain documentation on implementing tool and function calling
ReAct paper - ReAct: Synergizing Reasoning and Acting in Language Models
Reflexion - Reflexion: Language Agents with Verbal Reinforcement Learning
ReWOO - ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
Tree Search for Language Model Agents - Research on using tree search algorithms for improved agent planning and decision-making
How we built our multi-agent research system - Anthropic’s engineering deep-dive on building production multi-agent systems
A practical guide to building agents - OpenAI’s comprehensive guide for product and engineering teams on agent development
A2A - Announcing the Agent2Agent Protocol for enabling AI agents to collaborate across platforms
A survey of AI agents protocol - Academic survey analyzing existing agent communication protocols and standards
Agent leaderboard - Ranking LLMs on agentic tasks across real-world business scenarios
OpenAI Agent SDK - OpenAI’s Python SDK for building production-ready agentic AI applications

Agentic Tool Stack Jan 2025 #

source

Defining Agents #

LLMs are static #

Definition #

Types of Agent #

Simple Processor Agents #

Workflow Agents #

Prompt chaining #

Routing #

Reflection #

Parallelization #

Tool Caller #

Multi-step agents #

Multi-agent system #

The AI Agent Stack #

Resources #

Agentic Tool Stack Jan 2025 #

Model Serving & Inference #

Storage #

Agent Development #

Tool Execution #

Observability #

Memory Management #

See Also