blog

Week 3: Agents

Week 3 is all about agents. Workflows, tools, multi-step agents, and the protcols and frameworks involved.

Defining Agents

The accompanying project code is helpful to see many of these concepts in action. I also paired with Claude to create a sample MCP client and server to help me understand how the plumbing works.

LLMs are static

LLMs are static based on a body of knowledge. They don’t have autonomy or agency to plan, perform actions. For a next token predictor answering something like ‘Write an email to my boss to take one day off’ might return a good result. But ‘Write a full report on the housing marketing and share existing opportunities’ is going to miss expectations.

We can use RAG to augment content, and fine-tuning to make them more domain-specific. The there are limits to the complexity of task that even a fine-tuned, RAG-augmented model can handle on their owfn.

Our goal is to make LLMs more capable.

  • “How’s the weather in San Francisco today?” > LLM > + Weather API
  • “What is 1234532 + 56528” > LLM > + Calculator
  • “Where is my order?” > LLM > + Database access
  • “Who scored in the Barcelona game today?” > LLM > + Web search
  • “What is your refund policy?” > LLM > + RAG

Definition

An agent is1:

A software system that uses LLM(s) to pursue goals and complete tasks on behalf of users. They plan, reason, call tools, and rely on memory to complete complex tasks.

D2 diagram from agents.d2
An agent is a system of software components

Agents have autonomy; LLMs don’t. Agentic systems have different levels of agency.

  1. Simple processor
  2. Workflows
  3. Tool caller
  4. Multi-step agents
  5. Multi-agent systems

Types of Agent

Let’s talk a little more about each of those 5 agent levels.

Simple Processor Agents

This is seen less as an agent than a simple piece of sofware. Even thought it may be ‘simple’ in the agentic sense, there are typically many back-and-forth calls between software and LLM. And there’s more happening in the loops like tokenizations, response templating, and prompt engineering in general.

D2 diagram from simple-processor.d2

Workflow Agents

This is where the predefined code paths are designed offline using some kind of orchestrator. At runtime, the LLM is not asked to plan. The benefit is more determinism.

D2 diagram from workflow.d2

Common workflow patterns:

Prompt chaining

A single prompt can overwhelm the token limits and complexity that the LLM can handle. It can also lead to hallucination. Something like Analyze housing market, summarize the results, etc etc…" needs to be decomposed into smaller tasks. Something like this for a user query:

  1. Prompt 1: analyze housing market specific in {user query}

  2. Prompt 2: summarize findings in {output}

    The ONLY thing it does is summary. Much simpler!

  3. Prompt 3: Identify trends in {output2}

    Again, a focused task.

  4. Prompt 4: Share opportunities given {output1} and {output2}

Prompt chaining is where a task can be easily decomposed. It’s a tradeoff between accuracy and latency. Good examples might be:

  • Content generation (writing a docuemnt outline -> checking the outline -> write the document).
  • Data extraction like converting unstructured text into a structured format.
  • Information processing with transform1, transform2, transform3, etc.
Routing

There’s some example code in my agent-learn/2-routing-pattern repo.

This goes beyond deterministic steps and we introduce conditional logic. Really two key steps:

  1. Determine the intent of the user prompt.
  2. Route the prompt to the appropriate LLM.

Routers could be one of the following types:

  • Rule based (if/else)
  • ML-based: Use a traditional ML model that can determine the path: classifier model (A or B)
  • Embedding similarity: query -> embedding -> nearest specialized embedding. This is useful for semantic routing.
  • LLM Routers: Just prompt the LLM to classify the intent in a deliberately structured way.

Routers are commonly used for efficiency. For example, you could send common or simple questions to smaller and cheaper models, whereas you could send complex questions to larger and more expensive models.

D2 diagram from router.d2

Here’s the key: the decision logic for routing does not have to be an LLM! It could even be human in the loop.

Reflection

There’s some example code and a detailed README in my agent-learn/4-reflection repo.

This is also called Evaluator-Optimizer. The Evaluation is also called a Critic. There’s no way that a Router could figure this out: we’re entering a loop until the Critic is satisfied.

In practice, it’s ofter more effective for a specialized LLM so that you avoid feedback bias in the Critic.

Reflection is good when there are clear evalution criteria and where iterative refinement helps. Code generation is a great example. The Critic could write tests to assess and then ask the Generator to iterate on the code.

Parallelization
D2 diagram from parallelization.d2
Tool Caller

I’ve written a bunch of examples of the tool-caller pattern.

D2 diagram from tool-caller.d2

Tool calling workflow:

  1. Define a Tool
  2. Let LLM know about the tool
  3. When the LLM wants to use it, call the tool and return the response.

One way to let the LLM know about a tool is to define it in the system prompt. The key is to have the LLM spit out a structured format that the Agent can recognize:

  • System Prompt:

    1
    2
    3
    4
    5
    6
    7
    
    You can use a tool called add to add two numbers. It takes two inputs: number 1 and number 2. It returns their sum. Use it like this:
    
    <tool>
      add(number1, number2)
    </tool>
    
    Just replace number1 and number2 with the numbers you want to add.
    
  • User prompt:

    1
    
    What is 184322 + 54821?
    
  • Response:

    1
    2
    3
    
    <tool>
    add(184322, 54821)
    </tool>
    

Add this point the Agent Software can recognize the format of the response and can call an appropriate existing tool. In this case it might be a Python tool with an add(n1, n2) signature, or even have the LLM write a function. LangChain makes tool integration pretty easy.

But this isn’t scalable: jamming stuff into a system prompt is non-standard, impossible to maintain, and leads to brittle code.

Model Context Protocol (MCP) is the new standard for tool integration. It’s a protocol. The service provider writes the tool functions once and exposes them as an MCP-compliant server to broad consumption.

D2 diagram from mcp.d2

You introduce the MCP Servers to your LLM through a standard mechanism and protocol.

D2 diagram from mcp-for-llm.d2

This is more maintainable: adding a new tool is as simple as adding a new server to the MCP server.json. See GitHub MCP Registry for a searchable list of available MCP servers.

If you revisit Workflows and plug in the tools concept, everything becomes more maintainable and scalable and flexible. Imagine that the Tools in this workflow are augmenting the system and user prompt as data flows through the system.

D2 diagram from workflow-with-tools.d2

This fixes the issues of access of current information.

Multi-step agents

Solves the problem of fixed workflows by giving decision-making authority to the agent.

D2 diagram from multi-step.d2

Making good decisions involves access to both tools and memory of what happened before. We’ve covered the tools. Memory is for storing and observing past interactions as additional context.

Logically this is a continuous Planning cycle consisting of think, act, observe:

graph LR T[Think] --> A[Act] --> M[Observe] M --> T

Pseudo-prompt:

1
2
3
4
5
6
7
8
9
Thought: I need to check the current weather for New York.
Action:
{
    "action": "get_weather",
    "action_input": {
        "location": "New York"
    }
}
Observe: I got the answer "<result>34F</result>"

Anthropic’s Building Effective Agents Cookbook is a key resource to understand these patterns (example). For example, their prompts for research agents demonstrate how to instruct the LLM to act in specific ways, including being critical of tool results ("…do not take tool results at face value…") and watching out for information pointing to future (looking for ‘could’ ‘maybe’ ‘future’ in search results).

There’s a lot of prompt engineering going in here (look for <tags> in prompts).

There are different approaches to plan > act > adapt. ReACT (2023) is probably the most popular approach as of October 2025. It’s just a prompting technique.

Other papers describing how to implement this think > plan > act loop:

  • Reflexion - Reflexion: Language Agents with Verbal Reinforcement Learning
  • ReWOO - ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
  • Tree Search for Language Model Agents - Research on using tree search algorithms for improved agent planning and decision-making

In summary, workflows vs multi-step agents:

  • Workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale.
  • Agents can adapt to new situations, like answering a customer’s unique question. Workflows are rigid, better for repeating the same process, like scheduling maintenance.

When evaluating whether to use multi-step agents:

  • The autonomous nature of agents means higher costs, and the potential for compounding errors.
  • Challenging due to their dynamic nature. They can be unreliable, illogical, or prone to infinite loops,
  • When a problem’s solution is already well-understood and repeatable, constraining the agent to a predetermined, fixed workflow is more effective
  • Agents can be used for open-ended problems where it’s difficult to predict the required number of steps, and where you can’t hardcode a fixed path.

Multi-agent system

Single agent may not be capable or the problem is too complex for a single agent. On agent may fail or move in an incorrect direction, so multi-agent systems can self correct.

This is not easy! Common challenges include:

  • Coordination. Sharing planning, results.
  • Memory management. When to share, when to isolate.
  • Compounding errors. The inherent loop means errors can be amplified.

The Anthropic multi-agent research system is a good read that shows the level of effort. OpenAI has shared a practical guide that covers design patterns.

The coordination between agents led Google to defined and release the Agent To Agent (A2A) protocol. Agents can communicate regardless of who built them. A2A is to inter-agent communication what MCP is to tool sharing.

How do you evaluate agents?

Things to thing about:

  • Accuracy: How well does the agent achieve its goal?
  • Efficiency: How quickly does the agent achieve its goal?
  • Robustness: How well does the agent handle unexpected inputs or changes in the environment?
  • Safety: How well does the agent avoid harmful actions or outcomes?
  • Fairness: How well does the agent treat different groups or individuals?

Some common metrics include:

  • METRIC: Token consumption, e.g., avg token usage per request.
  • METRIC: Tool execution success rate.
  • METRIC: Observability. How easily can you even find errors.
  • METRIC: Task success rate.

The AI Agent Stack

This came from a Jan 2025 post and nicely summarizes a lot of the tooling options so that you don’t think you have to build everything from scratch!

diagram showing elements of the AI agent stack grouped by logical area of the architecture

source

Resources

Agentic Tool Stack Jan 2025

source

Model Serving & Inference

Storage

Agent Development

Tool Execution

Observability

Memory Management


See Also

View page source