TestForge Blog
← All Posts

AI Agent Architecture — From ReAct to Multi-Agent Systems

How to design production AI Agent systems. A practical guide covering the ReAct pattern, Tool Use, Memory management, Multi-Agent orchestration, and safety design.

TestForge Team ·

What Is an AI Agent?

Agent = LLM + Tools + Memory + Planning

The difference from a simple LLM call:

  • LLM: single input → single output
  • Agent: iteratively plans and acts until the goal is reached
User: "What's the weather in Seoul today?"
LLM: "I'm sorry, I don't have access to real-time information."
Agent:
  1. Think: Need to call a weather API
  2. Act: weather_tool("Seoul")
  3. Observe: {"temp": 18, "condition": "clear"}
  4. Answer: "It's 18°C and clear in Seoul today."

The ReAct Pattern

Reasoning + Acting in a loop.

from anthropic import Anthropic

client = Anthropic()

tools = [
    {
        "name": "search_web",
        "description": "Search the web for up-to-date information",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "run_code",
        "description": "Execute Python code",
        "input_schema": {
            "type": "object",
            "properties": {
                "code": {"type": "string"}
            },
            "required": ["code"]
        }
    }
]

def run_agent(user_message: str, max_iterations: int = 10):
    messages = [{"role": "user", "content": user_message}]
    
    for i in range(max_iterations):
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )
        
        if response.stop_reason == "end_turn":
            return response.content[0].text
        
        if response.stop_reason == "tool_use":
            tool_calls = [b for b in response.content if b.type == "tool_use"]
            tool_results = []
            
            for call in tool_calls:
                result = execute_tool(call.name, call.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": call.id,
                    "content": str(result)
                })
            
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
    
    raise RuntimeError("Max iterations reached")

Memory Design

Agents need three types of memory.

class AgentMemory:
    def __init__(self):
        # 1. Working Memory: current conversation context
        self.messages: list[dict] = []
        
        # 2. Episodic Memory: summaries of past conversations (Vector DB)
        self.vector_store = ChromaDB("agent_episodes")
        
        # 3. Semantic Memory: domain knowledge (searchable)
        self.knowledge_base = ChromaDB("knowledge")
    
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        # If context exceeds limit, summarize and compress
        if len(self.messages) > 20:
            self._compress()
    
    def retrieve_relevant(self, query: str, k: int = 3) -> list[str]:
        """Search past episodes for similar experiences"""
        return self.vector_store.similarity_search(query, k=k)
    
    def _compress(self):
        """Summarize old messages and store in vector DB"""
        old_messages = self.messages[:10]
        summary = summarize(old_messages)  # Summarize with LLM
        self.vector_store.add(summary)
        self.messages = self.messages[10:]

Multi-Agent Architecture

Complex tasks benefit from multiple specialized agents working together.

Orchestrator Agent
├── Research Agent    → Information gathering
├── Analysis Agent    → Data analysis
├── Code Agent        → Code writing/execution
└── Writer Agent      → Compiling results
class OrchestratorAgent:
    def __init__(self):
        self.agents = {
            "research": ResearchAgent(),
            "analysis": AnalysisAgent(),
            "code": CodeAgent(),
            "writer": WriterAgent(),
        }
    
    async def run(self, task: str) -> str:
        # 1. Decompose the task
        subtasks = await self.plan(task)
        
        # 2. Run independent tasks in parallel
        results = await asyncio.gather(*[
            self.agents[st.agent].run(st.description)
            for st in subtasks
        ])
        
        # 3. Synthesize results
        return await self.synthesize(task, results)

Safety Design

class SafeAgent:
    # High-risk tools require approval before execution
    REQUIRES_APPROVAL = {"delete_file", "send_email", "deploy_production"}
    
    async def execute_tool(self, tool_name: str, params: dict):
        if tool_name in self.REQUIRES_APPROVAL:
            if not await self.request_human_approval(tool_name, params):
                return {"error": "User denied the action"}
        
        # Validate inputs before execution
        validated = self.validate_inputs(tool_name, params)
        
        # Set timeout
        try:
            return await asyncio.wait_for(
                self.tools[tool_name](**validated),
                timeout=30.0
            )
        except asyncio.TimeoutError:
            return {"error": "Tool execution timeout"}

Production Checklist

  • Max iterations cap (prevent infinite loops)
  • Tool timeout configuration
  • Cost ceiling (monitor token usage)
  • Human-in-the-loop: require approval for risky actions
  • Execution logs saved (for debugging and auditing)
  • Failure recovery: fallback paths when tools fail
  • Input validation: defend against prompt injection

Cost Estimation

# Based on claude-opus-4-6 pricing
# Input: $15/1M tokens, Output: $75/1M tokens

def estimate_cost(iterations: int, avg_tokens_per_turn: int = 1000):
    input_tokens = iterations * avg_tokens_per_turn
    output_tokens = iterations * 500
    
    cost = (input_tokens * 15 + output_tokens * 75) / 1_000_000
    return f"Estimated cost: ${cost:.4f} / task"

Agent architectures are powerful but costs escalate quickly.
Control costs through caching and model selection (Haiku vs Opus).