AI Agent Architecture — From ReAct to Multi-Agent Systems
How to design production AI Agent systems. A practical guide covering the ReAct pattern, Tool Use, Memory management, Multi-Agent orchestration, and safety design.
TestForge Team ·
What Is an AI Agent?
Agent = LLM + Tools + Memory + Planning
The difference from a simple LLM call:
- LLM: single input → single output
- Agent: iteratively plans and acts until the goal is reached
User: "What's the weather in Seoul today?"
LLM: "I'm sorry, I don't have access to real-time information."
Agent:
1. Think: Need to call a weather API
2. Act: weather_tool("Seoul")
3. Observe: {"temp": 18, "condition": "clear"}
4. Answer: "It's 18°C and clear in Seoul today."
The ReAct Pattern
Reasoning + Acting in a loop.
from anthropic import Anthropic
client = Anthropic()
tools = [
{
"name": "search_web",
"description": "Search the web for up-to-date information",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
},
{
"name": "run_code",
"description": "Execute Python code",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string"}
},
"required": ["code"]
}
}
]
def run_agent(user_message: str, max_iterations: int = 10):
messages = [{"role": "user", "content": user_message}]
for i in range(max_iterations):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
)
if response.stop_reason == "end_turn":
return response.content[0].text
if response.stop_reason == "tool_use":
tool_calls = [b for b in response.content if b.type == "tool_use"]
tool_results = []
for call in tool_calls:
result = execute_tool(call.name, call.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": call.id,
"content": str(result)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
raise RuntimeError("Max iterations reached")
Memory Design
Agents need three types of memory.
class AgentMemory:
def __init__(self):
# 1. Working Memory: current conversation context
self.messages: list[dict] = []
# 2. Episodic Memory: summaries of past conversations (Vector DB)
self.vector_store = ChromaDB("agent_episodes")
# 3. Semantic Memory: domain knowledge (searchable)
self.knowledge_base = ChromaDB("knowledge")
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
# If context exceeds limit, summarize and compress
if len(self.messages) > 20:
self._compress()
def retrieve_relevant(self, query: str, k: int = 3) -> list[str]:
"""Search past episodes for similar experiences"""
return self.vector_store.similarity_search(query, k=k)
def _compress(self):
"""Summarize old messages and store in vector DB"""
old_messages = self.messages[:10]
summary = summarize(old_messages) # Summarize with LLM
self.vector_store.add(summary)
self.messages = self.messages[10:]
Multi-Agent Architecture
Complex tasks benefit from multiple specialized agents working together.
Orchestrator Agent
├── Research Agent → Information gathering
├── Analysis Agent → Data analysis
├── Code Agent → Code writing/execution
└── Writer Agent → Compiling results
class OrchestratorAgent:
def __init__(self):
self.agents = {
"research": ResearchAgent(),
"analysis": AnalysisAgent(),
"code": CodeAgent(),
"writer": WriterAgent(),
}
async def run(self, task: str) -> str:
# 1. Decompose the task
subtasks = await self.plan(task)
# 2. Run independent tasks in parallel
results = await asyncio.gather(*[
self.agents[st.agent].run(st.description)
for st in subtasks
])
# 3. Synthesize results
return await self.synthesize(task, results)
Safety Design
class SafeAgent:
# High-risk tools require approval before execution
REQUIRES_APPROVAL = {"delete_file", "send_email", "deploy_production"}
async def execute_tool(self, tool_name: str, params: dict):
if tool_name in self.REQUIRES_APPROVAL:
if not await self.request_human_approval(tool_name, params):
return {"error": "User denied the action"}
# Validate inputs before execution
validated = self.validate_inputs(tool_name, params)
# Set timeout
try:
return await asyncio.wait_for(
self.tools[tool_name](**validated),
timeout=30.0
)
except asyncio.TimeoutError:
return {"error": "Tool execution timeout"}
Production Checklist
- Max iterations cap (prevent infinite loops)
- Tool timeout configuration
- Cost ceiling (monitor token usage)
- Human-in-the-loop: require approval for risky actions
- Execution logs saved (for debugging and auditing)
- Failure recovery: fallback paths when tools fail
- Input validation: defend against prompt injection
Cost Estimation
# Based on claude-opus-4-6 pricing
# Input: $15/1M tokens, Output: $75/1M tokens
def estimate_cost(iterations: int, avg_tokens_per_turn: int = 1000):
input_tokens = iterations * avg_tokens_per_turn
output_tokens = iterations * 500
cost = (input_tokens * 15 + output_tokens * 75) / 1_000_000
return f"Estimated cost: ${cost:.4f} / task"
Agent architectures are powerful but costs escalate quickly.
Control costs through caching and model selection (Haiku vs Opus).