SKILL.md
Autonomous Agents
Autonomous agents are AI systems that can independently decompose goals,
plan actions, execute tools, and self-correct without constant human guidance.
The challenge isn't making them capable - it's making them reliable. Every
extra decision multiplies failure probability.
This skill covers agent loops (ReAct, Plan-Execute), goal decomposition,
reflection patterns, and production reliability. Key insight: compounding
error rates kill autonomous agents. A 95% success rate per step drops to
60% by step 10. Build for reliability first, autonomy second.
2025 lesson: The winners are constrained, domain-specific agents with clear
boundaries, not "autonomous everything." Treat AI outputs as proposals,
not truth.
Principles
- Reliability over autonomy - every step compounds error probability
- Constrain scope - domain-specific beats general-purpose
- Treat outputs as proposals, not truth
- Build guardrails before expanding capabilities
- Human-in-the-loop for critical decisions is non-negotiable
- Log everything - every action must be auditable
- Fail safely with rollback, not silently with corruption
Capabilities
- autonomous-agents
- agent-loops
- goal-decomposition
- self-correction
- reflection-patterns
- react-pattern
- plan-execute
- agent-reliability
- agent-guardrails
Scope
- multi-agent-systems → multi-agent-orchestration
- tool-building → agent-tool-builder
- memory-systems → agent-memory-systems
- workflow-orchestration → workflow-automation
Tooling
Frameworks
- LangGraph - When: Production agents with state management Note: 1.0 released Oct 2025, checkpointing, human-in-loop
- AutoGPT - When: Research/experimentation, open-ended exploration Note: Needs external guardrails for production
- CrewAI - When: Role-based agent teams Note: Good for specialized agent collaboration
- Claude Agent SDK - When: Anthropic ecosystem agents Note: Computer use, tool execution
Patterns
- ReAct - When: Reasoning + Acting in alternating steps Note: Foundation for most modern agents
- Plan-Execute - When: Separate planning from execution Note: Better for complex multi-step tasks
- Reflection - When: Self-evaluation and correction Note: Evaluator-optimizer loop
Patterns
ReAct Agent Loop
Alternating reasoning and action steps
When to use: Interactive problem-solving, tool use, exploration
REACT PATTERN:
"""
The ReAct loop:
- Thought: Reason about what to do next
- Action: Choose and execute a tool
- Observation: Receive result
- Repeat until goal achieved
Key: Explicit reasoning traces make debugging possible
"""
Basic ReAct Implementation
"""
from langchain.agents import create_react_agent
from langchain_openai import ChatOpenAI
Define the ReAct prompt template
react_prompt = '''
Answer the question using the following format:
Question: the input question
Thought: reason about what to do
Action: tool_name
Action Input: input to the tool
Observation: result of the action
... (repeat Thought/Action/Observation as needed)
Thought: I now know the final answer
Final Answer: the answer
'''
Create the agent
agent = create_react_agent(
llm=ChatOpenAI(model="gpt-4o"),
tools=tools,
prompt=react_prompt,
)
Execute with step limit
result = agent.invoke(
{"input": query},
config={"max_iterations": 10} # Prevent runaway loops
)
"""
LangGraph ReAct (Production)
"""
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.postgres import PostgresSaver
Production checkpointer
checkpointer = PostgresSaver.from_conn_string(
os.environ["POSTGRES_URL"]
)
agent = create_react_agent(
model=llm,
tools=tools,
checkpointer=checkpointer, # Durable state
)
Invoke with thread for state persistence
config = {"configurable": {"thread_id": "user-123"}}
result = agent.invoke({"messages": [query]}, config)
"""
Plan-Execute Pattern
Separate planning phase from execution
When to use: Complex multi-step tasks, when full plan visibility matters
PLAN-EXECUTE PATTERN:
"""
Two-phase approach:
- Planning: Decompose goal into subtasks
- Execution: Execute subtasks, potentially re-plan
Advantages:
- Full visibility into plan before execution
- Can validate/modify plan with human
- Cleaner separation of concerns
Disadvantages:
- Less adaptive to mid-task discoveries
- Plan may become stale
"""
LangGraph Plan-Execute
"""
from langgraph.prebuilt import create_plan_and_execute_agent
Planner creates the task list
planner_prompt = '''
For the given objective, create a step-by-step plan.
Each step should be atomic and actionable.
Format: numbered list of steps.
'''
Executor handles individual steps
executor_prompt = '''
You are executing step {step_number} of the plan.
Previous results: {previous_results}
Current step: {current_step}
Execute this step using available tools.
'''
agent = create_plan_and_execute_agent(
planner=planner_llm,
executor=executor_llm,
tools=tools,
replan_on_error=True, # Re-plan if step fails
)
Human approval of plan
config = {
"configurable": {
"thread_id": "task-456",
},
"interrupt_before": ["execute"], # Pause before execution
}
First call creates plan
plan = agent.invoke({"objective": goal}, config)
Review plan, then continue
if human_approves(plan):
result = agent.invoke(None, config) # Continue from checkpoint
"""
Decomposition Strategies
"""
Decomposition-First: Plan everything, then execute
Best for: Stable tasks, need full plan approval
Interleaved: Plan one step, execute, repeat
Best for: Dynamic tasks, learning as you go
def interleaved_execute(goal, max_steps=10):
state = {"goal": goal, "completed": [], "remaining": [goal]}
for step in range(max_steps):
# Plan next action based on current state
next_action = planner.plan_next(state)
if next_action == "DONE":
break
# Execute and update state
result = executor.execute(next_action)
state["completed"].append((next_action, result))
# Re-evaluate remaining work
state["remaining"] = planner.reassess(state)
return state
"""
Reflection Pattern
Self-evaluation and iterative improvement
When to use: Quality matters, complex outputs, creative tasks
REFLECTION PATTERN:
"""
Self-correction loop:
- Generate initial output
- Evaluate against criteria
- Critique and identify issues
- Refine based on critique
- Repeat until satisfactory
Also called: Evaluator-Optimizer, Self-Critique
"""
Basic Reflection
"""
def reflect_and_improve(task, max_iterations=3):
Initial generation
output = generator.generate(task)
for i in range(max_iterations):
# Evaluate output
critique = evaluator.critique(
task=task,
output=output,
criteria=[
"Correctness",
"Completeness",
"Clarity",
]
)
if critique["passes_all"]:
return output
# Refine based on critique
output = generator.refine(
task=task,
previous_output=output,
critique=critique["feedback"],
)
return output # Best effort after max iterations
"""
LangGraph Reflection
"""
from langgraph.graph import StateGraph
def build_reflection_graph():
graph = StateGraph(ReflectionState)
# Nodes
graph.add_node("generate", generate_node)
graph.add_node("reflect", reflect_node)
graph.add_node("output", output_node)
# Edges
graph.add_edge("generate", "reflect")
graph.add_conditional_edges(
"reflect",
should_continue,
{
"continue": "generate", # Loop back
"end": "output",
}
)
return graph.compile()
def should_continue(state):
if state["iteration"] >= 3:
return "end"
if state["score"] >= 0.9:
return "end"
return "continue"
"""
Separate Evaluator (More Robust)
"""
Use different model for evaluation to avoid self-bias
generator = ChatOpenAI(model="gpt-4o")
evaluator = ChatOpenAI(model="gpt-4o-mini") # Different perspective
Or use specialized evaluators
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("criteria", criteria="correctness")
"""
Guardrailed Autonomy
Constrained agents with safety boundaries
When to use: Production systems, critical operations
GUARDRAILED AUTONOMY:
"""
Production agents need multiple safety layers:
- Input validation
- Action constraints
- Output validation
- Cost limits
- Human escalation
- Rollback capability
"""
Multi-Layer Guardrails
"""
class GuardedAgent:
def init(self, agent, config):
self.agent = agent
self.max_cost = config.get("max_cost_usd", 1.0)
self.max_steps = config.get("max_steps", 10)
self.allowed_actions = config.get("allowed_actions", [])
self.require_approval = config.get("require_approval", [])
async def execute(self, goal):
total_cost = 0
steps = 0
while steps < self.max_steps:
# Get next action
action = await self.agent.plan_next(goal)
# Validate action is allowed
if action.name not in self.allowed_actions:
raise ActionNotAllowedError(action.name)
# Check if approval needed
if action.name in self.require_approval:
approved = await self.request_human_approval(action)
if not approved:
return {"status": "rejected", "action": action}
# Estimate cost
estimated_cost = self.estimate_cost(action)
if total_cost + estimated_cost > self.max_cost:
raise CostLimitExceededError(total_cost)
# Execute with rollback capability
checkpoint = await self.save_checkpoint()
try:
result = await self.agent.execute(action)
total_cost += self.actual_cost(action)
steps += 1
except Exception as e:
await self.rollback_to(checkpoint)
raise
if result.is_complete:
break
return {"status": "complete", "total_cost": total_cost}
"""
Least Privilege Principle
"""
Define minimal permissions per task type
TASK_PERMISSIONS = {
"research": ["web_search", "read_file"],
"coding": ["read_file", "write_file", "run_tests"],
"admin": ["all"], # Rarely grant this
}
def create_scoped_agent(task_type):
allowed = TASK_PERMISSIONS.get(task_type, [])
tools = [t for t in ALL_TOOLS if t.name in allowed]
return Agent(tools=tools)
"""
Cost Control
"""
Context length grows quadratically in cost
Double context = 4x cost
def trim_context(messages, max_tokens=4000):
Keep system message and recent messages
system = messages[0]
recent = messages[-10:]
# Summarize middle if needed
if len(messages) > 11:
middle = messages[1:-10]
summary = summarize(middle)
return [system, summary] + recent
return messages
"""
Durable Execution Pattern
Agents that survive failures and resume
When to use: Long-running tasks, production systems, multi-day processes
DURABLE EXECUTION:
"""
Production agents must:
- Survive server restarts
- Resume from exact point of failure
- Handle hours/days of runtime
- Allow human intervention mid-process
LangGraph 1.0 provides this natively.
"""
LangGraph Checkpointing
"""
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.graph import StateGraph
Production checkpointer (not MemorySaver!)
checkpointer = PostgresSaver.from_conn_string(
os.environ["POSTGRES_URL"]
)
Build graph with checkpointing
graph = StateGraph(AgentState)
... add nodes and edges ...
agent = graph.compile(checkpointer=checkpointer)
Each invocation saves state
config = {"configurable": {"thread_id": "long-task-789"}}
Start task
agent.invoke({"goal": complex_goal}, config)
If server dies, resume later:
state = agent.get_state(config)
if not state.is_complete:
agent.invoke(None, config) # Continues from checkpoint
"""
Human-in-the-Loop Interrupts
"""
Pause at specific nodes
agent = graph.compile(
checkpointer=checkpointer,
interrupt_before=["critical_action"], # Pause before
interrupt_after=["validation"], # Pause after
)
First invocation pauses at interrupt
result = agent.invoke({"goal": goal}, config)
Human reviews state
state = agent.get_state(config)
if human_approves(state):
Continue from pause point
agent.invoke(None, config)
else:
Modify state and continue
agent.update_state(config, {"approved": False})
agent.invoke(None, config)
"""
Time-Travel Debugging
"""
LangGraph stores full history
history = list(agent.get_state_history(config))
Go back to any previous state
past_state = history[5]
agent.update_state(config, past_state.values)
Replay from that point with modifications
agent.invoke(None, config)
"""
Sharp Edges
Error Probability Compounds Exponentially
Severity: CRITICAL
Situation: Building multi-step autonomous agents
Symptoms:
Agent works in demos but fails in production. Simple tasks succeed,
complex tasks fail mysteriously. Success rate drops dramatically
as task complexity increases. Users lose trust.
Why this breaks:
Each step has independent failure probability. A 95% success rate
per step sounds great until you realize:
- 5 steps: 77% success (0.95^5)
- 10 steps: 60% success (0.95^10)
- 20 steps: 36% success (0.95^20)
This is the fundamental limit of autonomous agents. Every additional
step multiplies failure probability.
Recommended fix:
Reduce step count
Combine steps where possible
Prefer fewer, more capable steps over many small ones
Increase per-step reliability
Use structured outputs (JSON schemas)
Add validation at each step
Use better models for critical steps
Design for failure
class RobustAgent:
def execute_with_retry(self, step, max_retries=3):
for attempt in range(max_retries):
try:
result = step.execute()
if self.validate(result):
return result
except Exception as e:
if attempt == max_retries - 1:
raise
self.log_retry(step, attempt, e)
Break into checkpointed segments
Human review at each segment
Resume from last good checkpoint
API Costs Explode with Context Growth
Severity: CRITICAL
Situation: Running agents with growing conversation context
Symptoms:
$47 to close a single support ticket. Thousands in surprise API bills.
Agents getting slower as they run longer. Token counts exceeding
model limits.
Why this breaks:
Transformer costs scale quadratically with context length. Double
the context, quadruple the compute. A long-running agent that
re-sends its full conversation each turn can burn money exponentially.
Most agents append to context without trimming. Context grows:
- Turn 1: 500 tokens → $0.01
- Turn 10: 5000 tokens → $0.10
- Turn 50: 25000 tokens → $0.50
- Turn 100: 50000 tokens → $1.00+ per message
Recommended fix:
Set hard cost limits
class CostLimitedAgent:
MAX_COST_PER_TASK = 1.00 # USD
def __init__(self):
self.total_cost = 0
def before_call(self, estimated_tokens):
estimated_cost = self.estimate_cost(estimated_tokens)
if self.total_cost + estimated_cost > self.MAX_COST_PER_TASK:
raise CostLimitExceeded(
f"Would exceed ${self.MAX_COST_PER_TASK} limit"
)
def after_call(self, response):
self.total_cost += self.calculate_actual_cost(response)
Trim context aggressively
def trim_context(messages, max_tokens=4000):
Keep: system prompt + last N messages
Summarize: everything in between
if count_tokens(messages) <= max_tokens:
return messages
system = messages[0]
recent = messages[-5:]
middle = messages[1:-5]
if middle:
summary = summarize(middle) # Compress history
return [system, summary] + recent
return [system] + recent
Use streaming to track costs in real-time
Alert at 50% of budget, halt at 90%
Demo Works But Production Fails
Severity: CRITICAL
Situation: Moving from prototype to production
Symptoms:
Impressive demo to stakeholders. Months of failure in production.
Works for the founder's use case, fails for real users. Edge cases
overwhelm the system.
Why this breaks:
Demos show the happy path with curated inputs. Production means:
- Unexpected inputs (typos, ambiguity, adversarial)
- Scale (1000 users, not 3)
- Reliability (99.9% uptime, not "usually works")
- Edge cases (the 1% that breaks everything)
The methodology is questionable, but the core problem is real.
The gap between a working demo and a reliable production system
is where projects die.
Recommended fix:
Test at scale before production
Run 1000+ test cases, not 10
Measure P95/P99 success rate, not average
Include adversarial inputs
Build observability first
import structlog
logger = structlog.get_logger()
class ObservableAgent:
def execute(self, task):
with logger.bind(task_id=task.id):
logger.info("task_started")
try:
result = self._execute(task)
logger.info("task_completed", result=result)
return result
except Exception as e:
logger.error("task_failed", error=str(e))
raise
Have escape hatches
Human takeover when confidence multi-agent-orchestration (Multiple agents working together)
- user needs to test/evaluate agent -> agent-evaluation (Benchmarking and testing)
- user needs tools for agent -> agent-tool-builder (Tool design and implementation)
- user needs persistent memory -> agent-memory-systems (Long-term memory architecture)
- user needs workflow automation -> workflow-automation (When agent is overkill for the task)
- user needs computer control -> computer-use-agents (GUI automation, screen interaction)
Related Skills
Works well with: agent-tool-builder, agent-memory-systems, multi-agent-orchestration, agent-evaluation
When to Use
- User mentions or implies: autonomous agent
- User mentions or implies: autogpt
- User mentions or implies: babyagi
- User mentions or implies: self-prompting
- User mentions or implies: goal decomposition
- User mentions or implies: react pattern
- User mentions or implies: agent loop
- User mentions or implies: self-correcting agent
- User mentions or implies: reflection agent
- User mentions or implies: langgraph
- User mentions or implies: agentic ai
- User mentions or implies: agent planning
Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.