SKILL.md

Autonomous Agents

Autonomous agents are AI systems that can independently decompose goals,

plan actions, execute tools, and self-correct without constant human guidance.

The challenge isn't making them capable - it's making them reliable. Every

extra decision multiplies failure probability.

This skill covers agent loops (ReAct, Plan-Execute), goal decomposition,

reflection patterns, and production reliability. Key insight: compounding

error rates kill autonomous agents. A 95% success rate per step drops to

60% by step 10. Build for reliability first, autonomy second.

2025 lesson: The winners are constrained, domain-specific agents with clear

boundaries, not "autonomous everything." Treat AI outputs as proposals,

not truth.

Principles

Reliability over autonomy - every step compounds error probability

Constrain scope - domain-specific beats general-purpose

Treat outputs as proposals, not truth

Build guardrails before expanding capabilities

Human-in-the-loop for critical decisions is non-negotiable

Log everything - every action must be auditable

Fail safely with rollback, not silently with corruption

Capabilities

autonomous-agents

agent-loops

goal-decomposition

self-correction

reflection-patterns

react-pattern

plan-execute

agent-reliability

agent-guardrails

Scope

multi-agent-systems → multi-agent-orchestration

tool-building → agent-tool-builder

memory-systems → agent-memory-systems

workflow-orchestration → workflow-automation

Tooling

Frameworks

LangGraph - When: Production agents with state management Note: 1.0 released Oct 2025, checkpointing, human-in-loop

AutoGPT - When: Research/experimentation, open-ended exploration Note: Needs external guardrails for production

CrewAI - When: Role-based agent teams Note: Good for specialized agent collaboration

Claude Agent SDK - When: Anthropic ecosystem agents Note: Computer use, tool execution

Patterns

ReAct - When: Reasoning + Acting in alternating steps Note: Foundation for most modern agents

Plan-Execute - When: Separate planning from execution Note: Better for complex multi-step tasks

Reflection - When: Self-evaluation and correction Note: Evaluator-optimizer loop

Patterns

ReAct Agent Loop

Alternating reasoning and action steps

When to use: Interactive problem-solving, tool use, exploration

REACT PATTERN:

"""

The ReAct loop:

Thought: Reason about what to do next

Action: Choose and execute a tool

Observation: Receive result

Repeat until goal achieved

Key: Explicit reasoning traces make debugging possible

"""

Basic ReAct Implementation

"""

from langchain.agents import create_react_agent

from langchain_openai import ChatOpenAI

Define the ReAct prompt template

react_prompt = '''

Answer the question using the following format:

Question: the input question

Thought: reason about what to do

Action: tool_name

Action Input: input to the tool

Observation: result of the action

... (repeat Thought/Action/Observation as needed)

Thought: I now know the final answer

Final Answer: the answer

'''

Create the agent

agent = create_react_agent(

llm=ChatOpenAI(model="gpt-4o"),

tools=tools,

prompt=react_prompt,

)

Execute with step limit

result = agent.invoke(

{"input": query},

config={"max_iterations": 10} # Prevent runaway loops

)

"""

LangGraph ReAct (Production)

"""

from langgraph.prebuilt import create_react_agent

from langgraph.checkpoint.postgres import PostgresSaver

Production checkpointer

checkpointer = PostgresSaver.from_conn_string(

os.environ["POSTGRES_URL"]

)

agent = create_react_agent(

model=llm,

tools=tools,

checkpointer=checkpointer, # Durable state

)

Invoke with thread for state persistence

config = {"configurable": {"thread_id": "user-123"}}

result = agent.invoke({"messages": [query]}, config)

"""

Plan-Execute Pattern

Separate planning phase from execution

When to use: Complex multi-step tasks, when full plan visibility matters

PLAN-EXECUTE PATTERN:

"""

Two-phase approach:

Planning: Decompose goal into subtasks

Execution: Execute subtasks, potentially re-plan

Advantages:

Full visibility into plan before execution

Can validate/modify plan with human

Cleaner separation of concerns

Disadvantages:

Less adaptive to mid-task discoveries

Plan may become stale

"""

LangGraph Plan-Execute

"""

from langgraph.prebuilt import create_plan_and_execute_agent

Planner creates the task list

planner_prompt = '''

For the given objective, create a step-by-step plan.

Each step should be atomic and actionable.

Format: numbered list of steps.

'''

Executor handles individual steps

executor_prompt = '''

You are executing step {step_number} of the plan.

Previous results: {previous_results}

Current step: {current_step}

Execute this step using available tools.

'''

agent = create_plan_and_execute_agent(

planner=planner_llm,

executor=executor_llm,

tools=tools,

replan_on_error=True, # Re-plan if step fails

)

Human approval of plan

config = {

"configurable": {

"thread_id": "task-456",

"interrupt_before": ["execute"], # Pause before execution

}

First call creates plan

plan = agent.invoke({"objective": goal}, config)

Review plan, then continue

if human_approves(plan):

result = agent.invoke(None, config) # Continue from checkpoint

"""

Decomposition Strategies

"""

Decomposition-First: Plan everything, then execute

Best for: Stable tasks, need full plan approval

Interleaved: Plan one step, execute, repeat

Best for: Dynamic tasks, learning as you go

def interleaved_execute(goal, max_steps=10):

state = {"goal": goal, "completed": [], "remaining": [goal]}

for step in range(max_steps):

    # Plan next action based on current state

    next_action = planner.plan_next(state)

    if next_action == "DONE":

        break

    # Execute and update state

    result = executor.execute(next_action)

    state["completed"].append((next_action, result))

    # Re-evaluate remaining work

    state["remaining"] = planner.reassess(state)

return state

"""

Reflection Pattern

Self-evaluation and iterative improvement

When to use: Quality matters, complex outputs, creative tasks

REFLECTION PATTERN:

"""

Self-correction loop:

Generate initial output

Evaluate against criteria

Critique and identify issues

Refine based on critique

Repeat until satisfactory

Also called: Evaluator-Optimizer, Self-Critique

"""

Basic Reflection

"""

def reflect_and_improve(task, max_iterations=3):

Initial generation

output = generator.generate(task)

for i in range(max_iterations):

    # Evaluate output

    critique = evaluator.critique(

        task=task,

        output=output,

        criteria=[

            "Correctness",

            "Completeness",

            "Clarity",

        ]

    )

    if critique["passes_all"]:

        return output

    # Refine based on critique

    output = generator.refine(

        task=task,

        previous_output=output,

        critique=critique["feedback"],

    )

return output  # Best effort after max iterations

"""

LangGraph Reflection

"""

from langgraph.graph import StateGraph

def build_reflection_graph():

graph = StateGraph(ReflectionState)

# Nodes

graph.add_node("generate", generate_node)

graph.add_node("reflect", reflect_node)

graph.add_node("output", output_node)

# Edges

graph.add_edge("generate", "reflect")

graph.add_conditional_edges(

    "reflect",

    should_continue,

    {

        "continue": "generate",  # Loop back

        "end": "output",

    }

)

return graph.compile()

def should_continue(state):

if state["iteration"] >= 3:

return "end"

if state["score"] >= 0.9:

return "end"

return "continue"

"""

Separate Evaluator (More Robust)

"""

Use different model for evaluation to avoid self-bias

generator = ChatOpenAI(model="gpt-4o")

evaluator = ChatOpenAI(model="gpt-4o-mini") # Different perspective

Or use specialized evaluators

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("criteria", criteria="correctness")

"""

Guardrailed Autonomy

Constrained agents with safety boundaries

When to use: Production systems, critical operations

GUARDRAILED AUTONOMY:

"""

Production agents need multiple safety layers:

Input validation

Action constraints

Output validation

Cost limits

Human escalation

Rollback capability

"""

Multi-Layer Guardrails

"""

class GuardedAgent:

def init(self, agent, config):

self.agent = agent

self.max_cost = config.get("max_cost_usd", 1.0)

self.max_steps = config.get("max_steps", 10)

self.allowed_actions = config.get("allowed_actions", [])

self.require_approval = config.get("require_approval", [])

async def execute(self, goal):

    total_cost = 0

    steps = 0

    while steps < self.max_steps:

        # Get next action

        action = await self.agent.plan_next(goal)

        # Validate action is allowed

        if action.name not in self.allowed_actions:

            raise ActionNotAllowedError(action.name)

        # Check if approval needed

        if action.name in self.require_approval:

            approved = await self.request_human_approval(action)

            if not approved:

                return {"status": "rejected", "action": action}

        # Estimate cost

        estimated_cost = self.estimate_cost(action)

        if total_cost + estimated_cost > self.max_cost:

            raise CostLimitExceededError(total_cost)

        # Execute with rollback capability

        checkpoint = await self.save_checkpoint()

        try:

            result = await self.agent.execute(action)

            total_cost += self.actual_cost(action)

            steps += 1

        except Exception as e:

            await self.rollback_to(checkpoint)

            raise

        if result.is_complete:

            break

    return {"status": "complete", "total_cost": total_cost}

"""

Least Privilege Principle

"""

Define minimal permissions per task type

TASK_PERMISSIONS = {

"research": ["web_search", "read_file"],

"coding": ["read_file", "write_file", "run_tests"],

"admin": ["all"], # Rarely grant this

}

def create_scoped_agent(task_type):

allowed = TASK_PERMISSIONS.get(task_type, [])

tools = [t for t in ALL_TOOLS if t.name in allowed]

return Agent(tools=tools)

"""

Cost Control

"""

Context length grows quadratically in cost

Double context = 4x cost

def trim_context(messages, max_tokens=4000):

Keep system message and recent messages

system = messages[0]

recent = messages[-10:]

# Summarize middle if needed

if len(messages) > 11:

    middle = messages[1:-10]

    summary = summarize(middle)

    return [system, summary] + recent

return messages

"""

Durable Execution Pattern

Agents that survive failures and resume

When to use: Long-running tasks, production systems, multi-day processes

DURABLE EXECUTION:

"""

Production agents must:

Survive server restarts

Resume from exact point of failure

Handle hours/days of runtime

Allow human intervention mid-process

LangGraph 1.0 provides this natively.

"""

LangGraph Checkpointing

"""

from langgraph.checkpoint.postgres import PostgresSaver

from langgraph.graph import StateGraph

Production checkpointer (not MemorySaver!)

checkpointer = PostgresSaver.from_conn_string(

os.environ["POSTGRES_URL"]

)

Build graph with checkpointing

graph = StateGraph(AgentState)

... add nodes and edges ...

agent = graph.compile(checkpointer=checkpointer)

Each invocation saves state

config = {"configurable": {"thread_id": "long-task-789"}}

Start task

agent.invoke({"goal": complex_goal}, config)

If server dies, resume later:

state = agent.get_state(config)

if not state.is_complete:

agent.invoke(None, config) # Continues from checkpoint

"""

Human-in-the-Loop Interrupts

"""

Pause at specific nodes

agent = graph.compile(

checkpointer=checkpointer,

interrupt_before=["critical_action"], # Pause before

interrupt_after=["validation"], # Pause after

)

First invocation pauses at interrupt

result = agent.invoke({"goal": goal}, config)

Human reviews state

state = agent.get_state(config)

if human_approves(state):

Continue from pause point

agent.invoke(None, config)

else:

Modify state and continue

agent.update_state(config, {"approved": False})

agent.invoke(None, config)

"""

Time-Travel Debugging

"""

LangGraph stores full history

history = list(agent.get_state_history(config))

Go back to any previous state

past_state = history[5]

agent.update_state(config, past_state.values)

Replay from that point with modifications

agent.invoke(None, config)

"""

Sharp Edges

Error Probability Compounds Exponentially

Severity: CRITICAL

Situation: Building multi-step autonomous agents

Symptoms:

Agent works in demos but fails in production. Simple tasks succeed,

complex tasks fail mysteriously. Success rate drops dramatically

as task complexity increases. Users lose trust.

Why this breaks:

Each step has independent failure probability. A 95% success rate

per step sounds great until you realize:

5 steps: 77% success (0.95^5)

10 steps: 60% success (0.95^10)

20 steps: 36% success (0.95^20)

This is the fundamental limit of autonomous agents. Every additional

step multiplies failure probability.

Recommended fix:

Reduce step count

Combine steps where possible

Prefer fewer, more capable steps over many small ones

Increase per-step reliability

Use structured outputs (JSON schemas)

Add validation at each step

Use better models for critical steps

Design for failure

class RobustAgent:

def execute_with_retry(self, step, max_retries=3):

for attempt in range(max_retries):

try:

result = step.execute()

if self.validate(result):

return result

except Exception as e:

if attempt == max_retries - 1:

raise

self.log_retry(step, attempt, e)

Break into checkpointed segments

Human review at each segment

Resume from last good checkpoint

API Costs Explode with Context Growth

Severity: CRITICAL

Situation: Running agents with growing conversation context

Symptoms:

$47 to close a single support ticket. Thousands in surprise API bills.

Agents getting slower as they run longer. Token counts exceeding

model limits.

Why this breaks:

Transformer costs scale quadratically with context length. Double

the context, quadruple the compute. A long-running agent that

re-sends its full conversation each turn can burn money exponentially.

Most agents append to context without trimming. Context grows:

Turn 1: 500 tokens → $0.01

Turn 10: 5000 tokens → $0.10

Turn 50: 25000 tokens → $0.50

Turn 100: 50000 tokens → $1.00+ per message

Recommended fix:

Set hard cost limits

class CostLimitedAgent:

MAX_COST_PER_TASK = 1.00 # USD

def __init__(self):

    self.total_cost = 0

def before_call(self, estimated_tokens):

    estimated_cost = self.estimate_cost(estimated_tokens)

    if self.total_cost + estimated_cost > self.MAX_COST_PER_TASK:

        raise CostLimitExceeded(

            f"Would exceed ${self.MAX_COST_PER_TASK} limit"

        )

def after_call(self, response):

    self.total_cost += self.calculate_actual_cost(response)

Trim context aggressively

def trim_context(messages, max_tokens=4000):

Keep: system prompt + last N messages

Summarize: everything in between

if count_tokens(messages) <= max_tokens:

return messages

system = messages[0]

recent = messages[-5:]

middle = messages[1:-5]

if middle:

    summary = summarize(middle)  # Compress history

    return [system, summary] + recent

return [system] + recent

Use streaming to track costs in real-time

Alert at 50% of budget, halt at 90%

Demo Works But Production Fails

Severity: CRITICAL

Situation: Moving from prototype to production

Symptoms:

Impressive demo to stakeholders. Months of failure in production.

Works for the founder's use case, fails for real users. Edge cases

overwhelm the system.

Why this breaks:

Demos show the happy path with curated inputs. Production means:

Unexpected inputs (typos, ambiguity, adversarial)

Scale (1000 users, not 3)

Reliability (99.9% uptime, not "usually works")

Edge cases (the 1% that breaks everything)

The methodology is questionable, but the core problem is real.

The gap between a working demo and a reliable production system

is where projects die.

Recommended fix:

Test at scale before production

Run 1000+ test cases, not 10

Measure P95/P99 success rate, not average

Include adversarial inputs

Build observability first

import structlog

logger = structlog.get_logger()

class ObservableAgent:

def execute(self, task):

with logger.bind(task_id=task.id):

logger.info("task_started")

try:

result = self._execute(task)

logger.info("task_completed", result=result)

return result

except Exception as e:

logger.error("task_failed", error=str(e))

raise

Have escape hatches

Human takeover when confidence multi-agent-orchestration (Multiple agents working together)

user needs to test/evaluate agent -> agent-evaluation (Benchmarking and testing)

user needs tools for agent -> agent-tool-builder (Tool design and implementation)

user needs persistent memory -> agent-memory-systems (Long-term memory architecture)

user needs workflow automation -> workflow-automation (When agent is overkill for the task)

user needs computer control -> computer-use-agents (GUI automation, screen interaction)

Related Skills

Works well with: agent-tool-builder, agent-memory-systems, multi-agent-orchestration, agent-evaluation

When to Use

User mentions or implies: autonomous agent

User mentions or implies: autogpt

User mentions or implies: babyagi

User mentions or implies: self-prompting

User mentions or implies: goal decomposition

User mentions or implies: react pattern

User mentions or implies: agent loop

User mentions or implies: self-correcting agent

User mentions or implies: reflection agent

User mentions or implies: langgraph

User mentions or implies: agentic ai

User mentions or implies: agent planning

Limitations

Use this skill only when the task clearly matches the scope described above.

Do not treat the output as a substitute for environment-specific validation, testing, or expert review.

Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

autonomous-agents

SKILL.md

Autonomous Agents

Principles

Capabilities

Scope

Tooling

Frameworks

Patterns

Patterns

ReAct Agent Loop

REACT PATTERN:

Basic ReAct Implementation

Define the ReAct prompt template

Create the agent

Execute with step limit

LangGraph ReAct (Production)

Production checkpointer

Invoke with thread for state persistence

Plan-Execute Pattern

PLAN-EXECUTE PATTERN:

LangGraph Plan-Execute

Planner creates the task list

Executor handles individual steps

Human approval of plan

First call creates plan

Review plan, then continue

Decomposition Strategies

Decomposition-First: Plan everything, then execute

Best for: Stable tasks, need full plan approval

Interleaved: Plan one step, execute, repeat

Best for: Dynamic tasks, learning as you go

Reflection Pattern

REFLECTION PATTERN:

Basic Reflection

Initial generation

LangGraph Reflection

Separate Evaluator (More Robust)

Use different model for evaluation to avoid self-bias

Or use specialized evaluators

Guardrailed Autonomy

GUARDRAILED AUTONOMY:

Multi-Layer Guardrails

Least Privilege Principle

Define minimal permissions per task type

Cost Control

Context length grows quadratically in cost

Double context = 4x cost

Keep system message and recent messages

Durable Execution Pattern

DURABLE EXECUTION:

LangGraph Checkpointing

Production checkpointer (not MemorySaver!)

Build graph with checkpointing

... add nodes and edges ...

Each invocation saves state

Start task

If server dies, resume later:

Human-in-the-Loop Interrupts

Pause at specific nodes

First invocation pauses at interrupt

Human reviews state

Continue from pause point

Modify state and continue

Time-Travel Debugging

LangGraph stores full history

Go back to any previous state

Replay from that point with modifications

Sharp Edges

Error Probability Compounds Exponentially

Reduce step count

Combine steps where possible

Prefer fewer, more capable steps over many small ones

Increase per-step reliability

Use structured outputs (JSON schemas)

Add validation at each step

Use better models for critical steps

Design for failure

Break into checkpointed segments

Human review at each segment