Agentic AI in Production

Cognition Devin

Persistent workspace · multi-hour coding

Problem

LLMs can write functions but cannot maintain state across hours of debugging

Software engineering tasks require multi-step reasoning, tool use, and iterative problem-solving over hours or days. Traditional LLMs can't maintain context, execute code, debug errors, or autonomously complete complex tasks without constant human guidance. Engineers spend significant time on repetitive debugging, refactoring, and boilerplate that requires understanding code, forming hypotheses about bugs, and testing solutions iteratively.

Scale

Tasks completed: 10K+
Code repos: 1K+
Tools available: 50+
SWE-bench pass: 14%
Avg task duration: 30–120 min
User sessions: 100K+

Solution

ReAct loop + persistent workspace + Docker sandbox

Devin combines a ReAct (Reason-Act-Observe) loop with 50+ tools, persistent workspaces that maintain files and state, and sandboxed Docker execution to safely run untrusted code. The agent decomposes tasks into sub-goals with verification criteria, maintains 200K+ token context with intelligent pruning, and checkpoints every 5 minutes for recovery.

GPT-4 (planning + reasoning)Custom Docker sandboxLangGraph (orchestration)Playwright (browser)GitHub APITerminal emulationVector DB (code search)

ReAct loop: Reason → Act → Observe cycle for autonomous execution
Long-term planning: decompose tasks into multi-hour plans with sub-goals
Tool orchestration: 50+ tools (code, terminal, browser, git)
Sandboxed execution: isolated Docker for running untrusted code
Checkpoint/resume: save state every 5 min to recover from failures
Human-in-the-loop: request clarification when stuck or uncertain
Observability: full transcript of agent reasoning and actions
Self-debugging: detect → read logs → form hypothesis → fix (80% solo)

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythondevin_react_loop.pyReAct loop with 50+ tools, checkpointing, self-debugging

from langchain.agents import Tool, AgentExecutor, initialize_agent

class DevinAgent:
    """ReAct loop: Reason → Act → Observe. Maintains state across hours."""

    def __init__(self):
        self.tools = self.initialize_tools()
        self.agent = initialize_agent(
            tools=self.tools,
            llm="gpt-4",
            agent_type="react",
            max_iterations=500,      # cap runaway loops
            early_stopping_method="generate",
        )
        self.workspace = PersistentWorkspace()
        self.checkpointer = Checkpointer(interval_seconds=300)

    def initialize_tools(self):
        return [
            Tool(name="code_editor", func=self.edit_code,
                 description="Read, write, or modify code files"),
            Tool(name="terminal", func=self.run_command,
                 description="Execute shell commands in sandbox"),
            Tool(name="browser", func=self.browse_web,
                 description="Navigate web pages, read documentation"),
            Tool(name="git", func=self.git_operation,
                 description="Git operations: commit, branch, PR"),
        ]

    def execute_task(self, task_description):
        """Execute multi-hour task with checkpointing."""
        plan = self.create_plan(task_description)

        for i, sub_task in enumerate(plan):
            print(f"[{i+1}/{len(plan)}]: {sub_task}")
            try:
                result = self.agent.run(sub_task)
                self.checkpointer.save({
                    'task': task_description,
                    'completed_subtasks': i + 1,
                    'workspace': self.workspace.state,
                    'result': result,
                })
            except Exception as e:
                print(f"Error: {e}. Self-debugging…")
                if not self.self_debug(sub_task, e):
                    self.request_human_help(sub_task, e)

        return self.workspace.state

    def self_debug(self, task, error):
        """Read logs, form hypothesis, try fixes."""
        debug_prompt = f"""
Task: {task}
Error: {error}

Analyze and attempt to fix:
1. Read relevant logs
2. Form hypothesis about root cause
3. Propose and test fix
4. Verify solution
"""
        return self.agent.run(debug_prompt)

Pythonsandbox.pyDocker isolation — 2GB RAM, 50% CPU, network whitelisted

import docker
import tempfile, os

class SandboxedExecutor:
    """Isolated environment — prevents damage to host system."""

    def __init__(self):
        self.client = docker.from_env()
        self.container = None

    def create_sandbox(self):
        self.container = self.client.containers.run(
            "python:3.11-slim",
            command="sleep infinity",   # keep container alive
            detach=True,
            mem_limit="2g",             # memory limit
            cpu_quota=50000,            # 50% of 1 core
            network_mode="bridge",      # isolated network
            remove=True,                # auto-cleanup on stop
            volumes={
                self.workspace_path: {'bind': '/workspace', 'mode': 'rw'},
            },
        )

    def execute_code(self, code, timeout=30):
        with tempfile.NamedTemporaryFile(
            mode='w', suffix='.py', delete=False,
        ) as f:
            f.write(code)
            code_file = f.name

        try:
            self.copy_to_container(code_file, '/tmp/code.py')
            exit_code, output = self.container.exec_run(
                cmd="python /tmp/code.py",
                workdir="/workspace",
                demux=True,
                stream=False,
            )
            stdout, stderr = output
            return {
                'exit_code': exit_code,
                'stdout': stdout.decode() if stdout else '',
                'stderr': stderr.decode() if stderr else '',
                'success': exit_code == 0,
            }
        except Exception as e:
            return {'exit_code': -1, 'stdout': '', 'stderr': str(e),
                    'success': False}
        finally:
            os.remove(code_file)

    def cleanup(self):
        if self.container:
            self.container.stop()

Outcomes

Business outcomes

14% SWE-bench pass rate (vs 2% for GPT-4 alone)
Completed real-world Upwork freelance tasks
Reduced engineer time on repetitive work by 60%
Enabled non-engineers to build functional prototypes

Technical outcomes

Avg completion 45 min (vs 2–3h for human)
Maintains context across 8+ hour debugging sessions
Self-debugs errors in 80% of cases without human help
Handles end-to-end: bug report → fix → PR → tests

Impact

From 2% to 14%: the first AI that debugs like a senior engineer

Devin completed real-world Upwork tasks and cut engineering time on repetitive work by 60%, proving autonomous multi-hour AI coding is viable.

Takeaways

Long-term planning is hard. Break tasks into 5–15 min sub-tasks with clear verification.
Observability is critical. Show full reasoning so users understand and can intervene.
Sandboxing prevents disasters. Isolate code execution to prevent host damage or data leaks.
Error recovery > avoiding errors. Agents will make mistakes — design for quick recovery.
Human-in-the-loop for high stakes. Let the agent work autonomously but pause for destructive actions.

AutoGPT

Self-prompting · open-source pioneer

Problem

Chaining LLM prompts required humans to decompose every task

LLMs are powerful but require humans to break down tasks, provide context, and chain multiple prompts together. Users wanted a 'just do it for me' experience where AI autonomously researches, plans, executes, and adapts without constant prompting. Traditional LLM usage requires humans to decompose goals, provide step-by-step instructions, and manually chain outputs between prompts.

Scale

GitHub stars: 160K+
Community: 50K+ members
Agent runs: 1M+
Plugins: 100+
Max steps/run: 500+
License: Open source

Solution

Self-prompting loop + hybrid memory + plugin ecosystem

AutoGPT pioneered the self-prompting architecture where the agent generates its own next action based on the goal and progress, without human-provided task lists. Combines short-term memory (last N messages) with long-term semantic search via vector DB (Pinecone/ChromaDB), enabling multi-session learning. Plugin system with 100+ community tools enables extensibility without modifying core.

GPT-4 (planning)GPT-3.5 Turbo (execution)Pinecone (memory)ChromaDB (local memory)Selenium (browser)Python pluginsFile system access

Goal-driven architecture: user provides high-level goal, agent decomposes
Memory system: short-term (last N) + long-term (vector DB)
Self-prompting loop: agent generates own next action
Plugin system: extensible tools via Python plugins
Self-evaluation: assess action success, adjust strategy
Resource management: token budget + API rate limits + max iterations
User confirmation: optional approval mode for destructive actions

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonself_prompting_loop.pyAgent generates its own next prompt — no human task list

class AutoGPTAgent:
    """Self-prompting loop. Agent generates own next action."""

    def __init__(self, goal, max_iterations=100):
        self.goal = goal
        self.max_iterations = max_iterations
        self.memory = []                  # short-term
        self.vector_db = ChromaDB()       # long-term

    def run(self):
        plan = self.create_plan(self.goal)

        for iteration in range(self.max_iterations):
            context = self.get_context(plan)
            prompt = self.build_prompt(context, plan)
            response = self.call_gpt4(prompt)

            action = self.parse_action(response)
            result = self.execute_action(action)

            self.memory.append({
                'action': action, 'result': result,
                'timestamp': time.time(),
            })
            self.vector_db.add(result)

            if self.is_goal_complete(result, plan):
                return {"status": "success", "result": result}

            if len(self.memory) > 10:    # prune short-term
                self.memory = self.memory[-10:]

        return {"status": "max_iterations_reached"}

    def build_prompt(self, context, plan):
        return f"""
Goal: {self.goal}
Plan: {plan}
Current Progress: {context}

What should you do next to make progress toward the goal?
Respond with a specific action (e.g., search_web, write_code, read_file).
"""

Pythonhybrid_memory.pyShort-term (last 10) + long-term (semantic vector retrieval)

class HybridMemorySystem:
    """Short-term for immediate context, long-term for semantic retrieval."""

    def __init__(self, chroma_collection):
        self.short_term = []
        self.long_term = chroma_collection
        self.max_short_term = 10

    def add(self, action, result):
        entry = {
            'action': action, 'result': result,
            'timestamp': time.time(),
        }
        self.short_term.append(entry)
        if len(self.short_term) > self.max_short_term:
            self.short_term.pop(0)

        self.long_term.add(
            documents=[f"{action}: {result}"],
            metadatas=[{"type": "agent_action",
                        "timestamp": entry['timestamp']}],
            ids=[f"action_{time.time()}"],
        )

    def get_context(self, current_goal):
        relevant = self.long_term.query(
            query_texts=[current_goal], n_results=5,
        )
        return {
            'recent_actions': self.short_term.copy(),
            'relevant_past':  relevant['documents'][0],
        }

Pythonplugins.pyPlugin system — extend without modifying core

class PluginSystem:
    """Extensible tool execution — community adds capabilities."""

    def __init__(self):
        self.plugins = {}
        self.load_plugins()

    def load_plugins(self):
        self.plugins['web_search']     = GoogleSearchPlugin()
        self.plugins['browse_web']     = SeleniumBrowserPlugin()
        self.plugins['write_file']     = FileSystemPlugin()
        self.plugins['execute_python'] = PythonExecutorPlugin()
        self.plugins['read_file']      = FileSystemPlugin()
        # ... 100+ community plugins

    def execute(self, action):
        plugin_name = action['command']
        args = action['args']

        if plugin_name not in self.plugins:
            return {"error": f"Unknown command: {plugin_name}"}

        try:
            result = self.plugins[plugin_name].execute(**args)
            return {"success": True, "result": result}
        except Exception as e:
            return {"success": False, "error": str(e)}


class GoogleSearchPlugin:
    def execute(self, query, num_results=10):
        results = google_search(query, num_results)
        return [{"title": r.title, "url": r.url, "snippet": r.snippet}
                for r in results]

Outcomes

Business outcomes

Pioneered the autonomous agent paradigm (March 2023)
Inspired BabyAGI, AgentGPT, and dozens of frameworks
Demonstrated feasibility of self-directed AI systems
Created blueprint for modern agent architectures

Technical outcomes

Completion: ~30% on simple, ~5% on complex (unassisted)
Runs for 100+ steps without human intervention
Memory enables multi-session tasks
Plugin ecosystem: 100+ community-built tools

Impact

The agent framework that inspired a thousand startups

AutoGPT pioneered autonomous AI agents in March 2023, demonstrating self-directed task decomposition and becoming the template for every agent framework that followed.

Takeaways

Unbounded loops are dangerous. Set hard limits on iterations and API calls.
Short-term memory fills fast. Aggressively prune or move to vector DB after N messages.
Evaluation is hard. How do you know if an agent succeeded? Need concrete verification.
Plugins expand capabilities but increase unpredictability. More tools = more failure modes.
User goals must be concrete. Vague goals lead to infinite loops; specific goals work.

Common pitfalls

The mistakes both teams hit on the road from "it works on my laptop" to "it runs the business."

Unsandboxed code execution

Problem

Early versions ran code directly in the user's environment. Agents deleted files, exposed API keys in logs, crashed systems, and corrupted data. Destroyed user trust and created legal liability.

Solution

Mandatory Docker sandboxing with resource limits (2GB RAM, 50% CPU), isolated filesystem, whitelisted network. All code execution must be sandboxed.

Impact

Zero host system damage reported post-sandboxing. Users can confidently let agents experiment.

Unbounded iteration loops

Problem

No hard limits on reasoning iterations. Agents got stuck trying the same failed approach 200+ times or endlessly researching. Costs spiked to $500+ per task.

Solution

Hard iteration limits (500 max for Devin, 100–500 for AutoGPT). Track repeated actions: after N failed attempts, force human help. Add token budget monitoring.

Impact

AutoGPT cost: $200+/run → <$20. Devin capped at $50/task max. Agents learned to ask for help.

Vague user goals with no success criteria

Problem

AutoGPT failed most with vague goals like 'make me money.' Without concrete success criteria, agents loop forever, never knowing when they're done.

Solution

Require concrete, measurable goals with explicit success criteria. Example: 'Research 10 AI startups and write a 1000-word report.' Reject vague goals; ask user to clarify.

Impact

Success rate: 5% on vague goals → 30% on specific goals (6× improvement).

Build it, don't just read about it

Build your own agent

Devin and AutoGPT both came down to the same primitives: a planning loop, bounded iteration, a sandbox, hybrid memory, and a graceful human handoff. Those are the parts you need to wire up — and they are the parts every agent framework you build on top of will keep getting wrong unless you understand them.

Our Agentic AI module covers the full stack: ReAct vs self-prompting, planning + sub-task decomposition, the Docker sandbox pattern, short-term + vector memory, iteration + cost caps, and the observability transcript that makes agents trustable.

Start the Agentic AI module Browse agent projects