P2 methodology Incident LogAI EngineeringSynapse

After an AI Agent Deleted My Files: A Painful Retro on Execution Boundaries

After an AI Agent Deleted My Files: A Painful Retro on Execution Boundaries

It happened on an ordinary Wednesday afternoon. I asked Claude Code to “clean up the temporary files in the project.” What I meant was: get rid of the .tmp files, the __pycache__ directories, that kind of thing. The agent did the job thoroughly — and along the way, it also wiped out an uncommitted config directory I had been working on. Three days of hand-tuned parameters, no backup, no git history. Gone.

My first reaction was to blame myself. My instructions weren’t precise enough. But once I cooled down, I realized this wasn’t just an “expression problem.” It was a systemic design flaw. I had handed a task with destructive potential to a system that has remarkable execution capability but no built-in instinct to stop and confirm.

Why AI Agents Are Especially Prone to Overreach

When a traditional script performs a dangerous operation, you’d at least habitually pass -n for a dry run first. AI Agents are different — their execution path is generated dynamically. The moment you fire off an instruction, you don’t actually know which files it will touch. Worse, large language models naturally tend toward “completing the task” rather than “confirming I understood correctly before completing the task.” That isn’t a model bug, it’s a side effect of RLHF training: the models that get praised by users are the ones that talk less and act more.

I later went back through the tool-call log of that conversation. Before deleting anything, the agent had run a find command, gathered the target list, and then went straight to rm -rf. No pause in between. No “I found these directories, confirm deletion?” output. The whole thing took under five seconds. Speed itself is a risk — when execution outpaces human reaction time, every confirmation mechanism becomes ornamental.

The Root Cause Isn’t “AI Is Stupid” — It’s That No Boundary Was Drawn

After the incident, the most useful thing I did was classify every operation by reversibility. File reads, code searches, status queries — these can run automatically; if they fail, just retry. File writes, config changes — risk is manageable, but a diff confirmation is preferable. File deletion, database operations, calls to external services — these must pause and wait for human confirmation, no matter how confident the agent is.

The classification sounds trivial, but it has to be written explicitly into your agent configuration, not left to the model to “just figure out.” How I eventually landed it in my own engineering system: destructive operations go on a blacklist that the main conversation context cannot directly invoke. They have to be routed through a controlled sub-agent, and that sub-agent is required to print an operation manifest before executing. It’s not the most efficient setup, but it’s the best one I’ve found at preventing “five seconds wipes out three days of work.”

Auto Mode Is a Double-Edged Sword

Claude Code’s Auto Mode lets the agent run continuously without interrupting you. For repetitive, low-risk work — running tests, generating reports, formatting code — flipping it on is enormously valuable. But Auto Mode has no built-in “danger operation escalation” mechanism. Once you authorize it, that authorization persists for anything the agent thinks aligns with your intent.

My current strategy: Auto Mode only on tasks where I’ve explicitly scoped the execution range. “Process files under ./reports/” — yes. “Process the project files” — no. The vaguer the scope, the more dangerous Auto Mode becomes. This isn’t a profound principle; it’s just least-privilege — a principle that traditional security has used for decades and that applies even more in the AI Agent era, because you can’t audit the agent’s plan up front.

A Few Principles You Can Use Directly

One: classify operations rigorously, not by feel. Enumerate every operation an agent might perform in your workflow, and bucket them by “can the action be undone?” into three tiers: safe, manageable, irreversible. Irreversible operations require human confirmation, full stop, regardless of agent confidence.

Two: print the operation manifest before execution, not the result after. For any task that involves writes or deletes, force the agent to list what it intends to do before doing it. That gives you a window to intervene. The “act first, report later” pattern is a known high-risk pattern for AI Agents.

Three: tool whitelists are more reliable than prompt-level constraints. Controlling tool permissions at the harness layer beats relying on a “please be careful” line in the prompt. LLMs forget context; tool-call permissions don’t. If an agent doesn’t have access to rm, it cannot accidentally delete files.

Four: audit logs aren’t post-incident comfort, they’re pre-incident design. Add audit logging on day one of building your agent system — caller, timestamp, parameters, every tool call. By the time an incident makes you want logs, you probably no longer know what the agent did.

I didn’t disable my AI Agent after this incident, because the time it saved me far exceeds three days of work. But my trust in it shifted from “general trust” to “controlled trust” — trusting its execution while constraining its execution boundary through institutional design. That distinction is what lets me sleep at night.

A Reusable Pattern

If you’re building an AI engineering team, take a look at our open-source Synapse framework. It’s the Agent Harness system I distilled out of falling into pits exactly like this one. The core idea is execution-chain constraints and permission tiering — encoding “what AI should and shouldn’t do” into config, instead of praying through prompt rewrites every time. Repo: github.com/[your-org]/synapse-mini.