P1 intelligence-evolution Multi-AgentAI EngineeringSynapseCapability Audit

How to Keep a Multi-Agent System From Quietly Rotting

The quiet death of a multi-agent system

Six months ago I stood up a fourteen-person AI execution team. Every agent had a clear function — one ran intelligence collection, another did code work, another maintained the knowledge base. With initial configuration done, the whole thing ran smoothly: the principal issued instructions, our AI CEO Lysander triaged and dispatched, specialist agents executed and delivered, Slack pushed the results.

Then, in month three, something started going wrong.

The intelligence agent began citing sources that were stale. The code agent had no idea about the new framework versions we had migrated to. The harness rule file had ballooned to five hundred lines of contradictory constraints. The most insidious part: nothing in the system was telling me it was degrading. AI teams do not file complaints about their own decay. They keep running the rules they were given, until a task fails badly enough to expose the rot.

This is the central death trap of multi-agent systems: building one is easy, evolving one is hard. You spend real effort on the initial configuration and then assume the system stays at that quality level. It does not. Without explicit evolution mechanics, time turns it into technical debt.

What follows is the three-mechanism framework we ended up building inside Synapse to fight this. Each one is small. Together they are the difference between a multi-agent system that compounds and one that decays.

Mechanism one: capability-card audits on a cycle

We maintain a capability card for every agent in obs/01-team-knowledge/HR/personnel/. The card is not a static résumé. It is a timestamped snapshot: declared skills, recent task types executed, historical QA scores, and one critical field — last verified date.

Every twenty-one days, our harness_engineer triggers an audit_harness() function that scans every card. The audit logic is deliberately blunt: if an agent has not executed a task that exercises a declared skill in the last three weeks, that skill is marked unverified. Skills unverified for more than forty-two days drop into a deprecation candidate pool, awaiting a human decision.

The mechanism encodes one belief: an agent’s capabilities are not real because we declared them. They are real because they keep being used.

The unexpected upside was that the audit surfaced resource misallocation we hadn’t seen. We discovered our growth team’s SEO specialist had not done any SEO work in three months — they were spending all their time fixing formatting for the content team. The capability card made that role drift visible in a way nothing else had.

Mechanism two: rule entropy as a budgeted resource

The rule file is the most corrosion-prone part of any multi-agent system. Every time you hit an edge case, the engineer’s instinct is to add a rule. Six months in, your CLAUDE.md is a five-hundred-line tangle of contradictory constraints, and nobody actually knows which rules are still in force.

We introduced an entropy budget: CLAUDE.md has a hard ceiling of three hundred and fifty lines for the current phase. Exceeding the ceiling means deleting before adding. No exceptions. Every rule must carry a # [ADDED: YYYY-MM-DD] timestamp. Any rule older than one hundred and eighty days that hasn’t been re-affirmed or tested becomes a deprecation candidate at the next review.

The engineering logic: rules are not free. Every rule the agent reads costs context tokens and adds an inference step about precedence between rules. The decision chain gets longer. Error rates climb. A line ceiling is a brutal forcing function — it makes you ask “which old rule can come out?” every time you want to add a new one.

We also stratified rule changes by impact:

  • P0 changes (CEO execution boundary, decision system) require explicit principal approval in a session, with a change record kept.
  • P1/P2 changes can be proposed by the harness engineer and approved by the AI CEO.

The stratification exists because in a fast-iteration phase we genuinely had cases where high-frequency small changes accidentally modified core constraints. Tiering prevents that.

Mechanism three: an intelligence pipeline that actually changes the system

This is the most interesting of the three, and the easiest one to skip building.

Most AI teams’ intelligence intake is one-way: a human reads industry news, then maybe updates the system prompt. The frequency of “maybe” is “whenever I happen to remember,” which in practice means never.

We made the intelligence pipeline a closed loop. Every morning at 8 AM Dubai, the intelligence agent collects AI frontier updates and pushes a daily report as HTML to git. At 10 AM, an intelligence-action agent extracts the actionable recommendations from the report, runs them past four expert agents for evaluation, and the recommendations that pass review are handed directly to the harness ops team for execution — config updates, strategy adjustments, capability card edits.

The crucial design choice is the word acts, not summarizes. Intelligence is not for human consumption. It is fuel for the execution chain. A piece of intel about a new prompt-optimization technique, once it passes review, lands inside the agent’s actual prompt configuration within twenty-four hours. The latency between “this technique exists” and “our system uses it” collapses from “human-attention-driven random cycle” to “one business day.”

That latency is, in our experience, the single biggest determinant of whether a multi-agent team gets stronger over time or weaker.

The three reusable principles

Pull the three mechanisms apart and three principles fall out:

Capability must be proven, not declared. An agent’s listed skills are unreliable until they keep being exercised. Build the verification loop, or assume the capability list is fiction.

Rules are liabilities, not assets. Every rule has a maintenance cost. The simpler the system, the easier it is for the agent to reason inside. Total volume control is engineering discipline, not aesthetic preference.

Intelligence consumption must produce system change. If your reading list outputs reports but never modifies the running system, all you’ve added is human reading load. The intelligence loop that doesn’t close is a vanity loop.

The shared assumption underneath all three: a multi-agent team’s capability is a moving quantity that requires active maintenance, not a static configuration that requires only initial care. Without that assumption, your system is a carefully tuned starting state, and time is the thing that turns it into technical debt.

If you are building an AI engineering team, the full implementation of capability audits, rule-entropy control, and intelligence-pipeline feedback is open in the Synapse framework that runs our Janus Digital operations.