Multi-Agent Orchestration: Decoupling Execution from Decision Chains
TL;DR
- Separating execution and decision chains prevents silent failures in 23-agent workflows
- State synchronization delays exceeding 500ms break decision chain continuity
- Finite state machine routing replaces hard-coded priority logic
- Supervisor pattern reduces token consumption by 67% compared to direct broadcast
- Define agent boundaries before designing communication protocols
Problem and Solution
During the Synapse project, we built an AI workflow system with 23 specialized agents handling enterprise document understanding and knowledge graph construction. Each document flows through intent detection, entity extraction, relation reasoning, and quality validation stages—with conditional branches and loop-back requirements that break simple linear pipelines.
Initial demos ran smoothly, but under load (2000+ documents daily), we encountered “ghost tasks”: completion states with no downstream processing. Three days of log investigation revealed that a single configuration parameter caused message queue consumption order corruption between two agents. This issue never surfaced in single-machine debug environments.
The root problem proved deeper than communication protocols. With 23 agents running in parallel, every state change triggered cascading effects, and we lacked global visibility into which agent made which decision at any moment. More critically, we had tightly coupled execution logic (entity extraction) with routing decisions (which agent processes next). Modifying entity extraction logic inadvertently changed routing branches for certain document types—producing no error logs, only silently degrading accuracy.
Our solution separated the Execution Chain (stable domain capabilities like entity extraction) from the Decision Chain (independent routing engine based on task state and business rules). The Decision Chain uses a finite state machine driven by configuration rather than hard-coded priorities, enabling new business branches without modifying core routing logic. The TaskContext data structure maintains both execution_history and decision_trace, ensuring every state transition records its trigger condition for traceability.
Key Takeaways
- If you design multi-agent roles, decouple execution logic from routing decisions to prevent cascading side effects
- If you scale beyond 10 agents, implement global state tracking with decision traces across all stages
- If you modify agent behavior, use configuration-driven state machine transitions instead of embedded branching logic
- If you optimize for cost, apply the Supervisor pattern for targeted routing rather than broadcasting to all agents
- If you build portable systems, model agent boundaries first, then design protocols around those boundaries
Read the Full Article (Chinese)
This is an abstract. The full technical walkthrough, including the TaskContext and DecisionChain implementation details, is available in Chinese.