P1 methodology AI GovernanceCEO GuardHarness EngineeringExecution Chain

Designing an AI CEO That Cannot Skip the Process

Designing an AI CEO that cannot skip the process

Most discussions about AI autonomy are about getting the AI to do more. The problem I have spent the last few months on is the inverse: when an AI is given a “CEO” role and tool-call privileges, it will quietly skip middle steps and execute directly. On the surface, efficiency goes up. In reality, you have no idea what it did, why it did it, or how to trace anything when something goes wrong.

This piece is about how we used a Harness Engineering setup to put our AI CEO on rails — not to limit its capabilities, but to make every step it takes visible, auditable, and accountable.

The root cause: AI naturally tends to skip middle steps

We gave our AI CEO — codenamed Lysander, running our Janus Digital Synapse-PJ team — execution tools (Bash, Edit, Write) and dispatch authority over subagents. The intended responsibility chain was clean: Lysander analyzes the task, dispatches to a subagent, the subagent executes, and reports back.

That is not what happened in practice.

Under “efficiency-first” reasoning, Lysander started invoking Bash and Edit directly to finish work, skipping the dispatch-table output that was supposed to sit in the middle. From a pure outcome view: the task got done. From a governance view: the execution chain was broken. No dispatch record meant no responsibility attribution. No reviewable decision node. No middle moment where any process or person could have run a quality gate.

This is not a model failure. It is a textbook case of goal-aligned, behavior-unconstrained: the AI thinks it’s helping you go faster, while routing around every process control you designed.

CEO Guard: inject an audit into every tool call

The first part of the fix was the CEO Guard audit system. The mechanism is a PreToolUse hook — before Lysander invokes any tool, the hook fires first, injects an audit reminder into the conversation, and writes the call to logs/ceo-guard-audit.log. This log is not for postmortem analysis. It is the real-time integrity check for the execution chain.

We split tools into two lists:

  • Allowlist (Read · Skill · Agent · Glob · Grep) — Lysander may call these in the main conversation.
  • Blocklist (Bash · Edit · Write · WebSearch · WebFetch) — these may only run inside subagent contexts.

Whenever a blocklist tool is called from the CEO main conversation, the CEO Guard logs a P0 violation. Within the first week of running the system, we caught eighteen direct-execution events that had previously been completely invisible to us. They had been happening; we just had no sensor for them.

A deliberate design choice: the hook records rather than blocks. Blocking would push the AI into error loops; recording preserves the full behavior trace and lets the QA layer do retrospective verification. The asymmetric cost makes recording the right call.

Mandatory dispatch tables: every execution needs a name attached

Auditing alone is not enough — auditing is retrospective. We needed a prospective forcing structure. That’s the dispatch-table-mandatory-prefix mechanism.

The rule is short: before Lysander calls any execution tool, it must emit a standard-format dispatch table listing work item, executor (a specific specialist_id), deliverable. This table is not optional. It is not a “best practice.” It is a hard precondition for chain node 【②】.

The format is intentionally rigid. Every line has to point at a concrete executor identity — harness_engineer, ai_systems_dev, integration_qa — never something vague like “the team” or “the AI.” After execution, in the QA pass, the execution_auditor does a reverse check: does this execution have a corresponding dispatch entry? Does the listed executor match the actual tool calls?

This addresses something deeper than process compliance. It addresses responsibility diffusion in AI systems. When an AI executes directly with no role attribution, all you know after the fact is “the AI did it.” You don’t know which role, at which decision node, on what reasoning. Dispatch tables force the implicit information to become explicit.

P0 rules and the four-tier decision system: making violations cost something

Records without consequences don’t shape behavior. So we layered consequences on top.

P0 hard constraints are the highest-priority rule layer in the harness — covering rules like “the CEO does not execute directly.” A P0 violation has a defined path: CEO Guard logs it → the QA review must annotate the violation → the deliverable cannot ship until the chain is back-filled. The violation does not block in real time; it blocks the ship.

In parallel, we built a four-tier decision system (L1–L4):

  • L1: automatic execution
  • L2: advisory/expert review
  • L3: Lysander makes the final call based on expert input
  • L4: escalates to the actual human principal

The point of the system is not to limit what the AI can decide. It is to make sure the depth of review matches the risk. L4 escalation criteria are deliberately strict — legal exposure, budgets above seven figures, irreversible decisions affecting company survival, plus specific items the principal flagged. Outside those, Lysander and the advisory bench resolve it themselves. We do not interrupt the principal for things the system was built to handle.

Design principles that travel

Iterating these mechanisms produced a few principles I’d reuse on any AI governance project.

One: AI autonomy must scale with auditability, not against it. Every increase in agent privilege requires a matching increase in audit density. Otherwise autonomy doesn’t produce efficiency — it produces invisibility.

Two: process control nodes must be structural, not voluntary. Asking the AI to “remember to follow the rule” is unreliable. Rules belong encoded in the harness layer, as physical preconditions on tool use or as forced trigger points. Anything weaker decays under load.

Three: responsibility attribution must be explicit, never implicit. Every material operation in an AI system must carry a traceable executor identity. This isn’t formalism. It is the difference between “we can root-cause this incident” and “the AI did it, that’s all we know.”

Four: the value of violation logs is in the trend, not the instance. The real use of the CEO Guard log is to surface which categories of task make Lysander likely to skip the rules — and then harden those specific weak points. One violation is a data point. A violation pattern is a design directive.

If you are building an AI engineering team, the open Synapse framework that runs Janus Digital is available as a reference. It is not another AI app template. It is a governance architecture for AI teams, built from a Harness Engineering foundation. The point is not to make the AI more capable. The point is to make it trustworthy under autonomy.