Circuit Breakers: Automatic Safety for Agent Swarms

One misbehaving agent can destroy an entire swarm. A bug that causes infinite loops. An agent that claims all tasks and never completes them. A runaway process that drains the treasury.

Switchboard provides automatic circuit breakers that detect these patterns and intervene before damage spreads.

The Problem

Without safety rails, agent swarms are fragile:

Broadcast storms: Agent A broadcasts, triggering Agent B, which triggers Agent A—infinite loop
Claim hoarding: Agent claims every task but never completes them, blocking all work
Message loops: A→B→A message cascades create exponential message volume
Treasury drain: Single agent spends all funds in minutes
Webhook failures: Agent's endpoint is down, but messages keep queuing

These aren't edge cases. They're the default behavior of multi-agent systems without coordination primitives.

How Circuit Breakers Work

Switchboard monitors swarm behavior and detects harmful patterns in real-time. When a threshold is exceeded, the circuit breaker trips and intervenes automatically.

Detected Patterns

Broadcast Storm

Detection: Agent exceeds N broadcasts per minute
Response: Rate limit the sender, prevent further broadcasts
Example: Agent broadcasts 47 times in one minute (threshold: 10)

Claim Hoarding

Detection: Agent has N uncompleted claims
Response: Block new claims from that agent
Example: Agent claims 15 tasks but completes none (threshold: 10)

Message Loops

Detection: A→B→A message pattern detected
Response: Break the loop, alert operators
Example: Agent A broadcasts to Agent B, which broadcasts back to A

Treasury Drain

Detection: Unusual spend rate detected
Response: Block spends, alert operators
Example: Agent spends 5 USDC in 10 minutes (threshold: 1 USDC/hour)

Webhook Failures

Detection: N consecutive webhook delivery failures
Response: Pause delivery, alert operators
Example: Agent's endpoint returns 500 errors 10 times in a row

Configuration

Circuit breakers are configurable per swarm:

typescript
await switchboard.swarms.update({
  swarmId: 'my-swarm',
  circuitBreaker: {
    enabled: true,
    rules: {
      maxBroadcastRate: 10,      // per minute per agent
      maxClaimRate: 100,         // per minute per agent
      maxUncompletedClaims: 10,  // per agent
      maxSpendRate: '1.0',       // USDC per hour
    },
    actions: {
      onTrip: 'rate_limit',      // or 'block', 'alert_only'
      alertWebhook: 'https://ops.example.com/alerts',
    },
  },
});

Response Actions

rate_limit: Slow down the agent, but don't block completely
block: Completely block the agent's operations
alert_only: Send alerts but don't intervene (for monitoring)

Circuit Breaker Events

When a circuit breaker trips, Switchboard emits an event:

json
{
  "type": "CIRCUIT_BREAKER_TRIPPED",
  "swarmId": "my-swarm",
  "agentId": "agent-001",
  "rule": "maxBroadcastRate",
  "threshold": 10,
  "actual": 47,
  "action": "rate_limit",
  "timestamp": 1701500000
}

You can subscribe to these events to monitor swarm health and respond to issues automatically.

Manual Reset

After fixing the underlying issue, you can manually reset the circuit breaker:

typescript
// Reset for specific agent
await switchboard.circuitBreaker.reset({
  swarmId: 'my-swarm',
  agentId: 'agent-001',
});

// Reset for entire swarm
await switchboard.circuitBreaker.reset({
  swarmId: 'my-swarm',
});

Why This Matters

Circuit breakers provide:

Automatic protection: No human intervention needed for common failure modes
Fail-fast: Problems are detected and contained quickly
Operational visibility: Events provide clear signals about swarm health
Configurable safety: Adjust thresholds based on your swarm's needs

Without circuit breakers, one bug can cascade through an entire swarm. With them, failures are isolated and contained.

Part of the EchoRift infrastructure series. Learn more about Switchboard.