The Story
I’ve been running Squad — a multi-agent AI framework — for a couple of weeks now. It orchestrates a team of AI agents that handle code review, architecture decisions, infrastructure, docs, and more. A reconciliation loop runs every 5 minutes, picking up work and dispatching agents. Most of the time it works great.
As I started planning to run Squad at scale — thinking about platforms like AKS, Azure VMs, or similar — I realized rate limiting with multiple agents is fundamentally different from single-service rate limiting. I found this thread on r/GithubCopilot where other people were describing the exact same problem I was hitting. So I went and did some research and reading, stress-tested the system, and designed 6 patterns to handle it.
Here’s what triggered the deep dive. As I added more machines and Ralph processes, things started breaking.
Nine agents launched simultaneously. In 22 minutes they opened 10 pull requests. Impressive — until minute 8, when GitHub started returning 429 Too Many Requests.
Every agent retried at the same time. The retry wave triggered a second 429 wave. That triggered a third. Within 90 seconds I’d burned through GitHub’s 5,000 requests/hour limit and was locked out entirely. Meanwhile, Picard — my lead agent making critical architecture decisions — was stuck behind Ralph, a background polling agent that had eaten the remaining Copilot completions doing low-priority issue triage.
Even in just a couple of weeks of running the system, I’d already hit memory issues, resource contention, and agent crashes. But rate limiting with multiple agents sharing the same quotas? That was a different problem entirely — and one that gets worse the more I scale.
The core lesson:
Rate limiting in multi-agent systems is a coordination problem, not a retry problem.
Every tool I evaluated — Azure API Management, Resilience4j, LangGraph — treats rate limiting as something each caller handles independently. But when 9 agents share the same API quotas, independent retry logic doesn’t just fail. It actively makes things worse.
The Three Failure Modes
Before designing anything, I had to understand why standard retry logic breaks down. I identified three patterns from my logs as the system scaled:
1. Thundering Herd
After a 429, all agents wait the same Retry-After duration and retry simultaneously. They collide again, triggering another 429. In my logs, ralph-self-heal.log showed 60+ chained failures in a single incident. Classic distributed systems problem — except the “services” are AI agents that don’t know about each other.
2. Priority Inversion
Ralph’s background polling (checking for new GitHub issues every 5 minutes) consumed API quota that Picard needed for blocking architecture decisions. Both agents had equal retry priority. There was no way to say “Picard goes first” — so critical work waited behind background noise.
3. Cascade Amplification
A single GitHub secondary-rate-limit hit caused multiple agents to queue their pending work. When the limit lifted, they all flushed their queues at once — immediately re-triggering the limit. One 429 became a system-wide outage that took up to 60 minutes to recover from.
6 Patterns I Designed
Based on the research and what I observed, I designed a Rate Governor — a coordination layer that all agents consult before making API calls. Here are the six patterns inside it, each one a direct response to a failure mode I observed or anticipated as the system scales.
Pattern 1: Traffic Light Throttling
What broke: Agents only reacted after hitting a 429. By then, the entire quota window was gone. Recovery meant waiting up to 60 seconds while every agent sat idle.
What I learned: When making direct API calls (e.g., via gh api or REST clients), every response includes x-ratelimit-remaining and x-ratelimit-reset headers. Nobody was reading them. (Note: these headers aren’t directly exposed when using Copilot CLI with -p — this pattern applies when you’re consuming APIs directly.)
I added a traffic-light system that reads remaining quota after every API call and adjusts behavior before hitting the wall:
| Zone | When | What happens |
|---|---|---|
| 🟢 Green | >40% quota left | Normal operation |
| 🟡 Amber | 15–40% left | Add proportional delays — background agents slow down first |
| 🔴 Red | <15% left | Background agents park. Standard agents slow to 1 req/sec. Critical agents pass through. |
Here’s what the header parsing looks like for the GitHub REST API (which returns standard x-ratelimit-* headers):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Read rate-limit state from API response headers
$remaining = [int]$response.Headers["x-ratelimit-remaining"]
$limit = [int]$response.Headers["x-ratelimit-limit"]
$resetAt = $response.Headers["x-ratelimit-reset"]
$ratio = $remaining / $limit
if ($ratio -ge 0.40) {
# GREEN — no throttling
} elseif ($ratio -ge 0.15) {
# AMBER — proportional delay for non-critical agents
$delayMs = 2000 * (0.40 - $ratio) / 0.25
Start-Sleep -Milliseconds $delayMs
} else {
# RED — park background agents, slow standard agents
if ($Priority -eq 2) { return "PARKED" }
if ($Priority -eq 1) { Start-Sleep -Seconds 1 }
# P0 passes through immediately
}
Key insight: Don’t wait for a 429 to tell you you’re out of quota. The headers tell you 10 calls in advance. Read them.
Pattern 2: Shared Token Pool
What broke: All agents share API quotas (80 completions/hour on Copilot, 5,000 requests/hour on GitHub REST) but tracked consumption independently. When Ralph was idle, Picard couldn’t borrow his unused allocation. When Ralph was busy triaging, he starved Data’s code generation.
What I learned: Agents need a shared ledger. I created rate-pool.json — a single file (with file-locking) that tracks the shared quota, per-agent soft reservations, and a donation register where idle agents release unused capacity.
1
2
3
4
5
6
7
8
9
10
11
12
13
// rate-pool.json
{
"github": {
"window_completions_total": 80,
"window_completions_remaining": 48,
"agent_allocations": {
"picard": { "reserved": 20, "used": 8 },
"ralph": { "reserved": 12, "used": 2 },
"data": { "reserved": 20, "used": 18 }
},
"donation_pool": 10
}
}
The rules are simple:
- P0 agents (Picard, Worf) always get completions if any remain
- P1 agents (Data, Seven) use their reservation, then pull from the donation pool
- P2 agents (Ralph) yield when the pool is under 30% capacity
- Idle agents donate unused reservations back to the pool automatically
- Starvation prevention: any P2 agent denied for 5+ minutes gets promoted to P1
There’s no circular wait — an agent either gets completions immediately or yields and retries next round. No deadlocks possible.
Key insight: Treat your API quota like a shared bank account, not separate wallets. Idle agents should donate, critical agents should overdraw.
Pattern 3: Predictive Circuit Breaker
What broke: My existing circuit breaker opened only after receiving a 429. That’s like pulling the fire alarm after the building is already on fire. The quota was gone, and recovery meant waiting the full cooldown window.
What I learned: You can predict exhaustion before it happens. If you’re burning 1,000 tokens/second and you have 2,000 left, you’ve got 2 seconds — not enough time for the next agent request to complete.
I added a PRE-EMPTIVE_OPEN state to the circuit breaker:
Before switching models entirely, the circuit breaker first tries reducing load on the same model — cutting max_tokens, compressing prompts. Only if that doesn’t help does it walk down the fallback chain:
1
claude-sonnet-4.6 → gpt-5.4-mini → gpt-5-mini → gpt-4.1
Key insight: The difference between “locked out for 10 minutes” and “gracefully downgraded for 30 seconds” is prediction. If you can see the wall coming, you can brake instead of crashing.
Pattern 4: Cascade Detector
What broke: Squad workflows are sequential — Picard makes an architecture decision, Data implements it, Belanna deploys it, Neelix announces it. A rate limit hit at any stage blocked everything downstream. But no agent knew about its dependencies.
What I learned: You need a dependency graph. When one agent gets rate-limited, every downstream agent should know before it attempts its next call.
When 3+ agents get rate-limited within a 30-second window, the cascade detector switches to sequential mode — agents take an ordered lock and go one at a time instead of all at once. This kills the thundering herd instantly.
I encode the workflow DAG in a simple config:
1
2
3
4
5
6
7
8
9
# backpressure.yaml
workflows:
issue-to-deploy:
- ralph # triage
- picard # architecture
- data # implementation
- belanna # deployment
- neelix # announcement
cascade_threshold: 3 # agents hit in 30s triggers sequential mode
Key insight: A rate limit isn’t a local event — it’s a signal that propagates through your agent dependency chain. Map the chain, propagate the signal.
Pattern 5: Lease-Based Cleanup
What broke: When an agent crashed mid-round, its reservation in the shared pool was never released. Even in a couple of weeks of running, I saw phantom allocations start to accumulate — agents got denied completions despite actual API quota being available. At scale, this would get much worse.
What I learned: Every allocation needs a lease with an expiry. I tag each reservation with a timestamp and tie it to the agent’s heartbeat. A background sweep every 30 seconds checks:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Reclaim tokens from dead agents
$heartbeatFiles = Get-ChildItem "$env:SQUAD_DIR/heartbeats/*.json"
foreach ($hb in $heartbeatFiles) {
$agent = $hb.BaseName
$lastBeat = (Get-Content $hb.FullName | ConvertFrom-Json).timestamp
$staleness = (Get-Date) - [datetime]$lastBeat
if ($staleness.TotalMinutes -gt 2) {
# Agent is dead — reclaim its tokens
$pool = Get-Content "rate-pool.json" | ConvertFrom-Json
$unused = $pool.github.agent_allocations.$agent.reserved -
$pool.github.agent_allocations.$agent.used
$pool.github.donation_pool += [Math]::Max(0, $unused)
$pool.github.agent_allocations.$agent.reserved = 0
$pool | ConvertTo-Json -Depth 5 | Set-Content "rate-pool.json"
Write-Host "♻️ Reclaimed $unused tokens from crashed agent: $agent"
}
}
This hooks directly into Squad’s existing ralph-heartbeat.ps1 — the heartbeat files are already there. I just started reading them.
Key insight: In any environment where agents can crash — and they will — allocations outlive the processes that made them. Add a lease, or your quota pool will slowly starve.
Pattern 6: Priority Retry Windows
What broke: The standard AWS exponential-backoff-with-jitter formula treats every caller equally. When Picard (critical architecture decisions) and Ralph (background polling) both get a 429 at the same time, they both retry in the same random window. Ralph can get lucky and grab the quota before Picard. That’s priority inversion.
What I learned: Give each priority tier its own non-overlapping retry window. P0 retries first. P1 retries after P0 is done. P2 goes last.
| Priority | Agents | Retry Window |
|---|---|---|
| P0 Critical | Picard, Worf | 0 – 0.5s |
| P1 Standard | Data, Seven, Belanna, Troi, Neelix | 0.5 – 3.5s |
| P2 Background | Ralph, Scribe | 3.5 – 9.5s |
This guarantees P0 agents consume available quota before P1 agents even begin retrying. Priority inversion becomes structurally impossible.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
function Get-RetryDelay {
param(
[int]$RetryAfterSeconds,
[int]$Attempt,
[int]$Priority # 0=critical, 1=standard, 2=background
)
# Base delay from Retry-After header (or exponential backoff)
if (-not $RetryAfterSeconds) {
$RetryAfterSeconds = [Math]::Min(60, [Math]::Pow(2, $Attempt))
}
# Non-overlapping priority windows
switch ($Priority) {
0 { $windowStart = 0; $windowEnd = 0.5 } # P0: first 500ms
1 { $windowStart = 0.5; $windowEnd = 3.5 } # P1: 500ms–3.5s
2 { $windowStart = 3.5; $windowEnd = 9.5 } # P2: 3.5s–9.5s
}
$jitter = Get-Random -Minimum 0 -Maximum (($windowEnd - $windowStart) * 1000)
return $RetryAfterSeconds + $windowStart + ($jitter / 1000.0)
}
Key insight: Standard jitter treats all callers as equal. In a multi-agent system, they’re not. Separate the retry windows by priority and the problem disappears.
The Full Architecture
All six patterns feed into a shared Rate State Store — a pair of JSON files (rate-pool.json and rate-state.json) with file locking. Every agent reads state before calling an API and writes state after receiving a response. No central server needed — it’s cooperative coordination through the filesystem.
Important caveat: This file-based approach works on a single machine (or a shared filesystem with strong POSIX semantics). For the multi-node case, see Pattern 7 below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌─────────────────────────────────────────────────┐
│ Squad Rate Governor │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐│
│ │ Traffic │ │ Shared │ │ Lease-Based ││
│ │ Light │ │ Token │ │ Cleanup ││
│ │ Throttle │ │ Pool │ │ (heartbeat-tied) ││
│ └────┬─────┘ └────┬─────┘ └────────┬─────────┘│
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Rate State Store │ │
│ │ rate-pool.json · rate-state.json │ │
│ └────────────┬────────────────────────────┘ │
│ │ │
│ ┌───────┼───────┐ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌──────┐ ┌───────────┐ │
│ │Cascade │ │Retry │ │Predictive │ │
│ │Detector│ │Window│ │Circuit │ │
│ │ │ │ │ │Breaker │ │
│ └────────┘ └──────┘ └───────────┘ │
└─────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
GitHub API GitHub Copilot Azure OpenAI
Pattern 7: When You Outgrow One Machine
Here’s where I need to be honest: The file-based Rate State Store I described above only works on a single node. If you’re running Squad on your dev machine or a single Azure VM, you’re fine. But the moment you scale to multiple AKS pods or separate VMs, the whole design breaks.
Why File Locking Doesn’t Work Across Nodes
The patterns I designed above rely on three things:
- POSIX file locks that guarantee mutual exclusion when accessing
rate-pool.json - Heartbeat files that let me detect when an agent crashes and reclaim its tokens
- Immediate consistency — when agent A writes to the pool, agent B reads the updated state instantly
On a single machine, all three work. On multiple machines? None of them do.
- File locks don’t propagate across networked filesystems. NFS has
lockdandstatd, but lock semantics are unreliable across network partitions. Azure Files supports SMB locking, but it’s eventual consistency — not atomic. - Heartbeat files are local. Each pod writes to its own filesystem. There’s no shared view of “which agents are still alive” without a coordination service.
- No fencing tokens. If a pod gets network-partitioned, it might still think it owns tokens and keep writing to the shared state — corrupting the pool with stale data.
- Eventual consistency on networked FS means stale reads. Agent A writes that it consumed 10 tokens. Agent B reads 2 seconds later and sees the old value. Both agents think they have quota. Both call the API. 429.
What I’d Use for Multi-Node Squad
If I needed to run Squad across multiple AKS pods (which I don’t yet — I’m still on a single machine), here’s what I’d reach for:
Option 1: Redis as the Rate State Store
Why it works:
- Atomic operations (
INCR,DECR,GETSET) guarantee no race conditions - TTL on keys gives me automatic lease expiry (no manual heartbeat cleanup)
- Pub/sub channels let me propagate backpressure signals instantly
- Already battle-tested for distributed rate limiting (see Stripe, GitHub, Shopify implementations)
What changes:
- Replace
rate-pool.jsonwith Redis hashes:HSET rate:pool github:remaining 48 - Replace file locks with Redis transactions (
MULTI/EXEC) - Heartbeats become Redis keys with TTL:
SET heartbeat:picard alive EX 30 - Cascade detection uses Redis pub/sub:
PUBLISH backpressure:github "429 detected"
Code sketch:
1
2
3
# Reserve tokens atomically
redis-cli --eval reserve-tokens.lua github picard 10
# Lua script ensures INCR + HSET happen as one atomic operation
I’d probably use Valkey (Redis fork) on Azure since it’s OSS and well-supported.
Option 2: etcd for Distributed Locking
Why it works:
- Already running in AKS clusters (it’s what powers Kubernetes itself)
- Strong consistency guarantees (Raft consensus)
- Lease-based locking with automatic expiry
- Watch API for propagating state changes
What changes:
- Replace
rate-pool.jsonwith etcd key-value store - Use etcd’s lease mechanism for heartbeats and token reservations
- Watch for changes to
/rate-pool/github/remainingto detect quota exhaustion - Use etcd transactions for atomic compare-and-swap on token allocation
Trade-off: etcd is heavier than Redis and optimized for configuration, not high-throughput counters. But if I’m already on AKS, it’s there and I don’t need another service.
Option 3: Sidecar / DaemonSet Pattern
Why it works:
- Run one “rate governor” per AKS node as a DaemonSet
- All agents on that node talk to their local governor (fast, no network)
- Governors coordinate centrally (Redis or etcd) but aggregate locally
- Reduces coordination overhead — only N governors talking, not N×M agents
What changes:
- Each agent calls
http://localhost:8080/reserve-tokens(local sidecar) - Sidecar maintains a local soft reservation (e.g., 20 tokens/node)
- When local pool is low, sidecar requests more from the central Redis pool
- Heartbeat = sidecar process liveness (Kubernetes handles this)
Trade-off: More complexity (another service to deploy), but much better performance at scale. This is how large-scale API gateways work (Envoy, Istio).
What I’m Actually Doing
Right now, I’m running Squad on a single machine. The file-based approach works perfectly and is way simpler than running Redis or etcd just for rate coordination. When I hit the point where I need multi-node Squad (probably when I start running multiple customer instances or large-scale load testing), I’ll migrate to Option 1 (Valkey on Azure) — it’s the most natural fit for high-frequency counter updates and already has proven multi-tenant rate limiting patterns from GitHub and Stripe.
The lesson: Start simple. Ship the file-based version. When you outgrow one machine, migrate to distributed state. Don’t build distributed infrastructure before you need it.
5 Things to Do Today
If you’re running multiple AI agents against shared API quotas, here’s the practical checklist:
1. Read the rate-limit headers
Every response from GitHub REST API and Azure OpenAI includes x-ratelimit-remaining. Parse it. Log it. React to it before hitting a 429. This is free and takes 20 minutes to implement. (Note: this applies when making direct API calls — not when using Copilot CLI with -p, where headers aren’t directly exposed.)
2. Assign priority tiers to my agents
Not all agents are equal. My architecture decision-maker should not compete with my background poller. I defined P0 (critical), P1 (standard), P2 (background) tiers and staggered retry windows accordingly.
3. Share quota state across agents
If agents track consumption independently, they will over-consume. A shared JSON file with file-locking is good enough to start. I didn’t need Redis or a coordinator service on day one.
4. Add lease expiry to allocations
If agents can crash (and they will), every token reservation needs a TTL. Dead agents shouldn’t hold quota hostage. Tie it to a heartbeat file — if the heartbeat stops, reclaim the tokens.
5. Map the agent dependency chain
Which agents depend on which other agents’ output? I wrote it down. When one agent gets rate-limited, I propagate a backpressure signal to everything downstream before they waste their own API calls.
References & Further Reading
These patterns didn’t come from nowhere — they’re adaptations of well-established distributed systems concepts applied to the multi-agent AI context:
- Circuit Breaker Pattern — Michael Nygard, Release It! (2007/2018). The original formulation. Also implemented in Resilience4j (Java) and Polly (.NET).
- Token Bucket / Leaky Bucket — classic rate limiting algorithms. See Stripe’s rate limiter blog post for a great practical introduction.
- Thundering Herd — well-documented in distributed systems literature. AWS’s exponential backoff and jitter article is the standard reference.
- Priority Inversion — originally a real-time OS concept (see the Mars Pathfinder bug). I adapted it to API quota scheduling.
- Backpressure — reactive systems concept from the Reactive Manifesto. Also central to Reactive Extensions (Rx) which I wrote about in Rx.NET in Action.
- Lease-Based Resource Management — inspired by Chubby (Google’s distributed lock service) and etcd lease mechanics.
- GitHub API Rate Limiting — GitHub REST API rate limits docs. The
x-ratelimit-remainingheaders are documented there. - Reddit discussion — r/GithubCopilot thread where others reported similar multi-agent rate limit issues.
- Adaptive Rate Limiting with Deep RL — arXiv 2511.03279 — research on multi-objective adaptive rate limiting in microservices using deep reinforcement learning. Shows 15–30% throughput improvements over static algorithms.
- Lamport, L. (1978) — “Time, Clocks, and the Ordering of Events in a Distributed System.” Communications of the ACM. The foundational paper — multi-agent coordination is fundamentally the same problem with LLM-shaped nodes.
- Resilient Microservices: A Systematic Review of Recovery Patterns — arXiv 2512.16959 — comprehensive survey of recovery patterns in distributed systems.
- Patterns of Distributed Systems — Martin Fowler’s catalog and Unmesh Joshi’s book (Addison-Wesley, 2024).
I’m currently experimenting with these patterns across multiple DevBoxes and AKS clusters. Squad manages 8–12 autonomous AI agents performing code review, architecture decisions, infrastructure deployment, research, and communication — the patterns described here are what I’m building toward as the system scales.
📚 Series: Scaling Your AI Development Team
- Part 0: Organized by AI — How Squad Changed My Daily Workflow
- Part 1: Resistance is Futile — Your First AI Engineering Team
- Part 2: The Collective — Organizational Knowledge for AI Teams
- Part 3: Unimatrix Zero — Many Teams, One Repo with SubSquads
- Part 4: When Eight Ralphs Fight Over One Login — Distributed Systems in AI Teams
- Part 5: Knowledge is Power — How an AI Squad Learns to Evolve Itself
- Part 6: 9 AI Agents, One API Quota — The Rate Limiting Problem ← You are here