logo

Cascade Failure in Distributed Systems

A cascade failure is a chain reaction where the failure of one component spreads to others through shared resources. When a single service slows down or stops responding, callers hold open connections and pile up retries. This consumes threads, memory, and connection pools across the system until healthy services begin failing too.

How It Works

  1. A downstream service becomes slow or unresponsive
  2. Upstream callers accumulate waiting requests, holding connections open
  3. Thread pools and connection pools fill up on the calling services
  4. Those callers start rejecting their own incoming requests
  5. The failure propagates further upstream until broad outages occur
Service A (healthy)
└── calls Service B (healthy)
└── calls Service C (failing)
B's connection pool fills → B starts failing
A's connection pool fills → A starts failing
Full system outage

The Amplification Effect in Serverless

In serverless environments, cascade failures take a different shape. A slow downstream dependency causes functions to run longer, which spins up more concurrent instances. Each new instance makes another call to the struggling dependency, amplifying the load. This can exhaust concurrency limits across your entire account and affect unrelated functions that share the same quota.

Prevention Strategies

  • Circuit breakers stop calling a service after repeated failures
  • Timeouts prevent callers from waiting indefinitely for responses
  • Backpressure signals upstream services to slow their request rate
  • Rate limiting caps the volume of outbound calls to protect fragile dependencies
  • Task queues absorb traffic spikes by buffering requests instead of forwarding them directly

When Cascade Failures Strike

  • A database connection pool is exhausted by one bad query pattern
  • A third-party API goes down and every service retrying it simultaneously
  • A deploy introduces a latency regression that propagates through the call graph