Cascade Failure in Distributed Systems

A cascade failure is a chain reaction where the failure of one component spreads to others through shared resources. When a single service slows down or stops responding, callers hold open connections and pile up retries. This consumes threads, memory, and connection pools across the system until healthy services begin failing too.

How It Works

A downstream service becomes slow or unresponsive
Upstream callers accumulate waiting requests, holding connections open
Thread pools and connection pools fill up on the calling services
Those callers start rejecting their own incoming requests
The failure propagates further upstream until broad outages occur

Service A (healthy)
   └── calls Service B (healthy)
          └── calls Service C (failing)
                ↓
        B's connection pool fills → B starts failing
                ↓
        A's connection pool fills → A starts failing
                ↓
        Full system outage

The Amplification Effect in Serverless

In serverless environments, cascade failures take a different shape. A slow downstream dependency causes functions to run longer, which spins up more concurrent instances. Each new instance makes another call to the struggling dependency, amplifying the load. This can exhaust concurrency limits across your entire account and affect unrelated functions that share the same quota.

Prevention Strategies

Circuit breakers stop calling a service after repeated failures
Timeouts prevent callers from waiting indefinitely for responses
Backpressure signals upstream services to slow their request rate
Rate limiting caps the volume of outbound calls to protect fragile dependencies
Task queues absorb traffic spikes by buffering requests instead of forwarding them directly

When Cascade Failures Strike

A database connection pool is exhausted by one bad query pattern
A third-party API goes down and every service retrying it simultaneously
A deploy introduces a latency regression that propagates through the call graph

Cascade Failure in Distributed Systems

How It Works

The Amplification Effect in Serverless

Prevention Strategies

When Cascade Failures Strike

Related Terms