Load Shedding for System Protection

Load shedding is the practice of intentionally rejecting requests when a system is overwhelmed. Rather than letting every request degrade in quality or crash the entire service, load shedding sacrifices some traffic to preserve stability for the rest.

How It Works

When system resources approach critical thresholds, a load shedding layer begins rejecting incoming requests. The goal is to keep the system operating within its capacity so that accepted requests complete successfully.

Incoming traffic: 10,000 req/sec
System capacity:   6,000 req/sec

Without load shedding:
  → All 10,000 slow down → timeouts → cascading failure

With load shedding:
  → 6,000 accepted and processed normally
  → 4,000 rejected immediately with HTTP 503

Load Shedding vs. Rate Limiting

These two mechanisms serve different purposes:

Aspect	Rate Limiting	Load Shedding
Trigger	Per-client request count	Overall system resource pressure
Goal	Fair usage enforcement	System survival
Timing	Proactive - applied before overload	Reactive - kicks in during overload
Scope	Per user, per API key, per IP	Global across all traffic

Rate limiting prevents any single client from consuming too many resources. Load shedding protects the entire system when aggregate demand exceeds capacity, regardless of how many clients contribute to that demand.

Strategies for Choosing What to Drop

Not all requests carry equal importance. Effective load shedding prioritizes what matters most:

Priority-based: Assign priority levels to request types and drop the lowest tier first
Random sampling: Reject a percentage of all requests evenly for simple implementation
Age-based: Drop requests that have already waited too long, since callers may have given up
Cost-based: Shed expensive operations first to free the most resources per rejection

function shouldShed(request, systemLoad) {
  if (systemLoad < 0.8) return false; // No shedding needed

  const priority = request.headers['x-priority'] || 'low';

  if (systemLoad > 0.95) return priority !== 'critical';
  if (systemLoad > 0.85) return priority === 'low';

  return false;
}

Implementing Load Shedding

Key decisions when adding load shedding to a service:

Pick your signal: CPU usage, memory pressure, queue depth, or request latency
Set thresholds: Define when shedding begins and when it stops
Return fast: Rejected requests should get an immediate 503 response with a Retry-After header
Monitor the shed rate: Track how often shedding activates to guide capacity planning

When to Use Load Shedding

Services that experience sudden traffic spikes beyond normal scaling speed
Systems where partial availability beats total unavailability
APIs serving mixed workloads with clear priority distinctions
Infrastructure with hard resource ceilings that cannot auto-scale quickly