Handling Failures in API Chains
With a single API call, recovery is simple: retry or return an error. In a multi-step chain, recovery becomes a design problem. Step 3 fails, but steps 1 and 2 already committed data. Do you roll back? Retry? Skip and continue? The answer depends on what failed, why, and what already succeeded.
This guide covers practical patterns for every failure scenario in async API (Application Programming Interface) chains. The goal: workflows that recover from transient issues on their own and degrade gracefully when recovery proves impossible.
Prerequisites: Read How to Chain Multiple API Calls and Managing State in API Workflows first.
Step 1: Classify Your Failure Types
Not all failures are equal. Your response strategy depends on the failure category.
Transient failures
The service is temporarily unavailable. A retry will likely succeed.
- HTTP 429 (rate limited), 502, 503, 504
- Connection timeouts and DNS resolution errors
- Database connection pool exhaustion
Strategy: Retry with exponential backoff.
Permanent failures
The request is invalid or the operation will never succeed regardless of retries.
- HTTP 400 (bad request), 401, 403, 404, 422
- Business logic rejections (insufficient funds, item out of stock)
- Schema validation errors
Strategy: Stop retrying, trigger compensation for completed steps.
Partial failures
The call succeeded but returned incomplete or degraded data.
- A batch operation where 90 out of 100 items succeeded
- A service that returned stale cached data instead of fresh results
- A non-critical enrichment that returned empty
Strategy: Continue the chain with what you have, flag for follow-up.
Categorizing in code
function classifyError(error, response) { if (!response) { return 'transient'; // Network error, no response received }
const status = response.status;
if (status === 429 || status >= 500) { return 'transient'; }
if (status >= 400 && status < 500) { return 'permanent'; }
// Check for partial success in the response body if (response.body?.partialResults) { return 'partial'; }
return 'permanent'; // Default to permanent to avoid infinite retries}Step 2: Configure Per-Step Retry Strategies
Each service in your chain has different reliability characteristics. A payment gateway needs cautious retries with idempotency keys. A notification service can be retried freely since sending duplicate messages is acceptable.
const STEP_CONFIG = { 'validate-order': { targetUrl: 'https://your-app.com/api/steps/validate', retries: 1, timeout: 10, // Fast, rarely fails. One retry is enough. }, 'charge-payment': { targetUrl: 'https://your-app.com/api/steps/charge', retries: 2, timeout: 30, // Critical and slow. Requires idempotency key. // Never retry more than 2x to avoid double charges. }, 'reserve-inventory': { targetUrl: 'https://your-app.com/api/steps/reserve', retries: 3, timeout: 15, // Internal service, reliable but occasionally slow under load. }, 'create-shipment': { targetUrl: 'https://your-app.com/api/steps/ship', retries: 3, timeout: 45, // Third-party API, variable latency, occasional 503s. }, 'send-confirmation': { targetUrl: 'https://your-app.com/api/steps/notify', retries: 5, timeout: 10, // Non-critical. Retry freely, sending twice is fine. },};Idempotency for payment steps
Payment retries without idempotency risk double charges. Pass a deterministic key so the provider can deduplicate.
app.post('/api/steps/charge', async (req, res) => { const { workflowId, amount, card } = req.body;
// Same workflowId always produces the same idempotency key const idempotencyKey = `charge-${workflowId}`;
const result = await paymentProvider.charge({ amount, card, idempotencyKey, });
res.json({ transactionId: result.id, amount: result.amount });});If AsyncQueue retries this step, the payment provider recognizes the key and returns the original result instead of processing a duplicate charge.
Step 3: Implement Compensation for Completed Steps
When a step fails permanently, you must undo work from earlier steps. This is the Saga pattern: each step has a corresponding compensation action.
Define compensation actions
const COMPENSATIONS = { 'charge-payment': async (results) => { await paymentProvider.refund(results['charge-payment'].transactionId); }, 'reserve-inventory': async (results) => { await inventoryService.release(results['reserve-inventory'].reservationId); }, 'create-shipment': async (results) => { await shippingProvider.cancel(results['create-shipment'].shipmentId); }, // No compensation for validate-order (read-only) // No compensation for send-confirmation (cannot unsend)};Execute compensations in reverse order
async function compensateWorkflow(workflowId) { const workflow = await db.workflows.findById(workflowId);
// Reverse order: undo the most recent step first const stepsToCompensate = [...workflow.completedSteps].reverse();
for (const stepName of stepsToCompensate) { const compensate = COMPENSATIONS[stepName]; if (!compensate) continue;
try { await compensate(workflow.results);
await db.workflows.update(workflowId, { [`steps.${stepName}.status`]: 'compensated', [`steps.${stepName}.compensatedAt`]: new Date().toISOString(), }); } catch (err) { // Compensation itself failed - this needs human attention await db.workflows.update(workflowId, { [`steps.${stepName}.status`]: 'compensation-failed', [`steps.${stepName}.compensationError`]: err.message, });
await alertOpsTeam({ workflowId, step: stepName, message: `Compensation failed: ${err.message}`, severity: 'critical', }); } }
await db.workflows.update(workflowId, { status: 'compensated', updatedAt: new Date().toISOString(), });}Key principles:
- Make compensations idempotent. They may need retries of their own.
- Log every compensation. You need an audit trail for every reversal.
- Alert on compensation failures. A failed refund demands immediate human attention.
- Not every step needs compensation. Read-only steps and idempotent notifications can be skipped.
Step 4: Add Dead Letter Handling
When a task exhausts all retries, it should not vanish. Route it to a dead letter queue for inspection and potential replay.
app.post('/api/orchestrator/failed', async (req, res) => { const { workflowId, step } = req.body.payload; const error = req.body.error;
// Classify the failure const failureType = classifyError(error, req.body.response);
if (failureType === 'transient') { // AsyncQueue already exhausted retries - this is a prolonged outage await db.deadLetterQueue.insert({ workflowId, step, error, payload: req.body.payload, reason: 'retries_exhausted', createdAt: new Date().toISOString(), });
// Do not compensate yet - the service may recover await db.workflows.update(workflowId, { [`steps.${step}.status`]: 'dead-lettered', status: 'awaiting-retry', }); } else { // Permanent failure - compensate immediately await db.workflows.update(workflowId, { [`steps.${step}.status`]: 'failed', [`steps.${step}.error`]: error, status: 'compensating', });
await compensateWorkflow(workflowId); }
await alertOpsTeam({ workflowId, step, error, failureType }); res.status(200).json({ received: true });});Replaying dead letter items
When the downstream service recovers, replay the stored items.
async function replayDeadLetters(stepName) { const items = await db.deadLetterQueue.find({ step: stepName, replayed: false, });
for (const item of items) { await aq.tasks.create({ targetUrl: STEP_CONFIG[stepName].targetUrl, payload: item.payload, webhookUrl: 'https://your-app.com/api/orchestrator', retries: STEP_CONFIG[stepName].retries, timeout: STEP_CONFIG[stepName].timeout, });
await db.deadLetterQueue.update(item.id, { replayed: true, replayedAt: new Date().toISOString(), }); }
console.log(`Replayed ${items.length} dead letter items for step: ${stepName}`);}Step 5: Design for Graceful Degradation
Not every step in a chain is equally critical. Identify which steps can fail without blocking the entire workflow.
const STEP_CONFIG = { 'charge-payment': { critical: true, // Workflow cannot continue without this }, 'send-confirmation': { critical: false, // Nice to have, but order still ships }, 'sync-to-crm': { critical: false, // Can be retried independently later }, 'update-analytics': { critical: false, // Losing one event is acceptable },};Update your orchestrator to bypass non-critical failures.
app.post('/api/orchestrator/failed', async (req, res) => { const { workflowId, step } = req.body.payload; const error = req.body.error; const stepConfig = STEP_CONFIG[step];
if (stepConfig.critical) { // Critical step failed - compensate and stop await db.workflows.update(workflowId, { [`steps.${step}.status`]: 'failed', status: 'compensating', }); await compensateWorkflow(workflowId); } else { // Non-critical step failed - log it and continue await db.workflows.update(workflowId, { [`steps.${step}.status`]: 'skipped', [`steps.${step}.error`]: error, });
// Queue for async retry outside the main workflow await db.deferredRetries.insert({ workflowId, step, payload: req.body.payload, scheduledFor: new Date(Date.now() + 30 * 60 * 1000), // retry in 30 min });
// Continue to the next step in the chain const nextStep = getNextStep(step); if (nextStep) { const workflow = await db.workflows.findById(workflowId); await dispatchStep(workflowId, nextStep, { ...workflow.context, previousResults: workflow.results, }); } }
res.status(200).json({ received: true });});This keeps the core workflow moving while deferring non-critical failures for later resolution.
Decision Flowchart
When a step fails, follow this decision logic.
Step failed ├─ Transient error? │ ├─ Retries remaining? → Retry (handled by AsyncQueue) │ └─ Retries exhausted? → Dead letter queue, alert ops ├─ Permanent error? │ ├─ Critical step? → Compensate completed steps, fail workflow │ └─ Non-critical step? → Skip, defer for later retry, continue chain └─ Partial failure? └─ Continue with partial data, flag for follow-upChecklist
Before shipping a chain with failure handling, confirm each item below.
- Every step is classified as critical or non-critical
- Retry count and timeout are tuned per step
- Payment and inventory steps use idempotency keys
- Compensation actions exist for all critical steps that mutate state
- Compensations are idempotent and logged
- Dead letter queue captures tasks that exhaust retries
- Dead letter items can be replayed with a single command
- Non-critical step failures do not block the workflow
- Ops team is alerted on compensation failures and dead letter items