logo

Handling Failures in API Chains

With a single API call, recovery is simple: retry or return an error. In a multi-step chain, recovery becomes a design problem. Step 3 fails, but steps 1 and 2 already committed data. Do you roll back? Retry? Skip and continue? The answer depends on what failed, why, and what already succeeded.

This guide covers practical patterns for every failure scenario in async API (Application Programming Interface) chains. The goal: workflows that recover from transient issues on their own and degrade gracefully when recovery proves impossible.

Prerequisites: Read How to Chain Multiple API Calls and Managing State in API Workflows first.

Step 1: Classify Your Failure Types

Not all failures are equal. Your response strategy depends on the failure category.

Transient failures

The service is temporarily unavailable. A retry will likely succeed.

  • HTTP 429 (rate limited), 502, 503, 504
  • Connection timeouts and DNS resolution errors
  • Database connection pool exhaustion

Strategy: Retry with exponential backoff.

Permanent failures

The request is invalid or the operation will never succeed regardless of retries.

  • HTTP 400 (bad request), 401, 403, 404, 422
  • Business logic rejections (insufficient funds, item out of stock)
  • Schema validation errors

Strategy: Stop retrying, trigger compensation for completed steps.

Partial failures

The call succeeded but returned incomplete or degraded data.

  • A batch operation where 90 out of 100 items succeeded
  • A service that returned stale cached data instead of fresh results
  • A non-critical enrichment that returned empty

Strategy: Continue the chain with what you have, flag for follow-up.

Categorizing in code

function classifyError(error, response) {
if (!response) {
return 'transient'; // Network error, no response received
}
const status = response.status;
if (status === 429 || status >= 500) {
return 'transient';
}
if (status >= 400 && status < 500) {
return 'permanent';
}
// Check for partial success in the response body
if (response.body?.partialResults) {
return 'partial';
}
return 'permanent'; // Default to permanent to avoid infinite retries
}

Step 2: Configure Per-Step Retry Strategies

Each service in your chain has different reliability characteristics. A payment gateway needs cautious retries with idempotency keys. A notification service can be retried freely since sending duplicate messages is acceptable.

const STEP_CONFIG = {
'validate-order': {
targetUrl: 'https://your-app.com/api/steps/validate',
retries: 1,
timeout: 10,
// Fast, rarely fails. One retry is enough.
},
'charge-payment': {
targetUrl: 'https://your-app.com/api/steps/charge',
retries: 2,
timeout: 30,
// Critical and slow. Requires idempotency key.
// Never retry more than 2x to avoid double charges.
},
'reserve-inventory': {
targetUrl: 'https://your-app.com/api/steps/reserve',
retries: 3,
timeout: 15,
// Internal service, reliable but occasionally slow under load.
},
'create-shipment': {
targetUrl: 'https://your-app.com/api/steps/ship',
retries: 3,
timeout: 45,
// Third-party API, variable latency, occasional 503s.
},
'send-confirmation': {
targetUrl: 'https://your-app.com/api/steps/notify',
retries: 5,
timeout: 10,
// Non-critical. Retry freely, sending twice is fine.
},
};

Idempotency for payment steps

Payment retries without idempotency risk double charges. Pass a deterministic key so the provider can deduplicate.

app.post('/api/steps/charge', async (req, res) => {
const { workflowId, amount, card } = req.body;
// Same workflowId always produces the same idempotency key
const idempotencyKey = `charge-${workflowId}`;
const result = await paymentProvider.charge({
amount,
card,
idempotencyKey,
});
res.json({ transactionId: result.id, amount: result.amount });
});

If AsyncQueue retries this step, the payment provider recognizes the key and returns the original result instead of processing a duplicate charge.

Step 3: Implement Compensation for Completed Steps

When a step fails permanently, you must undo work from earlier steps. This is the Saga pattern: each step has a corresponding compensation action.

Define compensation actions

const COMPENSATIONS = {
'charge-payment': async (results) => {
await paymentProvider.refund(results['charge-payment'].transactionId);
},
'reserve-inventory': async (results) => {
await inventoryService.release(results['reserve-inventory'].reservationId);
},
'create-shipment': async (results) => {
await shippingProvider.cancel(results['create-shipment'].shipmentId);
},
// No compensation for validate-order (read-only)
// No compensation for send-confirmation (cannot unsend)
};

Execute compensations in reverse order

async function compensateWorkflow(workflowId) {
const workflow = await db.workflows.findById(workflowId);
// Reverse order: undo the most recent step first
const stepsToCompensate = [...workflow.completedSteps].reverse();
for (const stepName of stepsToCompensate) {
const compensate = COMPENSATIONS[stepName];
if (!compensate) continue;
try {
await compensate(workflow.results);
await db.workflows.update(workflowId, {
[`steps.${stepName}.status`]: 'compensated',
[`steps.${stepName}.compensatedAt`]: new Date().toISOString(),
});
} catch (err) {
// Compensation itself failed - this needs human attention
await db.workflows.update(workflowId, {
[`steps.${stepName}.status`]: 'compensation-failed',
[`steps.${stepName}.compensationError`]: err.message,
});
await alertOpsTeam({
workflowId,
step: stepName,
message: `Compensation failed: ${err.message}`,
severity: 'critical',
});
}
}
await db.workflows.update(workflowId, {
status: 'compensated',
updatedAt: new Date().toISOString(),
});
}

Key principles:

  • Make compensations idempotent. They may need retries of their own.
  • Log every compensation. You need an audit trail for every reversal.
  • Alert on compensation failures. A failed refund demands immediate human attention.
  • Not every step needs compensation. Read-only steps and idempotent notifications can be skipped.

Step 4: Add Dead Letter Handling

When a task exhausts all retries, it should not vanish. Route it to a dead letter queue for inspection and potential replay.

app.post('/api/orchestrator/failed', async (req, res) => {
const { workflowId, step } = req.body.payload;
const error = req.body.error;
// Classify the failure
const failureType = classifyError(error, req.body.response);
if (failureType === 'transient') {
// AsyncQueue already exhausted retries - this is a prolonged outage
await db.deadLetterQueue.insert({
workflowId,
step,
error,
payload: req.body.payload,
reason: 'retries_exhausted',
createdAt: new Date().toISOString(),
});
// Do not compensate yet - the service may recover
await db.workflows.update(workflowId, {
[`steps.${step}.status`]: 'dead-lettered',
status: 'awaiting-retry',
});
} else {
// Permanent failure - compensate immediately
await db.workflows.update(workflowId, {
[`steps.${step}.status`]: 'failed',
[`steps.${step}.error`]: error,
status: 'compensating',
});
await compensateWorkflow(workflowId);
}
await alertOpsTeam({ workflowId, step, error, failureType });
res.status(200).json({ received: true });
});

Replaying dead letter items

When the downstream service recovers, replay the stored items.

async function replayDeadLetters(stepName) {
const items = await db.deadLetterQueue.find({
step: stepName,
replayed: false,
});
for (const item of items) {
await aq.tasks.create({
targetUrl: STEP_CONFIG[stepName].targetUrl,
payload: item.payload,
webhookUrl: 'https://your-app.com/api/orchestrator',
retries: STEP_CONFIG[stepName].retries,
timeout: STEP_CONFIG[stepName].timeout,
});
await db.deadLetterQueue.update(item.id, {
replayed: true,
replayedAt: new Date().toISOString(),
});
}
console.log(`Replayed ${items.length} dead letter items for step: ${stepName}`);
}

Step 5: Design for Graceful Degradation

Not every step in a chain is equally critical. Identify which steps can fail without blocking the entire workflow.

const STEP_CONFIG = {
'charge-payment': {
critical: true, // Workflow cannot continue without this
},
'send-confirmation': {
critical: false, // Nice to have, but order still ships
},
'sync-to-crm': {
critical: false, // Can be retried independently later
},
'update-analytics': {
critical: false, // Losing one event is acceptable
},
};

Update your orchestrator to bypass non-critical failures.

app.post('/api/orchestrator/failed', async (req, res) => {
const { workflowId, step } = req.body.payload;
const error = req.body.error;
const stepConfig = STEP_CONFIG[step];
if (stepConfig.critical) {
// Critical step failed - compensate and stop
await db.workflows.update(workflowId, {
[`steps.${step}.status`]: 'failed',
status: 'compensating',
});
await compensateWorkflow(workflowId);
} else {
// Non-critical step failed - log it and continue
await db.workflows.update(workflowId, {
[`steps.${step}.status`]: 'skipped',
[`steps.${step}.error`]: error,
});
// Queue for async retry outside the main workflow
await db.deferredRetries.insert({
workflowId,
step,
payload: req.body.payload,
scheduledFor: new Date(Date.now() + 30 * 60 * 1000), // retry in 30 min
});
// Continue to the next step in the chain
const nextStep = getNextStep(step);
if (nextStep) {
const workflow = await db.workflows.findById(workflowId);
await dispatchStep(workflowId, nextStep, {
...workflow.context,
previousResults: workflow.results,
});
}
}
res.status(200).json({ received: true });
});

This keeps the core workflow moving while deferring non-critical failures for later resolution.

Decision Flowchart

When a step fails, follow this decision logic.

Step failed
├─ Transient error?
│ ├─ Retries remaining? → Retry (handled by AsyncQueue)
│ └─ Retries exhausted? → Dead letter queue, alert ops
├─ Permanent error?
│ ├─ Critical step? → Compensate completed steps, fail workflow
│ └─ Non-critical step? → Skip, defer for later retry, continue chain
└─ Partial failure?
└─ Continue with partial data, flag for follow-up

Checklist

Before shipping a chain with failure handling, confirm each item below.

  • Every step is classified as critical or non-critical
  • Retry count and timeout are tuned per step
  • Payment and inventory steps use idempotency keys
  • Compensation actions exist for all critical steps that mutate state
  • Compensations are idempotent and logged
  • Dead letter queue captures tasks that exhaust retries
  • Dead letter items can be replayed with a single command
  • Non-critical step failures do not block the workflow
  • Ops team is alerted on compensation failures and dead letter items