When a webhook delivery fails, the event is not gone - but it might as well vanish if you lack a recovery plan. A 500 error, a network timeout, or a crashed server can all cause webhook failures. Without proper handling, those events get retried a few times and then dropped for good.
This guide shows you how to build webhook handling that never loses an event, even when things break.
Step 1: Understand How Webhook Deliveries Fail
Webhook delivery can fail at multiple points:
Sender ──HTTP POST──> Your Server ──Process──> Database │ │ │ ▼ ▼ ▼ Network error Server crash DB timeout DNS failure 500 error Constraint violation TLS error Memory limit Connection refusedWhat happens after a failure:
Most webhook senders (including AsyncQueue) retry failed deliveries with exponential backoff. But retries have limits. After 3-5 attempts over a few hours, the sender gives up.
| Failure Type | Retried? | Data Lost? |
|---|---|---|
| Your server returns 5xx | Yes | Only if all retries fail |
| Your server returns 4xx | No | Yes (sender assumes invalid) |
| Network timeout | Yes | Only if all retries fail |
| Your server is down for hours | Maybe | Likely - retries may exhaust |
| Your handler throws unhandled exception | Yes (if 5xx) | Only if all retries fail |
The critical insight: you must not rely on the sender’s retry mechanism as your sole safety net. You need your own.
Step 2: Respond Fast to Prevent False Failures
The most common cause of “failed” webhook deliveries is slow response time. If your handler takes 10 seconds to process and the sender has a 5-second timeout, the delivery gets marked as failed even though your handler finishes eventually.
// BAD - slow response triggers false failureapp.post('/api/webhook', async (req, res) => { await validateSignature(req); // 50ms await lookupOrder(req.body.orderId); // 200ms await updateInventory(req.body); // 3000ms await sendConfirmation(req.body); // 2000ms await updateAnalytics(req.body); // 1000ms // Total: 6.25 seconds - sender may have already timed out res.json({ received: true });});// GOOD - respond instantly, process laterapp.post('/api/webhook', async (req, res) => { await validateSignature(req); // 50ms - must do this synchronously
// Queue for reliable background processing await aq.tasks.create({ targetUrl: 'https://your-app.com/api/process-webhook-event', payload: req.body, maxRetries: 5, retryBackoff: 'exponential', });
res.json({ received: true }); // 100ms total});Rule of thumb: Your webhook endpoint should respond in under 1 second. Anything slower belongs in a background task.
Step 3: Use a Task Queue as a Reliability Buffer
By routing incoming webhooks through a task queue, you gain automatic retries, persistence, and observability at no extra effort.
// Webhook receiver - minimal, fast, reliableapp.post('/api/webhook', async (req, res) => { if (!verifySignature(req)) { return res.status(401).json({ error: 'Invalid signature' }); }
const { task } = await aq.tasks.create({ targetUrl: 'https://your-app.com/api/handle-event', payload: { eventId: req.body.id, eventType: req.body.type, data: req.body.data, receivedAt: new Date().toISOString(), }, maxRetries: 5, retryBackoff: 'exponential', timeout: 30, });
// Store event reference for auditing await db.webhookEvents.insert({ eventId: req.body.id, taskId: task.id, eventType: req.body.type, receivedAt: new Date(), status: 'queued', });
res.json({ received: true });});What this gives you:
- Webhook sender sees a fast 200 response (no retries on their end)
- If your handler fails, the task queue retries 5 times with backoff
- Every event lands in your database for auditing
- You can inspect failed events in the task dashboard
Step 4: Store Failed Events in a Dead Letter Queue
Even with retries, some events will fail for good. A bug in your handler, a schema change you missed, or corrupted data can cause persistent failures. These events need a durable home where you can find and fix them.
// Event handler with dead letter fallbackapp.post('/api/handle-event', async (req, res) => { try { await processEvent(req.body); await db.webhookEvents.update(req.body.eventId, { status: 'processed' }); res.json({ received: true }); } catch (error) { // Check if this is likely a permanent failure if (isPermanentError(error)) { // Store in dead letter queue instead of retrying await db.deadLetterEvents.insert({ eventId: req.body.eventId, eventType: req.body.eventType, payload: JSON.stringify(req.body.data), error: error.message, failedAt: new Date(), }); await db.webhookEvents.update(req.body.eventId, { status: 'dead_letter' });
// Return 200 to prevent further retries return res.json({ received: true, deadLettered: true }); }
// Transient error - let the task queue retry res.status(500).json({ error: 'Processing failed' }); }});
function isPermanentError(error) { // Validation errors, missing data, schema mismatches return error.name === 'ValidationError' || error.message.includes('not found') || error.message.includes('invalid format');}Review dead letter events on a regular schedule:
// Admin endpoint to list dead letter eventsapp.get('/api/admin/dead-letters', async (req, res) => { const events = await db.deadLetterEvents.find({ failedAt: { gte: sevenDaysAgo }, }); res.json({ events, count: events.length });});Step 5: Build a Replay Mechanism
The ultimate safety net: the ability to re-process any historical event. This proves invaluable when you fix a bug and need to reprocess all events that failed because of the defect.
// Replay a single eventapp.post('/api/admin/replay-event', async (req, res) => { const { eventId } = req.body;
// Find the original event const event = await db.webhookEvents.findOne({ eventId }); if (!event) { return res.status(404).json({ error: 'Event not found' }); }
// Create a new task to reprocess const { task } = await aq.tasks.create({ targetUrl: 'https://your-app.com/api/handle-event', payload: event.payload, maxRetries: 3, });
await db.webhookEvents.update(eventId, { status: 'replayed', replayTaskId: task.id, });
res.json({ taskId: task.id, status: 'replayed' });});
// Replay all dead letter events from a date rangeapp.post('/api/admin/replay-dead-letters', async (req, res) => { const { since, until } = req.body; const events = await db.deadLetterEvents.find({ failedAt: { gte: since, lte: until }, });
const tasks = []; for (const event of events) { const { task } = await aq.tasks.create({ targetUrl: 'https://your-app.com/api/handle-event', payload: JSON.parse(event.payload), maxRetries: 3, }); tasks.push(task.id); }
res.json({ replayed: tasks.length, taskIds: tasks });});Best practices for replay:
- Always replay through the same handler, not a special path
- Log replayed events distinctly so you can trace their origin
- Ensure your handler is idempotent so replays run safely
- Test replay on a staging environment before running against production