When a webhook delivery fails, the event is not gone - but it might as well vanish if you lack a recovery plan. A 500 error, a network timeout, or a crashed server can all cause webhook failures. Without proper handling, those events get retried a few times and then dropped for good.

This guide shows you how to build webhook handling that never loses an event, even when things break.

Step 1: Understand How Webhook Deliveries Fail

Webhook delivery can fail at multiple points:

Sender ──HTTP POST──> Your Server ──Process──> Database
         │                │              │
         ▼                ▼              ▼
    Network error    Server crash    DB timeout
    DNS failure      500 error       Constraint violation
    TLS error        Memory limit    Connection refused

What happens after a failure:

Most webhook senders (including AsyncQueue) retry failed deliveries with exponential backoff. But retries have limits. After 3-5 attempts over a few hours, the sender gives up.

Failure Type	Retried?	Data Lost?
Your server returns 5xx	Yes	Only if all retries fail
Your server returns 4xx	No	Yes (sender assumes invalid)
Network timeout	Yes	Only if all retries fail
Your server is down for hours	Maybe	Likely - retries may exhaust
Your handler throws unhandled exception	Yes (if 5xx)	Only if all retries fail

The critical insight: you must not rely on the sender’s retry mechanism as your sole safety net. You need your own.

Step 2: Respond Fast to Prevent False Failures

The most common cause of “failed” webhook deliveries is slow response time. If your handler takes 10 seconds to process and the sender has a 5-second timeout, the delivery gets marked as failed even though your handler finishes eventually.

// BAD - slow response triggers false failure
app.post('/api/webhook', async (req, res) => {
  await validateSignature(req);          // 50ms
  await lookupOrder(req.body.orderId);   // 200ms
  await updateInventory(req.body);       // 3000ms
  await sendConfirmation(req.body);      // 2000ms
  await updateAnalytics(req.body);       // 1000ms
  // Total: 6.25 seconds - sender may have already timed out
  res.json({ received: true });
});

// GOOD - respond instantly, process later
app.post('/api/webhook', async (req, res) => {
  await validateSignature(req);  // 50ms - must do this synchronously

  // Queue for reliable background processing
  await aq.tasks.create({
    targetUrl: 'https://your-app.com/api/process-webhook-event',
    payload: req.body,
    maxRetries: 5,
    retryBackoff: 'exponential',
  });

  res.json({ received: true });  // 100ms total
});

Rule of thumb: Your webhook endpoint should respond in under 1 second. Anything slower belongs in a background task.

Step 3: Use a Task Queue as a Reliability Buffer

By routing incoming webhooks through a task queue, you gain automatic retries, persistence, and observability at no extra effort.

// Webhook receiver - minimal, fast, reliable
app.post('/api/webhook', async (req, res) => {
  if (!verifySignature(req)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const { task } = await aq.tasks.create({
    targetUrl: 'https://your-app.com/api/handle-event',
    payload: {
      eventId: req.body.id,
      eventType: req.body.type,
      data: req.body.data,
      receivedAt: new Date().toISOString(),
    },
    maxRetries: 5,
    retryBackoff: 'exponential',
    timeout: 30,
  });

  // Store event reference for auditing
  await db.webhookEvents.insert({
    eventId: req.body.id,
    taskId: task.id,
    eventType: req.body.type,
    receivedAt: new Date(),
    status: 'queued',
  });

  res.json({ received: true });
});

What this gives you:

Webhook sender sees a fast 200 response (no retries on their end)
If your handler fails, the task queue retries 5 times with backoff
Every event lands in your database for auditing
You can inspect failed events in the task dashboard

Step 4: Store Failed Events in a Dead Letter Queue

Even with retries, some events will fail for good. A bug in your handler, a schema change you missed, or corrupted data can cause persistent failures. These events need a durable home where you can find and fix them.

// Event handler with dead letter fallback
app.post('/api/handle-event', async (req, res) => {
  try {
    await processEvent(req.body);
    await db.webhookEvents.update(req.body.eventId, { status: 'processed' });
    res.json({ received: true });
  } catch (error) {
    // Check if this is likely a permanent failure
    if (isPermanentError(error)) {
      // Store in dead letter queue instead of retrying
      await db.deadLetterEvents.insert({
        eventId: req.body.eventId,
        eventType: req.body.eventType,
        payload: JSON.stringify(req.body.data),
        error: error.message,
        failedAt: new Date(),
      });
      await db.webhookEvents.update(req.body.eventId, { status: 'dead_letter' });

      // Return 200 to prevent further retries
      return res.json({ received: true, deadLettered: true });
    }

    // Transient error - let the task queue retry
    res.status(500).json({ error: 'Processing failed' });
  }
});

function isPermanentError(error) {
  // Validation errors, missing data, schema mismatches
  return error.name === 'ValidationError'
    || error.message.includes('not found')
    || error.message.includes('invalid format');
}

Review dead letter events on a regular schedule:

// Admin endpoint to list dead letter events
app.get('/api/admin/dead-letters', async (req, res) => {
  const events = await db.deadLetterEvents.find({
    failedAt: { gte: sevenDaysAgo },
  });
  res.json({ events, count: events.length });
});

Step 5: Build a Replay Mechanism

The ultimate safety net: the ability to re-process any historical event. This proves invaluable when you fix a bug and need to reprocess all events that failed because of the defect.

// Replay a single event
app.post('/api/admin/replay-event', async (req, res) => {
  const { eventId } = req.body;

  // Find the original event
  const event = await db.webhookEvents.findOne({ eventId });
  if (!event) {
    return res.status(404).json({ error: 'Event not found' });
  }

  // Create a new task to reprocess
  const { task } = await aq.tasks.create({
    targetUrl: 'https://your-app.com/api/handle-event',
    payload: event.payload,
    maxRetries: 3,
  });

  await db.webhookEvents.update(eventId, {
    status: 'replayed',
    replayTaskId: task.id,
  });

  res.json({ taskId: task.id, status: 'replayed' });
});

// Replay all dead letter events from a date range
app.post('/api/admin/replay-dead-letters', async (req, res) => {
  const { since, until } = req.body;
  const events = await db.deadLetterEvents.find({
    failedAt: { gte: since, lte: until },
  });

  const tasks = [];
  for (const event of events) {
    const { task } = await aq.tasks.create({
      targetUrl: 'https://your-app.com/api/handle-event',
      payload: JSON.parse(event.payload),
      maxRetries: 3,
    });
    tasks.push(task.id);
  }

  res.json({ replayed: tasks.length, taskIds: tasks });
});

Best practices for replay:

Always replay through the same handler, not a special path
Log replayed events distinctly so you can trace their origin
Ensure your handler is idempotent so replays run safely
Test replay on a staging environment before running against production

Step 1: Understand How Webhook Deliveries Fail

Step 2: Respond Fast to Prevent False Failures

Step 3: Use a Task Queue as a Reliability Buffer

Step 4: Store Failed Events in a Dead Letter Queue

Step 5: Build a Replay Mechanism

Related Guides