Your webhook endpoint will go down. Deployments, server crashes, database outages, and cloud provider incidents all cause downtime. When an outage hits during a critical event - like a payment confirmation or an order fulfillment trigger - the consequences grow severe fast.
This guide walks through what happens during downtime and how to build systems that survive disruptions.
Step 1: Understand the Failure Timeline
Here is what happens when your endpoint goes down while a payment processor tries to deliver a webhook:
T+0s: Customer pays $200. Payment processor queues webhook.T+1s: Payment processor sends POST to your endpoint. Your server: connection refused (server is down).T+1s: Delivery fails. Payment processor schedules retry.
T+60s: Retry #1. Your server: still down.T+300s: Retry #2. Your server: still down.T+3600s: Retry #3. Your server: back up. 200 OK.
Result: Event delivered 1 hour late. Customer waited 1 hour for order confirmation.Best case: The sender retries enough times, and your server recovers before retries exhaust.
Worst case:
T+0s: Customer pays. Webhook queued.T+1s: Delivery fails.T+60s: Retry #1 fails.T+300s: Retry #2 fails.T+3600s: Retry #3 fails.T+7200s: Retry #4 fails.T+14400s: Retry #5 fails. Sender gives up.
Result: Event lost for good. Customer paid but order never fulfilled. Support ticket incoming.The gap between “temporary delay” and “permanent data loss” depends on your uptime relative to the sender’s retry window.
Step 2: Know Your Sender’s Retry Behavior
Different services follow different retry policies:
| Service | Retry Attempts | Retry Window | Behavior After Exhaustion |
|---|---|---|---|
| AsyncQueue | Configurable (1-10) | Minutes to hours | Task marked as failed |
| Stripe | ~15 retries | 3 days | Event visible in dashboard, webhook disabled |
| GitHub | 3 retries | ~30 minutes | Event visible in recent deliveries |
| Shopify | 19 retries | 48 hours | Webhook registration removed |
| Twilio | Variable | Hours | Event logged but not retried further |
Key takeaway: Most services give you between 30 minutes and 48 hours to recover. If your downtime exceeds that window, events vanish unless you have your own recovery mechanism.
Note: Shopify removes your webhook registration after too many failures. That means you stop receiving all future events, not only the ones that failed during the outage.
Step 3: Use a Queue as a Buffer to Survive Downtime
The most effective strategy places a task queue between the webhook sender and your processing logic. The queue acts as a separate, always-available service that stores events durably.
Architecture without a buffer (fragile):
Payment Processor -> Your Server -> Your Database (single point of failure)Architecture with a buffer (resilient):
Payment Processor -> Your Webhook Endpoint -> Task Queue -> Your Processing Logic (fast, minimal) (durable) (can be down temporarily)// Your webhook endpoint - minimal, always upapp.post('/api/stripe-webhook', async (req, res) => { if (!verifyStripeSignature(req)) { return res.status(401).end(); }
// Store in the task queue - this is the durability layer await aq.tasks.create({ targetUrl: 'https://your-app.com/api/process-stripe-event', payload: { eventId: req.body.id, type: req.body.type, data: req.body.data.object, }, maxRetries: 10, retryBackoff: 'exponential', });
res.json({ received: true });});If your processing endpoint is down when the task runs, the queue retries. Your webhook receiver only needs to stay up long enough to accept the event and queue the work, not long enough to process the full payload.
Step 4: Design for Zero-Downtime Deployments
Most webhook endpoint downtime comes not from crashes but from deployments. A typical deploy takes 5-30 seconds where the old server shuts down and the new one starts.
Strategies for zero-downtime deployment:
Rolling deployments: Run multiple instances of your server. Update one at a time so at least one instance always accepts requests.
Blue-green deployments: Start the new version alongside the old one. Switch traffic only after the new version passes health checks.
Graceful shutdown: When your server receives a shutdown signal, stop accepting new connections but finish processing in-flight requests.
// Graceful shutdown handlerprocess.on('SIGTERM', () => { console.log('Shutting down gracefully...');
// Stop accepting new connections server.close(() => { console.log('All connections closed'); process.exit(0); });
// Force shutdown after 30 seconds setTimeout(() => { console.log('Forcing shutdown'); process.exit(1); }, 30000);});Queue-based processing eliminates the deployment problem:
If your webhook endpoint only queues events, even a few seconds of downtime during deployment means a handful of events get retried by the sender. Your processing logic (the task target URL) can deploy independently because the queue holds events until your processor becomes available.
Step 5: Build Recovery Procedures for Extended Outages
For outages longer than your sender’s retry window, you need a way to recover missed events.
Strategy 1: Sender event log
Most webhook senders keep a log of dispatched events. After recovering from an outage, query their API to find missed events:
// After recovering from downtime, backfill missed Stripe eventsasync function backfillStripeEvents(downStart, downEnd) { const events = await stripe.events.list({ created: { gte: Math.floor(downStart / 1000), lte: Math.floor(downEnd / 1000) }, limit: 100, });
for (const event of events.data) { const alreadyProcessed = await db.processedEvents.findOne({ eventId: event.id, });
if (!alreadyProcessed) { await aq.tasks.create({ targetUrl: 'https://your-app.com/api/process-stripe-event', payload: { eventId: event.id, type: event.type, data: event.data.object }, maxRetries: 3, }); console.log(`Backfilled event: ${event.id}`); } }}Strategy 2: Reconciliation jobs
Run periodic reconciliation that compares your data against the source of truth:
// Daily reconciliation - find orders that were paid but not fulfilledasync function reconcileOrders() { const paidOrders = await paymentApi.list({ status: 'succeeded', created: { gte: yesterday }, });
for (const payment of paidOrders) { const order = await db.orders.findOne({ paymentId: payment.id }); if (order && order.status === 'pending') { // This order was paid but our webhook handler never processed the confirmation await aq.tasks.create({ targetUrl: 'https://your-app.com/api/fulfill-order', payload: { orderId: order.id, paymentId: payment.id }, }); await alertTeam(`Reconciliation: fulfilling order ${order.id}`); } }}Strategy 3: Health check and auto-recovery
Monitor your webhook endpoint health and trigger backfill on recovery from downtime:
// Track uptime windowslet lastHealthy = Date.now();
setInterval(async () => { const healthy = await checkEndpointHealth(); if (healthy) { const downtime = Date.now() - lastHealthy; if (downtime > 60000) { // Was down for more than 1 minute - trigger backfill console.log(`Recovered after ${downtime}ms downtime, backfilling...`); await backfillMissedEvents(lastHealthy, Date.now()); } lastHealthy = Date.now(); }}, 30000);