logo

Your webhook endpoint will go down. Deployments, server crashes, database outages, and cloud provider incidents all cause downtime. When an outage hits during a critical event - like a payment confirmation or an order fulfillment trigger - the consequences grow severe fast.

This guide walks through what happens during downtime and how to build systems that survive disruptions.

Step 1: Understand the Failure Timeline

Here is what happens when your endpoint goes down while a payment processor tries to deliver a webhook:

T+0s: Customer pays $200. Payment processor queues webhook.
T+1s: Payment processor sends POST to your endpoint.
Your server: connection refused (server is down).
T+1s: Delivery fails. Payment processor schedules retry.
T+60s: Retry #1. Your server: still down.
T+300s: Retry #2. Your server: still down.
T+3600s: Retry #3. Your server: back up. 200 OK.
Result: Event delivered 1 hour late. Customer waited 1 hour
for order confirmation.

Best case: The sender retries enough times, and your server recovers before retries exhaust.

Worst case:

T+0s: Customer pays. Webhook queued.
T+1s: Delivery fails.
T+60s: Retry #1 fails.
T+300s: Retry #2 fails.
T+3600s: Retry #3 fails.
T+7200s: Retry #4 fails.
T+14400s: Retry #5 fails. Sender gives up.
Result: Event lost for good. Customer paid but order
never fulfilled. Support ticket incoming.

The gap between “temporary delay” and “permanent data loss” depends on your uptime relative to the sender’s retry window.

Step 2: Know Your Sender’s Retry Behavior

Different services follow different retry policies:

ServiceRetry AttemptsRetry WindowBehavior After Exhaustion
AsyncQueueConfigurable (1-10)Minutes to hoursTask marked as failed
Stripe~15 retries3 daysEvent visible in dashboard, webhook disabled
GitHub3 retries~30 minutesEvent visible in recent deliveries
Shopify19 retries48 hoursWebhook registration removed
TwilioVariableHoursEvent logged but not retried further

Key takeaway: Most services give you between 30 minutes and 48 hours to recover. If your downtime exceeds that window, events vanish unless you have your own recovery mechanism.

Note: Shopify removes your webhook registration after too many failures. That means you stop receiving all future events, not only the ones that failed during the outage.

Step 3: Use a Queue as a Buffer to Survive Downtime

The most effective strategy places a task queue between the webhook sender and your processing logic. The queue acts as a separate, always-available service that stores events durably.

Architecture without a buffer (fragile):

Payment Processor -> Your Server -> Your Database
(single point of failure)

Architecture with a buffer (resilient):

Payment Processor -> Your Webhook Endpoint -> Task Queue -> Your Processing Logic
(fast, minimal) (durable) (can be down temporarily)
// Your webhook endpoint - minimal, always up
app.post('/api/stripe-webhook', async (req, res) => {
if (!verifyStripeSignature(req)) {
return res.status(401).end();
}
// Store in the task queue - this is the durability layer
await aq.tasks.create({
targetUrl: 'https://your-app.com/api/process-stripe-event',
payload: {
eventId: req.body.id,
type: req.body.type,
data: req.body.data.object,
},
maxRetries: 10,
retryBackoff: 'exponential',
});
res.json({ received: true });
});

If your processing endpoint is down when the task runs, the queue retries. Your webhook receiver only needs to stay up long enough to accept the event and queue the work, not long enough to process the full payload.

Step 4: Design for Zero-Downtime Deployments

Most webhook endpoint downtime comes not from crashes but from deployments. A typical deploy takes 5-30 seconds where the old server shuts down and the new one starts.

Strategies for zero-downtime deployment:

Rolling deployments: Run multiple instances of your server. Update one at a time so at least one instance always accepts requests.

Blue-green deployments: Start the new version alongside the old one. Switch traffic only after the new version passes health checks.

Graceful shutdown: When your server receives a shutdown signal, stop accepting new connections but finish processing in-flight requests.

// Graceful shutdown handler
process.on('SIGTERM', () => {
console.log('Shutting down gracefully...');
// Stop accepting new connections
server.close(() => {
console.log('All connections closed');
process.exit(0);
});
// Force shutdown after 30 seconds
setTimeout(() => {
console.log('Forcing shutdown');
process.exit(1);
}, 30000);
});

Queue-based processing eliminates the deployment problem:

If your webhook endpoint only queues events, even a few seconds of downtime during deployment means a handful of events get retried by the sender. Your processing logic (the task target URL) can deploy independently because the queue holds events until your processor becomes available.

Step 5: Build Recovery Procedures for Extended Outages

For outages longer than your sender’s retry window, you need a way to recover missed events.

Strategy 1: Sender event log

Most webhook senders keep a log of dispatched events. After recovering from an outage, query their API to find missed events:

// After recovering from downtime, backfill missed Stripe events
async function backfillStripeEvents(downStart, downEnd) {
const events = await stripe.events.list({
created: { gte: Math.floor(downStart / 1000), lte: Math.floor(downEnd / 1000) },
limit: 100,
});
for (const event of events.data) {
const alreadyProcessed = await db.processedEvents.findOne({
eventId: event.id,
});
if (!alreadyProcessed) {
await aq.tasks.create({
targetUrl: 'https://your-app.com/api/process-stripe-event',
payload: { eventId: event.id, type: event.type, data: event.data.object },
maxRetries: 3,
});
console.log(`Backfilled event: ${event.id}`);
}
}
}

Strategy 2: Reconciliation jobs

Run periodic reconciliation that compares your data against the source of truth:

// Daily reconciliation - find orders that were paid but not fulfilled
async function reconcileOrders() {
const paidOrders = await paymentApi.list({
status: 'succeeded',
created: { gte: yesterday },
});
for (const payment of paidOrders) {
const order = await db.orders.findOne({ paymentId: payment.id });
if (order && order.status === 'pending') {
// This order was paid but our webhook handler never processed the confirmation
await aq.tasks.create({
targetUrl: 'https://your-app.com/api/fulfill-order',
payload: { orderId: order.id, paymentId: payment.id },
});
await alertTeam(`Reconciliation: fulfilling order ${order.id}`);
}
}
}

Strategy 3: Health check and auto-recovery

Monitor your webhook endpoint health and trigger backfill on recovery from downtime:

// Track uptime windows
let lastHealthy = Date.now();
setInterval(async () => {
const healthy = await checkEndpointHealth();
if (healthy) {
const downtime = Date.now() - lastHealthy;
if (downtime > 60000) {
// Was down for more than 1 minute - trigger backfill
console.log(`Recovered after ${downtime}ms downtime, backfilling...`);
await backfillMissedEvents(lastHealthy, Date.now());
}
lastHealthy = Date.now();
}
}, 30000);