logo

Managing State in API Workflows

Multi-step API workflows produce state at every step: intermediate results, progress markers, error context, and data that downstream steps require. Losing this state destroys your ability to resume, debug, or compensate a failed workflow.

This guide covers practical patterns for tracking workflow state across async steps. The goal is straightforward: at any point, you should be able to answer “what happened, what is happening, and what happens next” for every active workflow.

Prerequisites: This guide builds on How API Orchestration Works and How to Chain Multiple API Calls. Read those first if you are new to async workflows.

Step 1: Design Your Workflow State Schema

A workflow state record needs five components:

const workflowState = {
// Identity
id: 'wf-abc-123',
type: 'order-processing',
// Status
status: 'running', // pending | running | completed | failed | cancelled
// Step tracking
steps: {
validate: { status: 'completed', startedAt: '...', completedAt: '...' },
charge: { status: 'completed', startedAt: '...', completedAt: '...' },
ship: { status: 'running', startedAt: '...' },
notify: { status: 'pending' },
},
// Results from completed steps
results: {
validate: { valid: true, riskScore: 12 },
charge: { transactionId: 'tx-456', amount: 99.00 },
},
// Original input (needed for retries and compensation)
context: {
userId: 'user-789',
items: [{ sku: 'WIDGET-1', qty: 2 }],
address: { street: '123 Main St', city: 'Austin' },
},
// Metadata
createdAt: '2026-03-19T10:00:00Z',
updatedAt: '2026-03-19T10:00:12Z',
error: null,
};

Key design decisions:

  • Store results per step, not in a flat object. This makes it clear which step produced which data.
  • Keep the original context. You need it for retries, compensation, and debugging.
  • Track timing per step. This reveals bottlenecks without external monitoring tools.

Choosing a storage backend

BackendGood forTrade-off
RedisHigh throughput, short-lived workflowsData loss risk without persistence config
PostgreSQL/MySQLDurability, complex queries, reportingHigher latency per write
MongoDBFlexible schema, moderate throughputLess strict consistency
DynamoDBServerless, auto-scalingCost at high write volume

For most teams, a relational database with a JSON column for results and context offers the simplest starting point. Reserve Redis for high-volume workflows that complete within a few minutes.

Step 2: Persist State at Every Transition

The critical rule: write state before dispatching the next task. If you dispatch first and crash before writing, you lose all record of what was dispatched.

app.post('/api/orchestrator', async (req, res) => {
const { workflowId, step: completedStep } = req.body.payload;
const result = req.body.result;
// 1. Write the completed step's result FIRST
await db.workflows.update(workflowId, {
[`steps.${completedStep}.status`]: 'completed',
[`steps.${completedStep}.completedAt`]: new Date().toISOString(),
[`results.${completedStep}`]: result,
updatedAt: new Date().toISOString(),
});
// 2. Determine the next step
const nextStep = getNextStep(completedStep);
if (nextStep) {
// 3. Mark the next step as dispatched before sending it
await db.workflows.update(workflowId, {
[`steps.${nextStep}.status`]: 'dispatched',
[`steps.${nextStep}.startedAt`]: new Date().toISOString(),
});
// 4. Now dispatch
await aq.tasks.create({
targetUrl: STEP_CONFIG[nextStep].targetUrl,
payload: { workflowId, step: nextStep, ...result },
webhookUrl: 'https://your-app.com/api/orchestrator',
retries: STEP_CONFIG[nextStep].retries,
timeout: STEP_CONFIG[nextStep].timeout,
});
} else {
await db.workflows.update(workflowId, {
status: 'completed',
updatedAt: new Date().toISOString(),
});
}
res.status(200).json({ received: true });
});

If the process crashes after step 3 but before step 4, the recovery sweep (Step 4 below) detects any step stuck in dispatched status and re-dispatches the stalled task.

Step 3: Handle Concurrent State Updates

When parallel steps finish at the same time, two webhook handlers may try to update the same workflow record. Without protection, one update can overwrite the other.

Option A: Atomic field updates

If your database supports partial updates (MongoDB, Redis, DynamoDB), modify only the specific fields each step touches:

// Step "charge" completes
await db.workflows.update(workflowId, {
'steps.charge.status': 'completed',
'steps.charge.completedAt': new Date().toISOString(),
'results.charge': { transactionId: 'tx-456' },
});
// Step "reserve" completes at the same time - no conflict
await db.workflows.update(workflowId, {
'steps.reserve.status': 'completed',
'steps.reserve.completedAt': new Date().toISOString(),
'results.reserve': { reservationId: 'res-789' },
});

Each update targets different fields, so they never conflict.

Option B: Optimistic locking

For relational databases, add a version column and reject outdated writes:

async function updateWorkflow(workflowId, updates) {
const workflow = await db.workflows.findById(workflowId);
const result = await db.workflows.updateOne(
{ id: workflowId, version: workflow.version },
{ ...updates, version: workflow.version + 1 }
);
if (result.modifiedCount === 0) {
// Another process updated first - reload and retry
return updateWorkflow(workflowId, updates);
}
}

Fan-in synchronization

When parallel branches must converge before the next step, verify completion atomically:

// After updating a parallel step, check if all parallel steps are done
const workflow = await db.workflows.findById(workflowId);
const parallelSteps = ['charge', 'reserve', 'fraud-check'];
const allComplete = parallelSteps.every(
s => workflow.steps[s]?.status === 'completed'
);
if (allComplete) {
await dispatchStep(workflowId, 'fulfill', {
...workflow.results.charge,
...workflow.results.reserve,
});
}

To prevent duplicate dispatches from concurrent handlers both seeing “all complete”, use an atomic compare-and-set:

const result = await db.workflows.updateOne(
{ id: workflowId, 'steps.fulfill.status': 'pending' },
{ 'steps.fulfill.status': 'dispatched' }
);
// Only the handler that wins the update dispatches the next step
if (result.modifiedCount === 1) {
await dispatchStep(workflowId, 'fulfill', payload);
}

Step 4: Build a Recovery Mechanism

Tasks can be dispatched but never report back because of network failures, process crashes, or bugs. A sweep process detects and recovers stalled workflows:

async function recoverStalledWorkflows() {
const staleThreshold = new Date(Date.now() - 10 * 60 * 1000); // 10 minutes
const stalled = await db.workflows.find({
status: 'running',
updatedAt: { $lt: staleThreshold },
});
for (const workflow of stalled) {
// Find the step that is stuck
const stuckStep = Object.entries(workflow.steps).find(
([, step]) => step.status === 'dispatched' || step.status === 'running'
);
if (!stuckStep) continue;
const [stepName] = stuckStep;
console.log(`Recovering workflow ${workflow.id}, re-dispatching step: ${stepName}`);
// Build payload from stored context and previous results
const payload = {
workflowId: workflow.id,
step: stepName,
...workflow.context,
previousResults: workflow.results,
};
await aq.tasks.create({
targetUrl: STEP_CONFIG[stepName].targetUrl,
payload,
webhookUrl: 'https://your-app.com/api/orchestrator',
retries: STEP_CONFIG[stepName].retries,
timeout: STEP_CONFIG[stepName].timeout,
});
await db.workflows.update(workflow.id, {
updatedAt: new Date().toISOString(),
});
}
}
// Run every 5 minutes
setInterval(recoverStalledWorkflows, 5 * 60 * 1000);

Recovery only works because you stored context and results in Step 1. Without those fields, you cannot rebuild the payload for a re-dispatched step.

Step 5: Expire and Archive Old Workflows

Completed workflows accumulate over time. Without cleanup, your state store grows without bound, degrading query speed and raising storage costs.

async function archiveCompletedWorkflows() {
const archiveThreshold = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000); // 7 days
const old = await db.workflows.find({
status: { $in: ['completed', 'failed', 'cancelled'] },
updatedAt: { $lt: archiveThreshold },
});
if (old.length === 0) return;
// Move to cold storage
await db.workflowArchive.insertMany(old);
// Remove from active store
const ids = old.map(w => w.id);
await db.workflows.deleteMany({ id: { $in: ids } });
console.log(`Archived ${old.length} workflows`);
}

Guidelines:

  • Keep active workflows in a fast store (Redis, primary database table)
  • Archive to cold storage (S3, separate table, data warehouse) after 7-30 days
  • Retain failed workflows longer for debugging - 30-90 days is reasonable
  • Index by status and updatedAt so cleanup queries are efficient

Checklist

Before shipping a stateful workflow to production, confirm the following:

  • State is written before every task dispatch
  • Parallel step completions do not overwrite each other
  • A recovery sweep detects and re-dispatches stuck steps
  • The original context is stored so steps can be retried with correct input
  • Completed workflows are archived on a schedule
  • You can query workflow status by ID for debugging and user-facing progress