Architecture PatternsAdvanced6 min

Saga Pattern

Distributed transactions without two-phase commit

In a nutshell

When a business operation spans multiple services -- like placing an order that involves creating the order, reserving inventory, and charging payment -- you can't wrap it all in a single database transaction. A saga breaks the operation into a chain of smaller steps, each in its own service, with a defined "undo" action for every step. If step three fails, the undo actions run in reverse to clean up steps one and two.

The situation

A customer places an order on your e-commerce platform. Three things need to happen:

Order Service creates the order
Inventory Service reserves the items
Payment Service charges the customer's card

In a monolith, you'd wrap all three in a database transaction. If payment fails, everything rolls back. Clean, atomic, simple.

But you have microservices. Each service has its own database. There is no shared transaction. If payment fails after inventory is reserved, you're stuck with ghost reservations. If inventory fails after the order is created, you have an order with no items.

You can't use a distributed transaction (two-phase commit) because it's slow, fragile, and doesn't scale. You need a saga.

What a saga is

A saga is a sequence of local transactions, each executed by a different service. Each step has a compensating action — a way to undo its effect if a later step fails. Instead of one atomic transaction, you get a chain of smaller ones with explicit rollback logic.

Loading diagram...

Each compensation undoes exactly one step. They run in reverse order. The system ends up in a consistent state — not the committed state, but a cleanly rolled-back one.

The happy path

Step 1: Create order

curl -X POST https://orders.internal/api/orders \
  -H "Content-Type: application/json" \
  -H "X-Saga-Id: saga_q8r3t5" \
  -d '{
    "customer_id": "cust_8a3f",
    "items": [
      { "sku": "WIDGET-42", "quantity": 2, "price": 24.99 },
      { "sku": "GADGET-17", "quantity": 1, "price": 89.00 }
    ]
  }'

HTTP/1.1 201 Created

{
  "order_id": "ord_x7k9",
  "saga_id": "saga_q8r3t5",
  "status": "pending",
  "total": 138.98,
  "items": [
    { "sku": "WIDGET-42", "quantity": 2, "price": 24.99 },
    { "sku": "GADGET-17", "quantity": 1, "price": 89.00 }
  ],
  "created_at": "2026-04-13T14:32:00Z"
}

Step 2: Reserve inventory

curl -X POST https://inventory.internal/api/reservations \
  -H "Content-Type: application/json" \
  -H "X-Saga-Id: saga_q8r3t5" \
  -d '{
    "order_id": "ord_x7k9",
    "items": [
      { "sku": "WIDGET-42", "quantity": 2 },
      { "sku": "GADGET-17", "quantity": 1 }
    ]
  }'

HTTP/1.1 201 Created

{
  "reservation_id": "res_m4n7p2",
  "saga_id": "saga_q8r3t5",
  "order_id": "ord_x7k9",
  "status": "reserved",
  "items": [
    { "sku": "WIDGET-42", "quantity": 2, "available_before": 150 },
    { "sku": "GADGET-17", "quantity": 1, "available_before": 43 }
  ],
  "expires_at": "2026-04-13T14:47:00Z"
}

Notice the expires_at field. Reservations are time-limited. If the saga doesn't complete within 15 minutes, the reservation auto-releases. This is a safety net against sagas that get stuck.

Step 3: Charge payment

curl -X POST https://payments.internal/api/charges \
  -H "Content-Type: application/json" \
  -H "X-Saga-Id: saga_q8r3t5" \
  -d '{
    "order_id": "ord_x7k9",
    "customer_id": "cust_8a3f",
    "amount": 138.98,
    "currency": "usd",
    "idempotency_key": "saga_q8r3t5_charge"
  }'

HTTP/1.1 201 Created

{
  "charge_id": "chg_k9j2m5",
  "saga_id": "saga_q8r3t5",
  "order_id": "ord_x7k9",
  "amount": 138.98,
  "currency": "usd",
  "status": "succeeded",
  "charged_at": "2026-04-13T14:32:03Z"
}

All three steps succeeded. The saga is complete. The order is confirmed.

When step 3 fails

The card is declined. Now you need to undo steps 1 and 2.

HTTP/1.1 402 Payment Required

{
  "error": "payment_failed",
  "charge_id": "chg_k9j2m5",
  "reason": "card_declined",
  "decline_code": "insufficient_funds"
}

Compensate step 2: Release inventory

curl -X DELETE https://inventory.internal/api/reservations/res_m4n7p2 \
  -H "X-Saga-Id: saga_q8r3t5"

HTTP/1.1 200 OK

{
  "reservation_id": "res_m4n7p2",
  "status": "released",
  "items": [
    { "sku": "WIDGET-42", "quantity": 2, "available_after": 152 },
    { "sku": "GADGET-17", "quantity": 1, "available_after": 44 }
  ],
  "released_at": "2026-04-13T14:32:04Z"
}

Compensate step 1: Cancel order

curl -X PATCH https://orders.internal/api/orders/ord_x7k9 \
  -H "Content-Type: application/json" \
  -H "X-Saga-Id: saga_q8r3t5" \
  -d '{
    "status": "cancelled",
    "reason": "payment_declined"
  }'

HTTP/1.1 200 OK

{
  "order_id": "ord_x7k9",
  "status": "cancelled",
  "reason": "payment_declined",
  "cancelled_at": "2026-04-13T14:32:04Z"
}

The system is back to a consistent state. No ghost reservations. No orphaned orders.

Compensation is not undo

A compensating action doesn't reverse time. It creates a new transaction that semantically negates the previous one. A payment refund is not "un-charging" the card — it's a new credit transaction. An inventory release is not "un-reserving" — it's a new stock adjustment. This distinction matters because compensations are visible in audit logs, accounting records, and event streams.

Choreography vs orchestration

There are two ways to coordinate a saga: let services talk to each other (choreography) or have a central coordinator manage the flow (orchestration).

Choreography: event-driven coordination

Each service publishes events. Other services subscribe and react. No central coordinator.

Loading diagram...

Event payloads for the choreography flow:

// Published by Order Service
{
  "type": "order.created",
  "saga_id": "saga_q8r3t5",
  "data": {
    "order_id": "ord_x7k9",
    "customer_id": "cust_8a3f",
    "items": [
      { "sku": "WIDGET-42", "quantity": 2 },
      { "sku": "GADGET-17", "quantity": 1 }
    ],
    "total": 138.98
  }
}

// Published by Inventory Service (after reserving)
{
  "type": "inventory.reserved",
  "saga_id": "saga_q8r3t5",
  "data": {
    "order_id": "ord_x7k9",
    "reservation_id": "res_m4n7p2",
    "items": [
      { "sku": "WIDGET-42", "quantity": 2 },
      { "sku": "GADGET-17", "quantity": 1 }
    ]
  }
}

// Published by Payment Service (after charging)
{
  "type": "payment.charged",
  "saga_id": "saga_q8r3t5",
  "data": {
    "order_id": "ord_x7k9",
    "charge_id": "chg_k9j2m5",
    "amount": 138.98
  }
}

If payment fails, the Payment Service publishes payment.failed. Inventory Service subscribes to that event and releases its reservation. Order Service subscribes and cancels the order.

// Published by Payment Service (on failure)
{
  "type": "payment.failed",
  "saga_id": "saga_q8r3t5",
  "data": {
    "order_id": "ord_x7k9",
    "reason": "card_declined",
    "decline_code": "insufficient_funds"
  }
}

Orchestration: central coordinator

A saga orchestrator tells each service what to do and handles failures centrally.

Loading diagram...

The orchestrator maintains the saga state:

{
  "saga_id": "saga_q8r3t5",
  "type": "place_order",
  "status": "compensating",
  "started_at": "2026-04-13T14:32:00Z",
  "current_step": "compensate_inventory",
  "steps": [
    {
      "name": "create_order",
      "service": "order-service",
      "status": "compensating",
      "result": { "order_id": "ord_x7k9" }
    },
    {
      "name": "reserve_inventory",
      "service": "inventory-service",
      "status": "compensated",
      "result": { "reservation_id": "res_m4n7p2" },
      "compensated_at": "2026-04-13T14:32:05Z"
    },
    {
      "name": "charge_payment",
      "service": "payment-service",
      "status": "failed",
      "error": { "reason": "card_declined" },
      "failed_at": "2026-04-13T14:32:03Z"
    }
  ]
}

When to use which

Choreography works for simple sagas with 2-3 steps where the flow is linear. Orchestration is better when you have 4+ steps, conditional branching, or you need visibility into the saga's state. Most teams start with choreography and switch to orchestration when debugging event chains becomes painful.

Comparing the two approaches

Factor	Choreography	Orchestration
Coupling	Loose — services only know about events	Tighter — orchestrator knows all services
Visibility	Hard to see the full flow	Saga state is centralized and inspectable
Complexity	Distributed — each service has partial logic	Centralized — one place to understand the flow
Adding steps	Touch multiple services	Touch the orchestrator only
Debugging	Requires distributed tracing across events	Check the orchestrator's saga log
Single point of failure	None	The orchestrator (must be highly available)
Best for	Simple, linear flows	Complex flows with branching or many steps

Common pitfalls

Compensation can fail too

What if the inventory release fails during compensation? You need retries on compensations, and eventually a manual intervention queue:

{
  "saga_id": "saga_q8r3t5",
  "status": "compensation_failed",
  "stuck_step": "compensate_inventory",
  "attempts": 5,
  "last_error": "inventory-service unavailable",
  "requires_manual_intervention": true,
  "alert_sent_to": "ops-team@example.com"
}

Observability is critical

Every saga step should emit structured logs with the saga_id:

{
  "timestamp": "2026-04-13T14:32:03Z",
  "level": "error",
  "service": "payment-service",
  "saga_id": "saga_q8r3t5",
  "step": "charge_payment",
  "action": "failed",
  "order_id": "ord_x7k9",
  "reason": "card_declined",
  "next_action": "trigger_compensation"
}

Without consistent saga_id correlation, debugging a failed saga across three services and a message broker is an exercise in frustration.

Sagas don't give you isolation

A database transaction gives you isolation — other transactions can't see intermediate states. A saga has no isolation. Between Step 1 and Step 3, the order exists but isn't paid. Other parts of the system can see this intermediate state. Design your read paths to handle it — show "processing" statuses, filter out unpaid orders from reports, and don't send confirmation emails until the saga completes.

Checklist: implementing a saga

Can you define a compensating action for every step?
Are all your saga steps idempotent (safe to retry)?
Do you have a strategy for compensation failures?
Is every event/request tagged with a saga_id for tracing?
Have you designed your UI for intermediate states (pending, processing)?
Do you have timeouts for stuck sagas?
Have you decided between choreography and orchestration?

Next up: Authentication vs Authorization — two words that aren't synonyms, no matter how many people use them interchangeably.