Event-driven microservices are having a moment. Every architecture blog, conference talk, and system design interview revolves around Kafka, event sourcing, CQRS, and saga patterns. The appeal is real: loose coupling between services, natural scalability, and a clean separation of concerns that makes independent deployments possible.
But there's a gap between event-driven architecture as described in blog posts and event-driven architecture as experienced in production at 3 AM when messages are being processed out of order, a consumer is stuck in an infinite retry loop, and your saga has left three services in an inconsistent state.
This post covers the patterns, tradeoffs, and failure modes of event-driven microservices — with a focus on what actually goes wrong and how to prevent it.
Event-Driven vs. Event Sourcing: They're Different
Before going further, let's clarify a common confusion. Event-driven architecture and event sourcing are related but distinct concepts, and conflating them leads to poor design decisions.
- Event-driven architecture: Services communicate by producing and consuming events through a message broker (Kafka, RabbitMQ, NATS). Each service reacts to events it cares about. The events are messages — they flow through the system and may or may not be stored long-term
- Event sourcing: The state of an entity is stored as a sequence of events. Instead of storing 'account balance = $500', you store 'deposited $300', 'deposited $400', 'withdrew $200'. The current state is derived by replaying the event log. Event sourcing is a data storage pattern
- You can use event-driven architecture without event sourcing (the common case)
- You can use event sourcing without event-driven architecture (rare but possible)
- Combining both adds significant complexity — only do it when you genuinely need the audit trail and temporal query capabilities that event sourcing provides
CQRS: Separating Reads from Writes
Command Query Responsibility Segregation (CQRS) splits your data model into two: a write model optimized for processing commands (creating orders, updating profiles, processing payments) and a read model optimized for queries (listing orders, searching products, generating reports).
The write model handles business logic and maintains consistency. The read model is denormalized, fast, and optimized for specific query patterns. Events flow from the write side to the read side, keeping the read models eventually consistent.
When CQRS Makes Sense
- Read and write patterns are dramatically different (e.g., high read volume with complex queries, low write volume with complex business rules)
- You need multiple read representations of the same data (e.g., a search index, an analytics data warehouse, and a customer-facing API — all from the same event stream)
- Write operations involve complex domain logic that shouldn't be contaminated by read optimization concerns
- You're building a system where audit trail and event history are first-class requirements
When CQRS Is Overkill
- Simple CRUD applications where reads and writes follow the same patterns
- Small teams that can't afford the operational overhead of maintaining two data models
- Systems where strong consistency between reads and writes is required (CQRS introduces eventual consistency, which is a fundamental tradeoff, not a bug to be fixed)
- Early-stage products where requirements are changing rapidly — CQRS adds structural rigidity
The biggest mistake teams make with CQRS is applying it globally to their entire system. CQRS should be applied to specific bounded contexts where the read/write asymmetry justifies the complexity. Most services in your architecture should remain simple CRUD.
The Saga Pattern: Distributed Transactions That Work
In a monolith, a business operation that spans multiple entities wraps everything in a database transaction. Either all changes commit or all roll back. In microservices, this isn't possible — each service has its own database, and distributed transactions (2PC) don't scale and create tight coupling.
The Saga pattern replaces a single distributed transaction with a sequence of local transactions, each in its own service. If a step fails, the saga executes compensating transactions to undo the previous steps.
Choreography: Event-Based Coordination
In choreography, each service listens for events and decides autonomously what to do next. There's no central coordinator. For example, the Order Service publishes 'OrderCreated', the Payment Service hears it and processes payment, publishing 'PaymentCompleted', the Inventory Service hears that and reserves stock.
- Pros: Loose coupling, no single point of failure, each service is autonomous
- Cons: Hard to understand the overall flow by reading any single service's code. The saga logic is distributed and implicit. Debugging failures requires correlating events across multiple services and logs
- Best for: Simple sagas with 3-4 steps and clear, linear flows
Orchestration: Centralized Coordination
In orchestration, a central Saga Orchestrator service explicitly manages the saga flow. It sends commands to each service and handles their responses. The orchestrator knows the full saga definition — which steps to execute, in what order, and what compensating actions to take on failure.
- Pros: The saga logic is in one place, making it readable, testable, and debuggable. Easy to add monitoring and retry logic. Clear ownership of the business process
- Cons: The orchestrator is a single point of failure (must be highly available). Services become coupled to the orchestrator's command interface. Risk of the orchestrator becoming a 'god service' that knows too much
- Best for: Complex sagas with many steps, branching logic, or where visibility and debuggability are critical
In practice, most production systems use orchestration for complex business flows and choreography for simpler, more decoupled interactions. It's not either/or — you'll use both patterns in the same system.
The Exactly-Once Myth
Every discussion of event-driven systems eventually hits the delivery guarantee question: at-most-once, at-least-once, or exactly-once? Teams naturally want exactly-once delivery because it eliminates the need to think about duplicates. Here's the uncomfortable truth: exactly-once delivery is impossible in distributed systems.
This isn't a limitation of current technology — it's a fundamental constraint. The Two Generals Problem proves that no protocol can guarantee exactly-once delivery over an unreliable network. What systems like Kafka offer as 'exactly-once' is actually 'effectively-once within the Kafka ecosystem' — they deduplicate within Kafka's own processing but cannot guarantee exactly-once delivery to external systems.
The Real Solution: Idempotency
Instead of trying to prevent duplicate messages (impossible), design your consumers to handle them safely. An idempotent operation produces the same result whether it's executed once or many times.
- Idempotency keys: Every message includes a unique ID. Before processing, the consumer checks if it has already processed a message with that ID. If yes, skip. Store processed IDs in a database with a TTL
- Natural idempotency: Design operations to be naturally idempotent. 'Set balance to $500' is idempotent. 'Add $100 to balance' is not. Prefer absolute state updates over relative ones
- Database constraints: Use unique constraints to prevent duplicate inserts. If processing a message would create a duplicate database record, the constraint catches it
- Conditional updates: Use optimistic concurrency (version numbers or ETags) so that processing the same message twice results in the second attempt being a no-op
Accept at-least-once delivery and build idempotent consumers. This is simpler, more reliable, and more honest than chasing the exactly-once illusion.
Dead Letter Queues and Poison Messages
A poison message is a message that consistently fails to process — maybe it contains invalid data, triggers a bug, or depends on a resource that's permanently unavailable. Without a dead letter queue (DLQ), a poison message blocks the entire consumer, causing it to retry forever.
- Configure a maximum retry count (typically 3-5 retries with exponential backoff)
- After max retries, route the message to a dead letter queue for manual inspection
- Monitor DLQ depth as a critical operational metric — a growing DLQ means something is systematically wrong
- Build tooling to inspect, replay, and manually resolve DLQ messages
- Alert on first DLQ entry, not just queue depth — a single poison message may indicate a systemic issue
Schema Evolution: The Silent Killer
Event schemas change over time. New fields are added, old fields are deprecated, data types evolve. In a microservices architecture where multiple teams produce and consume events, uncoordinated schema changes are the most common source of production incidents.
- Use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry) to enforce schema compatibility
- Adopt backward-compatible evolution: new fields must be optional with defaults, existing fields cannot be removed or change type
- Version your events (OrderCreatedV1, OrderCreatedV2) when backward-incompatible changes are necessary
- Consumer contracts: consumers should ignore unknown fields and tolerate missing optional fields
- Test schema compatibility in CI/CD — reject deployments that break schema compatibility
Observability in Event-Driven Systems
Event-driven systems are inherently harder to observe than synchronous request-response systems. A request doesn't follow a linear path — it triggers a cascade of events across multiple services with no deterministic ordering.
- Correlation IDs: Every event must carry a correlation ID that traces the entire business transaction across all services. Without this, debugging is effectively impossible
- Distributed tracing: Use OpenTelemetry to propagate trace context through event headers. This lets you visualize the full event cascade in tools like Jaeger or Grafana Tempo
- Consumer lag monitoring: Track how far behind each consumer is from the latest event. Growing lag means a consumer can't keep up with production rate — a capacity problem that becomes a data freshness problem
- End-to-end latency tracking: Measure the time from event production to final side effect. In event-driven systems, this can be surprisingly long due to queue depth, consumer processing time, and downstream propagation
- Business event monitoring: Track business-level outcomes (orders completed, payments processed) alongside technical metrics. Technical health doesn't guarantee business health
When NOT to Use Event-Driven Architecture
Event-driven architecture is powerful but not universal. Using it where it doesn't fit creates accidental complexity that makes systems harder to build, operate, and debug.
- When strong consistency is required: If the business logic requires that operation A and operation B are always consistent (no eventual consistency window), a synchronous approach is simpler and correct
- When the system is small: A monolith with function calls is dramatically simpler than microservices with event buses. Don't add distributed systems complexity to a problem that doesn't require distributed systems
- When the team is small: Event-driven systems require operational maturity — monitoring, alerting, DLQ management, schema evolution. Small teams are better served by simpler architectures
- For simple CRUD operations: Creating a user profile doesn't need to go through an event bus. Direct API calls with appropriate error handling are simpler and faster
- When latency matters more than throughput: Events add latency (serialization, network hop, deserialization, queuing). For real-time user-facing operations, synchronous calls are typically faster
The best architecture is the simplest one that meets your requirements. Event-driven patterns are tools — use them where they provide genuine value, not as a default because they sound sophisticated.
Architect Resilient Systems with Accelar
Accelar designs and builds distributed systems that scale without falling apart. From event-driven architectures and microservices to data pipelines and real-time processing — we engineer the infrastructure that keeps your business running. Let's discuss your architecture challenges.
