Improving Processing Speed While Ensuring Ordering
The Core Tensionβ
Kafka guarantees message ordering within a single partition. The default consumer model uses one thread per partition β simple and correct, but limited in throughput.
Standard Consumer (single-threaded per partition):
Partition 0: [m0] [m1] [m2] [m3] [m4] [m5] ...
β
βΌ
ββββββββββββ
β Thread 0 β β processes m0, then m1, then m2...
ββββββββββββ (100ms each = 10 msg/sec MAX)
If processing takes 100ms per message, your ceiling is 10 messages/second per partition. To go faster, you need concurrency β but naΓ―ve multithreading breaks ordering.
πΆ For Beginners: The "Bank Teller" Analogyβ
Imagine a bank with one massive line of customers:
Single Line (= 1 Kafka Partition):
[Alice-Acct123] [Bob-Acct123] [Carol-Acct456] [Dave-Acct789]
β β β β
βΌ βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1 Teller (1 Thread) β
β Processes Alice β Bob β Carol β Dave (strictly) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Correct ordering β
but SLOW β
You hire 3 tellers to speed things up:
NaΓ―ve Parallel (BROKEN):
[Alice-Acct123] [Bob-Acct123] [Carol-Acct456]
β β β
ββββββΌβββββ ββββββΌβββββ ββββββΌβββββ
β Teller1 β β Teller2 β β Teller3 β
β (slow) β β (fast!) β β β
βββββββββββ βββββββββββ βββββββββββ
Teller2 finishes Bob's withdrawal BEFORE
Teller1 finishes Alice's deposit β same account!
Account overdrafts! β
The fix: route by Account Number:
Key-Based Routing (CORRECT):
Manager routes by Account Number:
Acct123 β always Teller 1
Acct456 β always Teller 2
Acct789 β always Teller 3
[Alice-Acct123] [Bob-Acct123] β Teller 1 (sequential β
)
[Carol-Acct456] β Teller 2 (parallel β
)
[Dave-Acct789] β Teller 3 (parallel β
)
3x throughput β
Per-account ordering preserved β
π§ Deep Dive: Four Patterns for Throughput + Orderingβ
Pattern 1: Key-Level Parallel Consumerβ
The most powerful and general-purpose solution. Separate the network I/O from processing, and route by key.
Architecture:
Kafka Partition 0
β
ββββββΌβββββββββββββββββββββββ
β Poller Thread β β single thread: poll() + route
β (fetches batches) β
ββββββ¬βββββββ¬βββββββ¬βββββββββ
β β β
ββββββΌβββββββΌβββββββΌβββββ
β Queue ββQueue ββQueue β β per-key (or per-key-hash) queues
β Key=A ββKey=B ββKey=C β
βββββ¬ββββββββ¬ββββββββ¬ββββ
βββββΌββββββββΌββββββββΌββββ
βWorker ββWorkerββWorkerβ β each worker processes its queue
β T1 ββ T2 ββ T3 β sequentially
βββββββββββββββββββββββββ
Key A: m1βm4βm7 processed IN ORDER by T1
Key B: m2βm5βm8 processed IN ORDER by T2
Key C: m3βm6βm9 processed IN ORDER by T3
But T1, T2, T3 run IN PARALLEL
Implementation with Confluent Parallel Consumerβ
Don't build this yourself β use Confluent's open-source parallel-consumer:
<!-- pom.xml -->
<dependency>
<groupId>io.confluent.parallelconsumer</groupId>
<artifactId>parallel-consumer-core</artifactId>
<version>0.5.3.0</version>
</dependency>
var options = ParallelConsumerOptions.<String, PaymentEvent>builder()
.consumer(kafkaConsumer)
.ordering(ParallelConsumerOptions.ProcessingOrder.KEY) // β KEY ordering
.maxConcurrency(100) // up to 100 keys processed in parallel
.commitMode(ParallelConsumerOptions.CommitMode.PERIODIC_CONSUMER_ASYNCHRONOUS)
.build();
var parallelConsumer = new ParallelStreamProcessor<>(options);
parallelConsumer.poll(record -> {
// This lambda runs in parallel for DIFFERENT keys
// but SEQUENTIALLY for the SAME key
processPayment(record.key(), record.value());
});
parallel-consumer library handles the hardest part β safe, out-of-order offset committing β using an internal bitmap to track which offsets have completed, only committing the highest contiguous offset.Pattern 2: Batch Processingβ
If your bottleneck is I/O (database writes, API calls), batching can deliver 10β100x improvement without any threading complexity.
Without batching:
[m0] β INSERT β [m1] β INSERT β [m2] β INSERT (3 round-trips)
With batching:
[m0, m1, m2] β BULK INSERT (1 round-trip)
@KafkaListener(topics = "orders", batch = "true")
public void processBatch(List<ConsumerRecord<String, OrderEvent>> records) {
// Group by key to preserve per-key ordering within the batch
Map<String, List<OrderEvent>> byKey = records.stream()
.collect(Collectors.groupingBy(
ConsumerRecord::key,
LinkedHashMap::new, // preserve insertion order
Collectors.mapping(ConsumerRecord::value, Collectors.toList())
));
// Process each key's events in order, but different keys can be parallelized
byKey.forEach((key, events) -> {
List<OrderEntity> entities = events.stream()
.map(this::toEntity)
.collect(Collectors.toList());
orderRepository.saveAll(entities); // single bulk DB call per key
});
}
| Approach | Throughput | Ordering | Complexity |
|---|---|---|---|
| Individual processing | Low | β Guaranteed | Low |
| Batch processing | High | β Guaranteed (with grouping) | Medium |
| Batch without grouping | Highest | β οΈ Only within batch | Low |
Pattern 3: Router Topic (Fan-Out)β
If a single topic has extremely high throughput, write a lightweight "router" consumer that splits messages into smaller sub-topics by key:
Main Topic (high volume):
ββββββββββββββββββββββββββββββββββββββββββββ
β [A][B][C][A][D][B][A][C][D][A][B][C]... β
βββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββΌβββββββββ
β Router Consumerβ β ultra-fast, no business logic
β (fan-out) β just re-produces by key
ββββ¬βββββ¬βββββ¬ββββ
β β β
ββββββΌβββββΌβββββΌβββββ
βSub-AββSub-BββSub-Cβ β smaller sub-topics
ββββ¬βββββββ¬βββββββ¬βββ
ββββΌβββββββΌβββββββΌβββ
β C1 ββ C2 ββ C3 β β independent consumers
βββββββββββββββββββββ with their own groups
@KafkaListener(topics = "main-events", groupId = "router")
public void route(ConsumerRecord<String, Event> record) {
// Route to sub-topic based on key hash
int bucket = Math.abs(record.key().hashCode()) % NUM_SUB_TOPICS;
String subTopic = "events-sub-" + bucket;
kafkaTemplate.send(subTopic, record.key(), record.value());
}
Pattern 4: Partition-Level Pause/Resumeβ
If one partition is "stuck" on a slow message, you can pause it and continue processing other partitions:
@Override
public void onMessage(ConsumerRecord<String, Event> record,
Consumer<?, ?> consumer) {
try {
processEvent(record);
} catch (TemporaryFailureException e) {
// Pause this partition β other partitions continue normally
TopicPartition tp = new TopicPartition(record.topic(), record.partition());
consumer.pause(Collections.singleton(tp));
// Schedule a retry
scheduler.schedule(() -> {
consumer.resume(Collections.singleton(tp));
}, 30, TimeUnit.SECONDS);
}
}
This preserves ordering within the paused partition (nothing skips ahead) while allowing the rest of the consumer group to continue making progress.
β οΈ The Offset Committing Challengeβ
When processing messages out of order (parallel consumer, async workers), offset management becomes the hardest problem.
Why You Can't Just Commit the Latest Offsetβ
Fetched: [offset 1] [offset 2] [offset 3] [offset 4] [offset 5]
β β β β β
Thread A Thread B Thread A Thread B Thread A
β β β β β
DONE β³ SLOW DONE DONE DONE
If you commit offset 5 now:
β App crashes
β Kafka resumes from offset 5
β Offset 2 is LOST forever β
The High-Water Mark Solutionβ
Only commit the highest contiguous completed offset:
Completion bitmap: [1=β
] [2=β³] [3=β
] [4=β
] [5=β
]
β²
Still processing!
High-water mark = 1 β only safe to commit offset 1
Later: [1=β
] [2=β
] [3=β
] [4=β
] [5=β
]
High-water mark = 5 β now safe to commit offset 5
The Confluent Parallel Consumer handles this automatically using an internal offset bitmap.
β Best Practices Summaryβ
| Scenario | Recommended Pattern | Why |
|---|---|---|
| Slow DB writes, ordering required | Batch Processing | Simple, no threading, huge I/O improvement |
| Need 10β100x throughput, ordering per key | Parallel Consumer (Key-level) | Best throughput/ordering tradeoff |
| Extreme volume, single fat topic | Router Topic (Fan-Out) | Distributes load to independent consumer groups |
| One bad message blocks processing | Partition Pause/Resume | Isolates slow partition without losing ordering |
| No ordering requirement at all | Add more partitions + consumers | Simplest scaling path |
π‘οΈ Always Add Idempotencyβ
Regardless of which pattern you choose, your downstream systems must be idempotent. Crashes, retries, and rebalances can cause duplicate processing:
// β
Idempotent DB write β uses unique constraint on eventId
@Transactional
public void processPayment(PaymentEvent event) {
if (paymentRepository.existsByEventId(event.getEventId())) {
log.info("Duplicate event {}, skipping", event.getEventId());
return;
}
paymentRepository.save(toEntity(event));
}
Interview Questions β Processing & Orderingβ
Q: How do you increase Kafka consumer throughput without breaking ordering?
Use Key-Level Parallel Consumer: a single poller thread fetches messages, then routes them to worker threads by message key. Messages with the same key are processed sequentially by the same thread, while different keys are processed in parallel. Libraries like Confluent's
parallel-consumerhandle this pattern and the complex offset tracking.
Q: What is the High-Water Mark problem in parallel consumption?
When messages are processed out of order by worker threads, you cannot commit the latest completed offset β earlier offsets may still be in-progress. If you commit prematurely and crash, those in-progress messages are lost. The solution is to track completion with a bitmap and only commit the highest offset where all preceding offsets are complete.
Q: When would you use batch processing over parallel consumer?
Batch processing is simpler and preferred when the bottleneck is I/O (e.g., many small DB inserts that can be replaced with bulk inserts). It avoids threading complexity entirely. Use parallel consumer when the bottleneck is CPU-bound processing or when individual messages require independent async I/O.
Q: What is the Router Topic (Fan-Out) pattern?
A fast, stateless consumer reads a high-volume topic and immediately re-produces each message to one of several smaller sub-topics based on the message key. This allows independent consumer groups to process sub-topics in parallel while preserving per-key ordering within each sub-topic. The tradeoff is added operational complexity and one extra hop of latency.
Q: Why is idempotency critical when using parallel or async processing?
With parallel processing, crashes during a rebalance or between processing and committing can cause messages to be replayed. Without idempotent downstream systems, this leads to duplicate state (double charges, duplicate records). Always use unique constraints, upserts, or deduplication checks.