Skip to main content

Improving Processing Speed While Ensuring Ordering

The Core Tension​

Kafka guarantees message ordering within a single partition. The default consumer model uses one thread per partition β€” simple and correct, but limited in throughput.

Standard Consumer (single-threaded per partition):

Partition 0: [m0] [m1] [m2] [m3] [m4] [m5] ...
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Thread 0 β”‚ ← processes m0, then m1, then m2...
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ (100ms each = 10 msg/sec MAX)

If processing takes 100ms per message, your ceiling is 10 messages/second per partition. To go faster, you need concurrency β€” but naΓ―ve multithreading breaks ordering.


πŸ‘Ά For Beginners: The "Bank Teller" Analogy​

Imagine a bank with one massive line of customers:

Single Line (= 1 Kafka Partition):

[Alice-Acct123] [Bob-Acct123] [Carol-Acct456] [Dave-Acct789]
β”‚ β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1 Teller (1 Thread) β”‚
β”‚ Processes Alice β†’ Bob β†’ Carol β†’ Dave (strictly) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Correct ordering βœ… but SLOW ❌

You hire 3 tellers to speed things up:

NaΓ―ve Parallel (BROKEN):

[Alice-Acct123] [Bob-Acct123] [Carol-Acct456]
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ Teller1 β”‚ β”‚ Teller2 β”‚ β”‚ Teller3 β”‚
β”‚ (slow) β”‚ β”‚ (fast!) β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Teller2 finishes Bob's withdrawal BEFORE
Teller1 finishes Alice's deposit β€” same account!
Account overdrafts! ❌

The fix: route by Account Number:

Key-Based Routing (CORRECT):

Manager routes by Account Number:
Acct123 β†’ always Teller 1
Acct456 β†’ always Teller 2
Acct789 β†’ always Teller 3

[Alice-Acct123] [Bob-Acct123] β†’ Teller 1 (sequential βœ…)
[Carol-Acct456] β†’ Teller 2 (parallel βœ…)
[Dave-Acct789] β†’ Teller 3 (parallel βœ…)

3x throughput βœ… Per-account ordering preserved βœ…

🧠 Deep Dive: Four Patterns for Throughput + Ordering​

Pattern 1: Key-Level Parallel Consumer​

The most powerful and general-purpose solution. Separate the network I/O from processing, and route by key.

Architecture:

Kafka Partition 0
β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Poller Thread β”‚ ← single thread: poll() + route
β”‚ (fetches batches) β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”β”Œβ”€β”€β–Όβ”€β”€β”€β”β”Œβ”€β–Όβ”€β”€β”€β”€β”
β”‚ Queue β”‚β”‚Queue β”‚β”‚Queue β”‚ ← per-key (or per-key-hash) queues
β”‚ Key=A β”‚β”‚Key=B β”‚β”‚Key=C β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”˜β””β”€β”€β”¬β”€β”€β”€β”˜β””β”€β”€β”¬β”€β”€β”€β”˜
β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”β”Œβ”€β”€β–Όβ”€β”€β”€β”β”Œβ”€β”€β–Όβ”€β”€β”€β”
β”‚Worker β”‚β”‚Workerβ”‚β”‚Workerβ”‚ ← each worker processes its queue
β”‚ T1 β”‚β”‚ T2 β”‚β”‚ T3 β”‚ sequentially
β””β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”˜

Key A: m1β†’m4β†’m7 processed IN ORDER by T1
Key B: m2β†’m5β†’m8 processed IN ORDER by T2
Key C: m3β†’m6β†’m9 processed IN ORDER by T3
But T1, T2, T3 run IN PARALLEL

Implementation with Confluent Parallel Consumer​

Don't build this yourself β€” use Confluent's open-source parallel-consumer:

<!-- pom.xml -->
<dependency>
<groupId>io.confluent.parallelconsumer</groupId>
<artifactId>parallel-consumer-core</artifactId>
<version>0.5.3.0</version>
</dependency>
var options = ParallelConsumerOptions.<String, PaymentEvent>builder()
.consumer(kafkaConsumer)
.ordering(ParallelConsumerOptions.ProcessingOrder.KEY) // ← KEY ordering
.maxConcurrency(100) // up to 100 keys processed in parallel
.commitMode(ParallelConsumerOptions.CommitMode.PERIODIC_CONSUMER_ASYNCHRONOUS)
.build();

var parallelConsumer = new ParallelStreamProcessor<>(options);

parallelConsumer.poll(record -> {
// This lambda runs in parallel for DIFFERENT keys
// but SEQUENTIALLY for the SAME key
processPayment(record.key(), record.value());
});
The parallel-consumer library handles the hardest part β€” safe, out-of-order offset committing β€” using an internal bitmap to track which offsets have completed, only committing the highest contiguous offset.

Pattern 2: Batch Processing​

If your bottleneck is I/O (database writes, API calls), batching can deliver 10–100x improvement without any threading complexity.

Without batching:
[m0] β†’ INSERT β†’ [m1] β†’ INSERT β†’ [m2] β†’ INSERT (3 round-trips)

With batching:
[m0, m1, m2] β†’ BULK INSERT (1 round-trip)
@KafkaListener(topics = "orders", batch = "true")
public void processBatch(List<ConsumerRecord<String, OrderEvent>> records) {

// Group by key to preserve per-key ordering within the batch
Map<String, List<OrderEvent>> byKey = records.stream()
.collect(Collectors.groupingBy(
ConsumerRecord::key,
LinkedHashMap::new, // preserve insertion order
Collectors.mapping(ConsumerRecord::value, Collectors.toList())
));

// Process each key's events in order, but different keys can be parallelized
byKey.forEach((key, events) -> {
List<OrderEntity> entities = events.stream()
.map(this::toEntity)
.collect(Collectors.toList());
orderRepository.saveAll(entities); // single bulk DB call per key
});
}
ApproachThroughputOrderingComplexity
Individual processingLowβœ… GuaranteedLow
Batch processingHighβœ… Guaranteed (with grouping)Medium
Batch without groupingHighest⚠️ Only within batchLow

Pattern 3: Router Topic (Fan-Out)​

If a single topic has extremely high throughput, write a lightweight "router" consumer that splits messages into smaller sub-topics by key:

Main Topic (high volume):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ [A][B][C][A][D][B][A][C][D][A][B][C]... β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Router Consumerβ”‚ ← ultra-fast, no business logic
β”‚ (fan-out) β”‚ just re-produces by key
β””β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β–Όβ”β”Œβ”€β”€β–Όβ”€β”€β”β”Œβ–Όβ”€β”€β”€β”€β”
β”‚Sub-Aβ”‚β”‚Sub-Bβ”‚β”‚Sub-Cβ”‚ ← smaller sub-topics
β””β”€β”€β”¬β”€β”€β”˜β””β”€β”€β”¬β”€β”€β”˜β””β”€β”€β”¬β”€β”€β”˜
β”Œβ”€β”€β–Όβ”€β”€β”β”Œβ”€β”€β–Όβ”€β”€β”β”Œβ”€β”€β–Όβ”€β”€β”
β”‚ C1 β”‚β”‚ C2 β”‚β”‚ C3 β”‚ ← independent consumers
β””β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”˜ with their own groups
@KafkaListener(topics = "main-events", groupId = "router")
public void route(ConsumerRecord<String, Event> record) {
// Route to sub-topic based on key hash
int bucket = Math.abs(record.key().hashCode()) % NUM_SUB_TOPICS;
String subTopic = "events-sub-" + bucket;
kafkaTemplate.send(subTopic, record.key(), record.value());
}
This adds operational complexity (more topics to manage) and an extra hop of latency. Use it only when you've exhausted simpler options.

Pattern 4: Partition-Level Pause/Resume​

If one partition is "stuck" on a slow message, you can pause it and continue processing other partitions:

@Override
public void onMessage(ConsumerRecord<String, Event> record,
Consumer<?, ?> consumer) {
try {
processEvent(record);
} catch (TemporaryFailureException e) {
// Pause this partition β€” other partitions continue normally
TopicPartition tp = new TopicPartition(record.topic(), record.partition());
consumer.pause(Collections.singleton(tp));

// Schedule a retry
scheduler.schedule(() -> {
consumer.resume(Collections.singleton(tp));
}, 30, TimeUnit.SECONDS);
}
}

This preserves ordering within the paused partition (nothing skips ahead) while allowing the rest of the consumer group to continue making progress.


⚠️ The Offset Committing Challenge​

When processing messages out of order (parallel consumer, async workers), offset management becomes the hardest problem.

Why You Can't Just Commit the Latest Offset​

Fetched: [offset 1] [offset 2] [offset 3] [offset 4] [offset 5]
β”‚ β”‚ β”‚ β”‚ β”‚
Thread A Thread B Thread A Thread B Thread A
β”‚ β”‚ β”‚ β”‚ β”‚
DONE ⏳ SLOW DONE DONE DONE

If you commit offset 5 now:
β†’ App crashes
β†’ Kafka resumes from offset 5
β†’ Offset 2 is LOST forever ❌

The High-Water Mark Solution​

Only commit the highest contiguous completed offset:

Completion bitmap: [1=βœ…] [2=⏳] [3=βœ…] [4=βœ…] [5=βœ…]
β–²
Still processing!

High-water mark = 1 ← only safe to commit offset 1

Later: [1=βœ…] [2=βœ…] [3=βœ…] [4=βœ…] [5=βœ…]
High-water mark = 5 ← now safe to commit offset 5

The Confluent Parallel Consumer handles this automatically using an internal offset bitmap.


βœ… Best Practices Summary​

ScenarioRecommended PatternWhy
Slow DB writes, ordering requiredBatch ProcessingSimple, no threading, huge I/O improvement
Need 10–100x throughput, ordering per keyParallel Consumer (Key-level)Best throughput/ordering tradeoff
Extreme volume, single fat topicRouter Topic (Fan-Out)Distributes load to independent consumer groups
One bad message blocks processingPartition Pause/ResumeIsolates slow partition without losing ordering
No ordering requirement at allAdd more partitions + consumersSimplest scaling path

πŸ›‘οΈ Always Add Idempotency​

Regardless of which pattern you choose, your downstream systems must be idempotent. Crashes, retries, and rebalances can cause duplicate processing:

// βœ… Idempotent DB write β€” uses unique constraint on eventId
@Transactional
public void processPayment(PaymentEvent event) {
if (paymentRepository.existsByEventId(event.getEventId())) {
log.info("Duplicate event {}, skipping", event.getEventId());
return;
}
paymentRepository.save(toEntity(event));
}

Interview Questions β€” Processing & Ordering​

Q: How do you increase Kafka consumer throughput without breaking ordering?

Use Key-Level Parallel Consumer: a single poller thread fetches messages, then routes them to worker threads by message key. Messages with the same key are processed sequentially by the same thread, while different keys are processed in parallel. Libraries like Confluent's parallel-consumer handle this pattern and the complex offset tracking.

Q: What is the High-Water Mark problem in parallel consumption?

When messages are processed out of order by worker threads, you cannot commit the latest completed offset β€” earlier offsets may still be in-progress. If you commit prematurely and crash, those in-progress messages are lost. The solution is to track completion with a bitmap and only commit the highest offset where all preceding offsets are complete.

Q: When would you use batch processing over parallel consumer?

Batch processing is simpler and preferred when the bottleneck is I/O (e.g., many small DB inserts that can be replaced with bulk inserts). It avoids threading complexity entirely. Use parallel consumer when the bottleneck is CPU-bound processing or when individual messages require independent async I/O.

Q: What is the Router Topic (Fan-Out) pattern?

A fast, stateless consumer reads a high-volume topic and immediately re-produces each message to one of several smaller sub-topics based on the message key. This allows independent consumer groups to process sub-topics in parallel while preserving per-key ordering within each sub-topic. The tradeoff is added operational complexity and one extra hop of latency.

Q: Why is idempotency critical when using parallel or async processing?

With parallel processing, crashes during a rebalance or between processing and committing can cause messages to be replayed. Without idempotent downstream systems, this leads to duplicate state (double charges, duplicate records). Always use unique constraints, upserts, or deduplication checks.