Skip to main content

Kafka Broker โ€” Complete Guide

:::info Who this guide is for


What is Kafka?โ€‹

Apache Kafka is a distributed event streaming platform โ€” a persistent, ordered, replayable log of events that producers write to and consumers read from. Unlike traditional message queues that delete messages after delivery, Kafka retains messages for a configurable period, allowing any consumer to re-read the history.

Traditional message queue (RabbitMQ, ActiveMQ):
Producer โ†’ Queue โ†’ Consumer
โ†‘ deleted after delivery โ€” no replay

Kafka:
Producer โ†’ Topic (log) โ†’ Consumer A
โ†’ Consumer B (independent offset)
โ†’ Consumer C (replays from beginning)
โ†‘ retained for days/weeks โ€” any consumer, any time

When to use Kafkaโ€‹

Use caseWhy Kafka fits
Event streamingImmutable, ordered log of events (orders, clicks, transactions)
Async microservice decouplingProducer and consumer don't need to be online simultaneously
Change data capture (CDC)Database changes streamed as events to other systems
Log aggregationCentralise logs from many services; multiple consumers (Elasticsearch, S3)
Real-time analyticsFlink/Spark Streaming reads from Kafka topics
Event sourcingRebuild state by replaying all events from the beginning

When Kafka is overkillโ€‹

ScenarioBetter choice
Simple task queue (fire and forget)RabbitMQ, SQS
RPC-style request-responseREST, gRPC
Very low message volume (< 1,000/day)Database table with polling
Needs per-message acknowledgementRabbitMQ with manual ACK

What is a Broker?โ€‹

A Kafka broker is a single Kafka server process. It is the fundamental storage and networking unit of Kafka. Brokers:

  • Receive messages from producers and persist them to disk
  • Serve messages to consumers on request
  • Replicate data to/from other brokers for fault tolerance
  • Participate in cluster coordination (leader election, metadata)

A Kafka cluster is a group of brokers working together. Each broker has a unique integer ID (broker.id). Producers and consumers connect to the cluster through any broker โ€” they all share metadata about who owns what data.

Kafka Cluster (3 brokers)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Broker 1 โ”‚ โ”‚ Broker 2 โ”‚ โ”‚ Broker 3 โ”‚
โ”‚ ID: 1 โ”‚ โ”‚ ID: 2 โ”‚ โ”‚ ID: 3 โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ Leader: orders-P0 โ”‚ โ”‚ Leader: orders-P1 โ”‚ โ”‚ Leader: orders-P2 โ”‚
โ”‚ Leader: users-P1 โ”‚ โ”‚ Leader: users-P0 โ”‚ โ”‚ Leader: users-P2 โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ Follower: orders-P1 โ”‚ โ”‚ Follower: orders-P0 โ”‚ โ”‚ Follower: orders-P0 โ”‚
โ”‚ Follower: orders-P2 โ”‚ โ”‚ Follower: orders-P2 โ”‚ โ”‚ Follower: orders-P1 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†‘ each broker is both a leader and a follower for different partitions

Broker responsibilitiesโ€‹

ResponsibilityWhat it does
Message storageAppends incoming messages to segment files on disk โ€” sequential writes
Partition leadershipHandles all producer writes and consumer reads for its leader partitions
ReplicationAs a follower, fetches data from the leader to stay in sync
Consumer offset trackingStores committed consumer offsets in the __consumer_offsets internal topic
Cluster metadataRegisters itself and reports partition state to the cluster controller
Client request handlingProcesses Produce, Fetch, Metadata, and Admin API requests from clients

Core Kafka Conceptsโ€‹

Before going deeper into brokers, these terms must be clear:

Topics and partitionsโ€‹

Topic: "orders" (replication factor = 3, partitions = 3)

Partition 0 (leader on Broker 1):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ msg0 โ”‚ msg1 โ”‚ msg2 โ”‚ msg3 โ”‚ msg4 โ”‚ โ”€โ”€โ–บ append only, offset 0โ†’4
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Partition 1 (leader on Broker 2):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ msg0 โ”‚ msg1 โ”‚ msg2 โ”‚ โ”€โ”€โ–บ independent offset sequence
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Partition 2 (leader on Broker 3):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ msg0 โ”‚ msg1 โ”‚ msg2 โ”‚ msg3 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
ConceptDescription
TopicA named, ordered log. Like a table name, but for event streams.
PartitionA topic is split into N partitions โ€” each is an independent ordered log. Partitions enable parallelism.
OffsetA monotonically increasing integer per partition. Every message has a unique offset within its partition.
Message keyOptional. Messages with the same key always go to the same partition (consistent hashing).
Replication factorHow many copies of each partition exist across brokers. RF=3 โ†’ 3 copies.

Producer routingโ€‹

# No key โ†’ round-robin across partitions (or sticky partitioning)
producer.send("orders", value="order_data")

# With key โ†’ consistent hashing to the same partition always
# partition = hash(key) % num_partitions
producer.send("orders", key="user-42", value="order_data")
# user-42 always โ†’ Partition 1 regardless of which broker you send to

Consumer groupsโ€‹

Consumer Group "fulfillment":
Consumer A reads Partition 0
Consumer B reads Partition 1
Consumer C reads Partition 2
โ†’ Each partition consumed by exactly one consumer in the group
โ†’ 3 partitions = max 3 consumers in parallel

Consumer Group "analytics" (independent):
Consumer D reads Partition 0, 1, 2 independently
โ†’ Different offset, doesn't interfere with "fulfillment" group

Broker Storage Layoutโ€‹

Kafka stores data as segment files โ€” append-only log files on disk. This is one of the key reasons Kafka is fast: sequential disk writes are orders of magnitude faster than random writes.

/var/kafka/logs/
orders-0/ โ† Topic "orders", Partition 0
00000000000000000000.log โ† Segment 1: messages at offsets 0 โ†’ N
00000000000000000000.index โ† Offset index for Segment 1
00000000000000000000.timeindex โ† Timestamp index for Segment 1
00000000000001048576.log โ† Segment 2: messages at offsets 1048576 โ†’ M
00000000000001048576.index โ† Offset index for Segment 2
00000000000001048576.timeindex
leader-epoch-checkpoint โ† Leader epoch history
orders-1/
00000000000000000000.log
...
__consumer_offsets-0/ โ† Internal topic โ€” stores consumer positions
...

Log files explainedโ€‹

Each segment consists of three files:

FilePurposeHow Kafka uses it
.logThe actual message data (binary, batch-compressed)Sequential append on produce; sequential read on fetch
.indexSparse offset โ†’ byte offset mappingBinary search to find the byte position of a given offset
.timeindexTimestamp โ†’ offset mappingFind messages by timestamp for --from-time consumers

Finding a message by offset โ€” why it's O(log n)โ€‹

Consumer requests: offset 1,048,640

Step 1: Identify the correct segment file
Files: 000...0000.log (offsets 0โ€“1M), 000...1048576.log (offsets 1M+)
โ†’ The 1,048,640 is in the second segment

Step 2: Binary search the .index file
.index contains: [offset 1048576 โ†’ byte 0], [offset 1048656 โ†’ byte 4096], ...
Binary search finds: 1,048,640 is between entries โ†’ start at byte 0

Step 3: Scan forward in the .log file
Read from byte 0 โ†’ scan until offset 1,048,640 found (at most a few KB)

Total: O(log segments) + O(log index entries) + O(small linear scan)

Segment rollingโ€‹

When a segment grows to log.segment.bytes (default 1 GB) or ages past log.roll.hours (default 168h), Kafka rolls to a new segment file. Only the newest segment is "active" (being written to). Older segments are candidates for deletion or compaction.


Replication & ISRโ€‹

Replication is what makes Kafka fault-tolerant. Every partition has one leader and zero or more followers (replicas on other brokers).

How replication worksโ€‹

Topic "orders", Partition 0 โ€” Replication Factor 3

Broker 1 (Leader P0):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ offset: 0 1 2 3 4 โ† HEAD โ”‚ โ† Producer writes here
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“ fetch (followers pull from leader)
Broker 2 (Follower P0):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ offset: 0 1 2 3 4 โ”‚ โ† in sync (ISR)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
Broker 3 (Follower P0):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ offset: 0 1 2 โ”‚ โ† lagging โ€” may leave ISR
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Followers continuously fetch from the leader (just like a consumer). The leader tracks how far each follower has replicated.

In-Sync Replicas (ISR)โ€‹

The ISR (In-Sync Replica set) is the list of replicas that are caught up with the leader within replica.lag.time.max.ms (default 30 seconds). Only ISR members are eligible to become the new leader in a failover.

ISR tracking:
Leader knows Broker 2 last fetched offset 4 โ†’ in ISR โœ…
Leader knows Broker 3 last fetched offset 2 โ†’ 30s ago โ†’ removed from ISR โŒ

ISR = [Broker 1 (leader), Broker 2]

Producer acknowledgement levels โ€” the durability contractโ€‹

The acks setting on the producer controls when Kafka considers a write successful:

Producer sends message โ†’ Leader writes to disk
โ†’ Waits for ALL ISR members to acknowledge
โ†’ Only then responds "success" to producer

Broker 1 (Leader): โœ… written
Broker 2 (Follower): โœ… fetched and written
โ†’ Leader responds: "acknowledged"

If leader crashes immediately after:
Broker 2 has the message โ†’ elected as new leader โ†’ no data loss โœ…

Risk: none for data loss (given sufficient ISR)
Cost: highest latency (waits for slowest replica)

min.insync.replicas โ€” the safety floorโ€‹

min.insync.replicas (default 1) defines the minimum number of ISR replicas that must acknowledge a write when acks=all. If the ISR shrinks below this threshold, the broker rejects new produces with NotEnoughReplicasException.

Setup: replication.factor=3, min.insync.replicas=2, acks=all

Scenario A โ€” 3 brokers healthy:
ISR = [Broker1, Broker2, Broker3]
ISR size (3) >= min.insync.replicas (2) โ†’ writes succeed โœ…

Scenario B โ€” Broker3 falls behind / crashes:
ISR = [Broker1, Broker2]
ISR size (2) >= min.insync.replicas (2) โ†’ writes still succeed โœ…

Scenario C โ€” Broker2 also crashes:
ISR = [Broker1]
ISR size (1) < min.insync.replicas (2) โ†’ writes REJECTED โŒ
โ†’ Cluster sacrifices availability for consistency (no data loss)

The "safe" production formula:
replication.factor = 3
min.insync.replicas = 2 (RF - 1)
acks = all
โ†’ Can tolerate 1 broker failure with zero data loss

Leader election on broker failureโ€‹

Normal state:
Partition 0 leader = Broker 1 (in ISR)
ISR = [Broker1, Broker2, Broker3]

Broker 1 crashes:
ZooKeeper/KRaft detects session expiry โ†’ notifies Controller
Controller selects new leader from ISR (e.g. Broker 2)
Controller updates partition metadata
Brokers and clients receive metadata update

Recovery:
Broker 2 becomes leader at offset 4 (the last committed offset)
Producers retry and reconnect to new leader
Consumers receive metadata update and resume from last committed offset

Duration: typically 5โ€“30 seconds (depends on session timeout)

If Broker 1 was the ONLY ISR member:
โ†’ No eligible leader โ†’ partition goes OFFLINE
โ†’ Reads and writes to this partition fail until Broker 1 recovers
โ†’ Alternatively: enable unclean.leader.election.enable=true to elect a
lagging follower (possible data loss)

Controller Brokerโ€‹

Among all brokers, one is elected as the Controller. It is responsible for all cluster-wide administrative decisions:

ResponsibilityDetail
Leader electionWhen a broker fails, elects new partition leaders from the ISR
Broker join/leaveProcesses broker registration and deregistration events
Partition reassignmentMoves partition leadership when you add/remove brokers
Topic managementCreates and deletes topics (partition + replica assignment)
ISR managementPropagates ISR change notifications to all brokers
Cluster with 3 brokers โ€” Broker 1 is the Controller:

Broker 2 crashes
โ”‚
โ–ผ
ZooKeeper/KRaft detects โ†’ notifies Controller (Broker 1)
โ”‚
โ–ผ
Controller:
1. Identifies partitions where Broker 2 was leader
2. For each: selects next ISR member as new leader
3. Sends LeaderAndIsr request to affected brokers
4. Updates cluster metadata
โ”‚
โ–ผ
Brokers and clients refresh metadata โ†’ resume operations

There is always exactly one active Controller. ActiveControllerCount JMX metric must always equal 1 โ€” if it is 0, the cluster cannot manage itself; if it is 2, there is a split-brain condition.


KRaft vs ZooKeeperโ€‹

Historically, Kafka used Apache ZooKeeper as its external coordination service for storing cluster metadata and running Controller elections. Since Kafka 3.3, KRaft (Kafka Raft) replaces ZooKeeper entirely.

The ZooKeeper architecture problemโ€‹

ZooKeeper-based Kafka (legacy):

Producer/Consumer โ†’ Kafka Brokers โ†’ ZooKeeper (separate cluster)
โ†‘
Stores: broker list, topic config,
ISR, controller election
consumer offsets (old)

Problems:
โœ— Two separate systems to deploy, monitor, and upgrade
โœ— ZooKeeper becomes a bottleneck with 100,000+ partitions
โœ— Controller failover requires ZooKeeper session expiry (slow)
โœ— Metadata stored in ZooKeeper limited scalability
โœ— Operational complexity: ZooKeeper quorum, znodes, ACLs

KRaft โ€” Kafka without ZooKeeperโ€‹

KRaft-based Kafka (modern):

Producer/Consumer โ†’ Kafka Brokers (also store their own metadata)
โ†‘
Internal Raft log manages:
broker registration, topic config,
ISR, leader election, partition assignment

Benefits:
โœ… One system instead of two
โœ… Supports millions of partitions (10ร— more than ZooKeeper mode)
โœ… Faster controller failover (milliseconds vs seconds)
โœ… Simpler operations and deployment
โœ… Metadata is stored as a Kafka topic (__cluster_metadata)

KRaft deployment modesโ€‹

# server.properties โ€” broker also acts as controller
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@broker1:9093,2@broker2:9093,3@broker3:9093

listeners=PLAINTEXT://:9092,CONTROLLER://:9093
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT

Each node is both a broker (handles client requests) and a controller (manages metadata via Raft). Simpler to run but controller and broker contend for resources.

ZooKeeper modeKRaft mode
External dependencyZooKeeper cluster requiredNone โ€” self-contained
Max partitions~200,000 practical limitMillions
Controller failover15โ€“30 secondsMilliseconds
Metadata storageZooKeeper znodes__cluster_metadata topic (Raft log)
Supported sinceKafka 0.xKafka 2.8 (preview), 3.3 (production)
StatusDeprecated โ€” removed in Kafka 4.0Default from Kafka 3.3+

:::warning ZooKeeper removed in Kafka 4.0 If you are running Kafka 3.x with ZooKeeper, migrate to KRaft before upgrading to 4.0. Use kafka-storage.sh to format the metadata log in KRaft mode. :::


Log Retention and Compactionโ€‹

Every partition's log is managed by one of two cleanup policies:

Delete policy (default)โ€‹

Segments older than log.retention.hours or larger than log.retention.bytes are deleted:

log.retention.hours=168 โ† delete segments older than 7 days
log.retention.bytes=10737418240 โ† delete old segments if partition > 10 GB

When both are set, whichever limit is hit first triggers deletion. Segments are deleted as whole units โ€” Kafka never deletes individual messages.

Segment timeline for orders-0:
[Seg 1: Jan 1โ€“3] [Seg 2: Jan 3โ€“5] [Seg 3: Jan 5โ€“7] [Seg 4: Active]
โ†‘ Today = Jan 8
7-day retention: Seg 1 is > 7 days old โ†’ deleted
Segs 2, 3, 4 kept

Compact policy โ€” the changelog patternโ€‹

cleanup.policy=compact keeps only the latest value per message key, permanently โ€” no time-based deletion.

Topic "user-profiles" (compacted):
offset 0: key=user-42, value={"name":"Alice","city":"HCM"}
offset 1: key=user-99, value={"name":"Bob","city":"HN"}
offset 2: key=user-42, value={"name":"Alice","city":"Hanoi"} โ† update
offset 3: key=user-42, value=null โ† tombstone (delete)

After compaction:
offset 1: key=user-99, value={"name":"Bob","city":"HN"} โ† kept
offset 3: key=user-42, value=null โ† kept temporarily (tombstone)
(offset 0 and 2 are superseded โ†’ deleted)

Tombstone messages (null value with a key) signal deletion. The compaction process removes the tombstone after delete.retention.ms (default 24h), allowing downstream consumers time to process the delete.

When to use each policyโ€‹

PolicyUse forExample topics
DeleteTime-series events, logs, analytics โ€” historical data ages outclickstream, application-logs, order-events
CompactLatest state per entity โ€” changelog, cache invalidation, event sourcing stateuser-profiles, product-inventory, feature-flags
Compact + DeleteKeep latest state but also enforce max agesession-state (compact, but expire after 30 days)

Bootstrap Servers and Client Discoveryโ€‹

bootstrap.servers is the entry point for all Kafka clients. You only need to list 2โ€“3 brokers โ€” the client fetches the full cluster metadata from them:

bootstrap.servers=broker1:9092,broker2:9092,broker3:9092
Client startup sequence:
1. Connect to any bootstrap broker (e.g. broker1:9092)
2. Send Metadata request: "what brokers and partitions exist?"
3. Receive: full cluster topology (all brokers, all topics, partitionโ†’broker mapping)
4. Cache metadata locally
5. Connect directly to the correct leader broker for each partition
6. On error (broker down) โ†’ refresh metadata โ†’ reconnect to new leader

:::tip List only 2โ€“3 bootstrap servers You don't need to list every broker. List 2โ€“3 for redundancy in case one is down during startup. The client discovers the rest automatically from the metadata response. :::


Key Broker Configurationsโ€‹

# โ”€โ”€ Identity โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
broker.id=1 # Unique integer per broker in the cluster
node.id=1 # KRaft mode uses this (same as broker.id)

# โ”€โ”€ Storage โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
log.dirs=/var/kafka/logs,/data/kafka # Multiple dirs = JBOD (striped across disks)
log.segment.bytes=1073741824 # Roll to a new segment at 1 GB
log.roll.hours=168 # Or roll after 7 days even if < 1 GB

# โ”€โ”€ Retention โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
log.retention.hours=168 # Delete segments older than 7 days
log.retention.bytes=-1 # -1 = unlimited size-based retention
log.cleanup.policy=delete # or compact, or delete,compact

# โ”€โ”€ Replication โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
default.replication.factor=3 # New topics get RF=3 by default
min.insync.replicas=2 # Require 2 ISR acks for acks=all
replica.lag.time.max.ms=30000 # Remove replica from ISR after 30s lag
unclean.leader.election.enable=false # Never elect an out-of-sync replica as leader

# โ”€โ”€ Performance โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
num.network.threads=8 # Threads handling network I/O
num.io.threads=16 # Threads executing disk I/O requests
num.replica.fetchers=4 # Threads fetching from leader (follower side)
socket.send.buffer.bytes=102400 # OS TCP send buffer
socket.receive.buffer.bytes=102400 # OS TCP receive buffer
socket.request.max.bytes=104857600 # Max request size (100 MB)

# โ”€โ”€ Producer limits โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
message.max.bytes=1048576 # Max single message size (1 MB default)

# โ”€โ”€ Topic defaults โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
num.partitions=3 # Default partitions for auto-created topics
auto.create.topics.enable=false # Disable in production โ€” create topics explicitly

# โ”€โ”€ Compression โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
compression.type=producer # Accept whatever compression the producer chose

Broker Internalsโ€‹

๐Ÿ”ฌ Senior deep-dive: why Kafka is so fast โ€” page cache and zero-copy

Kafka achieves high throughput through three OS-level optimisations that most message brokers don't exploit:

Sequential I/Oโ€‹

All writes to a partition are append-only sequential writes to the end of the active segment file. Sequential disk I/O is 50โ€“100ร— faster than random I/O because:

Random I/O (traditional DB):
Write record A โ†’ seek to address 0x4A2F โ†’ write 200 bytes
Write record B โ†’ seek to address 0x9C10 โ†’ write 200 bytes
โ†’ Disk head physically moves (HDD) or wear-levels across blocks (SSD)

Sequential I/O (Kafka):
Write record A โ†’ append to end of file
Write record B โ†’ append right after A (already positioned)
โ†’ No seeking โ€” 100% sequential, saturates disk bandwidth

OS Page Cacheโ€‹

Kafka does not manage its own memory buffer for recent writes. Instead, it writes to the OS page cache (kernel memory that maps file contents into RAM). The OS flushes to disk asynchronously:

Producer sends message:
1. Kafka writes to page cache (RAM) โ†’ returns ACK to producer
2. OS flush daemon writes page cache โ†’ disk (async, every ~30s or when memory pressure)

Consumer reads message:
1. Kafka checks: is this page still in cache? โ†’ YES (for recent messages)
2. Returns from RAM โ†’ no disk I/O at all โœ…

Result: recent messages are served from RAM at memory bandwidth speed,
not disk bandwidth speed.

This is why Kafka recommends giving the OS most of the machine RAM rather than configuring a large JVM heap โ€” the OS page cache IS Kafka's buffer.

Recommended JVM heap: 6 GB (Kafka broker doesn't need much)
Remaining RAM: 26 GB on a 32 GB machine
โ†’ OS page cache holds ~26 GB of recent Kafka data in memory
โ†’ Most reads are served from RAM, not disk

Zero-Copy (sendfile)โ€‹

When a consumer fetches messages, naive transfer is:

Naive (4 copies):
Disk โ†’ Kernel buffer โ†’ User space (Kafka JVM) โ†’ Kernel socket buffer โ†’ Network
โ†‘ copy 1 โ†‘ copy 2 โ†‘ copy 3+4

Zero-copy (sendfile syscall, 2 copies):
Disk โ†’ Kernel buffer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ Network
โ†‘ OS handles directly via DMA โ€” JVM never touches the data

Kafka uses Java's FileChannel.transferTo() which calls the OS sendfile() syscall. This reduces copies from 4 to 2 and eliminates context switches between user space and kernel space. For read-heavy workloads, this is the primary throughput driver.

๐Ÿ”ฌ Senior deep-dive: batch compression pipeline

Kafka compresses at the batch level, not per-message. The producer accumulates a batch of messages and compresses the entire batch:

Producer batch (before compression):
[msg1: 500 bytes] [msg2: 500 bytes] ... [msg100: 500 bytes] = 50 KB

After LZ4 compression:
[compressed batch: 8 KB] โ† 83% reduction for structured/repeated data

Single batch write to broker:
8 KB written (one sequential write)

Consumer receives batch:
8 KB transferred over network
Consumer decompresses locally

Result: network and disk I/O 83% lower than uncompressed

Compression algorithm comparison:

AlgorithmCompression ratioSpeedCPU costBest for
none1ร—FastestNoneAlready-compressed data (JPEG, video)
lz42โ€“4ร—Very fastLowDefault choice โ€” best speed/ratio tradeoff
snappy2โ€“3ร—FastLowGood alternative to LZ4
gzip3โ€“6ร—SlowHighArchival topics with low throughput
zstd3โ€“7ร—FastMediumBest ratio with acceptable speed โ€” modern choice

Set compression.type=zstd on the producer for best overall efficiency in 2024+.

๐Ÿ”ฌ Senior deep-dive: exactly-once semantics (EOS)

Kafka's default delivery guarantee is at-least-once โ€” retries can cause duplicates. Exactly-once semantics (EOS) guarantees each message is processed precisely once end-to-end.

Idempotent producerโ€‹

# Producer config
enable.idempotence=true
acks=all # required for idempotence
max.in.flight.requests.per.connection=5 # max 5 for idempotent

# Broker assigns each producer a PID (Producer ID) and tracks sequence numbers.
# Duplicate sends (retries) are detected and deduplicated by the broker.
Producer sends batch with sequence=42:
Broker receives and stores it โ†’ responds ACK
Network drops โ†’ producer doesn't get ACK โ†’ retries
Broker receives again with sequence=42 โ†’ "I already have this" โ†’ drops duplicate โœ…

Transactional producer (cross-partition atomicity)โ€‹

producer.initTransactions();

try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("orders", "order-data"));
producer.send(new ProducerRecord<>("inventory", "reserve-item"));
producer.send(new ProducerRecord<>("payments", "charge-card"));
// All three writes are atomic โ€” either all succeed or all are rolled back
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction(); // rolls back all three writes
}
# Consumer must also set isolation level for EOS reads
isolation.level=read_committed # only see committed transaction data
# (default is read_uncommitted โ€” sees in-progress transaction data too)

EOS cost: ~20โ€“30% throughput reduction due to transaction coordination overhead. Use only when you genuinely need exactly-once โ€” most use cases are fine with at-least-once + idempotent consumer logic.


Performance Tuningโ€‹

Broker-side tuningโ€‹

# โ”€โ”€ Throughput optimisation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
num.network.threads=8 # = number of CPU cores (handles network I/O)
num.io.threads=16 # = 2ร— CPU cores (handles disk I/O)
num.replica.fetchers=4 # More = faster replication catch-up after restarts

# โ”€โ”€ Network buffers โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
socket.send.buffer.bytes=1048576 # 1 MB โ€” increase for high throughput
socket.receive.buffer.bytes=1048576 # 1 MB
socket.request.max.bytes=104857600 # 100 MB โ€” max request size

# โ”€โ”€ OS-level (not in server.properties) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Set in /etc/sysctl.conf:
# net.core.rmem_max = 134217728 # 128 MB TCP receive buffer
# net.core.wmem_max = 134217728 # 128 MB TCP send buffer
# vm.swappiness = 1 # Never swap โ€” protect page cache

JVM tuningโ€‹

# KAFKA_HEAP_OPTS โ€” set in kafka-env.sh or systemd unit
KAFKA_HEAP_OPTS="-Xms6g -Xmx6g" # 6 GB fixed heap (no resize overhead)

# G1GC is the recommended collector for Kafka brokers
KAFKA_JVM_PERFORMANCE_OPTS="
-server
-XX:+UseG1GC
-XX:MaxGCPauseMillis=20 # Target max GC pause: 20ms
-XX:InitiatingHeapOccupancyPercent=35
-XX:+ExplicitGCInvokesConcurrent
-Djava.awt.headless=true"

Disk and filesystem recommendationsโ€‹

โœ… Use dedicated disks for Kafka data (separate from OS, application logs)
โœ… XFS filesystem โ€” better performance for Kafka's large sequential writes
โœ… Mount with noatime โ€” don't update access time on every read
โœ… For multi-disk setups: use log.dirs with one path per disk (JBOD)
โ†’ Kafka stripes partitions across disks automatically

log.dirs=/disk1/kafka,/disk2/kafka,/disk3/kafka

โœ… RAID: avoid RAID5/6 (write penalty) โ€” use RAID10 or JBOD
โœ… SSD: use for controller nodes in KRaft mode (metadata I/O)

Partition count guidelinesโ€‹

Partitions per topic = max(throughput / per-partition-throughput, consumer-parallelism)

Rule of thumb:
Per-partition throughput: ~10 MB/s produce, ~50 MB/s consume
Target topic throughput: 100 MB/s
โ†’ Minimum partitions = 100 / 10 = 10 partitions

Consumer parallelism: 10 consumers in the group
โ†’ At least 10 partitions (each consumer handles one)

Choose: max(10, 10) = 10 partitions

Caution:
โœ— Over-partitioning (10,000+ partitions per broker) increases:
- Leader election time on restart
- Memory usage (open file descriptors per segment)
- End-to-end latency (more files to fsync on flush)

General guideline:
< 100 MB/s throughput: start with 3โ€“6 partitions
100 MB/s โ€“ 1 GB/s: 10โ€“50 partitions
> 1 GB/s: profile carefully; consider multiple topics

Monitoring & Observabilityโ€‹

Critical JMX metricsโ€‹

MetricHealthy valueAlert conditionWhat it means
UnderReplicatedPartitions0> 0Some partitions have fewer replicas than RF โ€” data at risk
ActiveControllerCount1โ‰  1No controller or split-brain โ€” cluster cannot manage itself
OfflinePartitionsCount0> 0Partitions with no leader โ€” producers and consumers failing
LeaderCountBalanced across brokersImbalancedUneven load โ€” one broker handling all traffic
BytesInPerSecPer capacity planApproaching disk bandwidthDisk saturation
BytesOutPerSecPer capacity planApproaching NIC bandwidthNetwork saturation
RequestHandlerAvgIdlePercent> 50%< 30%Broker CPU saturated โ€” add brokers or reduce load
ProducerRequestQueueTimeMs< 10ms> 100msNetwork thread backlog โ€” increase num.network.threads
FetchConsumerRequestQueueTimeMs< 10ms> 100msConsumer fetch backlog
ISRShrinkRateNear 0RisingReplicas frequently leaving ISR โ€” check broker GC, disk I/O

Consumer lag monitoringโ€‹

Consumer lag is the most operationally important Kafka metric โ€” it tells you how far behind consumers are from the latest messages:

# Check consumer group lag via CLI
kafka-consumer-groups.sh \
--bootstrap-server broker1:9092 \
--describe --group fulfillment-service

GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
fulfillment-service orders 0 1048600 1048610 10 โœ…
fulfillment-service orders 1 524200 524500 300 โš ๏ธ
fulfillment-service orders 2 786000 786000 0 โœ…
# Burrow (LinkedIn) or Kafka Exporter โ€” expose lag as Prometheus metrics
# Alert when lag exceeds SLA threshold
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_group_lag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Consumer group {{ $labels.group }} lag > 10,000 on {{ $labels.topic }}"

Prometheus + Grafana setupโ€‹

# docker-compose.yml โ€” Kafka with JMX exporter
services:
kafka:
image: confluentinc/cp-kafka:7.5.0
environment:
KAFKA_JMX_PORT: 9999
EXTRA_ARGS: "-javaagent:/opt/jmx-exporter.jar=9998:/etc/jmx-exporter.yml"

kafka-exporter:
image: danielqsj/kafka-exporter:latest
command:
- "--kafka.server=kafka:9092"
- "--topic.filter=.*"
- "--group.filter=.*"
ports:
- "9308:9308"
# Key PromQL queries for a Kafka Grafana dashboard

# Under-replicated partitions โ€” must be 0
kafka_server_replicamanager_underreplicatedpartitions

# Consumer lag per group/topic/partition
kafka_consumer_group_lag{group="fulfillment-service", topic="orders"}

# Broker throughput in MB/s
rate(kafka_server_brokertopicmetrics_bytesin_total[1m]) / 1024 / 1024

# Request handler idle % โ€” alert if < 30%
kafka_server_requesthandlerpool_idlepercent

Common Failure Scenariosโ€‹

ScenarioSymptomsRoot causeFix
Broker OOM crashOfflinePartitionsCount spikes, consumers fall behindJVM heap too large (eats page cache)Reduce heap to 6 GB; give remaining RAM to OS
ISR constantly shrinkingISRShrinkRate rising, UnderReplicatedPartitions > 0Slow disk I/O or GC pauses on followerCheck iostat, GC logs; add disks or upgrade instance
Consumer lag growingLag increases indefinitelyConsumer processing too slowScale consumer group (add instances up to partition count)
Unclean leader electionPossible data loss, offsets jumpunclean.leader.election.enable=true and leader lostSet unclean.leader.election.enable=false in production
Rebalance stormConsumer group repeatedly rebalancingSession timeout too short or slow poll()Increase session.timeout.ms; fix slow processing
Large message rejectionRecordTooLargeExceptionMessage exceeds message.max.bytesIncrease limit or split large messages
Disk fullBroker crashes, OfflinePartitionsCountRetention policy too looseReduce log.retention.hours or add storage
Split brain (ZooKeeper)ActiveControllerCount = 2ZooKeeper network partitionFix ZooKeeper quorum; prefer KRaft mode

Common Mistakesโ€‹

MistakeProblemFix
acks=1 for financial transactionsLeader acknowledges, crashes before replication โ†’ data lossUse acks=all + min.insync.replicas=2 for critical topics
unclean.leader.election.enable=trueOut-of-sync replica becomes leader โ†’ silent data lossSet to false in production โ€” accept brief unavailability over data loss
Single partition per topicNo consumer parallelism; one broker handles all writesPartition count โ‰ฅ max consumer group size
Huge JVM heap (> 8 GB)Page cache starved โ€” Kafka reads from disk instead of RAMKeep heap at 6 GB; give remaining RAM to OS for page cache
No auto.create.topics.enable=falseTypos in topic names silently create empty topicsDisable auto-creation; manage topics explicitly
replication.factor=1Single copy โ€” one broker failure = permanent data lossAlways use RF โ‰ฅ 2 in production; RF=3 for critical topics
Not monitoring consumer lagConsumer falls hours behind before anyone noticesAlert on lag > threshold per topic SLA
Large number of small partitionsController election time and memory overhead growKeep partitions per broker under 4,000; right-size partition count
Blocking poll() loopSession timeout โ†’ consumer appears dead โ†’ rebalance stormProcess messages asynchronously; keep poll() interval < max.poll.interval.ms
log.retention.bytes without log.retention.hoursOld data never deleted if size stays under limitSet both; test retention with realistic data volume

๐ŸŽฏ Interview Questionsโ€‹

Q1. What is a Kafka broker and what does it do?

A Kafka broker is a single Kafka server process. It receives messages from producers and appends them to partition segment files on disk, serves messages to consumers on fetch requests, replicates data to and from other brokers to maintain the configured replication factor, and participates in cluster coordination (leader election, metadata management). A Kafka cluster is simply multiple brokers working together โ€” each partition is assigned one leader broker (handles all reads and writes) and zero or more follower brokers (replicate the data for redundancy).

Q2. What is the ISR and why does it matter?

The ISR (In-Sync Replica set) is the list of partition replicas that are fully caught up with the leader within replica.lag.time.max.ms. Only ISR members are eligible to become the new leader on failover โ€” this prevents data loss from electing an out-of-date replica. With acks=all, the leader waits for all ISR members to acknowledge a write before returning success to the producer. If the ISR shrinks below min.insync.replicas, writes are rejected entirely โ€” the cluster sacrifices availability to preserve consistency.

Q3. What is the difference between acks=1, acks=0, and acks=all?

acks=0: fire-and-forget โ€” producer doesn't wait for any acknowledgement; highest throughput, highest data loss risk. acks=1: leader acknowledges after writing to its local log โ€” good balance of performance and safety, but data can be lost if the leader crashes before followers replicate. acks=all (or -1): leader waits for all ISR members to write the record before acknowledging โ€” zero data loss given at least one ISR replica survives. For financial or critical data, always use acks=all combined with min.insync.replicas=2.

Q4. What happens when a broker fails?

The Controller detects the failure via ZooKeeper session expiry (or KRaft heartbeat timeout). For each partition where the failed broker was the leader, the Controller selects a new leader from the ISR. It sends a LeaderAndIsr request to all affected brokers with the new leader assignment, then updates cluster metadata. Producers and consumers receive a metadata error on the next request, refresh their metadata, and reconnect to the new leader โ€” typically recovering within 5โ€“30 seconds. If the failed broker was the only ISR member for any partition, those partitions go offline until the broker recovers or unclean.leader.election.enable=true is set (which risks data loss).

Q5. What is the difference between log compaction and log deletion?

Log deletion (cleanup.policy=delete) removes entire segments after they exceed log.retention.hours or log.retention.bytes. Useful for time-series or event data where older history is no longer needed. Log compaction (cleanup.policy=compact) retains the latest value per message key indefinitely โ€” older records with the same key are deleted, but the most recent record for each key is kept forever. Useful for changelog topics or event-sourced state where you need to reconstruct current entity state by replaying from the beginning. Compaction is ideal for user-profiles, product-inventory, or any "latest state per entity" use case.

Q6. Why is Kafka fast? What are the key architectural choices behind its throughput?

Three OS-level optimisations: (1) Sequential I/O โ€” all partition writes are append-only sequential disk writes, which are 50โ€“100ร— faster than random I/O (no seeking). (2) OS page cache โ€” Kafka writes to the kernel's page cache, not its own buffer. The OS keeps recent data in RAM; consumer reads for recent messages are served from memory at RAM bandwidth speeds, not disk speeds. This is why Kafka recommends a small JVM heap (6 GB) and a large OS page cache. (3) Zero-copy via sendfile โ€” when consumers fetch data, Kafka uses the OS sendfile() syscall to transfer data directly from kernel buffer to the network socket without copying through the JVM. This halves the number of memory copies and eliminates user-kernel context switches.

Q7. (Senior) What is the difference between KRaft and ZooKeeper mode, and why was ZooKeeper removed?

ZooKeeper mode used an external Apache ZooKeeper cluster to store cluster metadata (broker list, partition leaders, ISR, topic configs) and run Controller elections. Problems: (1) operational complexity of two separate systems; (2) practical partition limit of ~200,000 before ZooKeeper became a bottleneck; (3) Controller failover took 15โ€“30 seconds (ZooKeeper session expiry); (4) metadata scalability limits. KRaft replaces ZooKeeper with an internal Raft consensus protocol โ€” Kafka stores its own metadata in a special __cluster_metadata topic replicated via Raft. Benefits: one system to manage, millisecond Controller failover, support for millions of partitions, simpler operations. ZooKeeper was deprecated in Kafka 3.x and removed entirely in Kafka 4.0.

Q8. (Senior) How does Kafka achieve exactly-once semantics?

Exactly-once requires two components: (1) Idempotent producer (enable.idempotence=true) โ€” the broker assigns each producer a PID and tracks sequence numbers per partition. Duplicate sends (from retries) are detected and deduplicated at the broker. (2) Transactions โ€” the transactional producer groups writes across multiple partitions into an atomic unit; either all writes commit or all roll back. Consumers set isolation.level=read_committed to see only committed data. The cost is ~20โ€“30% throughput overhead from transaction coordination. For most use cases, idempotent consumers (de-duplicate at the application level using unique message IDs) achieve the same result with less overhead.


See Alsoโ€‹