Skip to main content

KRaft vs ZooKeeper: Kafka Metadata Architecture

Apache Kafka relied on Apache ZooKeeper for over a decade to manage cluster metadata, broker registration, and leader election. Starting with KIP-500, Kafka introduced KRaft (Kafka Raft Metadata mode) โ€” replacing ZooKeeper with an internal Raft-based consensus protocol. As of Kafka 3.3, KRaft is production-ready. ZooKeeper mode is officially deprecated and will be removed in Kafka 4.0.

Understanding this architectural shift is not just a migration concern โ€” it is a window into fundamental distributed systems concepts: consensus protocols, split-brain prevention, event-sourced metadata, and the trade-offs of coordinating a high-throughput distributed log at scale.

Who this guide is for

๐Ÿ‘ถ The City Hall Analogyโ€‹

Before any diagrams, a concrete mental model for both architectures.

ZooKeeper Mode โ€” The External Registry: Imagine a city (your Kafka cluster) where all official records โ€” property deeds, business licenses, population data โ€” are stored in a separate government building across town (ZooKeeper). The city mayor (the Active Controller broker) runs the city, but every time they need to know who owns what property, they must physically travel to the registry, retrieve the records, and come back. When the mayor leaves office, the incoming mayor must travel to the registry, read all the city records from scratch before they can do anything. If the road between the city hall and the registry is congested or the registry building has issues, the entire city grinds to a halt.

KRaft Mode โ€” The Internal Record Office: The city now keeps its own internal archive room (the __cluster_metadata log). Every decision ever made is recorded as an entry in a logbook that all council members (controller quorum) replicate among themselves. When the mayor leaves, any council member already has an up-to-date copy of every record in their own office โ€” they can step in as mayor in seconds. There is no separate building to maintain, no dependency on road conditions, and no risk of the archive being in a different state than what the mayor believes.


1. The Legacy Architecture: Kafka with ZooKeeperโ€‹

What ZooKeeper Isโ€‹

ZooKeeper is an independent distributed coordination service โ€” a separate cluster, a separate JVM, a separate operational concern โ€” built around a hierarchical namespace of znodes (analogous to a filesystem). Kafka used it as an external distributed database for cluster state.

What ZooKeeper Storedโ€‹

Every piece of Kafka cluster state lived in ZooKeeper znodes:

ZooKeeper namespace (znodes):
/kafka
/brokers
/ids
/1 โ†’ {"host":"broker1","port":9092,"jmx_port":9999}
/2 โ†’ {"host":"broker2","port":9092,"jmx_port":9999}
/topics
/orders
/partitions
/0 โ†’ {"leader":1,"replicas":[1,2,3],"isr":[1,2,3]}
/1 โ†’ {"leader":2,"replicas":[2,3,1],"isr":[2,3,1]}
/controller โ†’ {"brokerid":1,"timestamp":"1718000000"} โ† ephemeral lock
/admin
/delete_topics โ†’ [...]
/config
/topics
/orders โ†’ {"retention.ms":"604800000"}
/isr_change_notification โ†’ [...]
/consumers
/my-group โ†’ [...]

How a Topic Creation Works in ZooKeeper Modeโ€‹

The Four Critical Limitationsโ€‹

Limitation 1 โ€” Two Systems to Operate:

For every Kafka cluster, you need:
Kafka cluster: N brokers ร— (JVM, config, TLS certs, SASL config, JMX, monitoring)
ZooKeeper cluster: M nodes ร— (JVM, config, TLS certs, SASL config, JMX, monitoring)

Typical production setup: 5 Kafka brokers + 3โ€“5 ZK nodes = 8โ€“10 JVM processes
Two separate security models to maintain (ZK has its own SASL, separate from Kafka)
Two separate monitoring dashboards
Two separate incident runbooks

Limitation 2 โ€” Controller Failover Bottleneck:

Controller (Broker 1) dies unexpectedly

New controller election process:
Step 1: ZK detects session timeout (zookeeper.session.timeout.ms = 6s by default)
Step 2: ZK ephemeral /controller znode is deleted โ€” triggers watch
Step 3: All eligible brokers race to create /controller (one wins)
Step 4: New controller reads ENTIRE cluster state from ZK:
- Read all /brokers/ids znodes
- Read all /brokers/topics/*/partitions/* znodes
- For a cluster with 200k partitions: 200k+ ZK reads
Step 5: New controller pushes metadata to all brokers via RPC

Total time: seconds to MINUTES for large clusters
During this window: clients cannot get metadata updates โ€” "metadata freeze"

Limitation 3 โ€” Split-Brain Risk:

Scenario: Network partition between Controller and ZooKeeper

Controller thinks: "I am still the active controller"
ZooKeeper thinks: "Controller session timed out, elected new controller"
Brokers think: "I received a LeaderAndIsr from the old controller"

Two controllers issue conflicting commands:
Old Controller: "Partition 0 leader = Broker 1"
New Controller: "Partition 0 leader = Broker 2"

โ†’ Split-brain: two brokers believe they are the leader for the same partition
โ†’ Potential: duplicate writes, data loss, consumer confusion

Limitation 4 โ€” Partition Scalability Ceiling:

ZK-based cluster practical limit: ~200,000 partitions

Why?
- Each partition = multiple ZK znodes (leader, replicas, ISR)
- Controller must push LeaderAndIsr to all brokers on any ISR change
- At 200k partitions: even a routine ISR change triggers massive RPC fan-out
- ZK's watch notification system saturates under heavy partition churn

Result: Large LinkedIn/Uber-scale Kafka deployments required multiple smaller clusters
instead of one large cluster โ€” increasing operational complexity

2. The Modern Architecture: KRaft (Kafka Raft)โ€‹

The Core Insightโ€‹

KRaft's fundamental insight: Kafka is already excellent at replicating sequential logs reliably and efficiently. Why use an external system (ZooKeeper) for metadata when Kafka's own log replication can manage it? KRaft applies Kafka's core competency โ€” the replicated, append-only log โ€” to metadata itself.

What KRaft Isโ€‹

KRaft stores all cluster metadata as records in an internal Kafka topic: __cluster_metadata. Controllers form a Raft consensus group โ€” they elect a leader, replicate the metadata log, and commit changes using the same Raft protocol that underpins databases like etcd and CockroachDB.

How a Topic Creation Works in KRaft Modeโ€‹

KRaft Deployment Modesโ€‹

Combined Mode: The same JVM process acts as both Broker and Controller. Suitable for development and small clusters (< 10 brokers, < 10,000 partitions).

Isolated Mode: Dedicated controller nodes run no client traffic. Recommended for production clusters where metadata operations should not compete with producer/consumer I/O.

The Metadata Log: What Records Look Likeโ€‹

__cluster_metadata topic records:

Offset 0: RegisterBrokerRecord { brokerId: 1, host: "broker1", port: 9092 }
Offset 1: RegisterBrokerRecord { brokerId: 2, host: "broker2", port: 9092 }
Offset 2: RegisterBrokerRecord { brokerId: 3, host: "broker3", port: 9092 }
Offset 3: TopicRecord { topicId: "uuid-abc", name: "orders" }
Offset 4: PartitionRecord { topicId: "uuid-abc", partitionId: 0,
leader: 1, replicas: [1,2,3], isr: [1,2,3] }
Offset 5: PartitionRecord { topicId: "uuid-abc", partitionId: 1,
leader: 2, replicas: [2,3,1], isr: [2,3,1] }
Offset 6: PartitionChangeRecord { topicId: "uuid-abc", partitionId: 0,
leader: 2, isr: [2,3] } โ† Broker 1 left ISR
Offset 7: ConfigRecord { resource: "orders", key: "retention.ms",
value: "604800000" }
Offset 8: ClientQuotaRecord { ...quotas... }
...
Offset 10,000: MetadataVersionRecord { version: 7 } โ† Snapshot trigger point

Any broker joining the cluster replays this log from the latest snapshot to reconstruct its complete in-memory metadata state โ€” without contacting any external system.


3. The Raft Consensus Protocol: How KRaft Achieves Agreementโ€‹

Understanding Raft is essential for understanding KRaft's guarantees. Raft is the consensus algorithm that ensures all controllers agree on the metadata log โ€” even in the presence of failures.

Raft Leader Electionโ€‹

Key Raft guarantees KRaft inherits:

  • Election Safety: At most one leader per term. No two controllers believe they are both active simultaneously โ€” eliminating split-brain.
  • Log Matching: If two logs have the same offset and term for an entry, they are identical up to that point โ€” ensuring all controllers agree on history.
  • Leader Completeness: A new leader always has all committed entries from previous terms โ€” no data loss on failover.

Why Raft Beats ZooKeeper's ZAB Protocol for This Use Caseโ€‹

ZooKeeper uses its own consensus protocol (ZAB โ€” ZooKeeper Atomic Broadcast). While robust, it was designed for a general-purpose coordination service. KRaft uses Raft specifically tuned for Kafka's metadata log:

AspectZAB (ZooKeeper)Raft (KRaft)
Designed forGeneral key-value coordinationAppend-only log replication
State modelHierarchical znodes (like a filesystem)Sequential log entries (like a Kafka topic)
Leader transferComplex, slowClean, fast (explicit leader transfer command)
Log compactionNot built-in (ZK znodes are durable state)Snapshot-based (Kafka-native)
IntegrationExternal system โ€” separate protocol, separate JVMInternal โ€” same codebase, same tooling

4. Failure Scenario Walkthroughโ€‹

The most revealing way to compare the architectures is through concrete failure scenarios.

Scenario 1: Active Controller Failsโ€‹

Why KRaft failover is sub-second: Standby controllers are continuously streaming the __cluster_metadata log. Their in-memory state is always within a few milliseconds of the active controller. When elected, they need zero catch-up reads โ€” they are already current.


Scenario 2: Network Partition (Split-Brain Risk)โ€‹

ZooKeeper Mode โ€” split-brain scenario:

Network partitions: [Controller + Brokers 1,2] | [ZooKeeper + Brokers 3,4,5]

From Controller's perspective:
"I cannot reach ZooKeeper, but I still have my in-memory metadata.
I'll keep serving metadata to Brokers 1 and 2."

From ZooKeeper's perspective:
"Controller session timed out. Electing Broker 3 as new controller."

Result:
Broker 1,2: receiving leadership commands from OLD controller
Broker 3,4,5: receiving leadership commands from NEW controller
โ†’ Two leaders for the same partition = SPLIT-BRAIN
โ†’ Potential data loss when partition heals and logs must be reconciled
KRaft Mode โ€” same network partition:

[Controller 1 (Active) + Brokers 1,2] | [Controller 2,3 + Brokers 3,4,5]

Controllers 2 and 3 form a majority quorum (2 of 3 controllers):
โ†’ Elect Controller 2 as new Active Controller
โ†’ Controller 2 knows its log is up-to-date (Raft log matching guarantee)
โ†’ Controller 1 cannot commit any new entries (no majority)
โ†’ Controller 1's epoch/term is outdated โ€” any broker receiving
a request from it will REJECT it (term check)

Result:
Clean leadership transfer. No split-brain.
Controller 1 is automatically fenced โ€” its writes are rejected.
When partition heals, Controller 1 discovers higher term and steps down.

The Raft epoch/term mechanism prevents split-brain: Every metadata record carries the controller's epoch (term). Brokers reject any metadata from a controller with an outdated term. This is the Raft equivalent of a database's fencing token โ€” a monotonically increasing number that makes old leaders' writes invalid.


Scenario 3: Broker Restart After Long Absenceโ€‹

ZooKeeper Mode:
Broker 2 was down for 2 hours (maintenance, crash)
Restarts:
1. Registers with ZooKeeper (/brokers/ids/2)
2. Controller detects registration via ZK watch
3. Controller pushes current state to Broker 2 via RPC
โ†’ Time to operational: seconds to tens of seconds
(Controller must push full metadata for all its partitions)

KRaft Mode:
Broker 2 was down for 2 hours
Restarts:
1. Loads latest metadata snapshot from local disk
2. Starts consuming __cluster_metadata from snapshot offset
3. Streams log entries until caught up to current offset
4. Registers with Active Controller (fetch only missing records)
โ†’ Time to operational: depends only on how many records
accumulated during downtime โ€” typically seconds
No RPC push from Controller needed โ€” broker self-heals

5. Deep Comparison Matrixโ€‹

DimensionZooKeeper ArchitectureKRaft Architecture
External dependencyโŒ Requires separate ZK clusterโœ… Self-contained
JVM processes per clusterN Kafka + M ZK (typically N+3 to N+5)N Kafka nodes only
Metadata storage modelHierarchical znodesAppend-only event log
Consensus protocolZAB (ZooKeeper Atomic Broadcast)Raft
Controller failover timeSeconds to minutes (full ZK read + RPC push)Sub-second (log already in memory)
Split-brain preventionWeak (ZK session timeout โ‰  broker fencing)โœ… Strong (Raft epoch/term fencing)
Metadata propagationPush via RPC (Controller โ†’ Brokers)Pull via log consumption (Brokers โ†’ metadata log)
Max practical partitions~200,0001,000,000+ (tested)
Security modelDual: ZK SASL + Kafka SASLSingle: Kafka SASL/TLS only
Startup time (broker)Wait for ZK + metadata propagationLoad snapshot + stream log tail
Operational toolingTwo systems: kafka-* CLI + zkCliOne system: kafka-* CLI only
Snapshot supportโŒ (ZK state is always full)โœ… Periodic snapshots for fast startup
Production readinessDeprecated (removed in Kafka 4.0)โœ… GA since Kafka 3.3

6. When to Use Each (Migration Context)โ€‹

Since ZooKeeper is deprecated and removed in Kafka 4.0, this is now a migration timeline decision, not an architectural choice between equals.

Migration Urgency Matrixโ€‹

Kafka VersionZooKeeper SupportKRaft StatusAction
< 2.8Only optionNot availablePlan upgrade path
2.8 โ€“ 3.2AvailableEarly access / previewBegin migration planning
3.3 โ€“ 3.7Available (deprecated)โœ… Production-readyMigrate actively
4.0+โŒ Removedโœ… Only optionMust be on KRaft

Migration Path Decisionโ€‹

Are you on Kafka < 3.3?
โ†’ Upgrade to 3.5+ first (ZK mode still supported)
โ†’ Then migrate to KRaft in-place (rolling migration supported since 3.4)
โ†’ Target: KRaft before your team is forced by a 4.0 upgrade

Is your cluster small (< 5 brokers, dev/staging)?
โ†’ Use Combined Mode: brokers also act as controllers
โ†’ Simplest possible setup, no dedicated controller nodes

Is your cluster production with > 10 brokers or > 10,000 partitions?
โ†’ Use Isolated Mode: dedicated controller nodes (3 or 5)
โ†’ Prevents metadata I/O from competing with producer/consumer throughput

Do you need > 200,000 partitions?
โ†’ KRaft is the only option โ€” ZK cannot handle this scale

Production Configuration & Tuningโ€‹

KRaft Broker Configuration (server.properties)โ€‹

# โ”€โ”€ Node Identity โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Unique ID for this node (must be unique across the entire cluster)
node.id=1

# Role: 'broker', 'controller', or 'broker,controller' (combined mode)
process.roles=broker,controller # Combined mode
# process.roles=broker # Isolated mode โ€” broker only
# process.roles=controller # Isolated mode โ€” controller only

# โ”€โ”€ KRaft Cluster Identity โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Must be generated once per cluster: kafka-storage.sh random-uuid
cluster.id=MkU3OEVBNTcwNTJENDM2Qg

# Controller quorum โ€” all controller node.id:host:port pairs
# Must be the same on ALL nodes in the cluster
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093

# โ”€โ”€ Listeners โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# PLAINTEXT: client traffic (brokers only)
# CONTROLLER: controller-to-controller and broker-to-controller (controllers)
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
advertised.listeners=PLAINTEXT://broker1.internal:9092

# Which listener is used for controller communication
controller.listener.names=CONTROLLER
inter.broker.listener.name=PLAINTEXT

# โ”€โ”€ Log and Metadata Storage โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
log.dirs=/data/kafka/logs
metadata.log.dir=/data/kafka/metadata # Separate disk recommended for metadata

# โ”€โ”€ KRaft Tuning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# How often to take a metadata snapshot (in number of records)
# Lower = faster broker startup, more snapshot I/O
metadata.log.max.record.bytes.between.snapshots=20971520 # 20 MB

# How long to retain old snapshots (for lagging brokers to catch up)
metadata.max.retention.bytes=104857600 # 100 MB

# Raft election timeout โ€” how long to wait before triggering election
# Default 1000ms โ€” reduce only if you need faster failover
# (lower values increase false-positive elections under load)
controller.quorum.election.timeout.ms=1000
controller.quorum.fetch.timeout.ms=2000

Format and Bootstrap a New KRaft Clusterโ€‹

# Step 1: Generate a unique cluster ID (do this ONCE โ€” same ID for all nodes)
CLUSTER_ID=$(kafka-storage.sh random-uuid)
echo "Cluster ID: $CLUSTER_ID"
# Output: MkU3OEVBNTcwNTJENDM2Qg

# Step 2: Format storage on EVERY node (uses the same CLUSTER_ID)
kafka-storage.sh format \
--config /etc/kafka/server.properties \
--cluster-id $CLUSTER_ID
# Output: Formatting /data/kafka/logs with metadata.version 3.7-IV4

# Step 3: Start brokers (they self-discover via controller.quorum.voters)
kafka-server-start.sh /etc/kafka/server.properties

Spring Boot Producer/Consumer โ€” No ZooKeeper Config Neededโ€‹

One of the most visible operational improvements: Spring Boot Kafka configuration becomes simpler with KRaft โ€” no ZooKeeper connection string required anywhere.

# application.yml โ€” KRaft cluster (no zookeeper.connect property)
spring:
kafka:
bootstrap-servers: broker1:9092,broker2:9092,broker3:9092
# That's it โ€” ZK is gone from client configuration entirely

producer:
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
acks: all # Wait for all ISR replicas to confirm
retries: 3
properties:
enable.idempotence: true
max.in.flight.requests.per.connection: 5

consumer:
group-id: order-processor
auto-offset-reset: earliest
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
properties:
spring.json.trusted.packages: "com.example.events"
isolation.level: read_committed # Only read committed transactions
// Checking cluster metadata โ€” works identically on ZK and KRaft
@Service
@RequiredArgsConstructor
public class KafkaClusterInspector {

private final KafkaAdmin kafkaAdmin;

public ClusterInfo inspectCluster() throws Exception {
try (AdminClient adminClient = AdminClient.create(kafkaAdmin.getConfigurationProperties())) {
// Describe the cluster
DescribeClusterResult cluster = adminClient.describeCluster();
String clusterId = cluster.clusterId().get();
Node controller = cluster.controller().get();
Collection<Node> nodes = cluster.nodes().get();

log.info("Cluster ID: {}", clusterId);
log.info("Active Controller: node {} at {}:{}",
controller.id(), controller.host(), controller.port());
log.info("Cluster nodes: {}", nodes.size());

// Describe a specific topic
DescribeTopicsResult topics = adminClient.describeTopics(List.of("orders"));
TopicDescription orders = topics.topicNameValues().get("orders").get();
orders.partitions().forEach(p ->
log.info("Partition {}: leader={}, isr={}",
p.partition(), p.leader().id(),
p.isr().stream().map(Node::id).toList())
);

return new ClusterInfo(clusterId, controller.id(), nodes.size());
}
}

// Check if we're running KRaft (useful during migration period)
public boolean isKRaftMode() throws Exception {
try (AdminClient client = AdminClient.create(kafkaAdmin.getConfigurationProperties())) {
// KRaft clusters have a non-null cluster ID from kafka-storage format
// ZK clusters use a ZK-generated cluster ID that may be absent in older versions
String clusterId = client.describeCluster().clusterId().get();
// KRaft cluster IDs are base64-encoded UUIDs (22 chars)
return clusterId != null && clusterId.length() == 22;
}
}
}

Programmatic Topic Managementโ€‹

@Configuration
public class KafkaTopicConfig {

// With KRaft, topic creation is faster and more consistent
// No ZK round-trips โ€” changes commit via Raft immediately
@Bean
public NewTopic ordersTopic() {
return TopicBuilder.name("orders")
.partitions(12)
.replicas(3)
.config(TopicConfig.RETENTION_MS_CONFIG, String.valueOf(Duration.ofDays(7).toMillis()))
.config(TopicConfig.COMPRESSION_TYPE_CONFIG, "lz4")
.config(TopicConfig.MIN_IN_SYNC_REPLICAS_CONFIG, "2")
.build();
}

@Bean
public NewTopic ordersDlqTopic() {
return TopicBuilder.name("orders.DLQ")
.partitions(12)
.replicas(3)
.config(TopicConfig.RETENTION_MS_CONFIG, String.valueOf(Duration.ofDays(14).toMillis()))
.build();
}
}

Migrating from ZooKeeper to KRaftโ€‹

Migration Overview (Kafka 3.4+ Rolling Migration)โ€‹

Since Kafka 3.4, in-place migration is supported: you can migrate a live ZooKeeper-backed cluster to KRaft without downtime โ€” brokers keep serving traffic throughout.

Migration Checklistโ€‹

# โ”€โ”€ Pre-migration validation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# 1. Confirm Kafka version is 3.4+ (rolling migration requires this)
kafka-broker-api-versions.sh --bootstrap-server broker1:9092 | grep "ApiVersions"

# 2. Check current ZK-mode cluster health
kafka-metadata-quorum.sh --bootstrap-server broker1:9092 describe --status
# Should show: MetadataVersion, ClusterId, ActiveControllerId

# 3. Verify ZK is healthy
echo "srvr" | nc zookeeper1 2181
# Should return Mode: leader or Mode: follower

# 4. Document current partition count and topic list
kafka-topics.sh --bootstrap-server broker1:9092 --list | wc -l

# โ”€โ”€ Phase 1: Deploy KRaft controllers alongside ZK โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# 5. Format storage on NEW controller nodes (NOT existing brokers yet)
kafka-storage.sh format --config controller.properties --cluster-id $(cat /etc/kafka/cluster.id)

# 6. Start KRaft controllers with migration mode enabled
# controller.properties additions:
# zookeeper.connect=zk1:2181,zk2:2181,zk3:2181 โ† still reading from ZK
# zookeeper.metadata.migration.enable=true

# โ”€โ”€ Phase 2: Migrate metadata โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# 7. Trigger metadata migration (atomic โ€” ZK state โ†’ KRaft log)
kafka-metadata-quorum.sh --bootstrap-server broker1:9092 describe --status
# Watch for: MetadataMigrationState: MIGRATION_COMPLETE

# โ”€โ”€ Phase 3: Rolling broker migration โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# 8. Rolling restart brokers with KRaft config
# Add to broker server.properties:
# process.roles=broker
# controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093
# Remove: zookeeper.connect

# Restart one broker at a time, verify health between each:
kafka-topics.sh --bootstrap-server broker1:9092 --describe --topic orders
# Confirm: all partitions have leaders, ISR is full

# โ”€โ”€ Phase 4: Decommission ZooKeeper โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# 9. Verify no broker is connected to ZK
echo "dump" | nc zookeeper1 2181 | grep -c "kafka"
# Should return 0

# 10. Stop ZooKeeper nodes โ€” you are done

๐Ÿง  Senior Deep Diveโ€‹

1. KRaft Metadata Snapshots: Preventing Log Growthโ€‹

Without snapshots, the __cluster_metadata log grows indefinitely. A broker that restarts after being down for weeks would need to replay years of log history to reconstruct current state.

KRaft snapshot lifecycle:

1. Active Controller writes records to __cluster_metadata
Offset 0 ... 50,000 records accumulated

2. Snapshot trigger: records since last snapshot > metadata.log.max.record.bytes.between.snapshots
โ†’ Controller serializes full in-memory state to snapshot file
โ†’ Snapshot file: /data/kafka/metadata/__cluster_metadata-0/00000000000050000-0000001234.checkpoint

3. New broker starts:
a. Finds latest snapshot: 00000000000050000-0000001234.checkpoint
b. Loads snapshot into memory (fast: one file read)
c. Starts consuming __cluster_metadata from offset 50,001
d. Applies all records since snapshot โ†’ fully up-to-date
e. Joins the cluster

4. Old snapshots cleaned up based on metadata.max.retention.bytes
(keep enough for lagging brokers to catch up)
# Inspect snapshots on a running controller node
ls -la /data/kafka/metadata/__cluster_metadata-0/
# Output:
# 00000000000000000-0000000001.checkpoint โ† initial snapshot
# 00000000000050000-0000001234.checkpoint โ† after 50k records
# 00000000000100000-0000002468.checkpoint โ† after 100k records
# ...

# View snapshot contents (diagnostic)
kafka-metadata-shell.sh \
--snapshot /data/kafka/metadata/__cluster_metadata-0/00000000000100000-0000002468.checkpoint
# Interactive shell: ls, cat, describe on metadata state at that snapshot

2. The kafka-metadata-quorum.sh Operational Toolkitโ€‹

KRaft ships with dedicated tooling for inspecting the Raft quorum โ€” something that had no equivalent in ZooKeeper mode.

# Describe quorum status โ€” who is the leader, what is each node's lag?
kafka-metadata-quorum.sh \
--bootstrap-server broker1:9092 \
describe --status

# Output:
# ClusterId: MkU3OEVBNTcwNTJENDM2Qg
# LeaderId: 1
# LeaderEpoch: 5
# HighWatermark: 102450
# MaxFollowerLag: 3 โ† Followers are 3 records behind
# MaxFollowerLagTimeMs: 12 โ† 12ms behind the leader
# CurrentVoters: [1,2,3]
# CurrentObservers: [4,5,6] โ† Brokers consuming metadata log but not voting

# Describe individual replica states
kafka-metadata-quorum.sh \
--bootstrap-server broker1:9092 \
describe --replication

# Output:
# NodeId LogEndOffset Lag LastFetchTimestamp LastCaughtUpTimestamp Status
# 1 102450 0 1718000100000 1718000100000 Leader
# 2 102447 3 1718000099988 1718000099988 Follower
# 3 102450 0 1718000100000 1718000100000 Follower
# 4 102448 2 1718000099991 1718000099991 Observer (broker)

3. Observability: What to Monitor in KRaftโ€‹

KRaft exposes new JMX metrics that replace ZooKeeper-specific monitoring.

// Micrometer / Spring Boot Actuator โ€” KRaft-specific metrics to watch

// 1. Controller active status โ€” alert if 0 (no active controller)
// kafka.controller:type=KafkaController,name=ActiveControllerCount
// Expected: exactly 1 across the cluster

// 2. Metadata log offset lag per broker
// kafka.server:type=MetadataManager,name=MetadataMirrorer$CurrentControllerId
// kafka.server:type=KafkaBroker,name=metadataCurrentOffset
// kafka.server:type=KafkaBroker,name=metadataAppliedOffset

// 3. Controller failover rate โ€” alert if > 0 in last 5 minutes
// kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs

// 4. Raft quorum lag
// kafka.raft:type=raft-metrics,name=current-leader
// kafka.raft:type=raft-metrics,name=high-watermark
# Prometheus alerting rules for KRaft
groups:
- name: kraft_alerts
rules:
# CRITICAL: No active controller
- alert: KafkaNoActiveController
expr: kafka_controller_kafkacontroller_activecontrollercount != 1
for: 30s
labels:
severity: critical
annotations:
summary: "No active Kafka controller โ€” cluster cannot accept metadata changes"

# WARNING: Controller failover occurred
- alert: KafkaControllerFailover
expr: increase(kafka_controller_controllerstats_leaderelectionrateanddtimems_count[5m]) > 0
labels:
severity: warning
annotations:
summary: "Kafka controller failover detected in the last 5 minutes"

# WARNING: Broker significantly behind on metadata log
- alert: KafkaBrokerMetadataLag
expr: kafka_server_metadatamanager_metadatacurrentoffset - on(instance)
kafka_server_metadatamanager_metadataappliedoffset > 1000
for: 1m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} is {{ $value }} records behind on metadata log"

4. KRaft Limitations and Known Constraintsโ€‹

KRaft is production-ready but has some known limitations as of Kafka 3.x:

LimitationStatusWorkaround
JBOD (multiple data dirs per broker)Not supported in KRaft (KIP-858 pending)Use single log.dirs per broker or RAID
Delegation token migrationManual re-creation needed during ZK โ†’ KRaft migrationRe-issue delegation tokens post-migration
Some legacy toolskafka-preferred-replica-election.sh deprecatedUse kafka-leader-election.sh instead
Cluster linking (Confluent)Verify version compatibilityCheck Confluent Platform KRaft support matrix
Mirror Maker 2 with KRaft sourceSupported from MM2 3.4+Upgrade MM2 before migrating source cluster

5. KRaft Isolated Mode: Sizing the Controller Quorumโ€‹

For production clusters, how many dedicated controller nodes should you run?

Raft quorum fault tolerance formula:
Can tolerate F failures with 2F+1 nodes

3 controllers โ†’ tolerate 1 failure (minimum for production)
5 controllers โ†’ tolerate 2 simultaneous failures (for critical clusters)
7 controllers โ†’ tolerate 3 simultaneous failures (rare; adds latency)

Controller write latency:
Write commits when majority (F+1) nodes acknowledge
3-node quorum: need 2 ACKs (1 network hop)
5-node quorum: need 3 ACKs (potentially 2 network hops to the slowest)
โ†’ 5-node quorum adds ~20-30% more latency on metadata writes
โ†’ Only use 5 nodes if the 2-failure tolerance is genuinely required

Controller node sizing:
CPU: 4 cores (metadata operations are not CPU-intensive)
RAM: 16โ€“32 GB (metadata snapshot + log in memory)
Disk: Fast SSD for metadata.log.dir (separate from broker data disks)
IOPS: 3,000+ IOPS for metadata disk (snapshot writes are periodic bursts)

๐ŸŽฏ Interview Decision Matrixโ€‹

QuestionAnswer
Why did Kafka deprecate ZooKeeper?Scalability ceiling (~200k partitions), minutes-long controller failovers, split-brain risk, operational complexity of running two distributed systems
How does KRaft prevent split-brain?Raft epoch/term fencing: any controller with an outdated term is rejected by brokers. Only the current Raft leader can commit metadata changes
How do brokers get metadata updates in KRaft?They consume __cluster_metadata topic as standard Kafka consumers โ€” pull-based, not push-based
Why is KRaft failover sub-second?Standby controllers continuously consume the metadata log; their in-memory state is already current. No catch-up reads needed on election
What is a KRaft snapshot?A point-in-time serialization of the full metadata state, enabling new/restarting brokers to bootstrap without replaying the full log history
Combined vs. Isolated mode?Combined for dev/small clusters (simpler). Isolated for production (isolates metadata I/O from client traffic, independent scaling)
Can I migrate live without downtime?Yes, since Kafka 3.4 โ€” rolling migration keeps ZK active until all brokers have switched to the KRaft controllers
Interview Phrasing โ€” Why KRaft?

"Kafka moved away from ZooKeeper to solve three compounding problems. First, scalability: ZK-backed clusters are limited to ~200k partitions because the controller must push state changes to every broker via RPC โ€” this saturates at scale. Second, failover latency: when a ZK-mode controller fails, the new controller must read the entire cluster state from ZooKeeper before it can function โ€” minutes of metadata freeze on large clusters. Third, operational complexity: two separate distributed systems, two JVM clusters, two security models, two monitoring stacks. KRaft solves all three: metadata is an internal Kafka log, standbys are always current, and there's only one system to operate."

Interview Phrasing โ€” How KRaft Works

"KRaft stores cluster metadata as records in an internal topic called __cluster_metadata, replicated using the Raft consensus protocol across a quorum of controller nodes. Brokers are consumers of this topic โ€” they continuously stream metadata changes and update their in-memory state. When the active controller fails, a standby is elected in sub-second time because it already has the full state in memory. There's no external system to consult. The Raft epoch/term mechanism ensures that any controller with a stale term is immediately fenced โ€” eliminating the split-brain risk that existed in ZooKeeper mode where the ZK state and in-broker state could diverge."


๐Ÿ“š Further Readingโ€‹