KRaft vs ZooKeeper: Kafka Metadata Architecture
Apache Kafka relied on Apache ZooKeeper for over a decade to manage cluster metadata, broker registration, and leader election. Starting with KIP-500, Kafka introduced KRaft (Kafka Raft Metadata mode) โ replacing ZooKeeper with an internal Raft-based consensus protocol. As of Kafka 3.3, KRaft is production-ready. ZooKeeper mode is officially deprecated and will be removed in Kafka 4.0.
Understanding this architectural shift is not just a migration concern โ it is a window into fundamental distributed systems concepts: consensus protocols, split-brain prevention, event-sourced metadata, and the trade-offs of coordinating a high-throughput distributed log at scale.
- New learners โ start at The City Hall Analogy and How ZooKeeper Works.
- Senior engineers โ jump to Raft Consensus Deep Dive, Failure Scenario Comparison, Migration Strategy, or Production Configuration.
๐ถ The City Hall Analogyโ
Before any diagrams, a concrete mental model for both architectures.
ZooKeeper Mode โ The External Registry: Imagine a city (your Kafka cluster) where all official records โ property deeds, business licenses, population data โ are stored in a separate government building across town (ZooKeeper). The city mayor (the Active Controller broker) runs the city, but every time they need to know who owns what property, they must physically travel to the registry, retrieve the records, and come back. When the mayor leaves office, the incoming mayor must travel to the registry, read all the city records from scratch before they can do anything. If the road between the city hall and the registry is congested or the registry building has issues, the entire city grinds to a halt.
KRaft Mode โ The Internal Record Office:
The city now keeps its own internal archive room (the __cluster_metadata log). Every decision ever made is recorded as an entry in a logbook that all council members (controller quorum) replicate among themselves. When the mayor leaves, any council member already has an up-to-date copy of every record in their own office โ they can step in as mayor in seconds. There is no separate building to maintain, no dependency on road conditions, and no risk of the archive being in a different state than what the mayor believes.
1. The Legacy Architecture: Kafka with ZooKeeperโ
What ZooKeeper Isโ
ZooKeeper is an independent distributed coordination service โ a separate cluster, a separate JVM, a separate operational concern โ built around a hierarchical namespace of znodes (analogous to a filesystem). Kafka used it as an external distributed database for cluster state.
What ZooKeeper Storedโ
Every piece of Kafka cluster state lived in ZooKeeper znodes:
ZooKeeper namespace (znodes):
/kafka
/brokers
/ids
/1 โ {"host":"broker1","port":9092,"jmx_port":9999}
/2 โ {"host":"broker2","port":9092,"jmx_port":9999}
/topics
/orders
/partitions
/0 โ {"leader":1,"replicas":[1,2,3],"isr":[1,2,3]}
/1 โ {"leader":2,"replicas":[2,3,1],"isr":[2,3,1]}
/controller โ {"brokerid":1,"timestamp":"1718000000"} โ ephemeral lock
/admin
/delete_topics โ [...]
/config
/topics
/orders โ {"retention.ms":"604800000"}
/isr_change_notification โ [...]
/consumers
/my-group โ [...]
How a Topic Creation Works in ZooKeeper Modeโ
The Four Critical Limitationsโ
Limitation 1 โ Two Systems to Operate:
For every Kafka cluster, you need:
Kafka cluster: N brokers ร (JVM, config, TLS certs, SASL config, JMX, monitoring)
ZooKeeper cluster: M nodes ร (JVM, config, TLS certs, SASL config, JMX, monitoring)
Typical production setup: 5 Kafka brokers + 3โ5 ZK nodes = 8โ10 JVM processes
Two separate security models to maintain (ZK has its own SASL, separate from Kafka)
Two separate monitoring dashboards
Two separate incident runbooks
Limitation 2 โ Controller Failover Bottleneck:
Controller (Broker 1) dies unexpectedly
New controller election process:
Step 1: ZK detects session timeout (zookeeper.session.timeout.ms = 6s by default)
Step 2: ZK ephemeral /controller znode is deleted โ triggers watch
Step 3: All eligible brokers race to create /controller (one wins)
Step 4: New controller reads ENTIRE cluster state from ZK:
- Read all /brokers/ids znodes
- Read all /brokers/topics/*/partitions/* znodes
- For a cluster with 200k partitions: 200k+ ZK reads
Step 5: New controller pushes metadata to all brokers via RPC
Total time: seconds to MINUTES for large clusters
During this window: clients cannot get metadata updates โ "metadata freeze"
Limitation 3 โ Split-Brain Risk:
Scenario: Network partition between Controller and ZooKeeper
Controller thinks: "I am still the active controller"
ZooKeeper thinks: "Controller session timed out, elected new controller"
Brokers think: "I received a LeaderAndIsr from the old controller"
Two controllers issue conflicting commands:
Old Controller: "Partition 0 leader = Broker 1"
New Controller: "Partition 0 leader = Broker 2"
โ Split-brain: two brokers believe they are the leader for the same partition
โ Potential: duplicate writes, data loss, consumer confusion
Limitation 4 โ Partition Scalability Ceiling:
ZK-based cluster practical limit: ~200,000 partitions
Why?
- Each partition = multiple ZK znodes (leader, replicas, ISR)
- Controller must push LeaderAndIsr to all brokers on any ISR change
- At 200k partitions: even a routine ISR change triggers massive RPC fan-out
- ZK's watch notification system saturates under heavy partition churn
Result: Large LinkedIn/Uber-scale Kafka deployments required multiple smaller clusters
instead of one large cluster โ increasing operational complexity
2. The Modern Architecture: KRaft (Kafka Raft)โ
The Core Insightโ
KRaft's fundamental insight: Kafka is already excellent at replicating sequential logs reliably and efficiently. Why use an external system (ZooKeeper) for metadata when Kafka's own log replication can manage it? KRaft applies Kafka's core competency โ the replicated, append-only log โ to metadata itself.
What KRaft Isโ
KRaft stores all cluster metadata as records in an internal Kafka topic: __cluster_metadata. Controllers form a Raft consensus group โ they elect a leader, replicate the metadata log, and commit changes using the same Raft protocol that underpins databases like etcd and CockroachDB.
How a Topic Creation Works in KRaft Modeโ
KRaft Deployment Modesโ
Combined Mode: The same JVM process acts as both Broker and Controller. Suitable for development and small clusters (< 10 brokers, < 10,000 partitions).
Isolated Mode: Dedicated controller nodes run no client traffic. Recommended for production clusters where metadata operations should not compete with producer/consumer I/O.
The Metadata Log: What Records Look Likeโ
__cluster_metadata topic records:
Offset 0: RegisterBrokerRecord { brokerId: 1, host: "broker1", port: 9092 }
Offset 1: RegisterBrokerRecord { brokerId: 2, host: "broker2", port: 9092 }
Offset 2: RegisterBrokerRecord { brokerId: 3, host: "broker3", port: 9092 }
Offset 3: TopicRecord { topicId: "uuid-abc", name: "orders" }
Offset 4: PartitionRecord { topicId: "uuid-abc", partitionId: 0,
leader: 1, replicas: [1,2,3], isr: [1,2,3] }
Offset 5: PartitionRecord { topicId: "uuid-abc", partitionId: 1,
leader: 2, replicas: [2,3,1], isr: [2,3,1] }
Offset 6: PartitionChangeRecord { topicId: "uuid-abc", partitionId: 0,
leader: 2, isr: [2,3] } โ Broker 1 left ISR
Offset 7: ConfigRecord { resource: "orders", key: "retention.ms",
value: "604800000" }
Offset 8: ClientQuotaRecord { ...quotas... }
...
Offset 10,000: MetadataVersionRecord { version: 7 } โ Snapshot trigger point
Any broker joining the cluster replays this log from the latest snapshot to reconstruct its complete in-memory metadata state โ without contacting any external system.
3. The Raft Consensus Protocol: How KRaft Achieves Agreementโ
Understanding Raft is essential for understanding KRaft's guarantees. Raft is the consensus algorithm that ensures all controllers agree on the metadata log โ even in the presence of failures.
Raft Leader Electionโ
Key Raft guarantees KRaft inherits:
- Election Safety: At most one leader per term. No two controllers believe they are both active simultaneously โ eliminating split-brain.
- Log Matching: If two logs have the same offset and term for an entry, they are identical up to that point โ ensuring all controllers agree on history.
- Leader Completeness: A new leader always has all committed entries from previous terms โ no data loss on failover.
Why Raft Beats ZooKeeper's ZAB Protocol for This Use Caseโ
ZooKeeper uses its own consensus protocol (ZAB โ ZooKeeper Atomic Broadcast). While robust, it was designed for a general-purpose coordination service. KRaft uses Raft specifically tuned for Kafka's metadata log:
| Aspect | ZAB (ZooKeeper) | Raft (KRaft) |
|---|---|---|
| Designed for | General key-value coordination | Append-only log replication |
| State model | Hierarchical znodes (like a filesystem) | Sequential log entries (like a Kafka topic) |
| Leader transfer | Complex, slow | Clean, fast (explicit leader transfer command) |
| Log compaction | Not built-in (ZK znodes are durable state) | Snapshot-based (Kafka-native) |
| Integration | External system โ separate protocol, separate JVM | Internal โ same codebase, same tooling |
4. Failure Scenario Walkthroughโ
The most revealing way to compare the architectures is through concrete failure scenarios.
Scenario 1: Active Controller Failsโ
Why KRaft failover is sub-second: Standby controllers are continuously streaming the __cluster_metadata log. Their in-memory state is always within a few milliseconds of the active controller. When elected, they need zero catch-up reads โ they are already current.
Scenario 2: Network Partition (Split-Brain Risk)โ
ZooKeeper Mode โ split-brain scenario:
Network partitions: [Controller + Brokers 1,2] | [ZooKeeper + Brokers 3,4,5]
From Controller's perspective:
"I cannot reach ZooKeeper, but I still have my in-memory metadata.
I'll keep serving metadata to Brokers 1 and 2."
From ZooKeeper's perspective:
"Controller session timed out. Electing Broker 3 as new controller."
Result:
Broker 1,2: receiving leadership commands from OLD controller
Broker 3,4,5: receiving leadership commands from NEW controller
โ Two leaders for the same partition = SPLIT-BRAIN
โ Potential data loss when partition heals and logs must be reconciled
KRaft Mode โ same network partition:
[Controller 1 (Active) + Brokers 1,2] | [Controller 2,3 + Brokers 3,4,5]
Controllers 2 and 3 form a majority quorum (2 of 3 controllers):
โ Elect Controller 2 as new Active Controller
โ Controller 2 knows its log is up-to-date (Raft log matching guarantee)
โ Controller 1 cannot commit any new entries (no majority)
โ Controller 1's epoch/term is outdated โ any broker receiving
a request from it will REJECT it (term check)
Result:
Clean leadership transfer. No split-brain.
Controller 1 is automatically fenced โ its writes are rejected.
When partition heals, Controller 1 discovers higher term and steps down.
The Raft epoch/term mechanism prevents split-brain: Every metadata record carries the controller's epoch (term). Brokers reject any metadata from a controller with an outdated term. This is the Raft equivalent of a database's fencing token โ a monotonically increasing number that makes old leaders' writes invalid.
Scenario 3: Broker Restart After Long Absenceโ
ZooKeeper Mode:
Broker 2 was down for 2 hours (maintenance, crash)
Restarts:
1. Registers with ZooKeeper (/brokers/ids/2)
2. Controller detects registration via ZK watch
3. Controller pushes current state to Broker 2 via RPC
โ Time to operational: seconds to tens of seconds
(Controller must push full metadata for all its partitions)
KRaft Mode:
Broker 2 was down for 2 hours
Restarts:
1. Loads latest metadata snapshot from local disk
2. Starts consuming __cluster_metadata from snapshot offset
3. Streams log entries until caught up to current offset
4. Registers with Active Controller (fetch only missing records)
โ Time to operational: depends only on how many records
accumulated during downtime โ typically seconds
No RPC push from Controller needed โ broker self-heals
5. Deep Comparison Matrixโ
| Dimension | ZooKeeper Architecture | KRaft Architecture |
|---|---|---|
| External dependency | โ Requires separate ZK cluster | โ Self-contained |
| JVM processes per cluster | N Kafka + M ZK (typically N+3 to N+5) | N Kafka nodes only |
| Metadata storage model | Hierarchical znodes | Append-only event log |
| Consensus protocol | ZAB (ZooKeeper Atomic Broadcast) | Raft |
| Controller failover time | Seconds to minutes (full ZK read + RPC push) | Sub-second (log already in memory) |
| Split-brain prevention | Weak (ZK session timeout โ broker fencing) | โ Strong (Raft epoch/term fencing) |
| Metadata propagation | Push via RPC (Controller โ Brokers) | Pull via log consumption (Brokers โ metadata log) |
| Max practical partitions | ~200,000 | 1,000,000+ (tested) |
| Security model | Dual: ZK SASL + Kafka SASL | Single: Kafka SASL/TLS only |
| Startup time (broker) | Wait for ZK + metadata propagation | Load snapshot + stream log tail |
| Operational tooling | Two systems: kafka-* CLI + zkCli | One system: kafka-* CLI only |
| Snapshot support | โ (ZK state is always full) | โ Periodic snapshots for fast startup |
| Production readiness | Deprecated (removed in Kafka 4.0) | โ GA since Kafka 3.3 |
6. When to Use Each (Migration Context)โ
Since ZooKeeper is deprecated and removed in Kafka 4.0, this is now a migration timeline decision, not an architectural choice between equals.
Migration Urgency Matrixโ
| Kafka Version | ZooKeeper Support | KRaft Status | Action |
|---|---|---|---|
| < 2.8 | Only option | Not available | Plan upgrade path |
| 2.8 โ 3.2 | Available | Early access / preview | Begin migration planning |
| 3.3 โ 3.7 | Available (deprecated) | โ Production-ready | Migrate actively |
| 4.0+ | โ Removed | โ Only option | Must be on KRaft |
Migration Path Decisionโ
Are you on Kafka < 3.3?
โ Upgrade to 3.5+ first (ZK mode still supported)
โ Then migrate to KRaft in-place (rolling migration supported since 3.4)
โ Target: KRaft before your team is forced by a 4.0 upgrade
Is your cluster small (< 5 brokers, dev/staging)?
โ Use Combined Mode: brokers also act as controllers
โ Simplest possible setup, no dedicated controller nodes
Is your cluster production with > 10 brokers or > 10,000 partitions?
โ Use Isolated Mode: dedicated controller nodes (3 or 5)
โ Prevents metadata I/O from competing with producer/consumer throughput
Do you need > 200,000 partitions?
โ KRaft is the only option โ ZK cannot handle this scale
Production Configuration & Tuningโ
KRaft Broker Configuration (server.properties)โ
# โโ Node Identity โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Unique ID for this node (must be unique across the entire cluster)
node.id=1
# Role: 'broker', 'controller', or 'broker,controller' (combined mode)
process.roles=broker,controller # Combined mode
# process.roles=broker # Isolated mode โ broker only
# process.roles=controller # Isolated mode โ controller only
# โโ KRaft Cluster Identity โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Must be generated once per cluster: kafka-storage.sh random-uuid
cluster.id=MkU3OEVBNTcwNTJENDM2Qg
# Controller quorum โ all controller node.id:host:port pairs
# Must be the same on ALL nodes in the cluster
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093
# โโ Listeners โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# PLAINTEXT: client traffic (brokers only)
# CONTROLLER: controller-to-controller and broker-to-controller (controllers)
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
advertised.listeners=PLAINTEXT://broker1.internal:9092
# Which listener is used for controller communication
controller.listener.names=CONTROLLER
inter.broker.listener.name=PLAINTEXT
# โโ Log and Metadata Storage โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
log.dirs=/data/kafka/logs
metadata.log.dir=/data/kafka/metadata # Separate disk recommended for metadata
# โโ KRaft Tuning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# How often to take a metadata snapshot (in number of records)
# Lower = faster broker startup, more snapshot I/O
metadata.log.max.record.bytes.between.snapshots=20971520 # 20 MB
# How long to retain old snapshots (for lagging brokers to catch up)
metadata.max.retention.bytes=104857600 # 100 MB
# Raft election timeout โ how long to wait before triggering election
# Default 1000ms โ reduce only if you need faster failover
# (lower values increase false-positive elections under load)
controller.quorum.election.timeout.ms=1000
controller.quorum.fetch.timeout.ms=2000
Format and Bootstrap a New KRaft Clusterโ
# Step 1: Generate a unique cluster ID (do this ONCE โ same ID for all nodes)
CLUSTER_ID=$(kafka-storage.sh random-uuid)
echo "Cluster ID: $CLUSTER_ID"
# Output: MkU3OEVBNTcwNTJENDM2Qg
# Step 2: Format storage on EVERY node (uses the same CLUSTER_ID)
kafka-storage.sh format \
--config /etc/kafka/server.properties \
--cluster-id $CLUSTER_ID
# Output: Formatting /data/kafka/logs with metadata.version 3.7-IV4
# Step 3: Start brokers (they self-discover via controller.quorum.voters)
kafka-server-start.sh /etc/kafka/server.properties
Spring Boot Producer/Consumer โ No ZooKeeper Config Neededโ
One of the most visible operational improvements: Spring Boot Kafka configuration becomes simpler with KRaft โ no ZooKeeper connection string required anywhere.
# application.yml โ KRaft cluster (no zookeeper.connect property)
spring:
kafka:
bootstrap-servers: broker1:9092,broker2:9092,broker3:9092
# That's it โ ZK is gone from client configuration entirely
producer:
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
acks: all # Wait for all ISR replicas to confirm
retries: 3
properties:
enable.idempotence: true
max.in.flight.requests.per.connection: 5
consumer:
group-id: order-processor
auto-offset-reset: earliest
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
properties:
spring.json.trusted.packages: "com.example.events"
isolation.level: read_committed # Only read committed transactions
// Checking cluster metadata โ works identically on ZK and KRaft
@Service
@RequiredArgsConstructor
public class KafkaClusterInspector {
private final KafkaAdmin kafkaAdmin;
public ClusterInfo inspectCluster() throws Exception {
try (AdminClient adminClient = AdminClient.create(kafkaAdmin.getConfigurationProperties())) {
// Describe the cluster
DescribeClusterResult cluster = adminClient.describeCluster();
String clusterId = cluster.clusterId().get();
Node controller = cluster.controller().get();
Collection<Node> nodes = cluster.nodes().get();
log.info("Cluster ID: {}", clusterId);
log.info("Active Controller: node {} at {}:{}",
controller.id(), controller.host(), controller.port());
log.info("Cluster nodes: {}", nodes.size());
// Describe a specific topic
DescribeTopicsResult topics = adminClient.describeTopics(List.of("orders"));
TopicDescription orders = topics.topicNameValues().get("orders").get();
orders.partitions().forEach(p ->
log.info("Partition {}: leader={}, isr={}",
p.partition(), p.leader().id(),
p.isr().stream().map(Node::id).toList())
);
return new ClusterInfo(clusterId, controller.id(), nodes.size());
}
}
// Check if we're running KRaft (useful during migration period)
public boolean isKRaftMode() throws Exception {
try (AdminClient client = AdminClient.create(kafkaAdmin.getConfigurationProperties())) {
// KRaft clusters have a non-null cluster ID from kafka-storage format
// ZK clusters use a ZK-generated cluster ID that may be absent in older versions
String clusterId = client.describeCluster().clusterId().get();
// KRaft cluster IDs are base64-encoded UUIDs (22 chars)
return clusterId != null && clusterId.length() == 22;
}
}
}
Programmatic Topic Managementโ
@Configuration
public class KafkaTopicConfig {
// With KRaft, topic creation is faster and more consistent
// No ZK round-trips โ changes commit via Raft immediately
@Bean
public NewTopic ordersTopic() {
return TopicBuilder.name("orders")
.partitions(12)
.replicas(3)
.config(TopicConfig.RETENTION_MS_CONFIG, String.valueOf(Duration.ofDays(7).toMillis()))
.config(TopicConfig.COMPRESSION_TYPE_CONFIG, "lz4")
.config(TopicConfig.MIN_IN_SYNC_REPLICAS_CONFIG, "2")
.build();
}
@Bean
public NewTopic ordersDlqTopic() {
return TopicBuilder.name("orders.DLQ")
.partitions(12)
.replicas(3)
.config(TopicConfig.RETENTION_MS_CONFIG, String.valueOf(Duration.ofDays(14).toMillis()))
.build();
}
}
Migrating from ZooKeeper to KRaftโ
Migration Overview (Kafka 3.4+ Rolling Migration)โ
Since Kafka 3.4, in-place migration is supported: you can migrate a live ZooKeeper-backed cluster to KRaft without downtime โ brokers keep serving traffic throughout.
Migration Checklistโ
# โโ Pre-migration validation โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 1. Confirm Kafka version is 3.4+ (rolling migration requires this)
kafka-broker-api-versions.sh --bootstrap-server broker1:9092 | grep "ApiVersions"
# 2. Check current ZK-mode cluster health
kafka-metadata-quorum.sh --bootstrap-server broker1:9092 describe --status
# Should show: MetadataVersion, ClusterId, ActiveControllerId
# 3. Verify ZK is healthy
echo "srvr" | nc zookeeper1 2181
# Should return Mode: leader or Mode: follower
# 4. Document current partition count and topic list
kafka-topics.sh --bootstrap-server broker1:9092 --list | wc -l
# โโ Phase 1: Deploy KRaft controllers alongside ZK โโโโโโโโโโโโโโโโโโโโโโโโ
# 5. Format storage on NEW controller nodes (NOT existing brokers yet)
kafka-storage.sh format --config controller.properties --cluster-id $(cat /etc/kafka/cluster.id)
# 6. Start KRaft controllers with migration mode enabled
# controller.properties additions:
# zookeeper.connect=zk1:2181,zk2:2181,zk3:2181 โ still reading from ZK
# zookeeper.metadata.migration.enable=true
# โโ Phase 2: Migrate metadata โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 7. Trigger metadata migration (atomic โ ZK state โ KRaft log)
kafka-metadata-quorum.sh --bootstrap-server broker1:9092 describe --status
# Watch for: MetadataMigrationState: MIGRATION_COMPLETE
# โโ Phase 3: Rolling broker migration โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 8. Rolling restart brokers with KRaft config
# Add to broker server.properties:
# process.roles=broker
# controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093
# Remove: zookeeper.connect
# Restart one broker at a time, verify health between each:
kafka-topics.sh --bootstrap-server broker1:9092 --describe --topic orders
# Confirm: all partitions have leaders, ISR is full
# โโ Phase 4: Decommission ZooKeeper โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 9. Verify no broker is connected to ZK
echo "dump" | nc zookeeper1 2181 | grep -c "kafka"
# Should return 0
# 10. Stop ZooKeeper nodes โ you are done
๐ง Senior Deep Diveโ
1. KRaft Metadata Snapshots: Preventing Log Growthโ
Without snapshots, the __cluster_metadata log grows indefinitely. A broker that restarts after being down for weeks would need to replay years of log history to reconstruct current state.
KRaft snapshot lifecycle:
1. Active Controller writes records to __cluster_metadata
Offset 0 ... 50,000 records accumulated
2. Snapshot trigger: records since last snapshot > metadata.log.max.record.bytes.between.snapshots
โ Controller serializes full in-memory state to snapshot file
โ Snapshot file: /data/kafka/metadata/__cluster_metadata-0/00000000000050000-0000001234.checkpoint
3. New broker starts:
a. Finds latest snapshot: 00000000000050000-0000001234.checkpoint
b. Loads snapshot into memory (fast: one file read)
c. Starts consuming __cluster_metadata from offset 50,001
d. Applies all records since snapshot โ fully up-to-date
e. Joins the cluster
4. Old snapshots cleaned up based on metadata.max.retention.bytes
(keep enough for lagging brokers to catch up)
# Inspect snapshots on a running controller node
ls -la /data/kafka/metadata/__cluster_metadata-0/
# Output:
# 00000000000000000-0000000001.checkpoint โ initial snapshot
# 00000000000050000-0000001234.checkpoint โ after 50k records
# 00000000000100000-0000002468.checkpoint โ after 100k records
# ...
# View snapshot contents (diagnostic)
kafka-metadata-shell.sh \
--snapshot /data/kafka/metadata/__cluster_metadata-0/00000000000100000-0000002468.checkpoint
# Interactive shell: ls, cat, describe on metadata state at that snapshot
2. The kafka-metadata-quorum.sh Operational Toolkitโ
KRaft ships with dedicated tooling for inspecting the Raft quorum โ something that had no equivalent in ZooKeeper mode.
# Describe quorum status โ who is the leader, what is each node's lag?
kafka-metadata-quorum.sh \
--bootstrap-server broker1:9092 \
describe --status
# Output:
# ClusterId: MkU3OEVBNTcwNTJENDM2Qg
# LeaderId: 1
# LeaderEpoch: 5
# HighWatermark: 102450
# MaxFollowerLag: 3 โ Followers are 3 records behind
# MaxFollowerLagTimeMs: 12 โ 12ms behind the leader
# CurrentVoters: [1,2,3]
# CurrentObservers: [4,5,6] โ Brokers consuming metadata log but not voting
# Describe individual replica states
kafka-metadata-quorum.sh \
--bootstrap-server broker1:9092 \
describe --replication
# Output:
# NodeId LogEndOffset Lag LastFetchTimestamp LastCaughtUpTimestamp Status
# 1 102450 0 1718000100000 1718000100000 Leader
# 2 102447 3 1718000099988 1718000099988 Follower
# 3 102450 0 1718000100000 1718000100000 Follower
# 4 102448 2 1718000099991 1718000099991 Observer (broker)
3. Observability: What to Monitor in KRaftโ
KRaft exposes new JMX metrics that replace ZooKeeper-specific monitoring.
// Micrometer / Spring Boot Actuator โ KRaft-specific metrics to watch
// 1. Controller active status โ alert if 0 (no active controller)
// kafka.controller:type=KafkaController,name=ActiveControllerCount
// Expected: exactly 1 across the cluster
// 2. Metadata log offset lag per broker
// kafka.server:type=MetadataManager,name=MetadataMirrorer$CurrentControllerId
// kafka.server:type=KafkaBroker,name=metadataCurrentOffset
// kafka.server:type=KafkaBroker,name=metadataAppliedOffset
// 3. Controller failover rate โ alert if > 0 in last 5 minutes
// kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs
// 4. Raft quorum lag
// kafka.raft:type=raft-metrics,name=current-leader
// kafka.raft:type=raft-metrics,name=high-watermark
# Prometheus alerting rules for KRaft
groups:
- name: kraft_alerts
rules:
# CRITICAL: No active controller
- alert: KafkaNoActiveController
expr: kafka_controller_kafkacontroller_activecontrollercount != 1
for: 30s
labels:
severity: critical
annotations:
summary: "No active Kafka controller โ cluster cannot accept metadata changes"
# WARNING: Controller failover occurred
- alert: KafkaControllerFailover
expr: increase(kafka_controller_controllerstats_leaderelectionrateanddtimems_count[5m]) > 0
labels:
severity: warning
annotations:
summary: "Kafka controller failover detected in the last 5 minutes"
# WARNING: Broker significantly behind on metadata log
- alert: KafkaBrokerMetadataLag
expr: kafka_server_metadatamanager_metadatacurrentoffset - on(instance)
kafka_server_metadatamanager_metadataappliedoffset > 1000
for: 1m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} is {{ $value }} records behind on metadata log"
4. KRaft Limitations and Known Constraintsโ
KRaft is production-ready but has some known limitations as of Kafka 3.x:
| Limitation | Status | Workaround |
|---|---|---|
| JBOD (multiple data dirs per broker) | Not supported in KRaft (KIP-858 pending) | Use single log.dirs per broker or RAID |
| Delegation token migration | Manual re-creation needed during ZK โ KRaft migration | Re-issue delegation tokens post-migration |
| Some legacy tools | kafka-preferred-replica-election.sh deprecated | Use kafka-leader-election.sh instead |
| Cluster linking (Confluent) | Verify version compatibility | Check Confluent Platform KRaft support matrix |
| Mirror Maker 2 with KRaft source | Supported from MM2 3.4+ | Upgrade MM2 before migrating source cluster |
5. KRaft Isolated Mode: Sizing the Controller Quorumโ
For production clusters, how many dedicated controller nodes should you run?
Raft quorum fault tolerance formula:
Can tolerate F failures with 2F+1 nodes
3 controllers โ tolerate 1 failure (minimum for production)
5 controllers โ tolerate 2 simultaneous failures (for critical clusters)
7 controllers โ tolerate 3 simultaneous failures (rare; adds latency)
Controller write latency:
Write commits when majority (F+1) nodes acknowledge
3-node quorum: need 2 ACKs (1 network hop)
5-node quorum: need 3 ACKs (potentially 2 network hops to the slowest)
โ 5-node quorum adds ~20-30% more latency on metadata writes
โ Only use 5 nodes if the 2-failure tolerance is genuinely required
Controller node sizing:
CPU: 4 cores (metadata operations are not CPU-intensive)
RAM: 16โ32 GB (metadata snapshot + log in memory)
Disk: Fast SSD for metadata.log.dir (separate from broker data disks)
IOPS: 3,000+ IOPS for metadata disk (snapshot writes are periodic bursts)
๐ฏ Interview Decision Matrixโ
| Question | Answer |
|---|---|
| Why did Kafka deprecate ZooKeeper? | Scalability ceiling (~200k partitions), minutes-long controller failovers, split-brain risk, operational complexity of running two distributed systems |
| How does KRaft prevent split-brain? | Raft epoch/term fencing: any controller with an outdated term is rejected by brokers. Only the current Raft leader can commit metadata changes |
| How do brokers get metadata updates in KRaft? | They consume __cluster_metadata topic as standard Kafka consumers โ pull-based, not push-based |
| Why is KRaft failover sub-second? | Standby controllers continuously consume the metadata log; their in-memory state is already current. No catch-up reads needed on election |
| What is a KRaft snapshot? | A point-in-time serialization of the full metadata state, enabling new/restarting brokers to bootstrap without replaying the full log history |
| Combined vs. Isolated mode? | Combined for dev/small clusters (simpler). Isolated for production (isolates metadata I/O from client traffic, independent scaling) |
| Can I migrate live without downtime? | Yes, since Kafka 3.4 โ rolling migration keeps ZK active until all brokers have switched to the KRaft controllers |
"Kafka moved away from ZooKeeper to solve three compounding problems. First, scalability: ZK-backed clusters are limited to ~200k partitions because the controller must push state changes to every broker via RPC โ this saturates at scale. Second, failover latency: when a ZK-mode controller fails, the new controller must read the entire cluster state from ZooKeeper before it can function โ minutes of metadata freeze on large clusters. Third, operational complexity: two separate distributed systems, two JVM clusters, two security models, two monitoring stacks. KRaft solves all three: metadata is an internal Kafka log, standbys are always current, and there's only one system to operate."
"KRaft stores cluster metadata as records in an internal topic called __cluster_metadata, replicated using the Raft consensus protocol across a quorum of controller nodes. Brokers are consumers of this topic โ they continuously stream metadata changes and update their in-memory state. When the active controller fails, a standby is elected in sub-second time because it already has the full state in memory. There's no external system to consult. The Raft epoch/term mechanism ensures that any controller with a stale term is immediately fenced โ eliminating the split-brain risk that existed in ZooKeeper mode where the ZK state and in-broker state could diverge."
๐ Further Readingโ
- KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum โ The original design proposal; explains every motivation and design decision from the Kafka team.
- Kafka 3.7 KRaft Migration Guide โ Official step-by-step migration documentation including rollback procedures.
- In Search of an Understandable Consensus Algorithm (Raft Paper) โ Diego Ongaro's original Raft paper; the algorithm underpinning KRaft's controller quorum.
- The Raft Consensus Algorithm โ Interactive Visualization โ Visual walkthrough of leader election and log replication; build intuition before reading the paper.
- KIP-630: Kafka Raft Snapshot โ The design proposal for metadata snapshots; explains why they exist and how they interact with log retention.
- Confluent โ Running Kafka in KRaft Mode โ Confluent's production deployment guide with Confluent Platform-specific KRaft configuration.
- Apache Kafka Documentation โ KRaft Mode โ Official Kafka docs; covers
server.propertiesconfiguration, storage format, and operational commands. - kafka-metadata-shell.sh โ The interactive metadata snapshot inspector tool; invaluable for debugging KRaft state.