Amazon Kinesis
Core concept: Kinesis handles real-time streaming data โ logs, metrics, IoT events, clickstreams.
Kinesis Services Comparisonโ
| Service | Purpose | Retention | Consumers |
|---|---|---|---|
| Data Streams | Real-time stream processing | 1โ365 days | Custom (Lambda, KCL, SDK) |
| Data Firehose | Load streaming data to destinations | No retention | Managed destinations only |
| Data Analytics | SQL / Apache Flink on streams | N/A | Output to streams/destinations |
Kinesis Data Streamsโ
Shardsโ
- 1 shard = 1 MB/s write, 2 MB/s read, 1,000 records/s
- Add shards to scale (Shard Splitting)
- Remove shards to reduce cost (Shard Merging)
Partition Keysโ
// Records with same partition key โ same shard (ordered within shard)
PutRecordRequest request = PutRecordRequest.builder()
.streamName("clickstream")
.data(SdkBytes.fromUtf8String(jsonData))
.partitionKey(userId) // Same userId โ same shard โ ordered
.build();
Consumersโ
| Type | Read throughput | Description |
|---|---|---|
| Standard | 2 MB/s shared across all consumers | Pull-based, cheaper |
| Enhanced Fan-Out | 2 MB/s per consumer per shard | Push-based, dedicated throughput |
Lambda ESM for Kinesisโ
- Lambda polls shards automatically
- Bisect on error: splits failed batches
- Tumbling windows: aggregate records over a time window
- Iterator position:
TRIM_HORIZON(from beginning) orLATEST
Kinesis vs SQSโ
| Aspect | Kinesis Data Streams | SQS |
|---|---|---|
| Order | Per-shard ordering | FIFO only |
| Multiple consumers | โ All consumers read the same stream | โ One consumer per message |
| Replay | โ Up to retention period | โ Message deleted after processing |
| Real-time | โ Sub-second | Near real-time |
| Provisioning | Manual (shards) | Automatic |
| Use case | Analytics, metrics, logs | Task queues, job processing |
- "Multiple applications consume same data" โ Kinesis
- "Replay data from the past" โ Kinesis
- "Job processing, decoupling" โ SQS
- "Exactly-once, ordering required, no replay" โ SQS FIFO
Kinesis Data Firehoseโ
- Fully managed โ no shards to manage
- Destinations: S3, Redshift, OpenSearch, Splunk, HTTP endpoints
- Buffering: by size (1โ128 MB) or time (60โ900 seconds)
- Can transform data with Lambda before delivery
- Near-real-time (not real-time โ has buffer delay)
๐งช Practice Questionsโ
Q1. An IoT platform generates 5 MB/s of sensor data. Multiple analytics applications need to read the same data simultaneously and replay data from the last 7 days. What service fits best?
A) SQS Standard
B) SQS FIFO
C) Kinesis Data Streams
D) SNS
โ Answer & Explanation
C โ Kinesis Data Streams supports multiple consumers reading the same data independently, with data retention (configurable up to 365 days) enabling replay. SQS doesn't support multiple consumers or replay.
Q2. You need to load streaming clickstream data into S3 every 5 minutes for batch analytics. Which service requires the least operational overhead?
A) Kinesis Data Streams + custom Lambda
B) Kinesis Data Firehose
C) SQS + Lambda
D) Kinesis Data Analytics
โ Answer & Explanation
B โ Kinesis Data Firehose is fully managed, handles buffering and S3 delivery natively, with no shards or consumers to manage. It's purpose-built for this use case.