S3 Advanced
Senior-level topics: Replication architecture, Object Lambda for on-the-fly transformations, S3 Batch Operations, and event-driven architectures.
Replicationโ
Cross-Region Replication (CRR) vs Same-Region Replication (SRR)โ
| Feature | CRR | SRR |
|---|---|---|
| Regions | Source โ different region | Source โ same region |
| Use case | DR, compliance, latency reduction | Log aggregation, dev/prod copies |
| Versioning | Required on both buckets | Required on both buckets |
| Existing objects | โ Not replicated (use S3 Batch Replication) | โ Same |
| Delete markers | Not replicated by default (opt-in) | Not replicated by default |
| Chaining | โ No โ AโBโC not supported (AโB and AโC separately) | โ Same |
Replication Configurationโ
{
"Role": "arn:aws:iam::123456789012:role/s3-replication-role",
"Rules": [{
"ID": "ReplicateAll",
"Status": "Enabled",
"Filter": { "Prefix": "" },
"Destination": {
"Bucket": "arn:aws:s3:::dest-bucket",
"StorageClass": "STANDARD_IA",
"EncryptionConfiguration": {
"ReplicaKmsKeyID": "arn:aws:kms:eu-west-1:123:key/dest-key-id"
},
"ReplicationTime": {
"Status": "Enabled",
"Time": { "Minutes": 15 }
},
"Metrics": { "Status": "Enabled" }
},
"DeleteMarkerReplication": { "Status": "Enabled" }
}]
}
S3 Replication Time Control (RTC)โ
- Guarantees 99.99% of objects replicated within 15 minutes
- Provides CloudWatch metrics and S3 events for monitoring
- Extra cost but important for compliance/DR requirements
- No chaining โ if Bucket A โ B, and B โ C, changes from A do NOT automatically reach C
- Delete markers not replicated by default โ enable explicitly
- Existing objects not replicated โ use S3 Batch Replication
- SSE-C encrypted objects are NOT replicated
- Lifecycle rules are NOT replicated (must configure separately)
S3 Transfer Accelerationโ
Without acceleration:
Client (Australia) โโโโ public internet โโโโโ S3 (us-east-1)
Latency: ~200ms
With acceleration:
Client (Australia) โ Edge (Sydney) โโโโ AWS backbone โโโโโ S3 (us-east-1)
Latency: ~80ms
| Property | Details |
|---|---|
| How it works | Uses CloudFront edge network as entry point |
| Best for | Long-distance uploads (cross-continent) |
| Endpoint | bucket.s3-accelerate.amazonaws.com |
| Cost | Additional per-GB transfer fee |
| NOT useful | Same-region clients, small files |
# Test if acceleration helps your use case
aws s3api put-bucket-accelerate-configuration \
--bucket my-bucket \
--accelerate-configuration Status=Enabled
# Speed comparison tool
# https://s3-accelerate-speedtest.s3-accelerate.amazonaws.com
S3 Select & Glacier Selectโ
Query data inside objects using SQL without downloading the entire file:
// Filter a CSV file server-side โ only matching rows are returned
SelectObjectContentRequest request = SelectObjectContentRequest.builder()
.bucket("data-lake")
.key("orders/2024-01.csv.gz")
.expressionType(ExpressionType.SQL)
.expression("SELECT orderId, amount FROM S3Object s WHERE s.status = 'FAILED' AND CAST(s.amount AS DECIMAL) > 1000")
.inputSerialization(InputSerialization.builder()
.csv(CSVInput.builder().fileHeaderInfo(FileHeaderInfo.USE).build())
.compressionType(CompressionType.GZIP) // Supports GZIP, BZIP2
.build())
.outputSerialization(OutputSerialization.builder()
.json(JSONOutput.builder().build()) // Output as JSON
.build())
.build();
S3 Select vs Athenaโ
| Feature | S3 Select | Athena |
|---|---|---|
| Scope | Single object | Multiple objects, partitioned data |
| Query | Simple SQL (SELECT, WHERE) | Full SQL (JOINs, GROUP BY, window functions) |
| Format | CSV, JSON, Parquet | CSV, JSON, Parquet, ORC, Avro |
| Use case | Quick filter on one file | Data lake analytics |
| Cost | Per data scanned/returned | Per data scanned |
S3 Object Lambdaโ
Transform objects on the fly during GET requests:
Client GET โ S3 Object Lambda Access Point โ Lambda function โ Transformed response
โ
S3 Supporting Access Point (original object)
Use Casesโ
- Redact PII โ remove SSN, email from CSV/JSON before returning
- Resize images โ return thumbnails without storing them
- Convert formats โ XML โ JSON on the fly
- Add watermarks โ overlay watermark on images
- Decompress โ return decompressed data
Setup with CloudFormationโ
Resources:
SupportingAccessPoint:
Type: AWS::S3::AccessPoint
Properties:
Bucket: !Ref DataBucket
Name: supporting-ap
ObjectLambdaAccessPoint:
Type: AWS::S3ObjectLambda::AccessPoint
Properties:
Name: pii-redaction-ap
ObjectLambdaConfiguration:
SupportingAccessPoint: !GetAtt SupportingAccessPoint.Arn
TransformationConfigurations:
- Actions: [GetObject]
ContentTransformation:
AwsLambda:
FunctionArn: !GetAtt RedactFunction.Arn
S3 Batch Operationsโ
Run large-scale operations on billions of objects:
S3 Inventory Report (source list)
โ
S3 Batch Job
โ
Operations: Copy, Invoke Lambda, Restore from Glacier,
Replace tags, Replace ACLs, Object Lock
| Feature | Details |
|---|---|
| Input | S3 Inventory report or CSV manifest |
| Retry | Automatic retry of failed operations |
| Tracking | Job progress, completion reports |
| Use cases | Batch replication, bulk encryption, mass tagging |
# Create batch job to copy objects to another bucket
aws s3control create-job \
--account-id 123456789012 \
--operation '{"S3PutObjectCopy": {"TargetResource": "arn:aws:s3:::dest-bucket"}}' \
--manifest '{"Spec": {"Format": "S3InventoryReport_CSV_20211130"}, "Location": {"ObjectArn": "arn:aws:s3:::source-bucket/inventory/manifest.json", "ETag": "abc123"}}' \
--report '{"Bucket": "arn:aws:s3:::report-bucket", "Prefix": "batch-reports/", "Format": "Report_CSV_20180820", "Enabled": true, "ReportScope": "AllTasks"}' \
--role-arn arn:aws:iam::123456789012:role/batch-operations-role \
--priority 10
MFA Deleteโ
Adds an extra layer of protection for versioned buckets:
| Action | MFA Required? |
|---|---|
| Permanently delete a specific version | โ Yes |
| Suspend versioning | โ Yes |
| Enable versioning | โ No |
| List versions | โ No |
| Add delete marker | โ No |
- Only the root account can enable/disable MFA Delete
- Must use CLI or API (cannot configure via console)
- Requires versioning to be enabled
Lifecycle Rulesโ
Example: Cost-Optimized Log Retentionโ
{
"Rules": [{
"ID": "LogRetentionPolicy",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" },
{ "Days": 180, "StorageClass": "GLACIER" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"NoncurrentVersionTransitions": [
{ "NoncurrentDays": 30, "StorageClass": "GLACIER" }
],
"NoncurrentVersionExpiration": { "NoncurrentDays": 90 },
"Expiration": { "Days": 2555 },
"AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
}]
}
Transition Constraintsโ
Standard โ Standard-IA (min 30 days)
Standard โ Glacier Instant (min 90 days)
Standard-IA โ Glacier (allowed)
One Zone-IA โ Glacier (allowed)
Any โ Deep Archive (allowed)
โ Cannot transition "backwards" (Deep Archive โ Standard)
S3 Object Lock & Legal Holdโ
Retention Modesโ
| Mode | Description |
|---|---|
| Governance | Users with s3:BypassGovernanceRetention can override/delete |
| Compliance | NO ONE can delete or override โ not even root account |
Legal Holdโ
- Independent of retention period โ applies or removes manually
- When active, object cannot be deleted regardless of retention settings
- Requires
s3:PutObjectLegalHoldpermission
Requester Paysโ
- The requester pays for data transfer and request costs (not the bucket owner)
- Use case: Sharing large public datasets (genomics, satellite imagery)
- Requester must be an authenticated AWS user (no anonymous access)
Event-Driven Architecture Patternsโ
Pattern 1: Image Processing Pipelineโ
User uploads โ S3 (PutObject) โ Lambda (resize + generate thumbnails)
โ S3 (thumbnails/)
โ DynamoDB (metadata)
Pattern 2: Fan-Out with SNSโ
S3 (PutObject) โ SNS Topic โ SQS Queue 1 (process A)
โ SQS Queue 2 (process B)
โ Lambda (process C)
Pattern 3: EventBridge for Complex Routingโ
S3 (PutObject) โ EventBridge โ Rule 1: if prefix="orders/" โ Lambda A
โ Rule 2: if suffix=".pdf" โ Step Functions
โ Rule 3: if size > 100MB โ SQS
โ Archive (replay up to 90 days)
๐ Best Practicesโ
Costโ
- Use lifecycle rules aggressively โ transition old data to cheaper tiers
- Abort incomplete multipart uploads โ orphaned parts cost money
- S3 Select for filtering โ avoid downloading entire objects
- Requester Pays for shared datasets
Securityโ
- Enable S3 Block Public Access at account level
- Use VPC Gateway Endpoints for private S3 access from VPC
- Enable access logging to track bucket access
- Object Lock for compliance/regulatory requirements
Reliabilityโ
- Cross-Region Replication for DR (enable RTC for SLA)
- Versioning for accidental delete protection
- MFA Delete for critical buckets
๐ฏ DVA-C02 Exam Tipsโ
- Replication requires versioning on both buckets
- Delete markers NOT replicated by default
- No chaining โ AโB, BโC does NOT replicate AโC
- Existing objects NOT replicated โ use S3 Batch Replication
- S3 Select = single object SQL filter. Athena = data lake SQL
- Object Lambda = transform on GET (redact PII, resize images)
- Transfer Acceleration = CloudFront edge for fast uploads
- MFA Delete = root account only, CLI only
- Lifecycle cannot transition backwards (Deep Archive โ Standard)
- Object Lock Compliance mode = NOBODY can delete, not even root
๐งช Practice Questionsโ
Q1. Company replicates S3 from us-east-1 to eu-west-1. User deletes in us-east-1. Deleted in eu-west-1?
A) Yes โ always replicated
B) No โ delete markers not replicated by default
C) Yes โ if bucket policy allows
D) No โ only new objects replicate
โ Answer & Explanation
B โ Delete markers are NOT replicated by default. Enable Delete Marker Replication explicitly to protect against accidental cross-region deletes.
Q2. 10GB CSV in S3, need only rows where status = 'ERROR'. Most cost-effective?
A) Download and filter locally
B) Lambda stream processing
C) S3 Select
D) Athena
โ Answer & Explanation
C โ S3 Select runs SQL server-side on a single object, returning only matching rows. Much cheaper than downloading 10GB.
Q3. API returns user data from S3 CSV. For GDPR, PII must be redacted before delivery. Best approach without storing duplicate files?
A) Pre-process and store redacted copies
B) S3 Object Lambda to redact on GET
C) CloudFront function to redact
D) API Gateway response mapping
โ Answer & Explanation
B โ S3 Object Lambda transforms data on-the-fly during GET requests. No need to store separate redacted copies.
Q4. Bucket A replicates to B, B replicates to C. Does data from A reach C?
A) Yes โ replication chains automatically
B) No โ replication does not chain
C) Yes โ if all buckets have versioning
D) Only for SSE-S3 encrypted objects
โ Answer & Explanation
B โ Replication does NOT chain. Objects replicated from AโB are not re-replicated BโC. Configure AโB and AโC separately.
Q5. Legal compliance requires that objects in a bucket CANNOT be deleted by anyone, including root account, for 7 years. What to use?
A) MFA Delete
B) Object Lock โ Governance Mode
C) Object Lock โ Compliance Mode
D) Bucket policy with explicit Deny
โ Answer & Explanation
C โ Compliance Mode prevents deletion by ALL users, including root. Governance Mode can be bypassed with special permissions.
Interview Questions (Senior Level)โ
- How do you choose between CRR and SRR for compliance, latency, and operational recovery?
- When is S3 Object Lambda superior to preprocessing pipelines, and when is it a bad fit?
- How would you design lifecycle and retention to control cost without violating legal hold requirements?
- Transfer Acceleration is enabled but performance gains are inconsistent. What do you investigate?