Skip to main content

S3 Advanced

Senior-level topics: Replication architecture, Object Lambda for on-the-fly transformations, S3 Batch Operations, and event-driven architectures.


Replicationโ€‹

Cross-Region Replication (CRR) vs Same-Region Replication (SRR)โ€‹

FeatureCRRSRR
RegionsSource โ†’ different regionSource โ†’ same region
Use caseDR, compliance, latency reductionLog aggregation, dev/prod copies
VersioningRequired on both bucketsRequired on both buckets
Existing objectsโŒ Not replicated (use S3 Batch Replication)โŒ Same
Delete markersNot replicated by default (opt-in)Not replicated by default
ChainingโŒ No โ€” Aโ†’Bโ†’C not supported (Aโ†’B and Aโ†’C separately)โŒ Same

Replication Configurationโ€‹

{
"Role": "arn:aws:iam::123456789012:role/s3-replication-role",
"Rules": [{
"ID": "ReplicateAll",
"Status": "Enabled",
"Filter": { "Prefix": "" },
"Destination": {
"Bucket": "arn:aws:s3:::dest-bucket",
"StorageClass": "STANDARD_IA",
"EncryptionConfiguration": {
"ReplicaKmsKeyID": "arn:aws:kms:eu-west-1:123:key/dest-key-id"
},
"ReplicationTime": {
"Status": "Enabled",
"Time": { "Minutes": 15 }
},
"Metrics": { "Status": "Enabled" }
},
"DeleteMarkerReplication": { "Status": "Enabled" }
}]
}

S3 Replication Time Control (RTC)โ€‹

  • Guarantees 99.99% of objects replicated within 15 minutes
  • Provides CloudWatch metrics and S3 events for monitoring
  • Extra cost but important for compliance/DR requirements
Replication Gotchas
  1. No chaining โ€” if Bucket A โ†’ B, and B โ†’ C, changes from A do NOT automatically reach C
  2. Delete markers not replicated by default โ€” enable explicitly
  3. Existing objects not replicated โ€” use S3 Batch Replication
  4. SSE-C encrypted objects are NOT replicated
  5. Lifecycle rules are NOT replicated (must configure separately)

S3 Transfer Accelerationโ€‹

Without acceleration:
Client (Australia) โ”€โ”€โ”€โ”€ public internet โ”€โ”€โ”€โ”€โ†’ S3 (us-east-1)
Latency: ~200ms

With acceleration:
Client (Australia) โ†’ Edge (Sydney) โ”€โ”€โ”€โ”€ AWS backbone โ”€โ”€โ”€โ”€โ†’ S3 (us-east-1)
Latency: ~80ms
PropertyDetails
How it worksUses CloudFront edge network as entry point
Best forLong-distance uploads (cross-continent)
Endpointbucket.s3-accelerate.amazonaws.com
CostAdditional per-GB transfer fee
NOT usefulSame-region clients, small files
# Test if acceleration helps your use case
aws s3api put-bucket-accelerate-configuration \
--bucket my-bucket \
--accelerate-configuration Status=Enabled

# Speed comparison tool
# https://s3-accelerate-speedtest.s3-accelerate.amazonaws.com

S3 Select & Glacier Selectโ€‹

Query data inside objects using SQL without downloading the entire file:

// Filter a CSV file server-side โ€” only matching rows are returned
SelectObjectContentRequest request = SelectObjectContentRequest.builder()
.bucket("data-lake")
.key("orders/2024-01.csv.gz")
.expressionType(ExpressionType.SQL)
.expression("SELECT orderId, amount FROM S3Object s WHERE s.status = 'FAILED' AND CAST(s.amount AS DECIMAL) > 1000")
.inputSerialization(InputSerialization.builder()
.csv(CSVInput.builder().fileHeaderInfo(FileHeaderInfo.USE).build())
.compressionType(CompressionType.GZIP) // Supports GZIP, BZIP2
.build())
.outputSerialization(OutputSerialization.builder()
.json(JSONOutput.builder().build()) // Output as JSON
.build())
.build();

S3 Select vs Athenaโ€‹

FeatureS3 SelectAthena
ScopeSingle objectMultiple objects, partitioned data
QuerySimple SQL (SELECT, WHERE)Full SQL (JOINs, GROUP BY, window functions)
FormatCSV, JSON, ParquetCSV, JSON, Parquet, ORC, Avro
Use caseQuick filter on one fileData lake analytics
CostPer data scanned/returnedPer data scanned

S3 Object Lambdaโ€‹

Transform objects on the fly during GET requests:

Client GET โ†’ S3 Object Lambda Access Point โ†’ Lambda function โ†’ Transformed response
โ†“
S3 Supporting Access Point (original object)

Use Casesโ€‹

  • Redact PII โ€” remove SSN, email from CSV/JSON before returning
  • Resize images โ€” return thumbnails without storing them
  • Convert formats โ€” XML โ†’ JSON on the fly
  • Add watermarks โ€” overlay watermark on images
  • Decompress โ€” return decompressed data

Setup with CloudFormationโ€‹

Resources:
SupportingAccessPoint:
Type: AWS::S3::AccessPoint
Properties:
Bucket: !Ref DataBucket
Name: supporting-ap

ObjectLambdaAccessPoint:
Type: AWS::S3ObjectLambda::AccessPoint
Properties:
Name: pii-redaction-ap
ObjectLambdaConfiguration:
SupportingAccessPoint: !GetAtt SupportingAccessPoint.Arn
TransformationConfigurations:
- Actions: [GetObject]
ContentTransformation:
AwsLambda:
FunctionArn: !GetAtt RedactFunction.Arn

S3 Batch Operationsโ€‹

Run large-scale operations on billions of objects:

S3 Inventory Report (source list)
โ†“
S3 Batch Job
โ†“
Operations: Copy, Invoke Lambda, Restore from Glacier,
Replace tags, Replace ACLs, Object Lock
FeatureDetails
InputS3 Inventory report or CSV manifest
RetryAutomatic retry of failed operations
TrackingJob progress, completion reports
Use casesBatch replication, bulk encryption, mass tagging
# Create batch job to copy objects to another bucket
aws s3control create-job \
--account-id 123456789012 \
--operation '{"S3PutObjectCopy": {"TargetResource": "arn:aws:s3:::dest-bucket"}}' \
--manifest '{"Spec": {"Format": "S3InventoryReport_CSV_20211130"}, "Location": {"ObjectArn": "arn:aws:s3:::source-bucket/inventory/manifest.json", "ETag": "abc123"}}' \
--report '{"Bucket": "arn:aws:s3:::report-bucket", "Prefix": "batch-reports/", "Format": "Report_CSV_20180820", "Enabled": true, "ReportScope": "AllTasks"}' \
--role-arn arn:aws:iam::123456789012:role/batch-operations-role \
--priority 10

MFA Deleteโ€‹

Adds an extra layer of protection for versioned buckets:

ActionMFA Required?
Permanently delete a specific versionโœ… Yes
Suspend versioningโœ… Yes
Enable versioningโŒ No
List versionsโŒ No
Add delete markerโŒ No
  • Only the root account can enable/disable MFA Delete
  • Must use CLI or API (cannot configure via console)
  • Requires versioning to be enabled

Lifecycle Rulesโ€‹

Example: Cost-Optimized Log Retentionโ€‹

{
"Rules": [{
"ID": "LogRetentionPolicy",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" },
{ "Days": 180, "StorageClass": "GLACIER" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"NoncurrentVersionTransitions": [
{ "NoncurrentDays": 30, "StorageClass": "GLACIER" }
],
"NoncurrentVersionExpiration": { "NoncurrentDays": 90 },
"Expiration": { "Days": 2555 },
"AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
}]
}

Transition Constraintsโ€‹

Standard โ†’ Standard-IA (min 30 days)
Standard โ†’ Glacier Instant (min 90 days)
Standard-IA โ†’ Glacier (allowed)
One Zone-IA โ†’ Glacier (allowed)
Any โ†’ Deep Archive (allowed)

โŒ Cannot transition "backwards" (Deep Archive โ†’ Standard)

Retention Modesโ€‹

ModeDescription
GovernanceUsers with s3:BypassGovernanceRetention can override/delete
ComplianceNO ONE can delete or override โ€” not even root account
  • Independent of retention period โ€” applies or removes manually
  • When active, object cannot be deleted regardless of retention settings
  • Requires s3:PutObjectLegalHold permission

Requester Paysโ€‹

  • The requester pays for data transfer and request costs (not the bucket owner)
  • Use case: Sharing large public datasets (genomics, satellite imagery)
  • Requester must be an authenticated AWS user (no anonymous access)

Event-Driven Architecture Patternsโ€‹

Pattern 1: Image Processing Pipelineโ€‹

User uploads โ†’ S3 (PutObject) โ†’ Lambda (resize + generate thumbnails)
โ†’ S3 (thumbnails/)
โ†’ DynamoDB (metadata)

Pattern 2: Fan-Out with SNSโ€‹

S3 (PutObject) โ†’ SNS Topic โ†’ SQS Queue 1 (process A)
โ†’ SQS Queue 2 (process B)
โ†’ Lambda (process C)

Pattern 3: EventBridge for Complex Routingโ€‹

S3 (PutObject) โ†’ EventBridge โ†’ Rule 1: if prefix="orders/" โ†’ Lambda A
โ†’ Rule 2: if suffix=".pdf" โ†’ Step Functions
โ†’ Rule 3: if size > 100MB โ†’ SQS
โ†’ Archive (replay up to 90 days)

๐Ÿ† Best Practicesโ€‹

Costโ€‹

  1. Use lifecycle rules aggressively โ€” transition old data to cheaper tiers
  2. Abort incomplete multipart uploads โ€” orphaned parts cost money
  3. S3 Select for filtering โ€” avoid downloading entire objects
  4. Requester Pays for shared datasets

Securityโ€‹

  1. Enable S3 Block Public Access at account level
  2. Use VPC Gateway Endpoints for private S3 access from VPC
  3. Enable access logging to track bucket access
  4. Object Lock for compliance/regulatory requirements

Reliabilityโ€‹

  1. Cross-Region Replication for DR (enable RTC for SLA)
  2. Versioning for accidental delete protection
  3. MFA Delete for critical buckets

๐ŸŽฏ DVA-C02 Exam Tipsโ€‹

S3 Advanced Exam Cheat Sheet
  1. Replication requires versioning on both buckets
  2. Delete markers NOT replicated by default
  3. No chaining โ€” Aโ†’B, Bโ†’C does NOT replicate Aโ†’C
  4. Existing objects NOT replicated โ€” use S3 Batch Replication
  5. S3 Select = single object SQL filter. Athena = data lake SQL
  6. Object Lambda = transform on GET (redact PII, resize images)
  7. Transfer Acceleration = CloudFront edge for fast uploads
  8. MFA Delete = root account only, CLI only
  9. Lifecycle cannot transition backwards (Deep Archive โ†’ Standard)
  10. Object Lock Compliance mode = NOBODY can delete, not even root

๐Ÿงช Practice Questionsโ€‹

Q1. Company replicates S3 from us-east-1 to eu-west-1. User deletes in us-east-1. Deleted in eu-west-1?

A) Yes โ€” always replicated
B) No โ€” delete markers not replicated by default
C) Yes โ€” if bucket policy allows
D) No โ€” only new objects replicate

โœ… Answer & Explanation

B โ€” Delete markers are NOT replicated by default. Enable Delete Marker Replication explicitly to protect against accidental cross-region deletes.


Q2. 10GB CSV in S3, need only rows where status = 'ERROR'. Most cost-effective?

A) Download and filter locally
B) Lambda stream processing
C) S3 Select
D) Athena

โœ… Answer & Explanation

C โ€” S3 Select runs SQL server-side on a single object, returning only matching rows. Much cheaper than downloading 10GB.


Q3. API returns user data from S3 CSV. For GDPR, PII must be redacted before delivery. Best approach without storing duplicate files?

A) Pre-process and store redacted copies
B) S3 Object Lambda to redact on GET
C) CloudFront function to redact
D) API Gateway response mapping

โœ… Answer & Explanation

B โ€” S3 Object Lambda transforms data on-the-fly during GET requests. No need to store separate redacted copies.


Q4. Bucket A replicates to B, B replicates to C. Does data from A reach C?

A) Yes โ€” replication chains automatically
B) No โ€” replication does not chain
C) Yes โ€” if all buckets have versioning
D) Only for SSE-S3 encrypted objects

โœ… Answer & Explanation

B โ€” Replication does NOT chain. Objects replicated from Aโ†’B are not re-replicated Bโ†’C. Configure Aโ†’B and Aโ†’C separately.


Q5. Legal compliance requires that objects in a bucket CANNOT be deleted by anyone, including root account, for 7 years. What to use?

A) MFA Delete
B) Object Lock โ€” Governance Mode
C) Object Lock โ€” Compliance Mode
D) Bucket policy with explicit Deny

โœ… Answer & Explanation

C โ€” Compliance Mode prevents deletion by ALL users, including root. Governance Mode can be bypassed with special permissions.


Interview Questions (Senior Level)โ€‹

  1. How do you choose between CRR and SRR for compliance, latency, and operational recovery?
  2. When is S3 Object Lambda superior to preprocessing pipelines, and when is it a bad fit?
  3. How would you design lifecycle and retention to control cost without violating legal hold requirements?
  4. Transfer Acceleration is enabled but performance gains are inconsistent. What do you investigate?

๐Ÿ”— Resourcesโ€‹