Skip to main content

AWS Step Functions

Core concept: Step Functions orchestrate multi-step workflows using state machines โ€” coordinate Lambda, SQS, DynamoDB, ECS, and 200+ AWS services without writing glue code.


๐Ÿ”ฐ What Are Step Functions?โ€‹

Step Functions is a visual workflow orchestration service. Instead of chaining Lambda functions together with custom code, you define a workflow as a state machine using JSON (Amazon States Language โ€” ASL).

Analogy: Think of Step Functions like a flowchart that actually runs. Each box is a state that does something (call Lambda, write to DynamoDB, wait for approval), and arrows define the flow based on conditions and outcomes.

When to Use Step Functions vs Alternativesโ€‹

Use CaseStep FunctionsSQSEventBridge
Orchestration (do A, then B, then C)โœ… Best fitโŒโŒ
Decoupling (A doesn't care when B runs)โŒโœ… Best fitโœ…
Event routing (route events to targets)โŒโŒโœ… Best fit
Human approval workflowsโœ… (waitForTaskToken)โŒโŒ
Parallel fan-outโœ… (Map/Parallel)โœ… (with Lambda)โœ…
Long-running processesโœ… (up to 1 year)โŒโŒ
Error handling with retriesโœ… Built-inCustom codeCustom code

Standard vs Express Workflowsโ€‹

FeatureStandardExpress
Max duration1 year5 minutes
Execution modelExactly-onceAt-least-once (async) / At-most-once (sync)
Execution historyFull history in console (90 days)CloudWatch Logs only
Max execution rate2,000/sec (soft limit)100,000/sec
PricingPer state transition ($0.025/1000)Per execution + duration
Use caseLong business processes, human approvalHigh-volume event processing, IoT, streaming

Express Workflow Typesโ€‹

TypeInvocationGuaranteeUse Case
Async ExpressStartExecution โ†’ immediate returnAt-least-onceFire-and-forget event processing
Sync ExpressStartSyncExecution โ†’ wait for resultAt-most-onceAPI Gateway backend, request/response
Exam Decision
  • Need exactly-once processing? โ†’ Standard
  • Need high throughput (>2000/sec)? โ†’ Express
  • Need to run for >5 minutes? โ†’ Standard
  • Need execution history in console? โ†’ Standard

State Typesโ€‹

StatePurposeKey Properties
TaskDo work (invoke Lambda, call API, run ECS task)Resource, Parameters, Retry, Catch
ChoiceBranch based on conditions (if/else)Choices, Default
WaitPause for duration or until timestampSeconds, Timestamp, SecondsPath
ParallelExecute branches simultaneouslyBranches (array of sub-state machines)
MapIterate over an array, processing each itemItemsPath, MaxConcurrency, Iterator
PassPass input to output (testing/transformation)Result, ResultPath
SucceedEnd workflow successfullyTerminal state
FailEnd workflow with errorError, Cause

State Machine Definition (ASL)โ€‹

Complete Order Processing Exampleโ€‹

{
"Comment": "E-Commerce Order Processing Workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:ValidateOrder",
"Next": "CheckInventory",
"Catch": [{
"ErrorEquals": ["ValidationError"],
"ResultPath": "$.error",
"Next": "SendFailureNotification"
}]
},
"CheckInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "Inventory",
"Key": { "productId": { "S.$": "$.productId" } }
},
"ResultPath": "$.inventory",
"Next": "IsInStock"
},
"IsInStock": {
"Type": "Choice",
"Choices": [{
"Variable": "$.inventory.Item.quantity.N",
"NumericGreaterThan": 0,
"Next": "ProcessPayment"
}],
"Default": "NotifyOutOfStock"
},
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:ProcessPayment",
"Retry": [{
"ErrorEquals": ["PaymentGatewayTimeout"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}],
"Catch": [{
"ErrorEquals": ["PaymentDeclined"],
"ResultPath": "$.error",
"Next": "HandlePaymentError"
}],
"Next": "FulfillOrder"
},
"FulfillOrder": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "UpdateInventory",
"States": {
"UpdateInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:UpdateInventory",
"End": true
}
}
},
{
"StartAt": "SendConfirmationEmail",
"States": {
"SendConfirmationEmail": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-east-1:123:OrderConfirmations",
"Message.$": "States.Format('Order {} confirmed', $.orderId)"
},
"End": true
}
}
},
{
"StartAt": "CreateShipment",
"States": {
"CreateShipment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:CreateShipment",
"End": true
}
}
}
],
"Next": "OrderComplete"
},
"OrderComplete": { "Type": "Succeed" },
"NotifyOutOfStock": { "Type": "Fail", "Error": "OutOfStock", "Cause": "Product not available" },
"HandlePaymentError": { "Type": "Fail", "Error": "PaymentFailed", "Cause": "Payment was declined" },
"SendFailureNotification": { "Type": "Fail", "Error": "ValidationFailed", "Cause": "Order validation failed" }
}
}

Error Handlingโ€‹

Retryโ€‹

Automatically retry failed states with exponential backoff:

"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2.0
},
{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 5,
"MaxAttempts": 2,
"BackoffRate": 3.0
}
]

Retry timeline: 1s โ†’ 2s โ†’ 4s (IntervalSeconds ร— BackoffRate^attempt)

Catchโ€‹

Handle specific errors gracefully:

"Catch": [
{
"ErrorEquals": ["PaymentDeclined"],
"ResultPath": "$.error",
"Next": "HandlePaymentError"
},
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "GenericErrorHandler"
}
]

Built-in Error Typesโ€‹

ErrorWhen It Occurs
States.ALLMatches all errors (catch-all)
States.TaskFailedTask state failed
States.TimeoutState or execution timed out
States.PermissionsInsufficient IAM permissions
States.ResultPathMatchFailureResultPath can't be applied to input
States.HeartbeatTimeoutTask didn't send heartbeat in time
Error Handling Order

Retry is evaluated before Catch. If all retries fail, then Catch is evaluated. Order matters โ€” more specific errors should come first, States.ALL last.


Integration Patternsโ€‹

1. Request-Response (Default)โ€‹

Step Functions calls the service and moves to the next state immediately:

"Resource": "arn:aws:states:::lambda:invoke"

2. Run a Job (.sync)โ€‹

Step Functions calls the service and waits for it to complete:

"Resource": "arn:aws:states:::ecs:runTask.sync"

Supported with: ECS, Glue, Batch, CodeBuild, SageMaker, EMR, Step Functions (nested)

3. Wait for Callback (.waitForTaskToken)โ€‹

Step Functions pauses and waits for an external system to call SendTaskSuccess/Failure:

"WaitForHumanApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approval-queue",
"MessageBody": {
"taskToken.$": "$$.Task.Token",
"orderId.$": "$.orderId",
"approvalUrl.$": "States.Format('https://api.example.com/approve?token={}', $$.Task.Token)"
}
},
"TimeoutSeconds": 86400,
"Next": "ProcessApproval"
}

The external system (human, webhook, etc.) resumes the workflow:

# Approve
aws stepfunctions send-task-success \
--task-token "AAAAKgAAAAIAA..." \
--task-output '{"approved": true, "approver": "[email protected]"}'

# Reject
aws stepfunctions send-task-failure \
--task-token "AAAAKgAAAAIAA..." \
--error "Rejected" \
--cause "Budget exceeded"

Map State (Parallel Processing)โ€‹

Inline Map (small arrays)โ€‹

"ProcessOrders": {
"Type": "Map",
"ItemsPath": "$.orders",
"MaxConcurrency": 10,
"ItemProcessor": {
"ProcessorConfig": { "Mode": "INLINE" },
"StartAt": "ProcessSingleOrder",
"States": {
"ProcessSingleOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:ProcessOrder",
"End": true
}
}
},
"Next": "SendSummary"
}

Distributed Map (millions of items)โ€‹

For processing massive datasets (S3 inventory, large CSV files):

"ProcessLargeDataset": {
"Type": "Map",
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "EXPRESS"
},
"StartAt": "ProcessChunk",
"States": {
"ProcessChunk": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:ProcessChunk",
"End": true
}
}
},
"ItemReader": {
"Resource": "arn:aws:states:::s3:getObject",
"ReaderConfig": {
"InputType": "CSV",
"CSVHeaderLocation": "FIRST_ROW"
},
"Parameters": {
"Bucket": "my-data-bucket",
"Key": "input/large-dataset.csv"
}
},
"MaxConcurrency": 1000,
"Next": "Done"
}

Direct Service Integrations (No Lambda Needed!)โ€‹

Step Functions can call 200+ AWS services directly:

"WriteToDatabase": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:putItem",
"Parameters": {
"TableName": "Orders",
"Item": {
"orderId": { "S.$": "$.orderId" },
"status": { "S": "PROCESSING" },
"createdAt": { "S.$": "$$.State.EnteredTime" }
}
},
"Next": "SendNotification"
}

Common direct integrations:

ServiceResource ARN
DynamoDBarn:aws:states:::dynamodb:putItem/getItem/updateItem
SNSarn:aws:states:::sns:publish
SQSarn:aws:states:::sqs:sendMessage
ECSarn:aws:states:::ecs:runTask.sync
Gluearn:aws:states:::glue:startJobRun.sync
EventBridgearn:aws:states:::events:putEvents
Cost Optimization

Using direct service integrations avoids Lambda invocation costs. If Step Functions just needs to write to DynamoDB or send an SNS message, skip Lambda entirely.


Input/Output Processingโ€‹

Key Fieldsโ€‹

FieldPurposeExample
InputPathFilter input before processing"$.order" โ€” only pass the order field
ParametersBuild the input payloadConstruct new JSON from input
ResultSelectorFilter the task resultExtract specific fields from API response
ResultPathWhere to place result in original input"$.taskResult" โ€” merge result into input
OutputPathFilter final output"$.taskResult" โ€” pass only result downstream

Data Flowโ€‹

Original Input โ†’ InputPath โ†’ Parameters โ†’ TASK โ†’ ResultSelector โ†’ ResultPath โ†’ OutputPath โ†’ Next State
{
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"InputPath": "$.orderDetails",
"Parameters": {
"TableName": "Products",
"Key": { "productId": { "S.$": "$.productId" } }
},
"ResultSelector": {
"productName.$": "$.Item.name.S",
"price.$": "$.Item.price.N"
},
"ResultPath": "$.product",
"OutputPath": "$",
"Next": "CalculateTotal"
}

๐Ÿ† Best Practicesโ€‹

Designโ€‹

  1. Use direct service integrations where possible โ€” avoid Lambda for simple CRUD
  2. Keep Lambda functions small โ€” do one thing per function
  3. Use ResultPath to preserve original input alongside task results
  4. Design for idempotency โ€” Standard workflows guarantee exactly-once but tasks may retry

Error Handlingโ€‹

  1. Always add Retry for transient errors on every Task state
  2. Use States.ALL as catch-all as the last Catch entry
  3. Set TimeoutSeconds on every Task to prevent stuck executions
  4. Use HeartbeatSeconds for long-running tasks (ECS, Batch)

Costโ€‹

  1. Use Express workflows for high-volume, short-duration workflows
  2. Minimize state transitions โ€” each transition costs $0.025 per 1,000
  3. Use Map state instead of spawning child executions where possible

๐ŸŽฏ DVA-C02 Exam Tipsโ€‹

Step Functions Exam Cheat Sheet
  1. Standard vs Express: Express = 5 min max, at-least-once, high throughput. Standard = 1 year max, exactly-once
  2. Wait for Callback = .waitForTaskToken โ€” use for human approval, external webhooks
  3. Map state = iterate over array (same logic per item). Parallel = different branches concurrently
  4. Distributed Map = process millions of items from S3
  5. Retry before Catch โ€” retries are attempted first, then catch handlers
  6. ResultPath: "$.error" in Catch โ€” preserves original input and adds error info
  7. Direct integrations save Lambda cost โ€” DynamoDB, SNS, SQS, ECS don't need Lambda
  8. Express + API Gateway = sync request/response for high-throughput APIs
  9. Error types: States.ALL catches everything, States.Timeout for timeout
  10. Step Functions vs SQS: Step Functions = orchestration (order matters). SQS = decoupling (order doesn't matter)

๐Ÿงช Practice Questionsโ€‹

Q1. A workflow needs to process each item in a list in parallel, up to 5 items at a time. Which state type achieves this?

A) Parallel state
B) Choice state with conditions
C) Map state with MaxConcurrency: 5
D) Multiple Task states

โœ… Answer & Explanation

C โ€” Map iterates over an array, applying the same logic per item. MaxConcurrency controls parallelism. Parallel runs different branches, not the same one per item.


Q2. A payment workflow needs to pause for manual approval that may come hours later. Which pattern?

A) Wait state with a fixed duration
B) Poll DynamoDB every minute
C) Task state with .waitForTaskToken
D) Choice state polling SQS

โœ… Answer & Explanation

C โ€” .waitForTaskToken pauses the workflow indefinitely. An external system calls SendTaskSuccess/Failure with the token to resume. No polling needed.


Q3. A Step Functions task retries 3 times with IntervalSeconds: 2 and BackoffRate: 2.0. What are the wait times?

A) 2s, 2s, 2s
B) 2s, 4s, 8s
C) 2s, 6s, 14s
D) 1s, 2s, 4s

โœ… Answer & Explanation

B โ€” Each retry waits IntervalSeconds ร— BackoffRate^(attempt-1): 2ร—1=2s, 2ร—2=4s, 2ร—4=8s.


Q4. A workflow needs to write an order to DynamoDB. The team wants minimum cost. What approach?

A) Lambda function that calls DynamoDB SDK
B) Direct DynamoDB integration from Step Functions
C) API Gateway calling Lambda calling DynamoDB
D) EventBridge rule triggering Lambda

โœ… Answer & Explanation

B โ€” Step Functions direct service integration calls DynamoDB without needing a Lambda function, saving Lambda invocation costs.


Q5. An Express workflow processes 50,000 events per second. What execution guarantee does it provide?

A) Exactly-once
B) At-least-once (async) / At-most-once (sync)
C) At-most-once
D) Best-effort

โœ… Answer & Explanation

B โ€” Express workflows are async (at-least-once) by default. Sync Express (StartSyncExecution) provides at-most-once. Only Standard provides exactly-once.


Q6. Where should States.ALL be placed in a Catch configuration?

A) Last โ€” as the catch-all after specific errors
B) First โ€” to handle all errors immediately
C) It doesn't matter โ€” order is not important
D) States.ALL cannot be used in Catch

โœ… Answer & Explanation

A โ€” Catch entries are evaluated in order. States.ALL should be last so specific errors are caught by their dedicated handlers first.


๐Ÿ”— Resourcesโ€‹