I'm designing a state machine for a contact center. One specific flow has two possible paths of resolution, as soon as one is done, it is not needed (nor wanted) to wait for the other path to complete. For the sake of simplicity, the following example follows the same pattern:
- A potential client creates a new user on the platform.
- The platform sends a welcoming email, asking the client to complete the sign-in process by logging in and providing their billing information.
- Simultaneously, an internal ticket for contacting this user is created (SQS).
- Customer support associates will actively poll for tickets, calling the given user to ask for their billing information.
The expected behavior: When either the client or the associate provides the missing information, the process is completed, and the state machine should mark the execution as complete. When a client actively logs in and fills out their information,** a trigger is sent to the state machine**, removing the ticket from the queue and marking the execution as complete.
I haven't found a similar use case in the documentation. The closest one is the manual approval process, but in this case, the approval is "optional" (?). For the same reason, I've tried 4 approaches with no luck:
A) Straight flow, asynchronous call to StartExecution, synchronous call to SendMessage (wait for callback). Problem: Missing a way to interrupt if the email sent meant the client actively logged in and finalized the process.
B) Parallel execution, calling StartExecution and SendMessage asynchronously. If either task finished, the parallel execution is considered finished and marked as complete. Problem: parallel tasks require all branches to finish in order to be marked as finished. Img shows the right side still waiting, before a manual stop of the execution.
C) Fail states to signal successful execution, interrupting the parallel task. Found this option on stackoverflow (https://stackoverflow.com/a/75945146/19666650). Problem: after implementing it, I found out that the output of the parallel state was the error itself, and I couldn't pass the output of either the left or right branch.
D) Cross signaling using DeleteMessage/TaskSucess between branches. I stored ReceiptHandle and TaskToken on a temp table, allowing each branch to "mark as done" its sibling branch. For example, if the "left" branch finished, it would remove the SQS message and send SendTaskSuccess to the "right" branch. This way, both branches would be considered done, and parallel state would have both outputs. Problem: Proxying the queues in order to store the ReceiptHandle creates data races on parallel lambdas.
I've tried every trick I could think of, but I'm running out of options. I'm not even sure if there's a better AWS service for what I need, like EventBridge. Maybe I should explore modeling the flow in EventBridge instead of Step Functions. I'm fairly new to the AWS environment, If you have any suggestions or ideas, I'd love to hear them!
Thanks for taking the time to read this.
Thanks for your input! An AWS Architect confirmed to me that there are no means of interrupting a SFN. I'll be implementing a similar solution to what you've described. Once I've confirmed it works, I'll be sharing my findings here.
We just released a new blog post that can help you in your use case. The idea is to use a parallel state, each branch doing one of the operations. When each one of them finishes, you raise an exception which then you can check after the parallel state.