Home United States USA — software Orchestrating Resilience Building Modern Asynchronous Systems

Orchestrating Resilience Building Modern Asynchronous Systems

126
0
SHARE

In this article, we will discuss what problems we had to solve at Twilio to efficiently build a resilient and scalable asynchronous system to handle a complex workflow and the advantages we got from adopting a Workflow Orchestration solution, including abstracting away state management and out-of-the-box support for retries, observability, and audibility.
Twilio is a customer engagement platform that allows you to engage with your customers on your application using different channels like Voice, Messaging, Whatsapp, email, video.
When you think of SMS, one of the big problems with SMS is spam. Additionally, there’s a lot of phishing going on over SMS (smishing).
Due to the increase in spam messages, many consumers have lost trust in SMS as a form of communication. In the US, A2P 10DLC is the standard that carriers have implemented to regulate this communication pathway.
A2P 10DLC improves the end-user experience by making sure consumers can opt in and out of messaging and also know who is sending them messages. It also benefits businesses, offering them higher messaging throughput, brand awareness, and accountability.
To be compliant with A2P messaging, you need to register your application at three different levels. First, you will register your brand or business, which will be manually reviewed and approved. Next, you need to register your campaign, where you will detail what messages you are going to send, for example, sending two-factor authentications for account setup and login as well as some kind of notifications. Here, again, your use case will be vetted by someone and you may be required to provide additional detail about your use case. Finally, you are required to register a set of phone numbers you will use to send messages.
In this article, we will discuss what problems we had to solve at Twilio to efficiently build a resilient and scalable asynchronous system to handle a complex workflow implementing A2P Messaging Compliance and the advantages we got from adopting a Workflow Orchestration solution.How Twilio implemented A2P compliance platform
We originally built the A2P compliance platform using a state machine to orchestrate different registration processes described above. This system was built using an event-driven architecture and communicated using queues and ensured event processing at a specific rate adhering to the downstream rate limits. Over time, this platform evolved into a complex system which became hard to maintain (state machines, queues, rate limiting, error handling, auditing etc). We then started seeing challenges in terms of scaling systems and teams.Challenges building a resilient asynchronous workflow
Challenges we faced building an event driven system by implementing state machines:State Management
The first one is state management. Basically, the problem here is that you need to contemplate lots of possible combinations of states and events. For example, the “review received” message could come in while the campaign is in pending state instead of the relevant waiting state, or an out of sequence event could come in from somewhere, and so on. All of those cases need to be handled, even though they are not the most likely sequence of events and states.
As the number of states and messages grow, so does the complexity of ensuring you are handling messages in all states accurately. You may want to handle states differently or not at all. If you want to add a new intermediate step, you have to look at all your states and add code to ensure that this message is reliably handled even in different states. The fundamental problem with the state machine is that you cannot configure a state machine as a sequence of steps required to carry through registration, but you can configure it in terms of events, states and actions: “I am getting event X, my database is in state Y, so I’m going to perform action Z.” This thinking in terms of state machines and handling state transitions became complex and was often very error prone.Retry mechanism
Handling retries becomes a task almost as complex as implementing primary logic, sometimes even more so.

Continue reading...