Yellow.ai's Multi-Region Expansion - Eliminating Single Points of Failure ( Part Two )
In the world of technology and infrastructure management, ensuring robustness and resilience is paramount. One crucial aspect of achieving this is by eliminating Single Points of Failure (SPOFs). These vulnerabilities can disrupt operations, lead to downtime, and pose significant risks to our business. This blog provides an insider's perspective on how we meticulously identified and eradicated SPOFs, sharing the strategies, insights, and technologies that have strengthened our platform's resilience.
This blog is the second part of the Multi-Region series from Yellow.ai. You can read the first part here
Understanding Primary and Secondary Regions
Our primary region functioned as the core metadata hub and was located in India. It included the authentication system with access to essential MySQL databases containing chatbot metadata, user account information, and user roles. The primary region formed the backbone of our operations and was central to our multi-region deployment strategy. Secondary regions complemented the primary region by extending the application's reach to other regions. The secondary regions were geographically dispersed to cater to users outside the primary region. They relied on the primary region for access to the metadata and proxied most of their calls to the primary.
In the secondary region, not every service proxied requests to the primary region or relied on data from the metadata database. This was limited to a select few core services, with the authentication service being a prominent example. The authentication service worked in a primary-secondary model, where it primarily handled requests by proxying them to the primary region. Additionally, it autonomously processed certain requests before seeking assistance from the primary.
One such request was user sessions for our platform. When a user authenticated with the primary system, it utilized the metadata database to compute the session object and cached it in Redis. Without this cache, the user was deemed unauthorized. Similarly, when the secondary system attempted to authenticate the user, it initially checked the cache. However, it did not flag the user as unauthorized in the absence of the cache. Instead, it requested session data from the primary system. Depending on whether the primary considered the user authenticated or not, it responded accordingly to the user. Additionally, if the user was deemed authenticated by the primary system, the secondary cached the session for a brief duration. Whenever the primary server purged the session in its Redis instance, it issued commands to the secondary servers to invalidate their caches as well. This gave a clear distinction between regions as primary and secondary.
A New Foe: Single Points Of Failure
Our first action item post enabling multiple regions was to increase the resilience of the system. We audited every component in the system to find and eliminate single points of failure. Since we had primary and secondary regions, we had to ensure that despite the primary systems going down, none of the secondary systems were affected. We wanted to minimize the blast radius due to any unforeseen problems we could encounter in the primary region. Imagine explaining to a customer in the USA that they cannot use the platform because a remote server in India is temporarily unavailable. Our motivation to eliminate the single point of failure boiled down to building redundancy and resilience across regions.
Seamless Failovers
Given that the microservices themselves maintain a stateless nature, the key approach was to ensure availability of the shared metadata in more than one region. In the event of an outage in the primary region, the end users would need a seamless switch to a different region that could handle the authentication requests.
The existing authentication sessions were cached in Redis and new sessions were computed from a MySQL metadata database. If the other region did not access the same data, it would deem the call unauthorized and the users would get logged out in the event of a failover. In essence, two microservices in different regions, both running the same image and connected to identical data sources, were expected to exhibit consistent behavior. With a clear problem statement in place, the subsequent logical step involved breaking it down into smaller concerns to be addressed at the network level, the database level, and the application level. For this, we selected a second region capable of acting as an alternative primary region.
Configuring Cloudflare LoadBalancer
When the other regions tried to communicate with the primary region, they used the generic domain “https://cloud.yellow.ai” instead of a region-specific domain such as “https://z1.cloud.yellow.ai”. This allowed us to load balance the requests to any of the primary regions depending on their health. Cloudflare allows the setting up of service health monitoring so we could extend the same health check endpoints used by Kubernetes to Cloudflare and ensure seamless failover. In the event of an untoward service failure, we opted for a simple failover switch since we wanted to avoid introducing a lot of changes to the system at the same time. We ensured that the Load Balancing rules were extensible enough to onboard additional regions as primary.
One side benefit of the setup is that it allowed us to experiment with different Load Balancing strategies such as geo-steering or round-robin so we could determine the best configuration for the lowest latencies in all our regions.
Region-to-Region Database Connectivity
Once the decision was made to replicate the database between regions, we needed to provide network connectivity. This was straightforward as we had always ensured the CIDR blocks did not overlap between regions. The database subnets were connected with readily available connectivity options provided by our cloud providers that allowed them to communicate with each other, and by extension, replicate the data.
Extending the SQL Database
The metadata database existed in MySQL, a popular relational database. When we wanted to replicate this database, we considered other relational databases focusing mainly on High Availability and Fault Tolerance. We evaluated MySQL Group Replication, Amazon RDS, and Amazon Aurora. We ruled out RDS immediately because it was not proving to be cost-effective. We found group replication to be a great mechanism and a good fit for us as it worked with the existing setup, but it was a close call. What pushed MySQL GR over the edge was the fact that Aurora could not withstand region-level failures, and was limited to withstanding the loss of Availability Zones within a region. However, Amazon Aurora is a great distributed database and we would consider that for future use cases.
The main objective of MySQL replication was to make the failovers between regions seamless. This meant the data replication had to be synchronous, and hence, we set the consistency level of the group replication to “AFTER”. From the group replication docs,
you have a group that has predominantly read-only (RO) data, you want your read-write (RW) transactions to be applied everywhere once they commit, so that subsequent reads are done on up-to-date data that includes your latest writes and you do not pay the synchronization on every RO transaction, but only on RW ones. In this case, you should choose AFTER.
Since the system is read-heavy, we could afford the updates to become a bit slower as a trade-off for high availability. It did not change the write latency to anything noticeable, so we got the validation that we had made the right decision.
ProxySQL
Next up, we wanted the applications to be aware of any topology changes that could happen in MySQL. We evaluated ProxySQL, MySQL Router, and MariaDB MaxScale. Based on popular benchmarks, MySQL Router’s performance was significantly lower than any other proxy tools like MariaDB MaxScale and ProxySQL. This was a deal breaker as we wanted to build a future-proof system. It became clear that the tool required improvements with respect to usability as well as performance. We eliminated MariaDB MaxScale because it needed a commercial license.
Finally, the community around ProxySQL and the open-source license sealed the decision. ProxySQL ensured the traffic always went to the primary and switched the traffic in the event of any topology change. This meant every application could connect to ProxySQL which would then multiplex the connections to MySQL. It supports read/write split as well should the need arise. Thank you, Rene Cannao.
If you are interested in reading more about proxies, this is a great video that covers most of the relevant MySQL proxies.
Redis Sentinel
With MySQL replicating out of the way, the next step was to make the same Redis cache available in the alternative primary region. A simple Redis Sentinel setup was adequate for sharing the authentication data across regions and providing failovers. The sentinels could run on the primary regions as Kubernetes Deployments. They could then communicate with the Redis servers in two different regions and form the shared Cache Layer. The microservices used ioredis, a Redis client capable of understanding sentinels and reconnecting to the Redis server. In case the Sentinels switched the traffic, the use of this library ensured there were minimal disruptions to the end users.
Tweaking the Application Layer
Each service consuming the metadata database was stateless because the state persisted with the help of Redis and MySQL. This meant any of the regions could become the primary region as long as it had access to the same state. Depending on the configuration loaded, the application could decide whether to act as primary. As a part of the initial setup to enable multiple regions, we also added a proxy middleware in our application to automatically intercept the calls in secondary microservices and send them to the primary service. The proxy middleware had support for sending certain API calls to the local microservice as well, achieving the best of both worlds.
In this activity, we introduced support for Sentinel Redis and combed through the applications reading from the metadata database to remove any logic that was written with a single primary region intent. This single primary region programming pattern would cause untoward failures in the event of a failover. Once these patterns were refactored to ensure consistent behavior during failover, we subjected the application to rigorous testing.
The Promised Land of No SPOF
We did it. We have dismantled SPOFs from our infrastructure, layer by layer. The result is a resilient system that is capable of weathering unforeseen challenges and ensuring uninterrupted service to our users across the globe. As always, we remain cautious about any points of failure we have not realized yet and continue to be on the lookout for them.
Final Notes
At Yellow.ai, we firmly believe that resilience is not just about withstanding failures but also about thriving in the face of change and uncertainty. Our global commitment to delivering seamless conversational AI experiences drives us to continuously innovate and adapt. As we look to the future, we remain steadfast in our dedication to fortifying our infrastructure, ensuring uninterrupted service, and embracing the opportunities presented by a multi-region, multi-cloud platform.