Failover and Flawless Execution of High Availability

Failover and Flawless Execution of High Availability


Sep 14, 2022
by frz
in DevOps

Your website should run perfectly 100% of the time, but what happens if it doesn't?

This type of disaster planning signifies that you're thinking about problems before they happen, which is excellent. The next thing to realize here is that there's quite a bit of nuance in how you solve this problem. 

Three basic questions need to be addressed:

  1. How will we detect an issue?
  2. How will we redirect traffic to the failover solution?
  3. What is the failover destination?

How to monitor and detect an issue?

Monitoring issues is pretty simple to solve. There are any number of stand-alone services that will check the health of your website. Sites like Pingdom and Uptime just hit a URL at frequent intervals looking for a consistent response you can define. They typically have servers worldwide, and if several of them start reporting unexpected results, you can assume something is wrong with your website. These services can send emails and SMSs, or you can integrate them with your support systems like PagerDuty. Additionally, services like NewRelic actually integrate with your production backend systems and can send alerts when certain usage thresholds are exceeded. In this way a well-monitored system can ensure smart engineers are debugging a problem before it brings the live website down. 

How to redirect traffic?

So the site is broken, and it's not magically coming up on its own. While fixing the site is a great idea and a top priority, you would like to have something better than just an error page running while the engineers figure out what is wrong. Manual is the default answer. Manual simply means some engineer is going to do their best to respond to the situation as it emerges. Depending on where the issue is with your technology stack, or hosting you can redirect traffic to some type of failover solution. 

Failover Automation

Automatic is a better answer. Some (not all) DNS providers have an automatic failover solution. Your DNS provider can provide some basic monitoring and when it looks like the server is down, the DNS provider can route the public to a failover destination. DNSMadeEasy DNS Failover does this. You can use Amazon's Route 53 health checking to configure active-active and active-passive failover configurations. Fundamentally, your IT department likely wants to manage DNS themselves (instead of having your web agency in charge of it) so it's essential to make sure their chosen DNS provider has a failover option if you want this to happen automatically. 

Load Balancing

Your site's infrastructure may have a load balancer setup as well. In this scenario, it'd be possible for the load balancer to redirect traffic elsewhere in an emergency, assuming the load balancer is working. The challenge here is there's always going to be some point of failure in these systems, so if you're looking for the outermost point, that's your DNS server. 

What is the failover destination?

Websites are generally much more than the simple HTML pages of the 1990s, with many websites functioning much more like fully interactive applications that require a working database and programming language. Due to the complexity of digital marketing tools and content management systems, saying "just take a copy of the website" isn't as straightforward as it may seem. There are a few options for how complex your failover destination is:

A single static HTML page with a few images. Twitter famously had a "fail whale" page they used in their early days when performance was a constant concern. This single page went up when there was an issue and let the public know "we're here, we're working on it." Today we'd suggest hosting a page like this in Amazon S3 as that service is extremely dependable.

A static HTML microsite. Similar to the single "fail whale" page, but perhaps you design a few pages as a lightweight alternative to your marketing site. This contains critical information people are looking for, but it doesn't include any of the more complex functionality like search or forms. We'd also suggest you keep this site really small and full of content that is timeless as it won't be that easy to update this static site and ideally you shouldn't really ever see it much. 

A backup copy. You can maintain a whole copy of your website and the infrastructure it runs on in a different location with the idea of swapping to it in an emergency. Your site would have the same functionality as the normal live site would, but your data would be from a snapshot typically a day or less old. This backup copy sounds great on paper, but it does raise its own challenges. The systems used to copy data to the backup can be difficult to keep continually running properly, and that's easy to miss if you only ever see this site in an emergency (which hopefully never happens.) If you do start using the backup and it has a database collecting information, how do you then swap back to the main production one without losing content later?

A high availability, fault tolerant, load balanced, distributed, multi-zone setup. While those buzzwords all sound similar and look great, they actually each mean something unique. The safest possible way to build an application would be to really distribute traffic across multiple copies of everything that are continually used to serve the public all the time. In this fashion you know your backup is running properly because your backup is actually just part of a network of services powering production every day. This type of solution is great if you're building a mission-critical business application, but it will come with a price tag to match the impressiveness of the buzzwords involved. 

Conclusion

Keep it simple. The best bang for your buck is to learn about your DNS provider's failover services and build a single static web page to send traffic to in an emergency. Once you start trying to fake a more robust solution with clever workarounds, you invariably build yourself something challenging to maintain that might not deliver on the promises your CISO and CEO are expecting when they ask you “what happens when the site goes down?”