High Availability with DNS Round Robin Load-Balancing across Multiple Data Centers
Ensuring 24/7 availability is a substantial point in providing the high quality web-services. To achieve this, different solutions may be applied depending on a particular application pattern. For example, through horizontal scaling, software clusterization and session replication.
However, even a properly configured cluster may be rarely stricken by downtime in case of a whole data center occasional failure. So, today we’d like to showcase a simple, cheap and effective HA solution – distribution of your workloads across several data centers. Recently, this became possible with a newly added Jelastic feature of Multiple Hardware Regions.
To reach the highest point of availability, we offer configuring two application clusters with the same content at different data centers, so in case one of them is unavailable, all the incoming requests will be redirected to another region. In order to accomplish this, we’ll take advantage of the DNS round robin load-balancing as the most affordable option for such HA implementation.
Note that this approach suits best for web-sites and web-applications, which work over HTTP(S). For apps that use TCP, some additional configurations are required for ensuring their proper operability with several backends.
1. To start with, you need to have at least two environments, located within different hardware regions (the easiest way to implement this is to clone the environment with your application and migrate it to another region).
As an example, we’ll use the following topology: a pair of highly available Tomcat 7 application servers on the backend of the NGINX load balancer with the Public IP option enabled.
Note that the offered solution requires the incoming requests to be processed via external IP address, attached to the appropriate node (i.e. either balancer or application server).
2. Next, we’ll benefit on the DNS servers’ possibility to bind multiple IP addresses (i.e. entry points) to a single domain name. For that, add external IPs of your both balancers into separate A records for your custom domain using the corresponding guide.
To make sure everything is configured properly, you can examine the appropriate DNS settings through running the following command for your domain via terminal:
Both of the stated A records will be listed within the ANSWER SECTION of the received output in the following format (obviously, the exact values for your domain name will differ):
example.com 9999 IN A first_IP
example.com 9999 IN A second_IP
The Implied Workflow
As a result, DNS server will send back the whole list of the available addresses upon receiving a request for the stated domain. The corresponding web-browser will try them one-by-one and choose the first that responded for establishing the connection. Usually, it’s the first Public IP address in the list. In case the appropriate data center is unavailable, the next one will be checked.
Note that standard DNS server does not track the availability of provided addresses, handing out the whole list regardless of their accessibility. So in case one of them is unavailable, you should expect some delays with requests processing, as users’ web-browsers will wait for such IP’s response for some time (for ~10-30 sec).
Also be aware that when the unresponsive address is manually removed from the pool, it will be still returned by a DNS Server for the time, stated within its TTL caching setting. Herewith, changing this value afterwards will not affect the actually applied parameters, thus we recommend to set it to minimum beforehand.
In addition, DNS servers automatically provide round robin distribution for the domain names with multiple IP records. Thus, after each request is processed, the order of addresses will be cyclically shifted, moving first IP down, which results in even workload distribution among all of your environments in different regions.
Since it is very unlikely that both of data centers with your environments will go down at a time, you’ve already got enough redundancy level. Nevertheless, the number of availability regions can be easily enlarged even further with the similar configurations applied.
Tip: Some of DNS providers (e.g. Amazon, DNS Made Easy, Dyn etc) offer the additional dynamic DNS failover mechanism, where each of the A records is supplied with an alternative one. The original IPs are monitored by a specially configured DNS server (like lbnamed) and in case a particular one is non-responsive, it will be automatically removed from the pool and substituted with the appropriate extra address (for the time of unavailability only). In such a way, you can get rid of lags and benefit on the described HA approach even more.
Performance & Failover Testing
In order to check the effectiveness of this solution, we’ve held a special test with the simultaneous failover check up. For that, the dedicated domain was stably loaded with the help of the Apache JMeter tool, while changes on both instances were tracked using the embedded Jelastic statistics module.
To start with, we’ve ensured all the incoming requests are processed without errors by sending a continuous and persistent load. After the ~10 minutes period without a single failure occurred, one of the clusters was manually shut down in order to simulate the failure.
Below you can see the CPU & Network statistics for both of our load-balancers during running the described scenario:
As you can see, both environments in different regions handle steady and evenly divided load till the point, where the first cluster receives the shutdown command (approximately at 12:10 – this moment is marked with the red dotted line). After that, its activity drops down to zero (just as expected for the instance, which is unavailable), while the resource consumption at the second environment starts to rise, as it became the only entry point and all the received traffic is handled here now.
From the image above, you can notice that the Network load for the Region 2 has just slightly increased after the first one was shut down. In reality, such a small spike implies the doubled consumption due to the logarithmic scale, used for vertical axis of this graph.
After the first cluster halt, all newly incoming requests were directed to the active instance, thus no one of them was dropped down. This can be seen within the automatically generated JMeter report below, where you can examine the processing results for the moment of the simulated failure breakpoint and the overall test summary:
You can check the efficiency (i.e. response time & the amount of errors) of the described HA approach by yourself using any other load-generating tool (like Load Impact, Load UI, WebLoad, etc) in a similar way.
The presented method is extremely simple and affordable, as it doesn’t require any additional soft or hardware integration to successfully deal with such major disaster as failure of the entire data center. Thus, do not waste the time and increase your services availability and uptime period with Jelastic right now! Give a try and experience all the Jelastic DevOps PaaS benefits by yourself.
Stay tuned for the upcoming publications at our blog to discover another more advanced HA solution, which ensures automatic DNS-records management through tracking the availability of hardware regions and provides smart geo-distribution of the incoming traffic by means of the Azure Traffic Manager. And remember, that applying preventive measures beforehand always costs less efforts and money than ensuing recovering of the lost data and restoring your customers’ trust in case of trouble.