As a Disaster Recovery Service Provider we have a unique viewpoint into what are the most important items to consider when purchasing or implementing a DR solution. When talking to customers, we often find items we believe to be important are not considered in enough detail, which can lead to a less than ideal DR solution.
A Disaster Recovery solution is your insurance policy, hopefully you never need it, but when you do need it, you really need it. Your DR solution shouldn’t be an afterthought or a “tick in a box” exercise, it is a critical element of you overall IT strategy. Being able to recover quickly in the event of a disaster could prevent the loss of money, reputation and customers.
In this article I’m going to cover the top 5 things that I think you should consider when implementing, designing or purchasing a DR solution.
The first key question to answer is what actually represents a DR situation? It could be any of the below:
You need to be clear what you are trying to protect against when you start designing your DR solution or talking to a service provider. Once it is clear what requirements must be met it becomes easier to design an appropriate solution.
Understanding the answer to this question early on also enables clarification with your business as to what the final DR solution will provide. This ensures no confusion when the final solution doesn’t do something the business thought it would.
Some of the above-mentioned examples can have significant impacts on how a solution is designed. One distinction we try to make early on is whether the solution represents an all-or-nothing failover (i.e. full site outage) or whether it should provide for partial failover in the event that a single or group or applications fail (but not everything). This is a key decision as designing a solution for partial failover has specific requirements, especially on things like connectivity. If this condition is not made clear from the outset, there is a good chance a solution may not meet this requirement if it is raised at a later date.
Be clear from the outset what disaster scenarios your DR solution will protect against and ensure this is aligned technically with the business requirement.
What are RPO’s and RTO’s?
Different applications and services may have different RPO’s and RTO’s depending on their criticality or the data they hold. You should evaluate what these are for each of your applications/servers based on what target the business sets. Don’t be afraid to push back if the business sets unachievable targets.
When setting RPO’s and RTO’s you need to evaluate what is realistic, for example in order to achieve a low RPO (seconds), you need a product that can support it with near-continuous replication, for example Zerto. You may also need a good bandwidth connection to your DR site to support the level of data replicated. The rate of data change on a particular application can have a big impact on the achievable RPO, as you’ll be trying to push more data across to the DR site quickly. Databases, indexing servers and file servers (particularly those using DFS replication) are functions we see with particularly high data change rates.
For your RTO’s you need to not only consider how long it will take for a machine to fail over, it will need to boot up and may require manual changes in order for the application to be fully available. Think about whether your advertised RTO to the business represents when a machine is failed over, or when it is fully functional again.
Remember these are targets and they should be achievable, if you know you can never hit them but the application requires a low RPO then perhaps the DR solution isn’t the problem and the primary solution needs to be looked at.
Another key point when discussing RPO’s is recovery points. How much granularity do you need when failing over? That is, how many different points in time do you need to have available. Our primary DR solution of choice, Zerto, uses a journal which records recovery points every few seconds, which gives many different options when choosing when to fail back to.
To put some context around this, imagine you have been hit by ransomware, but you aren’t sure when you become infected, using granular recovery points, you could fail back to an hour ago, review the server for infected files, if still present, you can roll back the failover, then failover again to 1 hour 5 minutes ago, repeating this process until you find the point at which you were infected, thus minimising data loss.
We usually find that whilst it is useful to have a few days of granular recovery points available, much more than this you have to really consider the amount of data you’ll be losing if rolling back that far. For many businesses, this would be unacceptable.
Look at each of your applications or services and understand what their RPO and RTO should be, whether it is realistic and if it aligns with the business requirement.
Let’s be honest, no IT professional enjoys doing DR tests. Traditionally they take a lot of time, have to be done out of hours and rarely go to plan.
It goes without saying that testing failover is a really important part of your DR plan. A well tested DR solution instils confidence in the IT team and the business as a whole. If you know you can recover your applications and services, if a real DR event ever happens it will be considerably less stressful for everyone involved.
Testing is a lot more than just failing over a machine to make sure it boots. We usually recommend our customers break their testing activities into phases, working through each phase and gaining more confidence at each stage.
In order to ensure successful testing, you need to put your design hat on and think about how an application or service works, what are the dependencies, both internal and external? Are any manual changes required when failing over? These are all things that take time and should be clearly documented in your business continuity plan. Generally, these are not things that your DR service provider will be able to do for you, you will need to ensure adequate time is allocated to IT staff within your organisation to carry out these activities. We see a lot of customers who underestimate the amount of work involved in testing a DR solution after the initial replication has been completed.
If you put in the initial leg work to get your testing ironed out and processes documented, this will make subsequent tests considerably easier. You can start to stagger your tests, testing individual applications and services at different times and these tests should get increasingly quicker each time. Our most successful customers test their DR on a monthly or quarterly basis and have their testing wrapped up within a few hours.
Another great thing about modern DR applications, particularly Zerto, is that testing can be performed with no impact to production services, spinning up copies for machines in isolated environments. This enables IT to complete testing during normal business hours, without requiring staff to work overtime or the business to stop using services for a weekend.
Testing requires time investment. If you want a reliable DR solution that your IT staff have confidence in, you have to put the work in up front to get your testing ironed out. You won’t regret this.
This is probably one of the most under-considered points and yet one of the most important we see when working with new customers.
Replicating your machines from A-B is one thing, but you need to put proper thought into how your users, whether they are internal or external, will actually use those systems now they are in a different location.
If you have a WAN network it might be that you put a circuit into your DR site and rely on your WAN provider to assist with rerouting traffic in a DR event. For larger organisations or those with multiple sites, this is a good solution.
There are other options, maybe you bring up VPNs between sites instead of physical circuits. A key consideration with all connectivity options is IP ranges, is your DR site on the same IP range as your production site, or are you re-IPing? Re-IP can make things easier from a connectivity standpoint as you don’t have to worry about clashes across sites, however it introduces other potential problems with applications that either don’t like having their IP changed or have hard-coded IPs in them.
Remote VPN or SSL VPN is another connectivity option, this is particularly useful if your users are likely to be remote. If your IT equipment is hosted on the same site as where your users typically reside, a DR event affecting the whole site may mean that users need to be relocated. In this scenario a circuit-based connectivity method might not be appropriate, where a remote VPN solution will allow a user to work from anywhere.
It becomes clear when we start to look at the different connectivity options that multiple solutions might be necessary. Again, you need to put your design hat on and think about the different scenarios and where users could be in different disaster scenarios.
Think about your different disaster scenarios and where users might be located during each of them. What are the appropriate connectivity methods for each of them and have you considered them?
I’ve mentioned it already in this article, but it is such a key point that I feel it needs its own section. Implementing a DR solution is not a simple job, there are many different things that need considering. An effective DR solution is much more than just a product that can replicate from A-B.
Think about whether you have allocated enough time to cover off all the aspects of a proper DR solution. A few things to think about are design, initial testing of each application or service, ongoing testing, connectivity design, business continuity planning, process documentation.
Unfortunately, we often see customers who underestimate the amount of time required to get their DR solution to a point where they feel genuinely confident they can rely upon it. Even if you are purchasing a DR service, remember that there are a whole host of responsibilities that will still sit with you as an organisation. You can of course lean on your provider for assistance and advice, but no matter how good your provider is, there will be things they just can’t assist with.
Often the time investment can initially seem like a lot, but it will mean that ongoing management is so much simpler. You’ll have a DR solution and strategy that you can completely rely on. The confidence your IT staff will have in the solution will make any disaster event much more manageable.
Implementing a full DR solution that you can rely on and have confidence in takes time. You have to invest the time upfront and on an ongoing basis in order to achieve this.