Rewriting your disaster recovery plan might just save your company…and could transform it

December 1, 2021 TH Author

Paid Feature Disaster recovery (DR) used to be thought of as a form of corporate hygiene, but it’s becoming increasingly clear it has to be considered a matter of corporate survival.

Downtime cost companies $84,650 per hour in 2020, on average, research by Veeam shows. If we’re talking unplanned downtime, due to ransomware for example, you can throw in the costs of corroded customer confidence, broader reputational damage, remedial work, and unbearable stress for ops teams.

How bad can it be? Just ask Colonial Pipeline, which paid a ransomware gang $4.4m when its systems were locked in May – and had shouldered the blame for spiraling fuel prices and queues at petrol pumps. It doesn’t help that, depending on your point of view, we are all still in the midst of a major disaster that has been running for almost two years.

As Robin Gardner, Sales and Strategic Services Director of Xtravirt, the UK cloud consulting and managed services firm, explains, “For many firms remote access and working from home used to be the DR solution for an office incident. But since March 2020, home working IS business as usual for many businesses.”

At the same time, companies are restructuring their bricks and mortar estates, meaning those facilities can’t actually support the entire workforce. “And that now means, as an example, your digital workspace solution needs DR, because it’s now a business-critical service.”

And this ties into broader questions about how other on-prem assets are being transformed, and the knock-on effect for the VMware infrastructure on which the vast majority of enterprises depend – whether on-prem or in the cloud.

“It’s very easy to close down a facility with no data center in there,” says Gardner. “If you want to close a building that’s hosting the DR data center or the secondary data center, that becomes a challenge.”

Looked at this way, he says, disaster recovery itself can be an enabler for new working practices, even digital transformation. “Changing your DR strategy may enable a greater flexibility within your business or your organisation.” And he points out, if you have been following an on-prem based recovery strategy, what he euphemistically describes as “the current lead time to silicon and hardware” might also make the cloud seem more relevant.

This is made all the more urgent by two other factors. One is the growing threat of ransomware. The UK government’s own Cyber Security Breaches Survey 2021 showed that among organisations that had experienced breaches or attacks in the previous 12 months, 7 per cent had identified ransomware as the culprit.

However, the research also found that virus attacks, including ransomware, were considered among the most disruptive. IDC research released in August showed that 37 per cent of companies worldwide were hit by ransomware in the previous 12 months. The average ransom paid was $250,000, though some were more than $1m.

This has been exacerbated by the pandemic and the rush to home working. Systems were lashed together during the rapid rush from the office in early 2020, and managers are acutely aware that these are now a ticking time bomb, particularly where users are relying on home networks and VPN for access.

Which disaster are we talking about?

“Ransomware means that the likelihood of the need to use your DR environment has gone up,” says Gardner. Yet well over two thirds of IT decision makers are not confident they could recover from a cyberattack, research by Gartner shows.

These triggers – particularly ransomware – are far less abstract than the scenarios that might have informed DR planning in the past. As Gardner puts it, “It’s no longer all about the risk of an airplane landing on a data center.”

And this has contributed to DR and business continuity moving up the agenda for compliance and audit teams. “As a result of the pandemic,” says Gardner, “Xtravirt has seen that organisations are not just more willing to discuss business continuity and their recovery strategy, but are proactively seeking our support and guidance in this area. Indicators are that attitudes are shifting fast, with DR moving firmly from hygiene activity, to business necessity.”

So what does a “disaster” actually look like today? And how can you plan for it?

“The traditional assumption was that everything in your monitoring systems goes red,” says Gardner. “Which was either a network failure, a data center failure, or a fire.”

At that point, the objective was to return to the steady state the organisation was in minutes before the dials lit up. But the rise of ransomware and the borderless nature of modern networks means the signs of impending or actual doom may be much more subtle today.

As Gardner explains, the trigger may be an information security event that prompts an organisation to “lockdown and recover elsewhere” – if you were relying on active-active replication to a secondary data center, then ransomware can just follow the replication.

But, he continues, “The fact that you can no longer trust your workloads, or you don’t know where your trust point is means your recovery point objective is no longer necessarily the five minutes before the failure event occurred.” And of course, the first target of many ransomware or cyber attacks is going to be backups or DR systems, precisely because this will force the victim to pay up.

At the same time, the need for organisations to be always online means that traditional recovery strategies no longer suffice. Retrieving tapes from a remote facility is not a viable option when being offline can cost you $100,000s or even $1,000,000s per hour in lost business. Beyond the revenue hit, Veeam’s research showed that organisations are acutely aware of the effect of extended downtime on both customer and employee confidence. Other costs are likely to come in the form of stock price hits, potential legal action, and the loss of certifications and accreditations.

But if the nature of disasters has changed, the options for preparing for and recovering have changed drastically too.

When it comes to the VMware stack, which remains the underlying basis for the majority of enterprise workloads, Xtravirt’s Head of Technical Pre-Sales Andy Hine says there are multiple options.

What exactly is the recovery position?

Traditional full replication to a second data center might not be affordable due to the capex involved in creating that second facility. “VMware Site Recovery has been around a little while but has been adopted to be cloud compatible, so replicating your primary data center to a secondary data center that might be in the cloud, something like VMware Cloud on AWS, for example,” says Hine.

Alternatively, “what VMware Cloud Disaster Recovery (VCDR) gives you is the ability to replicate your critical workloads and data almost in a pre-staging way, in VMware Cloud Storage. It’s more of an on demand offering. So at the moment that the disaster occurs, the second infrastructure is spun up.”

This has the advantage that fees kick in only when the secondary site is spun up or you’re running a failover test. The latter is particularly important for providing a sandbox in which to check that, not only does the DR plan tick all the right boxes, but that it actually works and workloads will continue in the event of a real disaster.

VMware Cloud Disaster Recovery also offers new ways to look at disaster recovery, he explains, with the prospect of instant recovery.

“With VCDR, there’s the concept of a pilot light environment, which is a minimum viable environment that can take on your critical workloads instantly, but then also help you then allow you to scale out from that point during your disaster to the point that it’s resolved.”

If you’re already using the cloud, you might consider what options your cloud provider offers – or doesn’t. Many organisations might mistakenly think their SLA (service level agreement) ensures they will be able to recover their data in the event of a problem. But this may well not be the case.

Where the provider does provide a recovery option, this may be with its own tooling which involves converting your VM workloads into a provider-native workload. But, Gardner explains, recovering natively, may mean “You lose the VMware integrity of your original workloads, which increases the complexity of your roll back.”

“That’s one of the advantages of a VMware Cloud solution – your ability to move backwards and forwards between recovery points is much more flexible. And then once you’ve chosen your recovery point, you can either execute from storage, so that the virtual machine can either run instantly from the backup storage, or you can restore it to full production.”

There’s no one-size fits all option for something so fundamental to a company’s survival. So a large part of Xtravirt’s role is talking through the customer’s set up, according to Hine, “Helping customers understand what their objectives are, and if they’ve got a good understanding of what their application landscape looks like.”

As Hine explains, the high value or most critical workloads might be classed as the “gold tier” in the disaster recovery plan and merit the kind of instant recovery services offered by VCDR.

Other elements might be considered less critical, for example test dev, he says. “And then the typical kind of backup and recovery routes are still applicable, and still viable.”

Ultimately though, designing a disaster plan and putting it in place is simply the first step. As Hine explains, many organisations have had excellent plans in place, but when it came to the crunch, they’ve found they’re out of date, or never been tested, or simply don’t work.

This could be down to something as simple as “What they previously assumed was an orchestrated failover actually has a manual element” he says.

“It’s not enough to just have a DR plan,” says Hine. “You need to test it and evidence it.” And that needs to be done on an ongoing basis. After all, as the pandemic has proved, some disasters can rumble on for years.

Sponsored by Xtravirt.