Inside a Data Center Outage: Lessons About Resilience

April 14, 2022 TH Author

Volatile global dynamics like the Russia-Ukraine war, climate change driving an unprecedented surge in natural hazards, sharp peaks and rapid declines of a pandemic, out-of-control organized cybercrime syndicates … 2022 seems like a practical demonstration of Murphy’s Law for enterprise networks. In the face of looming uncertainty and nation-state attacks that may result in collateral damage to facilities, remember: No matter how advanced your network infrastructure is, its resilience determines its true worth.

The strength of any network can be measured by its points of presence (PoPs), where two or more networks or communication devices share a connection. (Editor’s note: The author’s company is one of many cloud network providers that maintain PoPs around the world.) PoPs connect people to different networks, as well as to the greater Internet. They usually contain multiple servers and routers, and Internet service providers typically have multiple PoPs distributed over a geographic area. These physical locations allow people to be interconnected to others around the world. Having PoPs near your offices and remote and mobile users, wherever they may be, is critical for receiving services.

But just counting PoPs doesn’t tell you whether the network is robust. What happens if a PoP fails? And PoPs do fail, even with good design and data center selection. Case in point is the recent Interxion outage.

Interxion’s Unexpected Power Outage
Earlier this year, Interxion, a leading provider of data center services across EMEA, faced an hours-long power outage at its central London campus. Being at the center of the European trading landscape and with billions in trade at stake, the facility had to have built-in redundancy and backup plans in place.

Unfortunately, even the equipment designed to switch power to an on-site backup generator failed, alongside the multiple power feeds going into the building.

My company’s own PoP in the Interxion data center was unavailable for several hours. Thankfully, our customers were automatically redirected to another nearby Cato PoP for operational continuity. But the incident caused service outages for numerous customers whose networks depended solely on this data center.

Alternate Scenario: How It Should Have Been
The hours-long outage could have been averted if customers had access to a global private backbone of PoPs and intelligent traffic-routing. The traffic would have been automatically redirected to another nearby PoP, and customers would have barely noticed the few seconds of outage. This alternate scenario pretty much sums up how a consolidated, cloud-native secure access service edge (SASE) service ensures resiliency and business continuity even in a situation as unlikely as the case in point.

Achieving Multidimensional Network Resiliency
In a situation like Interxion’s, the strategic distribution of PoPs matters more than sheer numbers. Beyond placement, the PoPs also need surplus capacity to handle the emergency traffic. Taking into account all such aspects, modern enterprise networks need to achieve multidimensional resiliency.

Such incidents have taught us what multidimensional resiliency really looks like:

• Availability: A globally distributed backbone of PoPs is essential for connectivity and service delivery. However, it’s the density of PoPs in a region that matters in a failover plan, not the number of countries covered. If a PoP fails, you’ll need another PoP nearby to reroute the traffic through.

• Security: This must always remain a priority, even during a crisis. That’s why the full security stack must be built into the underlying network infrastructure. When all PoPs have built-in encryption and security, all endpoints — on-site, remote, and mobile users and devices — stay within the security perimeter even when redirected through backup PoPs.

• Fault tolerance: Modern enterprise networks need self-healing capabilities to handle disruptions, using different techniques like switching over to another PoP. Every second of downtime counts, and only intelligent, automated traffic routing can ensure rapid detection and quick transition.

• Adaptability: It’s impossible to anticipate and prepare for all possible disruptions. Enterprise networks need to have the flexibility and agility to adapt to unexpected stressors. For instance, some of our customers couldn’t benefit from our quick transition from the London PoP because they used firewalls to route traffic whose main and failover links both went only to the London site. When the London PoP failed, we had to quickly adapt and configure a new link going to a PoP in Manchester.

• Dependability: IT and service providers need to have full insight into the network to see how a failure impacts traffic across the network. In this situation, it would have been impossible to see exactly which customers had their failover link routed to the London PoP without complete visibility into all customer traffic. There’s no way to ensure network dependability without having a complete view of the network at all times.

Final Thoughts
The true strength of a network lies in its resilience in the face of unanticipated, rare-case scenarios. Redundancy itself is simply not enough, since it can fail — as it did during the Interxion outage. Multidimensional resilience ensures the network stays operational even if multiple backup plans fail.

Although a global, cloud-based SASE architecture with a unified management console checks all the resiliency boxes, you need ongoing investments to determine how things can go wrong and build the capabilities to continue operations and services during crisis situations.

Leave a Reply Cancel reply