Akamai's Global Content Distribution Network: Replacing Recovery with Resilience

Many customers ask Akamai about Disaster Recovery testing and Business Continuity planning as a part of their due diligence or risk management process. Customers expect to see a governance document maintained by a central authority, a list of systems with Recovery Point Objectives (RPO), Recovery Time Objectives (RTO), and a documented testing plan that is enacted quarterly or annually. Akamai reframes these questions to better match our approach to continuity and recovery, all of which we include under the umbrella of "resilience."

The Deployed Network

Akamai is proud of its resilient and autonomous network design, and its ability to withstand the threats of malicious actors and natural disasters. We use real-world events, real attacks, and live traffic to understand our network capacity for withstanding and absorbing threats, and we use capacity modeling and planning to forecast our future needs and performance levels.

Akamai's long-standing 100% uptime Service Level Agreement is a testament to that process. Some of the systems that help support Akamai's network resilience are:

  • Telemetry: Akamai's network operates with a model of self-reporting that aggregates data from individual machines to feed into an on-going snapshot of the network's traffic and health. The data determines load-allocation dynamically and, when combined with data Akamai gathers about the congestion on routes around the Internet, helps Akamai understand the shape of the Internet. Due to this model, our network can quickly divert traffic dynamically around most outages.  This telemetry system was designed to be able to sustain the network for a period of days without human intervention.

  • Phased Releases: Akamai's deployed network operates with continuity testing built into its maintenance cycle. All of our maintenance processes are based on dynamically-configured changes that roll-out region-by-region across the network, allowing us to install new software or upgrade hardware in phases without any service disruptions.

  • Alert Management: Akamai's Network Operations Command Center (NOCC) requires alerts to be configured by business owners prior to new releases, meaning that any alert triggered by the network has an approved process for investigation and/or remediation that the NOCC can act on as the alert triggers. This kind of rapid response and highly coordinated operations model means that across its global NOCCs, high priority alerts are handled in time periods measured by seconds and minutes rather than hours or days.

Taken together, these systems are designed to ensure that our global deployed network maintains performance even under extreme circumstances, like regional natural disasters and cable cuts. In fact, Akamai's resilience model allows for a certain percentage of servers to be down at any given time, without impacting our ability to deliver customer's traffic.

The Enterprise

Many customers care about more than just Akamai's Intelligent Platform when they ask about BCP and DRP. Often, they care about our employee pandemic plans, the failover capacity for internal applications, or the provisioning and testing of critical facilities, such as the NOCC locations. All of these are handled by separate teams across the company, rather than through a single governance program.

  • Corporate Information Technology (IT): For all corporate hardware, applications, and services, an annual charter is created to specify business continuity, crisis management, and disaster recovery goals for the year. Full-time staff is dedicated to this work, and they manage business impact analyses for these tools, hardware, and systems. In the past few years, IT has instituted a rating system for each scoped system based on the system's failover RTOs and RPOs, number of failover sites, testing schedule, and other factors. This rating system allows the IT and Platform Security teams to rapidly understand how each system in use by Akamai operations can withstand disaster or crisis.

  • Network Operations Command Center (NOCC): The primary Network Operation Command Center (NOCC) is headquartered in Cambridge, Massachusetts and operates 24X7X365 days per year. Other NOCC locations include Bangalore, India, Krakow, Poland, and Santa Clara, California. The NOCC prepares for the risk of disruptions to critical business operations by performing an annual business impact analysis of adverse situations and regularly testing disaster recovery plans.   Depending on the hazard type, the Disaster Recovery Plan includes the following possibilities: shelter in place, move to cold site (alternate facility), or increase capacity at other NOCCs (allowing the NOCCs to load balance demand globally). These three possibilities are tested regularly and allow the NOCC to adaptively respond to different types of disaster scenarios and continue operations effectively.

  • Global Real Estate & Workplace Productivity (GREWP) & Human Resources (HR): Akamai's employees are only loosely tied to their corporate office spaces; VPN concentrators, distributed teams, and matrixed management allow many employees to work remotely as needed. Akamai maintains an automatic alerting system to inform employees of any threats to their workspace or region, allowing them to shelter in place and work from home in the case of pandemics, weather events, or other crises. Additional policies lay out other continuity-specific goals, such as limiting the number of employees on any single flight.

Putting It All Together

Akamai's Information Security Program requires that Akamai engage in comprehensive business continuity and disaster recovery planning. Akamai approaches that need in a decentralized and customizable means such that its operations and employee planning match the Platform's highly-distributed and semi-autonomous nature.

While this is not the most traditional governance model, most customers come to understand the real-world events that Akamai has weathered - both adversarial and natural - adequately demonstrate Akamai's ability to meet its high availability, uninterrupted service, and business continuity goals in the face of all kinds of disasters and crises.  

