On the drive to work this morning, I listened to a report about this being the 10th anniversary of the massive blackout that plunged an area from New York City to Toronto into darkness. I immediately thought of a post Akamai CSO Andy Ellis wrote recently called "Environmental Controls at Planetary Scale."
It might be overreaching to say the 2003 blackout was an early case study in the success and failures of controls at Planetary Scale. Andy was talking about the environmental controls in data centers around the world. The blackout wasn't something individual data centers had much control over, and the power failure was geographically limited to a section of the U.S. and Canada. The blackout's root cause was a software glitch in an alarm system inside one of FirstEnergy Corp.'s control rooms in Ohio. Workers apparently didn't realize they needed to redistribute power after overburdened transmission lines collapsed onto overgrown trees. A manageable local blackout thus snowballed into widespread electric grid failure.
Still, I can't help but think of the parallels. Andy's blog post examined the pros and cons of investing large sums of money in data center environmental controls. He wrote:
Is the cost worth the hassle? If you run one data center, then the costs might worthwhile - after all, it's only a few capital systems, and a few basis point improvements in MTBCF will likely be worth that hassle (both in operational false positives as well as deployment cost). But what if you operate in thousands of data centers, most of them someone else's? The cost multiplies significantly, but the marginal benefit significantly decreases - as any given data center improvement only affects such a small portion of your systems. Each data center in a planetary scale environment is now as critical to availability as a power strip is to a single data center location. Mustering an argument to monitor every power strip would be challenging; a better approach is to have a drawer full of power strips, and replace ones that fail.
I see lessons here in how we manage interconnected electrical systems where a failure in one place can spill over to many other places the world over. Security experts have said and written much in recent years about the threat to global power grids. Among other things, they've warned, a hacker could compromise SCADA controls in one power station and maximize the damage if the target is the weak link in a much bigger chain of power distribution centers.
The ways in which we manage the threat carry similar pros and cons to that of the environmental control management Andy wrote about.
On this particular anniversary, I throw it out there as food for thought.