Cloud file storage service Dropbox has explained how it took its key data center completely offline to test its disaster readiness capabilities.
As Dropbox explains, after migrating its computing infrastructure from Amazon Web Services in 2015 and then launching its Magic Pocket file content storage system, the company became “highly centralized” at its San Jose data center (SJC) — located not so far from the San Andreas Fault.
Given the criticality of the San Jose data center, Dropbox wanted to know what would happen to global availability if that region or “metro” went down, so the company worked towards a goal, in November of last year, of testing its resilience by physically unplugging the fiber network to its SJC data centers.
“In a world where natural disasters are more and more prevalent, it’s important that we consider the potential impact of such events on our data centers,” the team who ran the project explained in a detailed blog post.
The company stores file content and metadata about files and users. Magic Pocket splits content files into blocks and replicates them across its infrastructure in different regions. The system is designed to serve block data independently from different data centers concurrently in the event that a datacenter goes down, making it a so-called ‘active-active’ system.
Dropbox was seeking the same active-active architecture for its metadata stack. But back then, its main MySQL database for metadata was in the SJC and it hadn’t properly tested its failover or active-passive capability. It wanted to test how its database in SJC would failover to a replicated MySQL database at its passive data center in Idaho. A failover test in 2015 was successful but its engineers realized active-active architecture for metadata would be harder than for block storage.
The company’s engineers settled on active-passive for metadata and in 2019 began running many failover tests.
But then in May 2020, a “critical failure” in Dropbox’s failover tooling “caused a major outage, costing us 47 minutes of downtime.” The company kicked off an emergency audit of its failover tooling and processes, and created a dedicated seven-person disaster recovery team whose goal was to slash the Recovery Time Objective (RTO) by the end of 2021.
“We realized the best way to ensure we did not have any dependency on the active metro was to perform a disaster recovery test where we physically unplugged SJC from the rest of the Dropbox network,” the company explains.
“If unplugging SJC proved to have minimal impact on our operations, this would prove that in the event of a disaster affecting SJC, Dropbox could be operating normally within a matter of hours. We called this project the SJC blackhole.”
After ensuring critical services running in SJC were multi-homed — running from another metro other than SJC — the team decided how they would simulate the complete loss of SJC.
Initially, Dropbox planned to isolate SJC from the network by draining the metro’s network routers, but opted for the more drastic measure of unplugging the network fiber.
“While this would have gotten the job done, we ultimately landed on a physical approach that we felt better simulated a true disaster scenario: unplugging the network fiber!”
It carried out two test runs via two datacenters in its Dallas Forth Worth (DFW) metro (DFW4 and DFW5), but the first test which unplugged DFW4, was deemed a failure because it impacted global availability and the test was ended early. Dropbox incorrectly assumed the DFW4 and DFW5 were roughly equivalent and didn’t account for cross-facility dependencies.
A few weeks later, engineers ran a new test that would blackhole the entire DFW metro. Engineers at each of the two facilities unplugged the fiber on command.
Dropbox observed no impact to availability and maintained the blackhole for the full 30 minutes and it was deemed a success.
At 5pm PT on Thursday, November 18, 2021, Dropbox finally ran the major test at its SJC, where engineers unplugged each of the metro’s three datacenters one by one. Dropbox passed the 30 minute blackhole threshold without observing an impact to global availability, although some internal services were impacted.
“Yeah, we know, this probably sounds a bit anti-climactic. But that’s exactly the point! Our detail-oriented approach to preparing for this event is why the big day went so smoothly,” the company explained.
It’s still not an active-active architecture, but Dropbox says it is confident that, without SJC, Dropbox as a whole could still survive a major outage in that metro, noting that the approach had proved that it “now had the people and processes in place to offer a significantly reduced RTO – and that Dropbox could run indefinitely from another region without issue. And most importantly, our blackhole exercise proved that, without SJC, Dropbox could still survive.”