Facebook’s biggest outage in history was caused by a wrong order that resulted in what the social media giant called “our mistake.”
“We have worked a lot to harden our systems to prevent unauthorized access, and it was interesting to see how this hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but by an error of our own initiative, ”said the new post on Tuesday.
Santosh Janardhan, vice president of engineering and infrastructure at Facebook, explained in the article why and how the six-hour shutdown happened and the technical, physical and safety challenges facing engineers at the company faced to restore services.
The main reason for the failure was a bad control during routine maintenance work, according to Janardhan.
Facebook engineers were forced to physically access the data centers that form the “global backbone” and overcome several hurdles to correct the error caused by the wrong command.
Once these errors were corrected, however, they were faced with yet another challenge, in the form of dealing with an “increase in traffic” that would result from troubleshooting.
Mr Janardhan, in the post, explained how the error was triggered “by the system that manages the capacity of our global backbone”.
“The backbone is the network that Facebook has built to connect all of our IT facilities to each other, which consists of tens of thousands of kilometers of fiber optic cables crossing the world and connecting all of our data centers,” said the message.
All requests from Facebook users, including loading news feeds or accessing messages, are handled from this network, which handles requests from smaller data centers.
To effectively manage these centers, engineers perform day-to-day infrastructure maintenance, including taking some of the backbone offline, adding more capacity, or updating software on routers that handle all data traffic.
“That was the source of yesterday’s blackout,” Janardhan said.
“During one of these routine maintenance jobs, an order was issued to assess the availability of global backbone capacity, which unintentionally disrupted all connections in our backbone, effectively disconnecting them. Facebook data centers around the world, ”he added.
What complicated matters was that the erroneous order that caused the failure could not be audited because a bug in the company’s audit tool prevented it from stopping the order, said the message.
A “complete disconnect” between Facebook’s data centers and the Internet then occurred, something which “caused a second problem that made matters worse.”
The entirety of Facebook’s “backbone” has been taken out of service, causing data center locations to label themselves as “unhealthy”.
“The end result was that our DNS servers became inaccessible even though they were still operational,” the post said.
Domain Name Systems (DNS) are systems by which web page addresses entered by users are translated into Internet Protocol (IP) addresses that can be read by machines.
“It made it impossible for the rest of the Internet to find our servers. “
Mr Janardhan said this raised two challenges. The first was that Facebook engineers could not access data centers through normal means due to network disruption.
The second was that the internal tools of the company that it normally uses to solve such problems have been “broken”.
Engineers were forced to go on-site to these data centers, where they had to “debug the problem and restart the systems.”
However, this has not turned out to be an easy task, as Facebook’s data centers have significant physical and security covers that are designed to be “hard to get to.”
Mr. Janardhan outlined how the company’s routers and hardware were designed to be difficult to modify, despite physical access.
“So it took longer to enable the secure access protocols needed for people to be on site and able to work on the servers. Only then can we confirm the problem and get our backbone back online, ”he said.
Engineers were then faced with one final hurdle: They couldn’t just restore access to all users around the world, as increased traffic could lead to more crashes. Reversing sharp declines in data center power use could also endanger “everything from electrical systems to caches.”
Previously conducted “storm drills” by the company meant they knew how to bring systems back online slowly and securely, the post said.
“I think a compromise like this is worth it – dramatically increased daily safety versus slower recovery after a hopefully rare event like this,” Janardhan concluded.
The Facebook outage – which impacted all of its services, including Whatsapp and Instagram – resulted in a personal loss of around $ 7 billion for chief executive Mark Zuckerberg as the value of the company’s shares plummeted . Mr. Zuckerberg apologized to users for any inconvenience caused by the disruption in service.