Posted by Alan Warren, Engineering Director
(Cross-posted from the Google Docs Blog.)
Not our best week. On Wednesday we had an outage that lasted one hour and meant that document lists, documents, drawings and Apps Scripts were inaccessible for the majority of our users. We use Google Docs ourselves every day, so we feel your pain and are very sorry.
So what happened? The outage was caused by a change designed to improve real time collaboration within the document list. Unfortunately this change exposed a memory management bug which was only evident under heavy usage.
Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart. While they restarted, their load was picked up by the remaining lookup machines - making them run out of memory even faster. This meant that eventually the servers couldn’t properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday.
Our automated monitoring noticed that attempts to access documents were failing at an increased rate, and alerted us 60 seconds later after the failure rate increased sharply. The engineering teams diagnosed the problem, determined that it was correlated with the feature change, and started rolling it back 23 minutes after the first alert. In parallel, we doubled the capacity of the lookup service to mitigate the impact of the memory management bug. The rollback completed 24 minutes later, and 5 minutes after that the outage was effectively over as the additional capacity restored normal function.
Since resolution, we have been assembling and scrutinizing the timeline of this event, and have assembled a list of steps which will both reduce the chance of a future event, decrease the time required to notice and resolve a problem, and limit the scope which any single problem can affect. We intend to take all these steps; some are not easy, but we're committed to keeping Google's services exceptionally reliable. In the meantime, rest assured that we take every outage very very seriously, and as always we'll post a full incident report of what happened to the Apps Dashboard once our investigation is complete. Again, we apologize for the inconvenience and frustration which the outage has caused.