What Happened to Google Docs on Wednesday

Friday, September 9, 2011

Posted by Alan Warren, Engineering Director

(Cross-posted from the Google Docs Blog.)

Not our best week. On Wednesday we had an outage that lasted one hour and meant that document lists, documents, drawings and Apps Scripts were inaccessible for the majority of our users. We use Google Docs ourselves every day, so we feel your pain and are very sorry.

So what happened? The outage was caused by a change designed to improve real time collaboration within the document list. Unfortunately this change exposed a memory management bug which was only evident under heavy usage.

Every time a Google Doc is modified, a machine looks up the servers that need to be updated. Due to the memory management bug, the lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart. While they restarted, their load was picked up by the remaining lookup machines - making them run out of memory even faster. This meant that eventually the servers couldn’t properly process a large fraction of the requests to access document lists, documents, drawings, and scripts which led to the outage you saw on Wednesday.

Our automated monitoring noticed that attempts to access documents were failing at an increased rate, and alerted us 60 seconds later after the failure rate increased sharply. The engineering teams diagnosed the problem, determined that it was correlated with the feature change, and started rolling it back 23 minutes after the first alert. In parallel, we doubled the capacity of the lookup service to mitigate the impact of the memory management bug. The rollback completed 24 minutes later, and 5 minutes after that the outage was effectively over as the additional capacity restored normal function.

Since resolution, we have been assembling and scrutinizing the timeline of this event, and have assembled a list of steps which will both reduce the chance of a future event, decrease the time required to notice and resolve a problem, and limit the scope which any single problem can affect. We intend to take all these steps; some are not easy, but we're committed to keeping Google's services exceptionally reliable. In the meantime, rest assured that we take every outage very very seriously, and as always we'll post a full incident report of what happened to the Apps Dashboard once our investigation is complete. Again, we apologize for the inconvenience and frustration which the outage has caused.

9 comments:

SidCool said...

Thanks Google, for explaining what happened. It feels like we are the part of the solution when such transparency is maintained.

PotteryGuy said...

We all experience system outages from time to time. That's technology. The key is to not waste time pointing the finger or denying the issues but instead drive to root cause and put the right process or technology in place to make sure you never hit the same problem twice.

Thank you for the transparency into the issue. It's refreshing.

shinji257 said...

Why can't you just find the bug and squash it?

Jud said...

Nice transparency. Bummer though. No-one's immune to issues like this. The best you can hope for is a rapid rollback scenario after a bad deploy. Sounds like you've got that in place too.

Jeff Alhadeff said...

Thanks for this update! The 30 minutes of down time is nothing when compared to the hours and weeks of time that I've saved by Going Google.

SWASY said...

All i can say is thank u for begin transparent, it feels good.

SWASY said...

All i can say is thank u goggle . It feels good when an organization is this transparent whom we relay on.

CarlB said...

I appreciate the openness, however this illuminates the achilles heel with services such as this. When they fail, they fail for everyone, globally. While some may be able to get on without proper document building tools or the ability to research something held in a google doc for an hour, email is akin to dialtone and users expect it to work, period. Any mechanism to segment users and implement some A B testing would go a long way towards isolating and reducing the impact of such changes, as well as reducing the roll back time needed should that be required.
If the SAR team that I worked with was on a mission where they rely on Google Docs, it would have impaired the ability to communicate with the teams and reduce the effectiveness of the search effort. In a live search scenario, that could make the difference between someone surviving or not.

Nathan said...

Loving the transparency, keep up the good reporting!

Post a Comment

Thank you for sharing your feedback with the Google Enterprise team. We will respond to open issues addressed in Comments with future posts on this blog. We appreciate your interest in Google Enterprise.