PHP Fog Blog

Post-mortem on AWS Outage

As many of you know already, Amazon’s AWS East data center in Northern Virginia was affected by a major power outage last night. The company reported at 8:50 pm: “We are investigating degraded performance for some volumes in a single AZ [availability zone] in the us-east-1 region.” It was not until 3:26 AM PT that Amazon reported that “the service is now fully recovered and is operating normally.”

Because PHP Fog relies on AWS East as its infrastructure vendor, some apps hosted on PHP Fog experienced a disruption last night.

More specifically, PHP Fog relies heavily on AWS’ Elastic Block Store (EBS) for things like database persistence, and this piece of AWS’ infrastructure was heavily affected by the outage. Amazon marked one of our databases as failed (db01), our database disk became corrupted, and we were forced to restore it to a new virtual machine. Once this was completed, we verified the binlogs, and had to update our infrastructure to communicate with the new host.

Communication to db01 was done via an internal IP, which is ephemeral and has been corrected. This entire process from getting the failed hardware, verifying the possible corruption, took a couple of hours. All total, we had the database itself and database connections from dedicated hosts restored by 4am PST.

We would strongly recommend that legacy users point their database connections to db01.phpfog.com rather than db01-share.

At this time, PHP Fog is back up and running and has not experienced any significant disruptions for the last several hours. If anything changes over the course of the day, we will make announcements in this space and on Twitter.

While it would be easy to simply blame this outage on AWS, it’s also incumbent upon us to build in a way that assumes that infrastructure failures happen. This outage has exposed problems in our own crisis-management capabilities from which we’ve already learned a great deal. Improving our high-availability/failover architecture is a top priority for us.

This incident is a reminder of why we’re working tirelessly to make AppFog the first infrastructure-neutral public PaaS out there.

Again, our deepest apologies for the disruption. We sincerely thank you all for your patience and for the many kind words we’ve received from some of you already!

Powered by Olark