PHP Fog DB Outage Postmortem
Over the past month the team at AppFog has been working on migrating all of our users to a new, more robust and stable shared database service. This migration did not happen fast enough. As a consequence, one of our shared database servers became unstable Monday leaving many apps unavailable. We cannot apologize enough for this and are working to ensure it never happens again.
The Details
Beginning at 11:34am CST on Oct 8th we began to see unresponsiveness from the affected shared database server. We were getting a very large number of stacked connections and once this happened the database server became completely unresponsive and we were forced to restart it.
Upon restart, the database server came back up after 30 minutes and was stable again for approximately two hours. Unfortunately, connections stacked up as before and the DB again became unresponsive. Once again, we had to restart. This second restart went past the thirty minute mark, at which point we began the process of spinning up a new database server instance and restoring from a snapshot. Unfortunately, both of these processes took the better part of eight hours within Amazon’s environment. As of 10:19pm CST on Oct 8 this original server was back online.
There are no excuses for this outage. We know that you count upon us to be reliable, stable and performant and we let you down.
How We Intend To Fix It
Here is what we are doing to make it right: Starting immediately, we are greatly accelerating our migration of users off Amazon RDS, this week. In addition, all users who were affected by the outage will receive an account credit automatically.
Please accept our apologies for this outage. We will do better for you.
If you have any questions or concerns, please don’t hesitate to contact us.