UPDATE: Shortly after 5 p.m. ET on Tuesday, Amazon said that the affected services were fully recovered and operational. =====Original Story ==== Sure, Amazon is a huge retailer and a giant media company, but its biggest presence is hidden from a lot of folks: It’s a massive internet host. Thousands of sites and companies rely…
In the tweet storm that followed the outage of Amazon’s S3 service at US-EAST-1, I learned a few things.
- Amazon claims 9-9s of “Durability” which by most accounts means loss of data. So a very low chance Amazon S3 will lose your data in any incident.
- Depending on your level of buy-in, your SLA agreement level you may get anywhere as high as 4-9s of “Availability”, which is 52 minutes of downtime per year.
- Amazon S3 users may get less than 4-9s of availability with the lowest availability being one 9, or 90% reliability. Which would be 36days of downtime.
- 52 minutes vs. 36days of downtime on Amazon S3, that is one hell of a range. Hopefully paying Amazon “some” amount of money would hopefully boost you into something higher than 90% availability.
- Failover is something you as the app developer and Devops crew need to do in your app/testing environment. You then choose what level of failover/spare infrastructure you want to pay Amazon, then YOU failover the application. You can failover using Zones (Auto Zones) or Regions (Multi-regions) paying more as you go up the ladder of diverse routes and geographic dispersion of regions. That’s on you.
So learning this as I go along I realize I’m very much at the mercy of my upstream “provider” in this case it’s a packaged web app that the provider has hosted on Amazon S3. The people I work for sign a contract with the web app provider. But I am never privy to the details on the contract (Service Level Agreement) the provider has with Amazon S3. I know nothing about their architecture/design/disaster recover plan. But at a certain level of paying that provider and knowing they have many other accounts they handle in addition to mine, I’m thinking they are making a wise choice hosting on Amazon S3. They must know something I don’t and they MUST have architected their web app to work gracefully within the Amazon Devops platform, design to fail-over with no boost/assist from Amazon S3 other than to keep their Zones and Regions running as much as possible.
All that would be naïve magical thinking in the Universe we inhabit now I fear. Our web app was out from 12:45P EST to 5:00P. I worked another 2 hours after that to ensure all the queues and work that got submitted completed out and that the service is ready for tomorrow (Wednesday) at 9am when it is going to start the day doing work again. I’m so much more thankful for the last 3.5 years where this web app has had little to no outages. I guess we got lucky. That’s something at least. Whatever the root cause, I hope Amazon comes clean and lets the cat out of the bag and really begs everyone’s forgiveness. The trust in not just the service, but the expertise, all the articles written about Amazon’s engineers, patents, research, papers on Data Center design/ops/architecture is now flushed down the toilet. I don’t care now how much bigger Amazon is than Google/Facebook/MS Azure. There is a golden opportunity now for any entrepreneur out there to make web apps that can turn on a dime and move effortlessly between Google’s data centers, Microsoft’s and yes Amazon’s data centers. Or provide the glue, and expertise to make that happen. And then make the big 3 bid on hosting your damned service on their infrastructure. Then outages like today’s Amazon S3 downtime would mean something,… It would mean breach of contract, and you would pick up your toys, your DNS entries, your databases, your whole stack and move it to a competitor in a blink of an eye. That’s what I want, a cell phone service-like compute/storage infrastructure I can turn-over/cut-over when I get dissed, upset or disappointed by the performance of the hosting provider. No Vendor Lock-in, Free as in Freedom.