“Disaster Porn”: Infrastructure & Hosting Nightmares That Could Be Coming To A DAM System Near You?


The ‘Uptime’ section of Arstechnica.com features an article that reports from OmniTI Surge Conference (held last week) where some of the greatest IT failures were discussed.  The keynote features Ben Fried, CIO of Google who describes a previous position he held at Morgan Stanley where a critical application failed, costing the firm millions of dollars.  The article is quite tech-heavy but still has some great good general IT management best practice principles, including this analysis:

The real root of the problem, Fried said, was the way the organization around the system had been built. ‘Without even thinking about it, the way we scaled up was through specialization,’ Fried explained. ‘We added people to specialized teams, each operating within a functional boundary. We never said understanding how everything works is important.’ Because none of them had knowledge of how the application worked beyond their area of expertise, the teams made decisions that led to a ‘hard failure’ of the application.  As companies strive to scale up applications to handle larger tasks, Fried said, it’s increasingly important to have IT ‘generalists on the team who can look cross-functionally at systems. ‘Scalability is pushing the boundaries of the possible,’ he said. ‘We operate at the interface of the known and unknown. Normal industrial style thinking doesn’t work, because specialists’ expertise is not good at dealing with the unknown.'” [Read More]

For anyone who may be feeling smug and self-assured that their use of a Cloud hosting facility will protect them from these kind of monolithic corporate failures, there is a section at the end that has its fair share of horror stories about Amazon EC2 also, including this presentation lead by Mark Imbriaco, Director of Heroku:

Some of the other ‘disaster porn’ at Surge yielded practical advice that Google’s CIO couldn’t give, particularly about the dark arts of dealing with Amazon’s EC2 cloud infrastructure services. EC2 was the platform of choice for most of the cloud service players at the conference; Heroku, for example, runs completely on EC2. But that’s a choice that doesn’t come without pain.  Imbriaco said that Heroku has seen ‘so many different errors from Amazon’ that they have gotten to be experts at diagnosing them, and Heroku’s own monitoring usually beats Amazon’s usually by 15 minutes in diagnosing problems.  And when the problems are related to ephemeral disk failures, Amazon does little to deal with them other than occasionally sending a message. ‘We will get an email saying, ‘Your host is in a degraded state and you need to move your stuff,’ Imbriaco noted.” [Read More]

In the interests of balance, it is fair to say that article quotes other presenters who (despite these issues) rate Amazon as the best Cloud provider and a ‘fantastic’ option.

I think the key point to take away from this discussion is that it’s unwise to rely exclusively on the assurances of the vendor (or their resellers in the case of DAM systems) and end users of DAMs should not get too blasé about hosting and assume it’s a problem they can ignore as part of their wider operational risk management strategy.

Share this Article:

Leave a Reply

Your email address will not be published. Required fields are marked *