Update: As of today (26th March) Razuna report that data recovery has been 100% successful.
Open Source DAM Vendor, Razuna, announced last week that they had sustained a 39TB data loss due to failure of multiple RAID drives in their hosting environment. Data recovery experts were required to retrieve customer data. The hardware failure occurred on 13th March and lead to a 5-day outage in their core hosting service with data recovery completing yesterday (24th March). At the time of writing, Razuna assess that 1% of customer data is still to be retrieved.
Last week, Razuna founder and Technical Director, Nitai Aventaggiato, provided some explanation for the failure:
“This past Thursday we had a major hardware failure that affected not only our main storage server, but our backup storage as well. We’ve spent the weekend trying to recover the data, but unfortunately have not been able to do so as of yet. We have now sent the disks to a data recovery service, that will diagnose and try to repair the hard drives…We have managed to restore all data, meaning users, keywords, descriptions, folders etc. that you have created in Razuna, but the actual asset files have still not been recovered.” [Read More]
Firstly, I will say that I fully sympathise with Razuna’s plight. Server hardware failures are an utterly miserable experience for everyone involved. They seem to strike at the worst possible times. Some years ago I was involved in a consulting assignment with a hosting company client who sustained a major outage at 7pm on Christmas Eve. The service they provided was consumer oriented and the company faced legal action from their customers due to lost revenue if they did not get it operational by 26th December. This required all engineers to be called in over the holiday period and they shed a significant number of customers anyway despite their best efforts to restore the service rapidly.
While I could sympathise with the firm during the incident, there were clear warning signals that were not taken seriously by the company in question which would have prevented the outage. Furthermore, a question some of their customers would need to answer would be why they placed all their hosting eggs in one basket if the service was so business-critical?
Reading the description of the Razuna incident, similar issues emerge. There are both positive and negative points which are worth examining.
On the positive, Razuna have been up-front and open about the incident. They let their customers know immediately and took steps to resolve the situation. They also communicated with users and engaged with their questions on Twitter and other social media sites. This is absolutely the right thing to do. Also, as they offer a free (and open source) edition users can host themselves, there is the option for customers to handle this themselves if they prefer, which some do appear to have made use of judging by the twitter comments. I am dubious about SaaS vendors that lock you into their hosting platform, so Razuna get maximum flexibility points here.
On the other side, there are some warning signs which should have been clear. According to the blog article, they operated two mirrored RAID-6 drive arrays. RAID-6 allows for two drives to fail. They say they had four degrade:
“We use RAID 6 servers, so a failing harddisk is normally no problem. In our case, four disks failed at the same time, which means we can no longer run a repair to get the files back…We have a mirrored RAID 6 backup server, which had exactly the same failure at the same time. We use a mirror, so that if the main server fails, we can immediately point to the other and customers will then normally not see any downtime…Unfortunately, we were hit with a failure of the main storage and the backup server at the same time.” [Read More]
If they operated two separate mirrored RAID-6 storage facilities, assuming the four failed drives were 2+2 then they would still be operational, so I don’t understand how four lead to a complete loss, unless it was actually eight (4+4). Nitai mentions says that this is very rare. In my experience, it is not unheard of and does happen far more often than is generally reported. The reason is that hard drives are commodity mass-manufactured devices and if there is manufacturing fault, it can affect a whole series of drives which can end up being purchased by the same customer for use in multiple RAID arrays. Depending on the quality control measures used by the drive manufacturer, a bad batch can often lead to multiple drives failing in rapid succession at the same site. The method to mitigate against this risk is to operate a mirrored storage array provided by a completely different manufacturer.
Later in the explanation, Nitai describes how they have now introduced Amazon S3 backup (after the disk failure):
“For performance reasons, we are still using a mirror with RAID 6, so a similar breakdown in one server, would not be noticeable. If two servers break down, we have a cloud backup via Amazon S3. So a similar, but still theoretically close to impossible, failure would mean that we would have down-time but that files would be recoverable.” [Read More]
I am not sure of the location of both their storage arrays, but this suggests that if they lost their main data centre in a fire, all their customer data would have been either completely lost or have to be retrieved from fireproof stored tapes (assuming they were backing up 39TB to tape or some other offline local storage device). Why was backup to Amazon S3 or something like it was not being done beforehand? S3 Reduced Redundancy Storage (which is good enough for backups) of a small amount of data like 39TB is only around $3000 per month, which is a relatively low cost form of off-site data insurance when spread across all their customers. Glacier backup is cheaper (but slower to restore) and costs around $400 for a similar amount of data. Google Storage have announced even cheaper pricing than Amazon recently, so there are plenty of low cost options available. I don’t know how much they have paid for professional data recovery but it was probably a substantial sum (and a few nervy days for Razuna worrying about whether it would be successful or not). Data recovery really should be the last resort.
To operate safely, I recommend to clients that they need to replicate to three different geographical locations (ideally with hardware which is built by separate manufacturers). On Nitai’s point that few firms have two backups, from my experience, the larger ones certainly do (often more than that) . The description gives the impression that there was either aggressive cost-saving or complacency in their data safety policies. Although, I must point out again, the same is the case with numerous other SaaS vendors, they just don’t tell you that it is happening and we only know this because Razuna have been more transparent and honest than others.
The final point is frequent verification of the backups. It’s okay having multiple data centres but if the backup software systems stop working then they are as good as useless. The best systems administrators I have worked with worry constantly about backups and data safety and know that it is very easy to make a mistake. When I meet any that claim they “sleep easy” because of their backup policy, that is when I get nervous. Doing IT properly is hard, there are many possible points of failure and service providers should be relentless about checking for risks and making sure they know exactly what is going on with every part of their systems. If you can’t handle that kind of pressure, don’t take sys admin work as a career option.
The last point relates to the customers of Cloud or SaaS DAM. To re-iterate advice we have made many times before, you need to take direct responsibility for your own backups. If the vendor will not allow you to make your own, they should be avoided. If the vendor sustains data loss, with your own backups, you can recover, with or without their help.
Last year we had some arguments on DAM News with SaaS vendors and the point was made by them that you do not expect to “backup” your money when it is held at a bank. Personally, if I had that option, I would take it and many governments provide depositor protection schemes where they effectively provide that service for you. No such state aid is available for retrieving your lost data. Your digital assets represent years of your organisation’s staff time and therefore money which could cost a considerable sum to get back again later.
For those interested in learning more about DAM hosting best practices, our whitepaper, Digital Asset Management Hosting: Making The Right Decision For Your Organisation, covers this subject in a lot more detail.