Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
The Great TGTAP Downtime 2013
#1
As most of you may have been wondering, this site, among others hosted by my friend Chris at TGTAP, has suffered from a week-long downtime due to a hard drive failure. Here's what he has to say about it:

Quote:We're back!

First off, we're sorry about the lengthy downtime we suffered during the past week. For the most part it was completely beyond our control. We did unfortunately suffer some data loss, some of which was simply due to corruption. Now if you continue reading I'll explain what happened, or just skip to the end if you don't care and just want to know wtf is up with the site right now.

Back in the February we suffered some downtime, over 24 hours in fact. Scanning the system logs we were able to pinpoint this to an issue with one of the hard drives, but it was difficult to determine which one (we have 4 in a RAID 10 array) when the tools we have available were reporting them all as healthy! Fast-forward to last week, one of the hard drives decided to completely fail on us. Those of you who know about RAID will know that a single hard drive in the array failing is not a problem, so we simply asked the server techs to replace it for us. No big deal. As I was submitting the support ticket, I was running tests on the other 3 hard drives to check they were ok - turns out they weren't. A drive in the second pair was also failing, and as the test finished running, it did fail. This brought the server into a read-only state, we rebooted to allow the techs to replace the first bad disk. This was done, but we had to wait practically a whole day for it to finish filesystem checks and rebuild itself into the array.

The bad news got worse after that and long story short, the OS had become corrupted and we were unable to get the server to boot up. Combine this problem with incredibly slow support staff and you see where this is heading...
In the days that followed, they eventually replaced the second failed disk for us, and within another couple of days they finally got it into rescue mode and were able to get the server back online (today).

Unfortunately this was when I discovered quite a lot of file integrity loss, corrupted files everywhere. After working all night to get the server back into a workable state, we realised that unfortunately our database system was completely fucked, for lack of a better phrase. Worse still, our on-site nightly backups were mostly lost. The most recent off-site backups we had were over 2 weeks old, but these had to do.

What I've done is restored a database backup from 13th March. Anything that happened since then has been lost. As for files and uploads, we believe most of this is ok, but chances are there's some missing files we aren't aware of. Please let me know (in this topic) if you are experiencing errors or other weirdness on the forums or anywhere else on the website.

While no one is to blame (except myself for not having more recent backups available) for what happened, we feel the support staff made us wait unreasonably long times to both replace the failed hardware and recover the system for us. Downtime was almost inevitable with these kinds of failures, but it certainly should not have been this long. For this reason, we will be transferring TGTAP (and our other sites) to a new server within the next couple of weeks. We don't expect there to be any downtime while this happens, though the forums will be turned offline for approx 15 mins to ensure we have successfully migrated the data.

That's all.
[Image: pvi1xp-6.png]
[Image: bpawh5-6.png]
Reply
« Next Oldest | Next Newest »


Forum Jump:


Users browsing this thread: 2 Guest(s)