Disk failure! Wow, I love linux.

November 21, 2006 at 2:47 pm Leave a comment

Hmmm, yesterday one of our users complained that they were having trouble with email. A little while later, another user complained that they couldn’t connect to the Internet. Then a whole lot of things started to go wrong. So I hopped onto the server in question and started to tail the logs to try to determine what was going on… Oh oh…

Nov 19 06:33:15 horatio kernel: end_request: I/O error, dev 03:01 (hda), sector 140290664
Nov 19 06:33:38 horatio kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Nov 19 06:33:38 horatio kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=140290727, sector=140290664

Yaaaargh. Looks like that’s one hard disk failure taking place and its on our primary master disk. Time to stop a whole bunch of services and see what we can salvage. I ran up to one of the high-street computer shops and quickly bought a new hard disk (not bad, you can pick up a 300GB disk for £80 at the moment). We shutdown the server installed the blank disk and did a Debian linux base install on the new disk. Then I set the new disk to slave and rebooted on the old disk to see what we could recover.

Fortunately, the disk was not too far gone and the server came up with little trouble. We quickly installed smartctl which is a nifty tool to do SMART analysis on your disks. Although the disk was reporting healthy in smart, there were I/O errors, which suggested that its lifetime was well nigh at an end. Time to start building the replacement system.

I mounted the new disk and chrooted to the mountpoint. In a separate terminal window, I could access the current version of the server and view a list of all of the applications that had been installed. On the new chrooted disk, I could install the equivalent applications and start to configure the new server. I copied across configuration files from the old server disk, including password databases and email aliases. And finally copied across each user’s home folder with all of their email.

Once I was confident that I had everything I needed, I rebooted the server for a second time. I removed the original disk and watched the new install boot up. In no time, I was able to confirm that all of the services were working and that our users were able to access the fileshares and their email. Nothing lost.

The entire downtime during the disk-switching operation was minimal. And we literally rebuilt the server while it was running. Our users were hardly aware of the switchover and can now be confident that their data is safely on a new disk.

A couple of things came to light in this experience:

  • Its useful to run smartctl on your mission critical boxes to help catch disk failure before it becomes serious
  • We would never have got the level of detail in our logs to help us solve the problem if we had been running Windows servers
  • It would have been near impossible to rebuild your server on a new disk from within a live environment on a Windows server
  • Linux rocks!

Entry filed under: Uncategorized.

My magickal hangover… The mad christmas rush…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Subscribe to the comments via RSS Feed

Recent Posts

RSS New books at Shapero’s

  • An error has occurred; the feed is probably down. Try again later.

RSS New Books at Maggs

  • An error has occurred; the feed is probably down. Try again later.

%d bloggers like this: