Tyblog

Technology, open source, unsolicited opinions & digital sovereignty
blog.tjll.net

« When Disks Die: A ZFS Recovery Post-Mortem »

  • 12 March, 2018
  • 1,400 words
  • seven minutes read time

I read a lot of tech success stories, but most of them revolve around building out or creating cool stuff. Last week, I had a catastrophic disk failure, and all I wanted was to find some recorded notes about disk recovery in Linux with ZFS. This is a record of my experience to illustrate the strength and maturity of ZFS on Linux and potentially help anyone in a similar situation in the future.

For some context so you know what I’m working with:

As a very minor side note, I’m trying to write more technically-inclined articles, but usually try way too hard to make them formal and perfect, so this is going to be a little stream-of-consciousness. Apologies if it’s hard to follow!

The Failure

A couple of days before I was set to fly to San Francisco for a conference, accessing my NAS got sluggish. I centralize any state I care about on my fileserver, from my private gitolite repositories to my wedding photos, so this data is kind of important. The first signs of slowdown came when using my various Kodi setups in my house: reads and writes to their shared MySQL storage backend (hosted on my NAS) started to hang.

Side note: I run zfs scrub on a systemd timer, but my automation to alert me to failed scrubs (which were indeed failing) wasn’t properly alerting me. There’s a lesson here about validating your automation, but that’s a task for another time.

The first step is always to check the storage pool. Typically, the healthy disks look like this:

NAME                                           STATE     READ WRITE CKSUM
tank                                           ONLINE       0     0     0
  raidz2-0                                     ONLINE       0     0     0
    ata-<disk id>                              ONLINE       0     0     0
    ...more disks...                           ONLINE       0     0     0

Without any checksum errors. This time, I had:

NAME                                       STATE     READ WRITE      CKSUM
ata-<disk id>                              ONLINE       0     0 <not-zero>

No bueno. The first step I wanted to take was to verify that something was awry with the disk, so I first cleared any errors:

$ sudo zpool clear tank

And initiated a new scrub.

$ sudo zpool scrub tank

That kicked off a scrub, but I wanted (naturally) to see how the scrub was progressing. Turns out that any subsequent invocations of zpool status tank hung forever. When I say frozen, I mean that any kill -9 on related process got stuck as well – it was bad.

I ended up rebooting, and at that point, any attempts to zpool import tank also hung at the terminal. I’ve got pool problems. After ordering another disk off Amazon (priorities), I started to dig into what could be happening.

The Debugging

Unfortunately, failures like this one – frozen ZFS utility commands – didn’t turn up widely on any search engines. Most ZFS tutorials are concerned with setup/replacement of pools, so coming up with a path forward that didn’t involve blindly carving up my data made me a little nervous.

I ended up heading to #zfsonlinux in freenode and received prompt and helpful guidance from the members there. Score a point for ZFS on Linux; the IRC community is available and friendly.

I wish I had remembered to keep a transcript of the chat log, but the tl;dr came down to somebody suggesting to disconnect the disk wholesale. Funnily enough, someone also taught me something cool I didn’t know: you can also simulate pulling the plug on the disk by manipulating your virtual filesystem:

# echo 1 > /sys/block/sda/device/delete

Unfortunately for me, even trying to traverse anywhere in that device’s directory in my /sys caused my shell to lock up, so I really had problems. Time to go caveman on my disk.

THE CULPRIT. Does seeing a disk's platters exposed hurt you?
THE CULPRIT. Does seeing a disk's platters exposed hurt you?

The process wasn’t difficult – I’ve got my 4-disk RAIDZ2 array housed in an old HP ProLiant N40L microserver, so it was just a matter of shutting it down, opening the bay door, and pulling out the caddy – but it’s still a little surreal to cannibalize your own hardware.

After closing up my server and starting up again, my zpool commands work again – hooray! – but I am, of course, missing a disk according to zpool status. At the very least, I’ve determined that yes, my disk must have failed so hard that even I/O calls from userspace caused bad lockups and the disk must be really, truly dead.

However, that’s good news. While I do have a dead member of my zpool, I can still use my pool to read and write data normally since RAIDZ2 operates with two parity drives – I’m just operating in a degraded state. I could have bigger problems, like some bad metadata due to faulty RAM. However, I’m running with ECC memory, and since I can use my pool normally, we’ve got a typical case of device-needs-to-be-replaced instead of pool-is-corrupted-somehow, just with the minor complication of the need to physically remove the disk to get my system healthy again.

So: removed disk, need to fix it. This is the hard part, right?

The Fix

Shut off the server, insert the new disk, boot up, and:

# sudo zpool replace tank <numeric id> /dev/disk/by-id/<new disk id>

And that was it. All of the data is still usable, but resilvering is happening in the background. As an ops person, a) removing my old disk, b) adding a new one, c) replicating the data and d) doing so without downtime with one command is kind of surreal, but there it is. My resilvering took a while (days), but I’m back at 100% without any data loss.

Lessons Learned

You thought you could get away without this part, huh?