One of the two harddrives failed the day before yesterday. It was brand new, but I had a
Code: Select all
Failed SMART usage Attribute: 184 End-to-End_Error.
I have contacted Seagate and they told me to replace the drive.
But this is not about my personal experiences, but about a strange error. I used ZFS and created a mirror configuration. Before I knew that this is silly, I used a SSD drive as ZIF device with it (do not do that with a non-redunant drive or you loose one of ZFS's mayor features, called resilience). I removed the drive before informing ZFS about it and shred the drive so I could send it back to seagate to get a replacement. But when I powered up my fileserver I found the ZFS in a 'degraded' state, which is not unusual while one drive is missing so that the pool is incomplete. But for some reason no drive was in that pool any more. I started reading the ZFS documentation from oracle, which is horrible, because it lists commands, but the reasoning is missing. In that documentation is a chapter about
Repairing a Damaged ZFS Configuration which suggests to export and re-import the ZFS. In one or the other order I was able to export the configuration before or after I set the drive BBS priorities in the BIOS, because it finally came to my mind that my OS together with that ZIL partition was not on /dev/ada0, but /dev/ada2. After removing the drive that had changed, so that /dev/ada1 now was the SSD on which ZFS could even find a ZFS partition (which did not fit together with the other one). That seems to have confused zfs a bit. I believe, that nas4free uses the devices nodes when creating a ZFS configuration. To prevent such situations it had been better to use drive labels or Disc-UIDs instead / you never know how bad your BIOS really is until you recognize it...
But that story is not over. Next problem was, that after reassigning the preview drive order I could not re-import the zpool configuration, because one device was OFFLINE and the other had 'CORRUPTED DATA' on it. Reading in forums took me hours, but I found somebody mentioning, that in his case a reboot had done the job. This did not work with nas4free, but I put the drive in another computer for further investigation. There I ran FreeBSD 10 in a QEMU environment together with that 'corrupt data' drive and was for some reason able to re-import the ZFS pool with it. If you try it as well be aware, that the FreeBSD-*memstick.iso mounts / as read-only so that ZFS cannot create a directory under /mnt/, so that you have to do a
Code: Select all
mount -orw /
zpool import -fF zpool_name
# zpool_import without arguments will hopefully show you the name
before re-importing or
after the import (because importing of the pool works and gets written to the disk and is not part of your local /etc/-tree or such. After that I was able to reattach the disk to the nas4free server and it was recognized again, like magic ...
edit: I have received a rapaired drive directly from seagate
Replacing a failed disk in a zfs mirror configuration
Normally you would not do, what I have done above, but this:
Code: Select all
zpool status # to find the name of the corrupt device
zpool detach <device-name> # to remove the device from the zpool
To rebuild the zfs raid array these steps seem to be sane:
- tell zfs that you have removed a drive (see above)
- after having received a new hard disk do not instantly turn it on, but wait a few hours so it can acclimatize and do not cover the hole, which is used to compensate possible differences in air pressure.
- install the drive and inspect the drive
Code: Select all
smartctl -a /dev/adaX # you should see power-on-hours=0
smartctl -t conveyance # it will tell you how long it takes to finish
reinspect the drive with the above command to view the result.
- give the drive a label like for example:
this makes the drive available as /dev/label/mirror02
- to attach the drive to your zfs pool and make it a mirror of the remaining disk do:
Code: Select all
zpool list # to look after your pools name
zpool status # to look after the name of the disk, which should get mirrored to the new disk in the next step
zpool attach <zpool-name> <existing-disk-in-pool> /dev/label/mirror02
zpool status # will now show you something similar to this:
NAME STATE READ WRITE CKSUM
zfs_pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
label/mirror02 ONLINE 0 0 0 (resilvering)
logs
ada2s2 ONLINE 0 0 0
After having done these steps you can go for a coffee, because resilvering means, that all data gets copied, which usually takes a while. After that your raid configuration is up and running redundant again.
man zpool says, that the zpool attach command can also be used to attach more drives and
wikipedia says, that three disks do have a significant better failure rate than two (5% compared to 0.7% in 3 years). But using more than three drives is not economical any more.
remark: zpool attach only works with disks with no data on it (or if you specify -f, which is not recommend).
Clear a disk drive
Code: Select all
dd if=/dev/zero of=/dev/adaX bs=1M
and check its status by sending dd a signal (from another console):
- but
be careful to get the device right. dd will overwrite without questioning again!
Securely erase a disk
if you send your drive back to your supplier you might want to destroy its contents, which can also be done with dd or shred, which actually does the same:
Code: Select all
dd if=/dev/urandom of=/dev/adaX bs=1M
, this will overwrite your drive with random numbers. Do that 3 to 5 times to be on the save site. Notice that i have used bs=1M, which is not needed but speeds up things a bit.
Also notice that this does not make sense when you are using a SSD (and I do not know of any solution)