Page 1 of 1

Does resilvering a disk not "verify" all data on said disk?

Posted: 29 Nov 2015 12:26
by m_seitz
Some days ago, I had my raidZ2 (6x 4 TB) failing on me, but things turned out alight without data loss (other thread).

What puzzled me was the "incomplete resilvering" of a failed disk that I took offline and online again (after I checked it for bad blocks in another PC).

Code: Select all

scan: resilvered 16.6G in 0h56m with 0 errors
My pool contains 17.1 TB of data (78 % capacity used) and I thought that resilvering should write 78 % data to the failed disk, after I set it online again. This would amount to ~3 TB of data, instead of 16.6 GB.

After a Google search, the only hint I could find was "dirty time logging" nexenta white paper (PDF). Did ZFS notice that the disk, that was set online, was the old disk containing data, and not a different replacement disk? Did ZFS only resilver the data that was "touched" during the time the failed disk was offline?

Re: Does resilvering a disk not "verify" all data on said disk?

Posted: 29 Nov 2015 13:12
by ^nighthawk^
I think a resilver only changes what it needs to. Admittedly i have not been in your situation, but I imagine that it has read your old data and just made the changes it feel are required. It isnt a full scrub.

This link appears to verify this behaviour. http://firstboot.blogspot.co.uk/2012/10 ... at-is.html

You are right that I would expect this to take longer however... so that is strange.

It would take a long time with a fresh disk and 'replace' command as I have done that myself a few times. Interesting to hear others thoughts.

Re: Does resilvering a disk not "verify" all data on said disk?

Posted: 29 Nov 2015 16:10
by m_seitz
Hm, if it really works that way, could cloning a failed HDD be a safer alternative to replacing it with ZFS?

I never had a disk failing on me without most of the data still being readable. When I take a failed disk offline, clone it and take the clone online, ZFS would only resilver little data. The other disks would only be stressed during the subsequent scrub. During that scrub, a raidZ2 could have 2 additional HDDs fail. In that case, I would not lose my pool but only a few files that resided in the first failed HDD's unreadable sectors. To top it of, cloning the additionally failed disks could actually rescue these files. The probability of unreadable sectors in the same files on 3 disks is very low.
Or did I miss something here?

I hope this does not sound too paranoid :-)

Re: Does resilvering a disk not "verify" all data on said disk?

Posted: 01 Dec 2015 04:01
by Onichan
I have noticed the same thing where a disk is offline for a bit and bringing it back online it only resilvers the data it missed.

It seems to keep track on each disk where it left of, but no idea how far back that would go. I do know a scrub will read all data on all the disks in the pool and do a verify against the stored checksum to make sure it's good. So you should scrub it to make sure it's good.

Re: Does resilvering a disk not "verify" all data on said disk?

Posted: 05 Dec 2015 09:29
by m_seitz
My raidZ2 suffered another failure. Knowing that it was not critical, I simply rebooted. ZFS resilvered 1 MB, without me taking the failed disk offline and online again. Apparently, ZFS is very efficiently healing itself :)
I wouldn't trust my raidZ2 without a serious scrub afterwards, but it makes sense to get the pool online without stressing the disks too much.

(Yes, I am trying to find out what's wrong, but that's not the point of this thread ;) )

Re: Does resilvering a disk not "verify" all data on said disk?

Posted: 07 Dec 2015 09:00
by b0ssman
next time this happens look in the dmesg output.

also post the smart values of all your drives.