Page 1 of 1
Fix data corruption issue via file backups or another method
Posted: 29 Jan 2014 16:23
by cchayre
I just finished migrating a zpool from one NAS to a fresh build and it came back with some data corruption after running a scrub. My inclination is that the data corruption existed prior to the move---shame on me for not running a scrub on the old NAS prior. Does anyone have a recommendation for resolving this in a safe, reliable, long-term way?
I have at least 2-3 good copies of all files in question (those with permanent errors as shown in the below output). Would it simply be enough to do an rsync w/checksum to overwrite the files in question or should I be looking to do something a bit more drastic? Ex. Blowing away the zpool and starting from scratch.
pool: grandcentral
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see:
http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 92.4M in 4h26m with 239 errors on Wed Jan 29 03:33:19 2014
config:
NAME STATE READ WRITE CKSUM
grandcentral ONLINE 0 0 241
raidz1-0 ONLINE 0 0 482
ada0 ONLINE 0 0 693
ada1 ONLINE 0 0 410
ada3 ONLINE 0 0 713
ada2 ONLINE 0 0 403
errors: Permanent errors have been detected in the following files:
/mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00002.MTS
/mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00017.MTS
/mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00030.MTS
/mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00032.MTS
/mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00034.MTS
<list truncated>
Re: Fix data corruption issue via file backups or another me
Posted: 29 Jan 2014 16:51
by raulfg3
cchayre wrote:Would it simply be enough to do an rsync w/checksum to overwrite the files in question
overwrite whit good ones is enought.
cchayre wrote:Does anyone have a recommendation for resolving this in a safe, reliable, long-term way?
Do a cron job to scrub ZFS Pool , 1 time at month, or one time in two months can be enought.
PD: Revise your SMART values for disk, sometimes corruptions come from a defective / exhaust Disk
Re: Fix data corruption issue via file backups or another me
Posted: 29 Jan 2014 18:47
by cchayre
Thanks for the reply and feedback! It is definitely my intent to get the new build on a scheduled scrub.
I wasn't sure in regards to the errors whether it was something with the files themselves or something underlying (structurally with the pool itself). I will run an rsync with checksum to overwrite (hopefully). I tried manually checking the md5 hash against known good copies and it (md5) threw an IO error when calculating.
After I overwrite the files and what not, will the zfs status change and/or look more promising or will will I need to do a zpool clear?
I will recheck the SMART data for the drives, but I did glance over them and everything looked to be in working order. The drives are WD REDs and only have ~40 hours of uptime on them. My guess is that the errors either stem from the previous NAS system or a bad rsync or two. I did a lot of starting/stopping the rsync when I initially replicated data to these drives.
Re: Fix data corruption issue via file backups or another me
Posted: 29 Jan 2014 19:19
by raulfg3
cchayre wrote:ter I overwrite the files and what not, will the zfs status change and/or look more promising or will will I need to do a zpool clear?
you need to do a scrub first to be sure not more bad files, and a zfs clear to delete last ZFS message.
cchayre wrote: I did a lot of starting/stopping the rsync when I initially replicated data to these drives.
probably files where damaged at this point
Re: Fix data corruption issue via file backups or another me
Posted: 29 Jan 2014 21:11
by cchayre
i know this somewhat out of context with the original question, but do you have any good reading material on why stopping/restarting rsync transfers would potentially cause corruption? My presumption or hope would be that the size/time comparison would be enough in this situation, though I understand its not quite as comprehensive as cross-checking checksum.
Re: Fix data corruption issue via file backups or another me
Posted: 29 Jan 2014 22:07
by substr
You need to immediately stop using the new system. Like NOW.
These checksum errors did not come from rsync. Your new system has a major defect.
My reason is thus: If the damage was on the original pool, rsync would have failed with read errors. Since it completed, the original pool read fine and was not corrupt. The writes to the new pool were corrupt, which means bad hardware.
Re: Fix data corruption issue via file backups or another me
Posted: 30 Jan 2014 02:19
by cchayre
substr wrote:You need to immediately stop using the new system. Like NOW.
These checksum errors did not come from rsync. Your new system has a major defect.
My reason is thus: If the damage was on the original pool, rsync would have failed with read errors. Since it completed, the original pool read fine and was not corrupt. The writes to the new pool were corrupt, which means bad hardware.
What makes you think that? No data was written to the drives after import. I merely imported the zpool and ran a scrub. The data was already there. Likewise, I am skeptical because I previously had another zpool running on this very hardware that has thus far been rock-solid (knock-on-wood).
Re: Fix data corruption issue via file backups or another me
Posted: 30 Jan 2014 04:12
by substr
Ah. Whatever system wrote to that pool is defective. So if the old system wrote the pool, stop using the old hardware. If the old pool on old hardware doesn't seem to be showing errors (WARNING: do not run a scrub on suspected bad hardware, as that can increase corruption), then the problem could be in the controller/cables/ports used for the new pool during the transfer. Otherwise, bad mobo/CPU/RAM.
Re: Fix data corruption issue via file backups or another me
Posted: 30 Jan 2014 04:36
by cchayre
Is it plausible though that my data corruption was more a resultant of a bad rsync (or numerous) to begin with?
Re: Fix data corruption issue via file backups or another me
Posted: 30 Jan 2014 17:33
by cchayre
Definitely think the issue must be something with the old NAS. I moved the pool back to the old system and tried replicating good data to the drives (rsync archive w/checksum)---I'm seeing an occasional checksum error increment on the pool/raidz. I guess I won't be using the old NAS anymore. It's a shame too...I was hoping to leverage as an off-site backup. Unless, of course, it'd be suitable for a different NAS system? Ex. Perhaps a linux-based OS w/software raid.
NAME STATE READ WRITE CKSUM
grandcentral ONLINE 0 0 355
raidz1-0 ONLINE 0 0 710
ada1 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada2 ONLINE 0 0 0
Re: Fix data corruption issue via file backups or another me
Posted: 30 Jan 2014 18:41
by substr
No, rsync can't cause ZFS checksum errors. If rsync wrote bad data, ZFS wouldn't know. Only your md5 comparisons would show it. When ZFS has rampant checksum errors like you saw, it means something is going wrong in the hardware, and that ZFS and/or the hardware are writing corrupt blocks/checksums/parity to the drives.
When you have hardware that causes ZFS problems, it should definitely not be used for ZFS. Depending on the exact nature of the problem (which is unknown right now), that same hardware might operate acceptably with different software. Or, it might do the same thing, and you just wouldn't notice until it was too late.
There is also one other possibility to consider: a marginal power supply that only causes problems when you have it loaded up with drives. In that case, the system might be safe most of the time. If the long-time configuration has never given you ZFS problems before, then it is very likely it will continue to give no problems as long as you leave it in that configuration. I would at the least run something like memtest.org on both systems. It is built in to Ubuntu ISO, if you happen to have one, and can be selected from the boot menu. I believe Windows 7 and possibly 8 DVDs also have a memory test, possibly not as thorough.
Re: Fix data corruption issue via file backups or another me
Posted: 30 Jan 2014 18:53
by cchayre
substr wrote:No, rsync can't cause ZFS checksum errors. If rsync wrote bad data, ZFS wouldn't know. Only your md5 comparisons would show it. When ZFS has rampant checksum errors like you saw, it means something is going wrong in the hardware, and that ZFS and/or the hardware are writing corrupt blocks/checksums/parity to the drives.
When you have hardware that causes ZFS problems, it should definitely not be used for ZFS. Depending on the exact nature of the problem (which is unknown right now), that same hardware might operate acceptably with different software. Or, it might do the same thing, and you just wouldn't notice until it was too late.
There is also one other possibility to consider: a marginal power supply that only causes problems when you have it loaded up with drives. In that case, the system might be safe most of the time. If the long-time configuration has never given you ZFS problems before, then it is very likely it will continue to give no problems as long as you leave it in that configuration. I would at the least run something like memtest.org on both systems. It is built in to Ubuntu ISO, if you happen to have one, and can be selected from the boot menu. I believe Windows 7 and possibly 8 DVDs also have a memory test, possibly not as thorough.
Thanks for the feedback! I'm already thinking of different hardware (to replace the old NAS, from which I pulled the HDs in the first place). Your mention of a marginal power supply may actually be something I need to look into. The case I was using was a NAS-specific case with integrated PS (only 120 W). Without knowing or looking into it too much, that seems a bit small for 4x 3TB WD RED drives.
I will also run a memtest to cross-verify.