This is the old XigmaNAS forum in read only mode,
it will taken offline by the end of march 2021!



I like to aks Users and Admins to rewrite/take over important post from here into the new fresh main forum!
Its not possible for us to export from here and import it to the main forum!

Fix data corruption issue via file backups or another method

Forum rules
Set-Up GuideFAQsForum Rules
Post Reply
cchayre
Status: Offline

Fix data corruption issue via file backups or another method

Post by cchayre »

I just finished migrating a zpool from one NAS to a fresh build and it came back with some data corruption after running a scrub. My inclination is that the data corruption existed prior to the move---shame on me for not running a scrub on the old NAS prior. Does anyone have a recommendation for resolving this in a safe, reliable, long-term way?

I have at least 2-3 good copies of all files in question (those with permanent errors as shown in the below output). Would it simply be enough to do an rsync w/checksum to overwrite the files in question or should I be looking to do something a bit more drastic? Ex. Blowing away the zpool and starting from scratch.
pool: grandcentral
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 92.4M in 4h26m with 239 errors on Wed Jan 29 03:33:19 2014
config:

        NAME        STATE     READ WRITE CKSUM
        grandcentral  ONLINE       0     0   241
          raidz1-0  ONLINE       0     0   482
            ada0    ONLINE       0     0   693
            ada1    ONLINE       0     0   410
            ada3    ONLINE       0     0   713
            ada2    ONLINE       0     0   403

errors: Permanent errors have been detected in the following files:

        /mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00002.MTS
        /mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00017.MTS
        /mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00030.MTS
        /mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00032.MTS
        /mnt/grandcentral/Old Holding Tank/Cutting Room Floor/New Camera pt 2/00034.MTS
<list truncated>

User avatar
raulfg3
Site Admin
Site Admin
Posts: 4865
Joined: 22 Jun 2012 22:13
Location: Madrid (ESPAÑA)
Contact:
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by raulfg3 »

cchayre wrote:Would it simply be enough to do an rsync w/checksum to overwrite the files in question
overwrite whit good ones is enought.
cchayre wrote:Does anyone have a recommendation for resolving this in a safe, reliable, long-term way?
Do a cron job to scrub ZFS Pool , 1 time at month, or one time in two months can be enought.



PD: Revise your SMART values for disk, sometimes corruptions come from a defective / exhaust Disk
12.1.0.4 - Ingva (revision 7743) on SUPERMICRO X8SIL-F 8GB of ECC RAM, 11x3TB disk in 1 vdev = Vpool = 32TB Raw size , so 29TB usable size (I Have other NAS as Backup)

Wiki
Last changes

HP T510

cchayre
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by cchayre »

Thanks for the reply and feedback! It is definitely my intent to get the new build on a scheduled scrub.

I wasn't sure in regards to the errors whether it was something with the files themselves or something underlying (structurally with the pool itself). I will run an rsync with checksum to overwrite (hopefully). I tried manually checking the md5 hash against known good copies and it (md5) threw an IO error when calculating.

After I overwrite the files and what not, will the zfs status change and/or look more promising or will will I need to do a zpool clear?

I will recheck the SMART data for the drives, but I did glance over them and everything looked to be in working order. The drives are WD REDs and only have ~40 hours of uptime on them. My guess is that the errors either stem from the previous NAS system or a bad rsync or two. I did a lot of starting/stopping the rsync when I initially replicated data to these drives.

User avatar
raulfg3
Site Admin
Site Admin
Posts: 4865
Joined: 22 Jun 2012 22:13
Location: Madrid (ESPAÑA)
Contact:
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by raulfg3 »

cchayre wrote:ter I overwrite the files and what not, will the zfs status change and/or look more promising or will will I need to do a zpool clear?
you need to do a scrub first to be sure not more bad files, and a zfs clear to delete last ZFS message.
cchayre wrote: I did a lot of starting/stopping the rsync when I initially replicated data to these drives.
probably files where damaged at this point
12.1.0.4 - Ingva (revision 7743) on SUPERMICRO X8SIL-F 8GB of ECC RAM, 11x3TB disk in 1 vdev = Vpool = 32TB Raw size , so 29TB usable size (I Have other NAS as Backup)

Wiki
Last changes

HP T510

cchayre
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by cchayre »

i know this somewhat out of context with the original question, but do you have any good reading material on why stopping/restarting rsync transfers would potentially cause corruption? My presumption or hope would be that the size/time comparison would be enough in this situation, though I understand its not quite as comprehensive as cross-checking checksum.

substr
experienced User
experienced User
Posts: 113
Joined: 04 Aug 2013 20:21
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by substr »

You need to immediately stop using the new system. Like NOW.

These checksum errors did not come from rsync. Your new system has a major defect.

My reason is thus: If the damage was on the original pool, rsync would have failed with read errors. Since it completed, the original pool read fine and was not corrupt. The writes to the new pool were corrupt, which means bad hardware.

cchayre
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by cchayre »

substr wrote:You need to immediately stop using the new system. Like NOW.

These checksum errors did not come from rsync. Your new system has a major defect.

My reason is thus: If the damage was on the original pool, rsync would have failed with read errors. Since it completed, the original pool read fine and was not corrupt. The writes to the new pool were corrupt, which means bad hardware.
What makes you think that? No data was written to the drives after import. I merely imported the zpool and ran a scrub. The data was already there. Likewise, I am skeptical because I previously had another zpool running on this very hardware that has thus far been rock-solid (knock-on-wood).

substr
experienced User
experienced User
Posts: 113
Joined: 04 Aug 2013 20:21
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by substr »

Ah. Whatever system wrote to that pool is defective. So if the old system wrote the pool, stop using the old hardware. If the old pool on old hardware doesn't seem to be showing errors (WARNING: do not run a scrub on suspected bad hardware, as that can increase corruption), then the problem could be in the controller/cables/ports used for the new pool during the transfer. Otherwise, bad mobo/CPU/RAM.

cchayre
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by cchayre »

Is it plausible though that my data corruption was more a resultant of a bad rsync (or numerous) to begin with?

cchayre
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by cchayre »

Definitely think the issue must be something with the old NAS. I moved the pool back to the old system and tried replicating good data to the drives (rsync archive w/checksum)---I'm seeing an occasional checksum error increment on the pool/raidz. I guess I won't be using the old NAS anymore. It's a shame too...I was hoping to leverage as an off-site backup. Unless, of course, it'd be suitable for a different NAS system? Ex. Perhaps a linux-based OS w/software raid.
NAME STATE READ WRITE CKSUM
grandcentral ONLINE 0 0 355
raidz1-0 ONLINE 0 0 710
ada1 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada2 ONLINE 0 0 0

substr
experienced User
experienced User
Posts: 113
Joined: 04 Aug 2013 20:21
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by substr »

No, rsync can't cause ZFS checksum errors. If rsync wrote bad data, ZFS wouldn't know. Only your md5 comparisons would show it. When ZFS has rampant checksum errors like you saw, it means something is going wrong in the hardware, and that ZFS and/or the hardware are writing corrupt blocks/checksums/parity to the drives.

When you have hardware that causes ZFS problems, it should definitely not be used for ZFS. Depending on the exact nature of the problem (which is unknown right now), that same hardware might operate acceptably with different software. Or, it might do the same thing, and you just wouldn't notice until it was too late.

There is also one other possibility to consider: a marginal power supply that only causes problems when you have it loaded up with drives. In that case, the system might be safe most of the time. If the long-time configuration has never given you ZFS problems before, then it is very likely it will continue to give no problems as long as you leave it in that configuration. I would at the least run something like memtest.org on both systems. It is built in to Ubuntu ISO, if you happen to have one, and can be selected from the boot menu. I believe Windows 7 and possibly 8 DVDs also have a memory test, possibly not as thorough.

cchayre
Status: Offline

Re: Fix data corruption issue via file backups or another me

Post by cchayre »

substr wrote:No, rsync can't cause ZFS checksum errors. If rsync wrote bad data, ZFS wouldn't know. Only your md5 comparisons would show it. When ZFS has rampant checksum errors like you saw, it means something is going wrong in the hardware, and that ZFS and/or the hardware are writing corrupt blocks/checksums/parity to the drives.

When you have hardware that causes ZFS problems, it should definitely not be used for ZFS. Depending on the exact nature of the problem (which is unknown right now), that same hardware might operate acceptably with different software. Or, it might do the same thing, and you just wouldn't notice until it was too late.

There is also one other possibility to consider: a marginal power supply that only causes problems when you have it loaded up with drives. In that case, the system might be safe most of the time. If the long-time configuration has never given you ZFS problems before, then it is very likely it will continue to give no problems as long as you leave it in that configuration. I would at the least run something like memtest.org on both systems. It is built in to Ubuntu ISO, if you happen to have one, and can be selected from the boot menu. I believe Windows 7 and possibly 8 DVDs also have a memory test, possibly not as thorough.
Thanks for the feedback! I'm already thinking of different hardware (to replace the old NAS, from which I pulled the HDs in the first place). Your mention of a marginal power supply may actually be something I need to look into. The case I was using was a NAS-specific case with integrated PS (only 120 W). Without knowing or looking into it too much, that seems a bit small for 4x 3TB WD RED drives.

I will also run a memtest to cross-verify.

Post Reply

Return to “ZFS (only!)”