This is the old XigmaNAS forum in read only mode,
it will taken offline by the end of march 2021!



I like to aks Users and Admins to rewrite/take over important post from here into the new fresh main forum!
Its not possible for us to export from here and import it to the main forum!

an odd corruption

Forum rules
Set-Up GuideFAQsForum Rules
Post Reply
substr
experienced User
experienced User
Posts: 113
Joined: 04 Aug 2013 20:21
Status: Offline

an odd corruption

Post by substr »

Firstly, this is using straight FreeBSD 9.2, not NAS4Free 9.2. The hardware is Xeon w/ECC RAM. The pool is a single mirror vdev plus a single SSD w/L2ARC partition. All three are layered over GELI, with the two mirrors being set to 4096 byte sectors in GELI to match the AF 4K SATA discs. One half of the mirror is on Intel AHCI controller, the other half is on 3Ware controller.

I had a kernel panic caused by (I think, since I repeated it) a buggy userspace FUSE module, and when the system came back up, I had one file with permanent errors. What makes it mysterious is that this file had been created earlier that day, and had last been accessed an hour earlier as well and was about 33GB. No read, write, or checksum errors were shown on the pool. The file itself seemed to be completely fine, with no errors on a full read. Was this file metadata damaged by the kernel panic? If so, how? Was this file damaged earlier in the day? If so, how? Where and how did ZFS determine this file was damaged?

I haven't destroyed the pool(and probably won't, although it is backed up), but tried removing the file, rollback to earlier snapshot on the dataset, and then destroyed the dataset, yet zpool status -v pool still shows the remnant hexadecimal of this permanently damaged file.

My concern is that there is an underlying cause that I need to find. The only explanation I can invent is that somehow prior to the kernel panic, ZFS data structures in RAM were damaged, and made it to disk, and then the kernel panic'd, but this really is not very convincing as there was really no write load at the time, and especially not to that file or even the dataset it was in.

We've seen others who report pool metadata corruptions that prevent import, which usually go down to lack of ECC, or other such things. Could there be something else happening? A kernel panic, by itself, shouldn't be able to cause pool corruption.

kenZ71
Advanced User
Advanced User
Posts: 379
Joined: 27 Jun 2012 20:18
Location: Northeast, USA
Status: Offline

Re: an odd corruption

Post by kenZ71 »

Maybe memory error? run a memtest?
11.2-RELEASE-p3 | ZFS Mirror - 2 x 8TB WD Red | 28GB ECC Ram
HP ML10v2 x64-embedded on Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz

Extra memory so I can host a couple VMs
1) Unifi Controller on Ubuntu
2) Librenms on Ubuntu

substr
experienced User
experienced User
Posts: 113
Joined: 04 Aug 2013 20:21
Status: Offline

Re: an odd corruption

Post by substr »

It has ECC and has been rock solid, but I'll run a memtest86 on it and report back.

I'm wondering if the kernel panic itself should NOT have been possible with a FUSE userspace(in other words, a bug in the FUSE kernel module?). I've also gotten them from NTFS-3G, but I haven't updated it, I just avoid it.

update: memory tests fine.

User avatar
crowi
Forum Moderator
Forum Moderator
Posts: 1176
Joined: 21 Feb 2013 16:18
Location: Munich, Germany
Status: Offline

Re: an odd corruption

Post by crowi »

What makes it mysterious is that this file had been created earlier that day, and had last been accessed an hour earlier as well and was about 33GB. No read, write, or checksum errors were shown on the pool.
Hard to judge the whole situation, maybe the file was already corrupted on client side or during transfer.
Which type of file is it? Which clients do you have to access it?
NAS 1: Milchkuh: Asrock C2550D4I, Intel Avoton C2550 Quad-Core, 16GB DDR3 ECC, 5x3TB WD Red RaidZ1 +60 GB SSD for ZIL/L2ARC, APC-Back UPS 350 CS, NAS4Free 11.0.0.4.3460 embedded
NAS 2: Backup: HP N54L, 8 GB ECC RAM, 4x4 TB WD Red, RaidZ1, NAS4Free 11.0.0.4.3460 embedded
NAS 3: Office: HP N54L, 8 GB ECC RAM, 2x3 TB WD Red, ZFS Mirror, APC-Back UPS 350 CS NAS4Free 11.0.0.4.3460 embedded

substr
experienced User
experienced User
Posts: 113
Joined: 04 Aug 2013 20:21
Status: Offline

Re: an odd corruption

Post by substr »

Hard to judge the whole situation, maybe the file was already corrupted on client side or during transfer.
Which type of file is it? Which clients do you have to access it?
No, that is what is odd. The file was fine. It had been created by a FreeBSD NFS client earlier in the day. Scrub found zero errors, but did clear the residual -v permanent error for the now-deleted file/dataset.

I think I will try reproducing the panic with only a sacrificial pool, and see if I can get a coredump that helps.

update: hmm.. I got a screen cap of the panic, but it doesn't produce a coredump. Seems to be hard locking, although it auto-rebooted during the crash that corrupted ZFS.

substr
experienced User
experienced User
Posts: 113
Joined: 04 Aug 2013 20:21
Status: Offline

Re: an odd corruption

Post by substr »

The panic was a page fault, so it could have a small chance of corruption anywhere in the kernel on any given crash.

Lesson: Don't have kernel panics while ZFS pools are active.

hellokevin11
Starter
Starter
Posts: 45
Joined: 04 Apr 2014 04:16
Status: Offline

Re: an odd corruption

Post by hellokevin11 »

How old is the mainboard and psu? Perhaps you are getting voltage sag.

substr
experienced User
experienced User
Posts: 113
Joined: 04 Aug 2013 20:21
Status: Offline

Re: an odd corruption

Post by substr »

about 3 yrs, dual. I'll keep it in mind if I start seeing problems unrelated to FUSE.

hellokevin11
Starter
Starter
Posts: 45
Joined: 04 Apr 2014 04:16
Status: Offline

Re: an odd corruption

Post by hellokevin11 »

I would run prime95 full test, run for 24H and see if you have any failures.

24H full cpu load should make anything bad surface imho

Possibly due to cosmic ray impact you had double or triple flipped bit that was beyond ECC capability to recover.

Please post updates as you progress, and hopefully someone more knowledgeable will comment.

Perhaps post the failure info and pic on the freebsd forums as they have a lot of experts there.

Post Reply

Return to “ZFS (only!)”