Page 1 of 1

an odd corruption

Posted: 27 Jun 2014 03:34
by substr
Firstly, this is using straight FreeBSD 9.2, not NAS4Free 9.2. The hardware is Xeon w/ECC RAM. The pool is a single mirror vdev plus a single SSD w/L2ARC partition. All three are layered over GELI, with the two mirrors being set to 4096 byte sectors in GELI to match the AF 4K SATA discs. One half of the mirror is on Intel AHCI controller, the other half is on 3Ware controller.

I had a kernel panic caused by (I think, since I repeated it) a buggy userspace FUSE module, and when the system came back up, I had one file with permanent errors. What makes it mysterious is that this file had been created earlier that day, and had last been accessed an hour earlier as well and was about 33GB. No read, write, or checksum errors were shown on the pool. The file itself seemed to be completely fine, with no errors on a full read. Was this file metadata damaged by the kernel panic? If so, how? Was this file damaged earlier in the day? If so, how? Where and how did ZFS determine this file was damaged?

I haven't destroyed the pool(and probably won't, although it is backed up), but tried removing the file, rollback to earlier snapshot on the dataset, and then destroyed the dataset, yet zpool status -v pool still shows the remnant hexadecimal of this permanently damaged file.

My concern is that there is an underlying cause that I need to find. The only explanation I can invent is that somehow prior to the kernel panic, ZFS data structures in RAM were damaged, and made it to disk, and then the kernel panic'd, but this really is not very convincing as there was really no write load at the time, and especially not to that file or even the dataset it was in.

We've seen others who report pool metadata corruptions that prevent import, which usually go down to lack of ECC, or other such things. Could there be something else happening? A kernel panic, by itself, shouldn't be able to cause pool corruption.

Re: an odd corruption

Posted: 27 Jun 2014 04:16
by kenZ71
Maybe memory error? run a memtest?

Re: an odd corruption

Posted: 27 Jun 2014 04:34
by substr
It has ECC and has been rock solid, but I'll run a memtest86 on it and report back.

I'm wondering if the kernel panic itself should NOT have been possible with a FUSE userspace(in other words, a bug in the FUSE kernel module?). I've also gotten them from NTFS-3G, but I haven't updated it, I just avoid it.

update: memory tests fine.

Re: an odd corruption

Posted: 27 Jun 2014 12:52
by crowi
What makes it mysterious is that this file had been created earlier that day, and had last been accessed an hour earlier as well and was about 33GB. No read, write, or checksum errors were shown on the pool.
Hard to judge the whole situation, maybe the file was already corrupted on client side or during transfer.
Which type of file is it? Which clients do you have to access it?

Re: an odd corruption

Posted: 27 Jun 2014 18:13
by substr
Hard to judge the whole situation, maybe the file was already corrupted on client side or during transfer.
Which type of file is it? Which clients do you have to access it?
No, that is what is odd. The file was fine. It had been created by a FreeBSD NFS client earlier in the day. Scrub found zero errors, but did clear the residual -v permanent error for the now-deleted file/dataset.

I think I will try reproducing the panic with only a sacrificial pool, and see if I can get a coredump that helps.

update: hmm.. I got a screen cap of the panic, but it doesn't produce a coredump. Seems to be hard locking, although it auto-rebooted during the crash that corrupted ZFS.

Re: an odd corruption

Posted: 28 Jun 2014 05:24
by substr
The panic was a page fault, so it could have a small chance of corruption anywhere in the kernel on any given crash.

Lesson: Don't have kernel panics while ZFS pools are active.

Re: an odd corruption

Posted: 01 Jul 2014 23:19
by hellokevin11
How old is the mainboard and psu? Perhaps you are getting voltage sag.

Re: an odd corruption

Posted: 02 Jul 2014 01:31
by substr
about 3 yrs, dual. I'll keep it in mind if I start seeing problems unrelated to FUSE.

Re: an odd corruption

Posted: 04 Jul 2014 11:35
by hellokevin11
I would run prime95 full test, run for 24H and see if you have any failures.

24H full cpu load should make anything bad surface imho

Possibly due to cosmic ray impact you had double or triple flipped bit that was beyond ECC capability to recover.

Please post updates as you progress, and hopefully someone more knowledgeable will comment.

Perhaps post the failure info and pic on the freebsd forums as they have a lot of experts there.