Firstly, this is using straight FreeBSD 9.2, not NAS4Free 9.2. The hardware is Xeon w/ECC RAM. The pool is a single mirror vdev plus a single SSD w/L2ARC partition. All three are layered over GELI, with the two mirrors being set to 4096 byte sectors in GELI to match the AF 4K SATA discs. One half of the mirror is on Intel AHCI controller, the other half is on 3Ware controller.
I had a kernel panic caused by (I think, since I repeated it) a buggy userspace FUSE module, and when the system came back up, I had one file with permanent errors. What makes it mysterious is that this file had been created earlier that day, and had last been accessed an hour earlier as well and was about 33GB. No read, write, or checksum errors were shown on the pool. The file itself seemed to be completely fine, with no errors on a full read. Was this file metadata damaged by the kernel panic? If so, how? Was this file damaged earlier in the day? If so, how? Where and how did ZFS determine this file was damaged?
I haven't destroyed the pool(and probably won't, although it is backed up), but tried removing the file, rollback to earlier snapshot on the dataset, and then destroyed the dataset, yet zpool status -v pool still shows the remnant hexadecimal of this permanently damaged file.
My concern is that there is an underlying cause that I need to find. The only explanation I can invent is that somehow prior to the kernel panic, ZFS data structures in RAM were damaged, and made it to disk, and then the kernel panic'd, but this really is not very convincing as there was really no write load at the time, and especially not to that file or even the dataset it was in.
We've seen others who report pool metadata corruptions that prevent import, which usually go down to lack of ECC, or other such things. Could there be something else happening? A kernel panic, by itself, shouldn't be able to cause pool corruption.
This is the old XigmaNAS forum in read only mode,
it will taken offline by the end of march 2021!
I like to aks Users and Admins to rewrite/take over important post from here into the new fresh main forum!
Its not possible for us to export from here and import it to the main forum!
it will taken offline by the end of march 2021!
I like to aks Users and Admins to rewrite/take over important post from here into the new fresh main forum!
Its not possible for us to export from here and import it to the main forum!
an odd corruption
-
kenZ71
- Advanced User

- Posts: 379
- Joined: 27 Jun 2012 20:18
- Location: Northeast, USA
- Status: Offline
Re: an odd corruption
Maybe memory error? run a memtest?
11.2-RELEASE-p3 | ZFS Mirror - 2 x 8TB WD Red | 28GB ECC Ram
HP ML10v2 x64-embedded on Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
Extra memory so I can host a couple VMs
1) Unifi Controller on Ubuntu
2) Librenms on Ubuntu
HP ML10v2 x64-embedded on Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
Extra memory so I can host a couple VMs
1) Unifi Controller on Ubuntu
2) Librenms on Ubuntu
-
substr
- experienced User

- Posts: 113
- Joined: 04 Aug 2013 20:21
- Status: Offline
Re: an odd corruption
It has ECC and has been rock solid, but I'll run a memtest86 on it and report back.
I'm wondering if the kernel panic itself should NOT have been possible with a FUSE userspace(in other words, a bug in the FUSE kernel module?). I've also gotten them from NTFS-3G, but I haven't updated it, I just avoid it.
update: memory tests fine.
I'm wondering if the kernel panic itself should NOT have been possible with a FUSE userspace(in other words, a bug in the FUSE kernel module?). I've also gotten them from NTFS-3G, but I haven't updated it, I just avoid it.
update: memory tests fine.
- crowi
- Forum Moderator

- Posts: 1176
- Joined: 21 Feb 2013 16:18
- Location: Munich, Germany
- Status: Offline
Re: an odd corruption
Hard to judge the whole situation, maybe the file was already corrupted on client side or during transfer.What makes it mysterious is that this file had been created earlier that day, and had last been accessed an hour earlier as well and was about 33GB. No read, write, or checksum errors were shown on the pool.
Which type of file is it? Which clients do you have to access it?
NAS 1: Milchkuh: Asrock C2550D4I, Intel Avoton C2550 Quad-Core, 16GB DDR3 ECC, 5x3TB WD Red RaidZ1 +60 GB SSD for ZIL/L2ARC, APC-Back UPS 350 CS, NAS4Free 11.0.0.4.3460 embedded
NAS 2: Backup: HP N54L, 8 GB ECC RAM, 4x4 TB WD Red, RaidZ1, NAS4Free 11.0.0.4.3460 embedded
NAS 3: Office: HP N54L, 8 GB ECC RAM, 2x3 TB WD Red, ZFS Mirror, APC-Back UPS 350 CS NAS4Free 11.0.0.4.3460 embedded
NAS 2: Backup: HP N54L, 8 GB ECC RAM, 4x4 TB WD Red, RaidZ1, NAS4Free 11.0.0.4.3460 embedded
NAS 3: Office: HP N54L, 8 GB ECC RAM, 2x3 TB WD Red, ZFS Mirror, APC-Back UPS 350 CS NAS4Free 11.0.0.4.3460 embedded
-
substr
- experienced User

- Posts: 113
- Joined: 04 Aug 2013 20:21
- Status: Offline
Re: an odd corruption
No, that is what is odd. The file was fine. It had been created by a FreeBSD NFS client earlier in the day. Scrub found zero errors, but did clear the residual -v permanent error for the now-deleted file/dataset.Hard to judge the whole situation, maybe the file was already corrupted on client side or during transfer.
Which type of file is it? Which clients do you have to access it?
I think I will try reproducing the panic with only a sacrificial pool, and see if I can get a coredump that helps.
update: hmm.. I got a screen cap of the panic, but it doesn't produce a coredump. Seems to be hard locking, although it auto-rebooted during the crash that corrupted ZFS.
-
substr
- experienced User

- Posts: 113
- Joined: 04 Aug 2013 20:21
- Status: Offline
Re: an odd corruption
The panic was a page fault, so it could have a small chance of corruption anywhere in the kernel on any given crash.
Lesson: Don't have kernel panics while ZFS pools are active.
Lesson: Don't have kernel panics while ZFS pools are active.
-
hellokevin11
- Starter

- Posts: 45
- Joined: 04 Apr 2014 04:16
- Status: Offline
Re: an odd corruption
How old is the mainboard and psu? Perhaps you are getting voltage sag.
-
substr
- experienced User

- Posts: 113
- Joined: 04 Aug 2013 20:21
- Status: Offline
Re: an odd corruption
about 3 yrs, dual. I'll keep it in mind if I start seeing problems unrelated to FUSE.
-
hellokevin11
- Starter

- Posts: 45
- Joined: 04 Apr 2014 04:16
- Status: Offline
Re: an odd corruption
I would run prime95 full test, run for 24H and see if you have any failures.
24H full cpu load should make anything bad surface imho
Possibly due to cosmic ray impact you had double or triple flipped bit that was beyond ECC capability to recover.
Please post updates as you progress, and hopefully someone more knowledgeable will comment.
Perhaps post the failure info and pic on the freebsd forums as they have a lot of experts there.
24H full cpu load should make anything bad surface imho
Possibly due to cosmic ray impact you had double or triple flipped bit that was beyond ECC capability to recover.
Please post updates as you progress, and hopefully someone more knowledgeable will comment.
Perhaps post the failure info and pic on the freebsd forums as they have a lot of experts there.