Page 1 of 1

zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 01 Jan 2016 19:06
by exparrot
Hi

I recently installed NAS4Free 10.2.0.2 - Prester (revision 2235). I setup 5x4TB drives in RAIDZ2. I transferred data from and existing FreeBSD 8 system over NFS. The copy completed without issues. While I was getting the last settings ready to migrate to the new system I noticed that ZFS had detected corruption in one directory.

Code: Select all

# zpool status -v
  pool: pool0
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool0       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada4    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /pool0/media/music/flac/Assemblage 23 - 2004 - Ground
So I deleted the folder and copied it from source, but the status did not change, so I decided to delete it again and run a scrub. Before running the scrub the status changed to:

Code: Select all

errors: Permanent errors have been detected in the following files:

        pool0:<0x2da8a>
The scrub ran for about 30min and then the system rebooted and then went into a boot loop. Unfortunately I could not get the details at which it would reboot.

I then did a fresh install of NAS4Free and tried to import the pool, but each time I did this the system would reboot. Out of interest I tried the import on a FreeBSD 10.2 live CD and it also reboots, so the behaviour is the same.

zpool import reports the pool0 as available

Code: Select all

# zpool import
   pool: pool0
     id: 17274685908530395963
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

        pool0       ONLINE
          raidz2-0  ONLINE
            ada0    ONLINE
            ada1    ONLINE
            ada2    ONLINE
            ada3    ONLINE
            ada4    ONLINE
I enabled persistent logging and found the following in the system.log

Code: Select all

Jan  1 16:21:28 nas4free syslogd: kernel boot file is /boot/kernel/kernel
Jan  1 16:21:28 nas4free kernel: Solaris: WARNING: blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384
Jan  1 16:21:28 nas4free kernel:
Jan  1 16:21:28 nas4free kernel:
Jan  1 16:21:28 nas4free kernel: Fatal trap 12: page fault while in kernel mode
Jan  1 16:21:28 nas4free kernel: cpuid = 1; apic id = 01
Jan  1 16:21:28 nas4free kernel: fault virtual address  = 0x50
Jan  1 16:21:28 nas4free kernel: fault code             = supervisor read data, page not present
Jan  1 16:21:28 nas4free kernel: instruction pointer    = 0x20:0xffffffff81e79f94
Jan  1 16:21:28 nas4free kernel: stack pointer          = 0x28:0xfffffe0169ef5740
Jan  1 16:21:28 nas4free kernel: frame pointer          = 0x28:0xfffffe0169ef5750
Jan  1 16:21:28 nas4free kernel: code segment           = base 0x0, limit 0xfffff, type 0x1b
Jan  1 16:21:28 nas4free kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Jan  1 16:21:28 nas4free kernel: processor eflags       = interrupt enabled, resume, IOPL = 0
Jan  1 16:21:28 nas4free kernel: current process                = 6 (txg_thread_enter)
Jan  1 16:21:28 nas4free kernel: trap number            = 12
Jan  1 16:21:28 nas4free kernel: panic: page fault
Jan  1 16:21:28 nas4free kernel: cpuid = 1
Jan  1 16:21:28 nas4free kernel: KDB: stack backtrace:
Jan  1 16:21:28 nas4free kernel: #0 0xffffffff80a86a70 at kdb_backtrace+0x60
Jan  1 16:21:28 nas4free kernel: #1 0xffffffff80a4a1d6 at vpanic+0x126
Jan  1 16:21:28 nas4free kernel: #2 0xffffffff80a4a0a3 at panic+0x43
Jan  1 16:21:28 nas4free kernel: #3 0xffffffff80ecaedb at trap_fatal+0x36b
Jan  1 16:21:28 nas4free kernel: #4 0xffffffff80ecb1dd at trap_pfault+0x2ed
Jan  1 16:21:28 nas4free kernel: #5 0xffffffff80eca87a at trap+0x47a
Jan  1 16:21:28 nas4free kernel: #6 0xffffffff80eb0c72 at calltrap+0x8
Jan  1 16:21:28 nas4free kernel: #7 0xffffffff81e8071f at vdev_mirror_child_select+0x6f
Jan  1 16:21:28 nas4free kernel: #8 0xffffffff81e802d0 at vdev_mirror_io_start+0x270
Jan  1 16:21:28 nas4free kernel: #9 0xffffffff81e9cd86 at zio_vdev_io_start+0x1d6
Jan  1 16:21:28 nas4free kernel: #10 0xffffffff81e998b2 at zio_execute+0x162
Jan  1 16:21:28 nas4free kernel: #11 0xffffffff81e991b9 at zio_nowait+0x49
Jan  1 16:21:28 nas4free kernel: #12 0xffffffff81e1c91e at arc_read+0x8fe
Jan  1 16:21:28 nas4free kernel: #13 0xffffffff81e577b2 at dsl_scan_prefetch+0xc2
Jan  1 16:21:28 nas4free kernel: #14 0xffffffff81e574a3 at dsl_scan_visitbp+0x583
Jan  1 16:21:28 nas4free kernel: #15 0xffffffff81e5722f at dsl_scan_visitbp+0x30f
Jan  1 16:21:28 nas4free kernel: #16 0xffffffff81e5722f at dsl_scan_visitbp+0x30f
Jan  1 16:21:28 nas4free kernel: Copyright (c) 1992-2015 The FreeBSD Project.
I found some steps online to import the pool read-only. Which worked.

Code: Select all

zpool import -F -f -o readonly=on -R /pool0 pool0
zpool status
  pool: pool0
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Dec 30 13:34:03 2015
        1.06T scanned out of 8.53T at 1/s, (scan is slow, no estimated time)
        0 repaired, 12.45% done
config:

        NAME        STATE     READ WRITE CKSUM
        pool0       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0    ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
            ada4    ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list
I then tried to run a check with zdb as per: http://sigtar.com/2009/10/19/opensolari ... nel-panic/
but that runs for a while then segfaults.

Code: Select all

 zdb -e -bcsvL pool0

Traversing all blocks to verify checksums ...

22.1G completed (  59MB/s) estimated time remaining: 41hr 19min 07sec        Segmentation fault
from the system.log

Code: Select all

nas4free kernel: pid 2264 (zdb), uid 0: exited on signal 11
I could easily destroy the pool and start over as I still have the source system, but it looks like there is a possible bug in zpool import handling whatever issue my pool has and of course I'm curious what the problem is and how to fix it without starting over.

I did test all the drives with seatools before using them and they all passed. I am currently waiting for the smartctl long tests to complete on the drives to see if there are any other issues.

If anyone has any suggestions or wants further info please let me know.

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 02 Jan 2016 17:11
by b0ssman
first what hardware are you using?
does the system pass a memtest?
post the smart values of all your drives.

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 03 Jan 2016 13:36
by exparrot
Hardware: HPMS N40L custom BIOS 041. 2 x 4GB RAM non-ECC (need to check model). 5 x 4TB Seagate NAS ST4000VN000
Memtest: Will run that next now that the smartctl long test is done.
Smarctl output: http://pastebin.com/7wj6fz5V

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 03 Jan 2016 13:53
by b0ssman
smart values look fine.
to get zfs corruption like yours either the machine crashed and did not shut down properly or you had memory corruption.

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 03 Jan 2016 14:02
by exparrot
OK thanks.

Will see what memtest says. Are the default tests fine and how long should I run it for?

What are your thoughts on the import causing a panic and reboot? Should I create a bug report for NAS4Free or FreeBSD?

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 03 Jan 2016 15:01
by b0ssman
no this seems like hardware failure.

24h for a memtest.

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 05 Jan 2016 09:20
by exparrot
I ran the memtest for 29hrs. It picked up some errors while running test 10 on the 1st pass, but nothing further in the subsequent passes.

Image

I then ran just test 10 for 2 passes and there were no errors. So I'm not sure if I have intermittently faulty RAM or if it was a memtest issue? I would expect faulty RAM to fail all the time? Either way I contemplating replacing the RAM with ECC RAM.

So back to the panic. I understand the faulty RAM caused the corruption in the pool, but if I replace the RAM and the import still causes a reboot should zpool not be able to catch the problem and avoid the panic?

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 05 Jan 2016 09:37
by noclaf
Errors have probability. Even a broken clocks is right twice a day. Throw the memory out.

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 05 Jan 2016 10:12
by b0ssman
the pool is now damaged because of the faulty ram.
there is no file system repair tool for zfs because its enterprise, the motto for enterprise is. If its broken, restore from backup.
copy the data off the pool and rebuild the array.

Re: zpool import causes system to reboot "blkptr at 0xfffffe0003a5fa40 DVA 1 has invalid VDEV 16384"

Posted: 05 Jan 2016 10:19
by exparrot
Fair enough. I will replace the RAM and start again.

Thanks for all the assistance :)