Page 1 of 1

Huh - pool in ONLINE state with one disk missing?

Posted: 05 Oct 2016 23:33
by naser
The current status of my pool below puzzles me. How it can be ONLINE with one disk missing (UNAVAIL) ? Shouldn't it be DEGRADED ? This happened after I accidentally knocked two disks offline while trying to rearrange SATA cables in a powered up NAS. Yes, it was stupid. One disk came back online, another has not; the pool scrubbed itself, and after a couple of reboots came back into this state. However regardless of why it happened, this state (ONLINE without ALL disks present) just doesn't make sense ? Anyone ?

Code: Select all

pool: z2
 state: ONLINE
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 11h46m with 0 errors on Wed Oct  5 07:14:09 2016
config:

	NAME                     STATE     READ WRITE CKSUM
	z2                       ONLINE       0     0     0
	  raidz2-0               ONLINE       0     0     0
	    ada1                 ONLINE       0     0     0
	    ada2                 ONLINE       0     0     0
	    ada3                 ONLINE       0     0     0
	    ada4                 ONLINE       0     0     0
	    2103505231344362322  UNAVAIL      0     0     0  was /dev/ada5
	    ada0                 ONLINE       0     0     0

errors: No known data errors

Re: Huh - pool in ONLINE state with one disk missing?

Posted: 06 Oct 2016 06:21
by apollo567
Hello,

The More important question :
Do you have a backup of the data or can you access at least the data to do a backup now ?

Regards
apollo

Re: Huh - pool in ONLINE state with one disk missing?

Posted: 06 Oct 2016 14:06
by naser
Yes and yes. The MOST important question though, can I trust ZFS if it can show something that is not supposed to exist (supposedly healthy pool (ONLINE) while a disk is missing)?

I was planning to just resilver ada5 however when a supposedly mature tool shows a combination that is supposedly impossible, it really ruins your trust.

Any other suggestions ? As I mentioned, I've tried to reboot the box (10.2.0.2.2235), hoping that a full restart of the ZFS subsystem and import of the pool may generate a meaningful status, but it didn't help.

Added: the NAS runs from a USB drive with config stored on another USB drive - it gives me a bit more hope that if any corruption of the system itself happened, it got fixed when the box rebooted. Regardless though, I'd think that if any ZFS components (other than disks themselves) got corrupted (in memory), it just wouldn't work at all. ZFS getting tricked into miscalculating the pool health status while otherwise working sounds like quite an improbable glitch.

Re: Huh - pool in ONLINE state with one disk missing?

Posted: 07 Oct 2016 00:16
by substr
It must be an uncommon error state. A label missing or invalid? That sounds new to me. And maybe the response is to not degrade the pool for some reason, but I would do a replace on the disk to get it truly back to normal.

Re: Huh - pool in ONLINE state with one disk missing?

Posted: 07 Oct 2016 03:53
by naser
After a few hours of googling, found that it is apparently a known bug and has already been reported (at least to the Linux ZFS project - May 2016).
Here: https://github.com/zfsonlinux/zfs/issues/4653
Wasn't able to find a similar bug report for the BSD ZFS.

TL;DR: When a device is UNAVAIL because of physical problems, state will be shown as DEGRADED; however then device is physically fine but ZFS metadata (such as disk label) is corrupt or missing - the state will show ONLINE, even though the device is UNAVAIL.

This was exactly my case - data got corrupted while I was messing with SATA cables, however the device and the connection is now physically fine.

Will report back after doing zpool replace.

Re: Huh - pool in ONLINE state with one disk missing?

Posted: 08 Oct 2016 13:59
by naser
The pool is back and healthy. Wasn't exactly straightforward though:

Attempt for a straight replace didn't work as ada5 was still considered to be part of the pool. Override didn't work (not sure why the error message shown by the first command implies it would):

Code: Select all

# zpool replace z2 2103505231344362322 /dev/ada5
invalid vdev specification
use '-f' to override the following errors:
/dev/ada5 is part of active pool 'z2'
# zpool replace -f z2 2103505231344362322 /dev/ada5
invalid vdev specification
the following errors must be manually repaired:
/dev/ada5 is part of active pool 'z2'
Had to export the pool, then try to clear any remaining labels on the affected disk.

Code: Select all

# zpool export z2
# zpool labelclear /dev/ada5
labelclear operation failed.
Vdev /dev/ada5 is a member of the exported pool "z2".
Use "zpool labelclear -f /dev/ada5" to force the removal of label information.
Damn, so the labels are corrupted just enough for the disk to have fallen out of the pool, while still being recognised as part of the pool :)
The override worked here though:

Code: Select all

# zpool labelclear -f /dev/ada5
# zpool import z2
# zpool replace z2 2103505231344362322 /dev/ada5
# 
Thanks everyone for the moral support during this.