Page 1 of 1

System fails when disk is lost

Posted: 27 Oct 2015 06:27
by tuaris
Why does the entire system fail when I loose one disk in the ZFS RAID?

Code: Select all

Oct 27 01:16:41 <user.crit> storage kernel: arcsas: Completion Q Entry=0x300c0, Slot No.=0xc0, Status_Buff.Err_Info=0x00000000,01000000, INT status=0x1
Oct 27 01:16:41 <user.crit> storage kernel: Device 0x5 Task file error, Status Reg=0x51, Error Reg=0x40.
Oct 27 01:16:41 <user.crit> storage kernel: AbortReq reset command 0xffffff8141eae9c0: Reset pPort(0x1) pCCB->EntryIndex(0x5) Slot(0xc8)
Oct 27 01:16:41 <user.crit> storage kernel: arcsas_cmd_done: target=0x5, lun=0x0, SCSI Command=0x28,0x0,0x6a,0xc5,0x70,0x9a,0x0,0x0,0x7,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Oct 27 01:16:41 <user.crit> storage kernel: AbortReq reset command 0xffffff8141e781e0: Reset pPort(0x1) pCCB->EntryIndex(0x5) Slot(0xca)
Oct 27 01:16:41 <user.crit> storage kernel: arcsas_cmd_done: target=0x5, lun=0x0, SCSI Command=0x2a,0x0,0x32,0x0,0xd9,0xc1,0x0,0x0,0x3,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Oct 27 01:16:41 <user.crit> storage kernel: arcsas: Target=0x 5, lun=0, GONE!!!
Oct 27 01:16:41 <daemon.info> storage istgt[2177]: ABORT_TASK
Oct 27 01:16:42 <user.crit> storage kernel: da5 at arcsas0 bus 0 scbus0 target 5 lun 0
Oct 27 01:16:42 <user.crit> storage kernel: da5: <WDC WD1003FBYX-01Y7B 01.0> s/n WD-WCAW30740700 detached
Oct 27 01:16:42 <user.crit> storage kernel: (da5:arcsas0:0:5:0): READ(10). CDB: 28 00 53 df ff d1 00 00 10 00 
Oct 27 01:16:42 <user.crit> storage kernel: (da5:arcsas0:0:5:0): CAM status: SCSI Status Error
Oct 27 01:16:42 <user.crit> storage kernel: (da5:arcsas0:0:5:0): SCSI status: Check Condition
Oct 27 01:16:42 <user.crit> storage kernel: (da5:arcsas0:0:5:0): SCSI sense: RECOVERED ERROR asc:0,0 (No additional sense information)
Oct 27 01:16:42 <user.crit> storage kernel: (da5:arcsas0:0:5:0): Info: 0x53dfffdd
Oct 27 01:17:01 <user.debug> storage kernel: sonewconn: pcb 0xfffffe015fa447a8: Listen queue overflow: 2 already in queue awaiting acceptance (1 occurrences)
Oct 27 01:16:43 <daemon.info> storage last message repeated 3 times

Re: System fails when disk is lost

Posted: 27 Oct 2015 06:45
by Parkcomm
Not enough info to tell:

Is the nas4free OS hosted on the pool with the failing disk?

Could mean that one disk fails -> /dev/ numbers change -> ZFS is seeing the the wrong vdevs -> pool down -> OS down?

If so boot from a Nas4Free USB, report zpool status, zpool list etc.

Is this the same problem you had before?

Re: System fails when disk is lost

Posted: 27 Oct 2015 07:09
by tuaris
The NAS4Free system is hosted as en embedded image on a bootable flash drive.
There are 6 drives attached to a HBA, 4 external (da0-da3), 2 internal (da4 and da5).

There are two ZFS pools

Code: Select all

  pool: external1
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
	pool will no longer be accessible on software that does not support feature
	flags.
  scan: resilvered 879G in 4h7m with 0 errors on Tue Aug 19 18:04:40 2014
config:

	NAME        STATE     READ WRITE CKSUM
	external1   ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da0     ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	    da3     ONLINE       0     0     0

errors: No known data errors

  pool: internal
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: resilvered 5.98M in 0h0m with 0 errors on Tue Oct 27 01:34:50 2015
config:

	NAME        STATE     READ WRITE CKSUM
	internal    ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da5     ONLINE       0     0     0

errors: No known data errors
It appears that da5 has failed. Shouldn't the system continue to function?

Re: System fails when disk is lost

Posted: 27 Oct 2015 07:19
by b0ssman
what controller is that drive on?

Re: System fails when disk is lost

Posted: 27 Oct 2015 07:27
by tuaris
ARECA ARC-1320-4i4X

Re: System fails when disk is lost

Posted: 27 Oct 2015 12:54
by b0ssman
a lot depends on how the card and the driver handles failed drives.

if for example the drive just dies and the controler/driver does not correctly support hotplugging then the system can crash.