ZFS goes offline
Posted: 28 Jul 2014 23:44
I don't know what's causing it and it's happened two time so far. I have
EXTERNAL is using ZRAID-2
INTERNAL is using ZFS mirror.
I have several ZFS volumes and datasets on each
What happens is that all of a sudden the volumes and datasets on the EXTERNAL will stop working. Any attempt to access them will result in the command hanging, including the web interface. The INTERNAL is unaffected and continues to work fine. The last items in the log are:
When I do a zpool status it shows:
Attempting to do a zfs list will cause the command to hang. The problem is able to be resolved by rebooting, but I can't keep doing that. I've ruled out a power issue because the other items connected to the UPS are not going offline. Can't be the controller since the other volumes are okay.
What could be happening?
How can I fix this without rebooting?
- Lenovo RS110 with 8GB RAM, Dual Core Xeon
- ARC-1320-4i4X controller
- SansDigital TR4X+ Enclosure
- 4x Western Digital WD30EFRX drives
- 2x Western Digital WD1003FBYX drives
EXTERNAL is using ZRAID-2
INTERNAL is using ZFS mirror.
I have several ZFS volumes and datasets on each
What happens is that all of a sudden the volumes and datasets on the EXTERNAL will stop working. Any attempt to access them will result in the command hanging, including the web interface. The INTERNAL is unaffected and continues to work fine. The last items in the log are:
Code: Select all
Jul 28 15:55:49 <user.crit> storage kernel: arcsas: Completion Q Entry=0x30177, Slot No.=0x177, Status_Buff.Err_Info=0x00000000,01000000, INT status=0x1
Jul 28 15:55:49 <user.crit> storage kernel: Device 0x1 Task file error, Status Reg=0x51, Error Reg=0x40.
Jul 28 15:55:49 <user.crit> storage kernel: AbortReq reset command 0xffffff8139115720: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x179)
Jul 28 15:55:49 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x28,0x0,0x15,0x35,0xfd,0x10,0x0,0x0,0x38,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 28 15:55:49 <user.crit> storage kernel: AbortReq reset command 0xffffff8139191aa0: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x17b)
Jul 28 15:55:49 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x2a,0x0,0x7c,0x90,0x91,0xe0,0x0,0x0,0x8,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 28 15:55:49 <user.crit> storage kernel: AbortReq reset command 0xffffff81391aadc0: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x180)
Jul 28 15:55:49 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x2a,0x0,0x7c,0x90,0x91,0xd8,0x0,0x0,0x8,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 28 15:55:49 <user.crit> storage kernel: arcsas: Target=0x 1, lun=0, GONE!!!
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): lost device - 4 outstanding, 3 refs
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 3
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 2
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 1
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): READ(10). CDB: 28 00 7c 8c 38 d0 00 00 28 00
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): CAM status: SCSI Status Error
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI status: Check Condition
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI sense: RECOVERED ERROR asc:0,0 (No additional sense information)
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): Info: 0x7c8c38d0
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 0
Code: Select all
storage: ~ # zpool status
pool: external1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: none requested
config:
NAME STATE READ WRITE CKSUM
external1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
da2 ONLINE 0 0 0
da0 ONLINE 0 0 0
da1 ONLINE 3 19 0
da3 ONLINE 0 0 0
errors: No known data errors
What could be happening?
How can I fix this without rebooting?