This is the old XigmaNAS forum in read only mode,
it will taken offline by the end of march 2021!



I like to aks Users and Admins to rewrite/take over important post from here into the new fresh main forum!
Its not possible for us to export from here and import it to the main forum!

ZFS goes offline

Hard disks, HDD, RAID Hardware, disk controllers, SATA, PATA, SCSI, IDE, On Board, USB, Firewire, CF (Compact Flash)
Forum rules
Set-Up GuideFAQsForum Rules
Post Reply
User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

ZFS goes offline

Post by tuaris »

I don't know what's causing it and it's happened two time so far. I have
  • Lenovo RS110 with 8GB RAM, Dual Core Xeon
  • ARC-1320-4i4X controller
  • SansDigital TR4X+ Enclosure
  • 4x Western Digital WD30EFRX drives
  • 2x Western Digital WD1003FBYX drives
The 4 Red drives are in the enclosure (EXTERNAL) and the 2 blacks are in the server's drive bays (INTERNAL).
EXTERNAL is using ZRAID-2
INTERNAL is using ZFS mirror.
I have several ZFS volumes and datasets on each

What happens is that all of a sudden the volumes and datasets on the EXTERNAL will stop working. Any attempt to access them will result in the command hanging, including the web interface. The INTERNAL is unaffected and continues to work fine. The last items in the log are:

Code: Select all

Jul 28 15:55:49 <user.crit> storage kernel: arcsas: Completion Q Entry=0x30177, Slot No.=0x177, Status_Buff.Err_Info=0x00000000,01000000, INT status=0x1
Jul 28 15:55:49 <user.crit> storage kernel: Device 0x1 Task file error, Status Reg=0x51, Error Reg=0x40.
Jul 28 15:55:49 <user.crit> storage kernel: AbortReq reset command 0xffffff8139115720: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x179)
Jul 28 15:55:49 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x28,0x0,0x15,0x35,0xfd,0x10,0x0,0x0,0x38,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 28 15:55:49 <user.crit> storage kernel: AbortReq reset command 0xffffff8139191aa0: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x17b)
Jul 28 15:55:49 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x2a,0x0,0x7c,0x90,0x91,0xe0,0x0,0x0,0x8,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 28 15:55:49 <user.crit> storage kernel: AbortReq reset command 0xffffff81391aadc0: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x180)
Jul 28 15:55:49 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x2a,0x0,0x7c,0x90,0x91,0xd8,0x0,0x0,0x8,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 28 15:55:49 <user.crit> storage kernel: arcsas: Target=0x 1, lun=0, GONE!!!
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): lost device - 4 outstanding, 3 refs
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 3
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 2
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 1
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): READ(10). CDB: 28 00 7c 8c 38 d0 00 00 28 00 
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): CAM status: SCSI Status Error
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI status: Check Condition
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI sense: RECOVERED ERROR asc:0,0 (No additional sense information)
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): Info: 0x7c8c38d0
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 0
When I do a zpool status it shows:

Code: Select all

storage: ~ # zpool status
  pool: external1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        external1   ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       3    19     0
            da3     ONLINE       0     0     0

errors: No known data errors
Attempting to do a zfs list will cause the command to hang. The problem is able to be resolved by rebooting, but I can't keep doing that. I've ruled out a power issue because the other items connected to the UPS are not going offline. Can't be the controller since the other volumes are okay.

What could be happening?
How can I fix this without rebooting?

User avatar
apollo567
Site Admin
Site Admin
Posts: 675
Joined: 23 Jun 2012 06:37
Location: Ludwigshafen, Germany
Status: Offline

Re: ZFS goes offline

Post by apollo567 »

Did it happen, after you load the driver as you mention here :
viewtopic.php?f=78&t=6505&p=40536#p40536 ?
my NAS and its development until today: viewtopic.php?f=63&t=39&sid=039fed830cf ... 4d0abe4a04

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

apollo567 wrote:Did it happen, after you load the driver as you mention here :
viewtopic.php?f=78&t=6505&p=40536#p40536 ?
Yes, and it just happened again.

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

I can accept that the drive may be going bad or is bad, but the ZFS pool should not fail or cause stuff to hang like this, or is it?
Here is the output of ps aux:

Code: Select all

storage: ~ # ps aux
USER    PID  %CPU %MEM    VSZ   RSS TT  STAT STARTED      TIME COMMAND
root     11 200.0  0.0      0    32 ??  RL    5:24PM 113:16.74 [idle]
root      6   0.2  0.0      0   192 ??  DL    5:24PM   0:39.21 [zfskern]
root      0   0.0  0.1      0  5952 ??  DLs   5:24PM   1:55.11 [kernel]
root      1   0.0  0.0   6276   632 ??  ILs   5:24PM   0:00.00 /sbin/init --
root      2   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [crypto]
root      3   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [crypto returns]
root      4   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [ctl_thrd]
root      5   0.0  0.0      0    16 ??  DL    5:24PM   0:00.01 [fdc0]
root      7   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [sctp_iterator]
root      8   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [xpt_thrd]
root      9   0.0  0.0      0    16 ??  DL    5:24PM   0:00.04 [md0]
root     10   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [audit]
root     12   0.0  0.0      0   224 ??  WL    5:24PM   0:49.19 [intr]
root     13   0.0  0.0      0    48 ??  DL    5:24PM   0:25.10 [geom]
root     14   0.0  0.0      0    16 ??  DL    5:24PM   0:05.08 [yarrow]
root     15   0.0  0.0      0   448 ??  DL    5:24PM   0:06.47 [usb]
root     16   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [pagedaemon]
root     17   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [vmdaemon]
root     18   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [pagezero]
root     19   0.0  0.0      0    16 ??  DL    5:24PM   0:00.01 [bufdaemon]
root     20   0.0  0.0      0    16 ??  DL    5:24PM   0:00.01 [vnlru]
root     21   0.0  0.0      0    16 ??  DL    5:24PM   0:00.70 [syncer]
root     22   0.0  0.0      0    16 ??  DL    5:24PM   0:00.45 [softdepflush]
root   1030   0.0  0.0      0    16 ??  DL    5:24PM   0:00.00 [g_mirror Personal]
root   1179   0.0  0.0      0    16 ??  DL    5:24PM   0:00.02 [md1]
root   2586   0.0  0.0   6280   740 ??  Is    5:24PM   0:00.00 /sbin/devd
root   2766   0.0  0.0  14216  1876 ??  Is    5:24PM   0:00.03 /usr/sbin/syslogd -8 -s -f /var/etc/syslog.conf
root   2782   0.0  0.0  16268  2036 ??  Ss    5:24PM   0:00.01 /usr/sbin/rpcbind
root   2947   0.0  0.0  14184  2572 ??  Is    5:24PM   0:00.00 /usr/sbin/mountd -r /etc/exports /etc/zfs/exports
root   2958   0.0  0.0  12048  2084 ??  Is    5:24PM   0:00.02 nfsd: master (nfsd)
root   2959   0.0  0.0   9944  1616 ??  D     5:24PM   0:05.84 nfsd: server (nfsd)
root   2962   0.0  0.0 274188  2024 ??  Ss    5:24PM   0:00.00 /usr/sbin/rpc.statd
root   2965   0.0  0.0  14168  2068 ??  Ss    5:24PM   0:00.00 /usr/sbin/rpc.lockd
root   3024   0.0  0.0  26012  2672 ??  I     5:24PM   0:00.01 /usr/local/sbin/afpd -F /var/etc/afpd.conf
root   3027   0.0  0.0  15136  1916 ??  I     5:24PM   0:00.01 /usr/local/sbin/cnid_metad
root   3098   0.0  0.9 251580 76648 ??  SLs   5:24PM   1:08.02 /usr/local/bin/istgt -c /var/etc/iscsi/istgt.conf
nobody 3178   0.0  0.1  31864  5212 ??  Ss    5:24PM   0:00.04 proftpd: (accepting connections) (proftpd)
root   3237   0.0  0.1  68428  7540 ??  Ss    5:24PM   0:00.06 /usr/local/sbin/nmbd -D -s /var/etc/smb.conf
root   3240   0.0  0.1  79348 11752 ??  Is    5:24PM   0:00.15 /usr/local/sbin/smbd -D -s /var/etc/smb.conf
root   3243   0.0  0.1  78200  9392 ??  Ss    5:24PM   0:00.04 /usr/local/sbin/winbindd -s /var/etc/smb.conf
root   3254   0.0  0.1  78200 11008 ??  I     5:24PM   0:00.04 /usr/local/sbin/winbindd -s /var/etc/smb.conf
root   3277   0.0  0.1  79348 11816 ??  S     5:24PM   0:00.00 /usr/local/sbin/smbd -D -s /var/etc/smb.conf
root   3280   0.0  0.1  31012  4372 ??  Is    5:24PM   0:00.00 /usr/sbin/sshd -f /var/etc/ssh/sshd_config -h /var/etc/ssh/ssh_host_dsa_key
root   3326   0.0  0.0  16280  1912 ??  Is    5:24PM   0:00.00 /usr/sbin/cron -s
root   3407   0.0  0.1  77468  9504 ??  I     5:24PM   0:00.00 /usr/local/sbin/winbindd -s /var/etc/smb.conf
root   3414   0.0  0.0   9944  1908 ??  Is    5:24PM   0:00.00 /usr/local/bin/mDNSResponderPosix -b -f /var/etc/mdnsresponder.conf
root   3423   0.0  0.1  80608  9376 ??  I     5:24PM   0:00.00 /usr/local/sbin/winbindd -s /var/etc/smb.conf
root   3476   0.0  0.1  35320  5104 ??  S     5:24PM   0:00.08 /usr/local/sbin/lighttpd -f /var/etc/lighttpd.conf -m /usr/local/lib/lighttpd
root   3675   0.0  0.1  71000  5512 ??  Is    5:24PM   0:00.01 sshd: root@notty (sshd)
root   3677   0.0  0.0  16612  2784 ??  Is    5:24PM   0:00.00 tcsh -c /usr/libexec/sftp-server (csh)
root   3679   0.0  0.1  26856  4076 ??  I     5:24PM   0:00.00 /usr/libexec/sftp-server
root   4360   0.0  0.0  37624  2820 ??  D     5:54PM   0:00.00 zfs list -H -o used,available external1
root   4365   0.0  0.1  71000  5508 ??  Is    5:54PM   0:00.01 sshd: root@pts/0 (sshd)
root   5264   0.0  0.2  91660 19348 ??  I     6:02PM   0:00.03 /usr/local/bin/php-cgi /usr/local/www/disks_zfs_zpool.php
root   5266   0.0  0.0  37624  2820 ??  D     6:02PM   0:00.00 zfs list -H -o used,available external1
root   5271   0.0  0.2  91660 16984 ??  I     6:02PM   0:00.01 /usr/local/bin/php-cgi /usr/local/www/index.php
root   5289   0.0  0.1  71000  5512 ??  Ss    6:07PM   0:00.02 sshd: root@pts/1 (sshd)
root   5297   0.0  0.1  71000  5512 ??  Ss    6:08PM   0:00.04 sshd: root@pts/2 (sshd)
root   3405   0.0  0.1  51848  9392 v0- I     5:24PM   0:05.11 /usr/local/sbin/mt-daapd -m -c /var/etc/mt-daapd.conf
root   3582   0.0  0.3 195248 21292 v0- S     5:24PM   0:12.78 /usr/local/bin/fuppesd --config-file /var/etc/fuppes.cfg --log-level 3 --log-file /var/log/fuppes.log --plugin-dir /usr/local/lib/fuppes --friendly-name NAS4Free (%h) --database-file /mnt/external1/Music//fuppes.db
root   3655   0.0  0.0  60136  2852 v0  Is    5:24PM   0:00.01 login [pam] (login)
root   3658   0.0  0.0  16612  3168 v0  I     5:24PM   0:00.01 -tcsh (csh)
root   3666   0.0  0.0  14536  2688 v0  I+    5:24PM   0:00.00 /bin/sh /etc/rc.initial
root   3656   0.0  0.0  12084  1676 v1  Is+   5:24PM   0:00.00 /usr/libexec/getty Pc ttyv1
root   3657   0.0  0.0  12084  1676 v2  Is+   5:24PM   0:00.00 /usr/libexec/getty Pc ttyv2
root   4367   0.0  0.0  16612  3556  0  Is    5:54PM   0:00.01 -tcsh (csh)
root   5280   0.0  0.0  37612  3120  0  D+    6:04PM   0:00.00 zpool clear external1 da1
root   5291   0.0  0.0  16612  3588  1  Is    6:07PM   0:00.01 -tcsh (csh)
root   5334   0.0  0.0  37624  3048  1  D+    6:15PM   0:00.00 zfs list
root   5299   0.0  0.0  16612  3752  2  Ss    6:08PM   0:00.01 -tcsh (csh)
root   5369   0.0  0.0  14220  2116  2  R+    6:23PM   0:00.00 ps aux
Looks like /dev/da1 is indeed gone:

Code: Select all

storage: ~ # ls /dev/da*
/dev/da0     /dev/da2     /dev/da3     /dev/da4     /dev/da4.nop /dev/da5     /dev/da5.nop /dev/da6     /dev/da6p1   /dev/da7     /dev/da7p1   /dev/da8     /dev/da9

Code: Select all

storage: ~ # camcontrol devlist
<WDC WD30EFRX-68AX9N0 80.0>        at scbus0 target 0 lun 0 (da0,pass0)
<WDC WD30EFRX-68AX9N0 80.0>        at scbus0 target 2 lun 0 (da2,pass2)
<WDC WD30EFRX-68AX9N0 80.0>        at scbus0 target 3 lun 0 (da3,pass3)
<WDC WD1003FBYX-01Y7B 01.0>        at scbus0 target 4 lun 0 (da4,pass4)
<WDC WD1003FBYX-01Y7B 01.0>        at scbus0 target 5 lun 0 (da5,pass5)
<HL-DT-ST DVDRAM GSA-T50N RY05>    at scbus3 target 0 lun 0 (pass6,cd0)
<TRANSCEND 20110519>               at scbus4 target 0 lun 0 (ada0,pass7)
<WDC WD12 00BB-53CAA1 17.0>        at scbus6 target 0 lun 0 (da6,pass8)
<MicroNet Volume Set # 00 0100>    at scbus7 target 0 lun 0 (da7,pass9)
<WDC WD50 00AAKS-007AA0 >          at scbus8 target 0 lun 0 (da8,pass10)
<WDC WD50 00AAKS-007AA0 >          at scbus8 target 0 lun 1 (da9,pass11)

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

Doing a power cycle on the enclosure and then running

Code: Select all

storage: ~ # camcontrol rescan 0
Re-scan of bus 0 was successful
It brought back da1

Code: Select all

storage: ~ # camcontrol devlist
<WDC WD30EFRX-68AX9N0 80.0>        at scbus0 target 0 lun 0 (pass0,da0)
<WDC WD30EFRX-68AX9N0 80.0>        at scbus0 target 1 lun 0 (pass3,da1)
<WDC WD30EFRX-68AX9N0 80.0>        at scbus0 target 2 lun 0 (pass1,da2)
<WDC WD30EFRX-68AX9N0 80.0>        at scbus0 target 3 lun 0 (pass2,da3)
But I caused da5 to disappear and come back:

Code: Select all

(da5:arcsas0:0:5:0): lost device - 1 outstanding, 3 refs
(da5:arcsas0:0:5:0): oustanding 0
GEOM_NOP: Device da5.nop is still open, so it can't be definitely removed.
(da5:arcsas0:0:5:0): removing device entry
da5 at arcsas0 bus 0 scbus0 target 5 lun 0
da5: <WDC WD1003FBYX-01Y7B 01.0> Fixed Direct Access SCSI-5 device
da5: 300.000MB/s transfers
da5: Command Queueing enabled
da5: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
da5 is an internal drive. I am starting to think this is a driver issue.

User avatar
raulfg3
Site Admin
Site Admin
Posts: 4865
Joined: 22 Jun 2012 22:13
Location: Madrid (ESPAÑA)
Contact:
Status: Offline

Re: ZFS goes offline

Post by raulfg3 »

for me your ARC-1320-4i4X controller & SansDigital TR4X+ Enclosure are not happy to live together.

but if is a driver issue, read latest info on areca website: http://www.areca.com.tw/support/s_freeb ... reebsd.htm

http://www.areca.us/support/s_freebsd/n ... Free91.zip
12.1.0.4 - Ingva (revision 7743) on SUPERMICRO X8SIL-F 8GB of ECC RAM, 11x3TB disk in 1 vdev = Vpool = 32TB Raw size , so 29TB usable size (I Have other NAS as Backup)

Wiki
Last changes

HP T510

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

raulfg3 wrote:for me your ARC-1320-4i4X controller & SansDigital TR4X+ Enclosure are not happy to live together.

but if is a driver issue, read latest info on areca website: http://www.areca.com.tw/support/s_freeb ... reebsd.htm

http://www.areca.us/support/s_freebsd/n ... Free91.zip
I have another controller, a RocketRAID Card 2684LF I can put in. You think it might work better with that?
Can I put both of them in?

User avatar
b0ssman
Forum Moderator
Forum Moderator
Posts: 2438
Joined: 14 Feb 2013 08:34
Location: Munich, Germany
Status: Offline

Re: ZFS goes offline

Post by b0ssman »

that card is not supported by freebsd out of the box and the driver on the website is to old
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

raulfg3 wrote:for me your ARC-1320-4i4X controller & SansDigital TR4X+ Enclosure are not happy to live together.
They have been together for a year and a half. I guess they started having troubles. ;)

User avatar
raulfg3
Site Admin
Site Admin
Posts: 4865
Joined: 22 Jun 2012 22:13
Location: Madrid (ESPAÑA)
Contact:
Status: Offline

Re: ZFS goes offline

Post by raulfg3 »

tuaris wrote:
raulfg3 wrote:for me your ARC-1320-4i4X controller & SansDigital TR4X+ Enclosure are not happy to live together.
They have been together for a year and a half. I guess they started having troubles. ;)
It's only a opinion, and not based on data, only a suspect from this code:

Code: Select all

Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): READ(10). CDB: 28 00 7c 8c 38 d0 00 00 28 00
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): CAM status: SCSI Status Error
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI status: Check Condition
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI sense: RECOVERED ERROR asc:0,0 (No additional sense information)
You need to google a bit more to see if this error is the cause of your troubles and what's the ultimate cause.
12.1.0.4 - Ingva (revision 7743) on SUPERMICRO X8SIL-F 8GB of ECC RAM, 11x3TB disk in 1 vdev = Vpool = 32TB Raw size , so 29TB usable size (I Have other NAS as Backup)

Wiki
Last changes

HP T510

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

It's only a opinion, and not based on data, only a suspect from this code:

Code: Select all

Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): READ(10). CDB: 28 00 7c 8c 38 d0 00 00 28 00
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): CAM status: SCSI Status Error
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI status: Check Condition
Jul 28 15:55:49 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI sense: RECOVERED ERROR asc:0,0 (No additional sense information)
You need to google a bit more to see if this error is the cause of your troubles and what's the ultimate cause.
Most of what I find usggest it's a cabling issue and I think that is the case. I recently moved the enclosure shortly before this started happening. I've powered down and reset the cables. Lets hope it fixes it.

Code: Select all

storage: ~ # zpool status
  pool: external1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul 29 08:54:51 2014
        202G scanned out of 3.59T at 260M/s, 3h48m to go
        48.9G resilvered, 5.48% done
config:

        NAME        STATE     READ WRITE CKSUM
        external1   ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     1  (resilvering)
            da3     ONLINE       0     0     0

errors: No known data errors
For now it looks like it's rebuilding correctly, but I still see this in the logs:

Code: Select all

sonewconn: pcb 0xfffffe0124927000: Listen queue overflow: 2 already in queue awaiting acceptance
sonewconn: pcb 0xfffffe0124927000: Listen queue overflow: 2 already in queue awaiting acceptance
sonewconn: pcb 0xfffffe0124927000: Listen queue overflow: 2 already in queue awaiting acceptance
sonewconn: pcb 0xfffffe0124927000: Listen queue overflow: 2 already in queue awaiting acceptance
arcsas_category_cdb_for_sata: Unknown request: 0xb7.
 arcsas_cmd_done: target=0x0, lun=0x0, SCSI Command=0xb7,0xc,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x8,cmd_status=0x206, scsi_status=0x0, ccb_status=0x4
arcsas_category_cdb_for_sata: Unknown request: 0xb7.
 arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0xb7,0xc,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x8,cmd_status=0x206, scsi_status=0x0, ccb_status=0x4
arcsas_category_cdb_for_sata: Unknown request: 0xb7.
 arcsas_cmd_done: target=0x2, lun=0x0, SCSI Command=0xb7,0xc,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x8,cmd_status=0x206, scsi_status=0x0, ccb_status=0x4
arcsas_category_cdb_for_sata: Unknown request: 0xb7.
 arcsas_cmd_done: target=0x3, lun=0x0, SCSI Command=0xb7,0xc,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x8,cmd_status=0x206, scsi_status=0x0, ccb_status=0x4
arcsas_category_cdb_for_sata: Unknown request: 0xb7.
 arcsas_cmd_done: target=0x4, lun=0x0, SCSI Command=0xb7,0xc,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x8,cmd_status=0x206, scsi_status=0x0, ccb_status=0x4
arcsas_category_cdb_for_sata: Unknown request: 0xb7.
 arcsas_cmd_done: target=0x5, lun=0x0, SCSI Command=0xb7,0xc,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x8,cmd_status=0x206, scsi_status=0x0, ccb_status=0x4

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

The same disk went offline:

Code: Select all

Jul 31 22:03:28 <user.crit> storage kernel: arcsas: Completion Q Entry=0x30172, Slot No.=0x172, Status_Buff.Err_Info=0x00000000,01000000, INT status=0x1
Jul 31 22:03:28 <user.crit> storage kernel: Device 0x1 Task file error, Status Reg=0x51, Error Reg=0x40.
Jul 31 22:03:28 <user.crit> storage kernel: AbortReq reset command 0xffffff813910dda0: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x179)
Jul 31 22:03:28 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x28,0x0,0x1d,0xd3,0xeb,0xd0,0x0,0x0,0x80,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 31 22:03:28 <user.crit> storage kernel: AbortReq reset command 0xffffff81391a7f40: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x17b)
Jul 31 22:03:28 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x28,0x0,0x1d,0xd3,0xed,0xb0,0x0,0x0,0x80,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 31 22:03:28 <user.crit> storage kernel: AbortReq reset command 0xffffff81391af920: Reset pPort(0x1) pCCB->EntryIndex(0x1) Slot(0x184)
Jul 31 22:03:28 <user.crit> storage kernel: arcsas_cmd_done: target=0x1, lun=0x0, SCSI Command=0x28,0x0,0x1d,0xd3,0xfd,0xc8,0x0,0x0,0x80,0x0,cmd_status=0x208, scsi_status=0x0, ccb_status=0x6
Jul 31 22:03:28 <user.crit> storage kernel: arcsas: Target=0x 1, lun=0, GONE!!!
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): lost device - 4 outstanding, 3 refs
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 3
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 2
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 1
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): READ(10). CDB: 28 00 1d d3 d2 00 00 00 58 00 
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): CAM status: SCSI Status Error
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI status: Check Condition
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): SCSI sense: RECOVERED ERROR asc:0,0 (No additional sense information)
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): Info: 0x1dd3d200
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): oustanding 0
Jul 31 22:03:28 <user.crit> storage kernel: (da1:arcsas0:0:1:0): removing device entry
But this time the system was able to continue working in DEGRADED mode:

Code: Select all

 pool: external1
 state: DEGRADED
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
  scan: scrub repaired 0 in 3h29m with 0 errors on Tue Jul 29 12:55:06 2014
config:

	NAME                     STATE     READ WRITE CKSUM
	external1                DEGRADED     0     0     0
	  raidz2-0               DEGRADED     0     0     0
	    da2                  ONLINE       0     0     0
	    da0                  ONLINE       0     0     0
	    9976121821713388189  REMOVED      0     0     0  was /dev/da1
	    da3                  ONLINE       0     0     0

errors: No known data errors
So it's possible the disk may be going bad. How can I check for that?

User avatar
b0ssman
Forum Moderator
Forum Moderator
Posts: 2438
Joined: 14 Feb 2013 08:34
Location: Munich, Germany
Status: Offline

Re: ZFS goes offline

Post by b0ssman »

post the smart values
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

b0ssman wrote:post the smart values
The problem is that the disk has completely disappeared from the system. I would need to reboot to get it back unless there is a way to get it back without rebooting?

User avatar
b0ssman
Forum Moderator
Forum Moderator
Posts: 2438
Joined: 14 Feb 2013 08:34
Location: Munich, Germany
Status: Offline

Re: ZFS goes offline

Post by b0ssman »

prob not
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.

User avatar
tuaris
experienced User
experienced User
Posts: 85
Joined: 19 Jul 2012 21:31
Contact:
Status: Offline

Re: ZFS goes offline

Post by tuaris »

b0ssman wrote:post the smart values
I have since replaced the disk and the problem has no longer re-surfaced. However, the SAS controller (ARC-1320) I am using does not seem to support S.M.A.R.T.

Code: Select all

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WD30EFRX-68AX9N0
Revision:             80.0
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
Serial number:        WD-WCC1T0291127
Device type:          disk
Local Time is:        Sat Oct  4 08:31:33 2014 EDT
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===

Current Drive Temperature:     <not available>

Manufactured in week  of year
Specified cycle count over device lifetime:  10
Accumulated start-stop cycles:  0
Read defect list: asked for grown list but didn't get it
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.000           0
write:         0        0         0         0          0          0.000           0


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
# 2  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
# 3  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
# 4  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
# 5  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
# 6  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
# 7  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
# 8  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
# 9  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
#10  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
#11  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
#12  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
#13  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
#14  Abort background  Aborted (by user command)   8       0                 0 [-   -    -]
Long (extended) Self Test duration: 30 seconds [0.5 minutes]

Post Reply

Return to “Hard disk & controller”