Page 1 of 1

I almost had a heart attack...

Posted: 05 Jul 2014 18:00
by ChriZathens
Hello to all...
A few days ago I had ripped a movie (1080p, about 16GB in size) and yesterday I wanted to transfer it to my NAS.
My second pool consisting of 4x2TB disks has a total of 5.2TB space and contains all my movies. It has about 500GB free space now
Anyways, I noticed that while the transfer was in progress, the webui showed some hangs. I thought that it could not be caused by high CPU usage, since the machine I was transferring from is an atom CPU machine, with a not so great NIC, which is capable of sending at about 25MB/sec. So, I found it odd to have webui hangs due to high cpu usage.
Indeed, after I managed to see the status page again, I saw to my surprise that the pool appeared as degraded and device da7 was shown as faulted. Damn, I said, a disk failed me..
Luckily, I have a new spare 2TB WD Red, so I thought I should replace the faulty disk as soon as I returned from work (I was doing the job remotely from my office)
I should also mention that the 4 Samsung 2TB disks were bought all together, but after two months one of them started having unrecoverable errors and it was RMA'd. The disk which appeared as faulty was that one...
In the meanwhile the transfer was continuing. The point where I almost had a heart attack, is when I refreshed the status page after about 5 minutes (which had hanged again..) and now the pool appeared as unavailable. zpool status now showed 2 faulty disks :shock: :shock: :shock:
After recovering without a brain damage left (I hope) I thought to myself that it can't be that 2 disks failed me at the same time (especially because I remembered that the first one was not from the same batch)
I thought that maybe it is a cabling error and I shut down the NAS in order to look into it at the afternoon.
I realized that it was not shutting down, though, so I instructed the babysitter (my mother in law, lollll) to press and hold the power button. But after a while I told her to turn it on again, because I was curious.
Indeed, after it turned on the pool was again online with a notice that it resilvered 3GB in 1m without finding errors.
It successfully turned off this time and today I decided to check it again and possibly change the cable which connects the 4 2TB drives. ( It is a SFF-8087 to 4xSATA cable).
But after turning it on, I transferred the movie again and this time all was fine... :?
So decided to run a "stress test" by scrubbing the pool (I will probably not let it finish)
As of now the outcome is:

Code: Select all

pool: Media2
 state: ONLINE
  scan: scrub in progress since Sat Jul  5 18:13:08 2014
        460G scanned out of 6.42T at 204M/s, 8h31m to go
        0 repaired, 7.00% done
config:

	NAME        STATE     READ WRITE CKSUM
	Media2      ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da4     ONLINE       0     0     0

errors: No known data errors
I still am not sure whether it is or not the cable faulty.. (any ideas welcome...)

Just wanted to share this little adventure with you and possibly get some ideas from you, about what the problem might be
And just for the record, half of the data in that pool ( the half that I care about) are also in a backup. But losing so much data, even with a backup available, is always frustrating...

P.S. : Additional info:
  1. The S.M.A.R.T values are not telling me something. In particular, for the disks that appeared as faulty, UDMA_CRC_Error_Count is 0 (while in the past when I had a failed cable, the number increased to 37 for the hard disk attached to the faulty cable)
  2. The NAS is inside a wall mounted cabinet and the room's air conditioning unit is sending cool air directly to the cabinet. At the time of the problem, however, the air conditioning was turned off and temperatures in Athens this period are above 30 Celsius. So another idea is that maybe the whole SAS controller got overheated and started to throw errors (????). Right now the air conditioning is on, but I will turn it off and see what happens... Now, after an hour of scrubbing, the temps at the pool's disks are at 32 Celsius, but without the air conditioning they are reaching almost 50 (while scrubbing. In normal usage they are about 42-44)

Re: I almost had a heart attack...

Posted: 05 Jul 2014 18:28
by b0ssman
did you safe the dmesg output?

Re: I almost had a heart attack...

Posted: 05 Jul 2014 18:39
by ChriZathens
Nope ( i forgot to...)
I am monitoring it now and hopefully I will see something helpful (although it never happens when you expect it...)

Re: I almost had a heart attack...

Posted: 05 Jul 2014 19:00
by crowi
Are you using the normal webgui or the extended gui?
Using the extended gui gives me ahci errrors when accessing the webgui during hard disk load:
e.g.
Jul 4 07:33:15 nas4free kernel: ahcich2: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 40 serr 00000000 cmd 10008517
Jul 4 07:33:15 nas4free kernel: ahcich2: Timeout on slot 5 port 0

Re: I almost had a heart attack...

Posted: 05 Jul 2014 19:32
by ChriZathens
I am using the extended gui...
In the meanwhile:

Code: Select all

 pool: Media2
 state: ONLINE
  scan: scrub in progress since Sat Jul  5 18:13:08 2014
        1.65T scanned out of 6.42T at 206M/s, 6h43m to go
        0 repaired, 25.67% done
config:

	NAME        STATE     READ WRITE CKSUM
	Media2      ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da4     ONLINE       0     0     0

errors: No known data errors
And Temps:
CPU: 50
Disks from 44 to 46 (air condition off - outside temperature at 28 degrees 20:30 in the evening)

Re: I almost had a heart attack...

Posted: 05 Jul 2014 20:56
by crowi
Disks from 44 to 46
That's not really critical but I wouldn't feel comfortable above 40°C.
Can you put a fan near the disks?

Re: I almost had a heart attack...

Posted: 05 Jul 2014 22:08
by ChriZathens
Now, after almost 4 hours of scrubbing without A/C, temperatures have gone to 50-52 Celsius
In normal conditions, A/C is always on while scrubbing - but today I wanted to perform the stress test...
And BTW the case has 8 trays for HDDs and in front of them there are already 2 120mm fans for HDD cooling
And I am sure that the normal temperatures for the disks while scrubbing are lower. It is just that in my case the NAS is inside a closed cabinet, as I mentioned in my first post.
Imagine that, in the summer and without A/C, a while back when the NAS was on 24/7, the CPU temperature was dropping about 5 degrees just by opening the cabinet door.


Anyways, the thing is that after all this time everything is still normal...
I will leave it scrubbing a little more and then I will cancel scrubbing and shut it down... If it happens again in the following days I will see what I will do..

Re: I almost had a heart attack...

Posted: 07 Jul 2014 10:21
by ChriZathens
OK, update about my issue:
Timeouts occur again today and now the first pool is the one which appears to have problems....
This is the relevant output of dmesg:

Code: Select all

mpt0: request 0xffffff800091fdd0:12656 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091fdd0:12656 function 0
mpt0: completing timedout/aborted req 0xffffff800091fdd0:12656
mpt0: abort of req 0xffffff800091fdd0:0 completed
mpt0: request 0xffffff800091f0e0:12658 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091f0e0:12658 function 0
mpt0: completing timedout/aborted req 0xffffff800091f0e0:12658
mpt0: abort of req 0xffffff800091f0e0:0 completed
mpt0: request 0xffffff8000925590:12671 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000925590:12671 function 0
mpt0: completing timedout/aborted req 0xffffff8000925590:12671
(da0:mpt0:0:25:0): READ(10). CDB: 28 0 94 68 21 90 0 0 8 0 
(da0:mpt0:0:25:0): CAM status: CCB request terminated by the host
(da0:mpt0:0:25:0): Retrying command
mpt0: abort of req 0xffffff8000925590:0 completed
mpt0: request 0xffffff80009222f0:12675 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff80009222f0:12675 function 0
mpt0: completing timedout/aborted req 0xffffff80009222f0:12675
(da0:mpt0:0:25:0): READ(10). CDB: 28 0 94 68 21 90 0 0 8 0 
(da0:mpt0:0:25:0): CAM status: CCB request terminated by the host
(da0:mpt0:0:25:0): Retrying command
mpt0: abort of req 0xffffff80009222f0:0 completed
(da0:mpt0:0:25:0): READ(10). CDB: 28 0 94 68 21 90 0 0 8 0 
(da0:mpt0:0:25:0): CAM status: SCSI Status Error
(da0:mpt0:0:25:0): SCSI status: Check Condition
(da0:mpt0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mpt0:0:25:0): Retrying command (per sense data)
(da1:mpt0:0:26:0): READ(10). CDB: 28 0 92 8e 9c 20 0 0 8 0 
(da1:mpt0:0:26:0): CAM status: SCSI Status Error
(da1:mpt0:0:26:0): SCSI status: Check Condition
(da1:mpt0:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mpt0:0:26:0): Retrying command (per sense data)
mpt0: request 0xffffff80009160e0:12700 timed out for ccb 0xfffffe00170e0000 (req->ccb 0xfffffe00170e0000)
mpt0: attempting to abort req 0xffffff80009160e0:12700 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x80
mpt0: mpt_cam_event: 0x80
mpt0: completing timedout/aborted req 0xffffff80009160e0:12700
mpt0: request 0xffffff800091e5a0:15986 timed out for ccb 0xfffffe00170e0000 (req->ccb 0xfffffe00170e0000)
mpt0: attempting to abort req 0xffffff800091e5a0:15986 function 0
mpt0: completing timedout/aborted req 0xffffff800091e5a0:15986
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 93 40 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 e8 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 38 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 90 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
mpt0: abort of req 0xffffff800091e5a0:0 completed
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 93 40 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: SCSI Status Error
(da2:mpt0:0:27:0): SCSI status: Check Condition
(da2:mpt0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mpt0:0:27:0): Retrying command (per sense data)
mpt0: request 0xffffff8000922c80:16004 timed out for ccb 0xfffffe00170e0000 (req->ccb 0xfffffe00170e0000)
mpt0: attempting to abort req 0xffffff8000922c80:16004 function 0
mpt0: completing timedout/aborted req 0xffffff8000922c80:16004
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 93 40 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 90 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 e8 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 38 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
mpt0: abort of req 0xffffff8000922c80:0 completed
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 38 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: SCSI Status Error
(da2:mpt0:0:27:0): SCSI status: Check Condition
(da2:mpt0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mpt0:0:27:0): Retrying command (per sense data)
mpt0: request 0xffffff8000915fc0:16165 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000915fc0:16165 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x80
mpt0: completing timedout/aborted req 0xffffff8000915fc0:16165
mpt0: request 0xffffff8000919f50:17083 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000919f50:17083 function 0
mpt0: completing timedout/aborted req 0xffffff8000919f50:17083
(da3:mpt0:0:28:0): READ(10). CDB: 28 0 55 59 c0 48 0 0 8 0 
(da3:mpt0:0:28:0): CAM status: CCB request terminated by the host
(da3:mpt0:0:28:0): Retrying command
(da3:mpt0:0:28:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 
(da3:mpt0:0:28:0): CAM status: CCB request terminated by the host
(da3:mpt0:0:28:0): Retrying command
mpt0: abort of req 0xffffff8000919f50:0 completed
(da3:mpt0:0:28:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 
(da3:mpt0:0:28:0): CAM status: SCSI Status Error
(da3:mpt0:0:28:0): SCSI status: Check Condition
(da3:mpt0:0:28:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mpt0:0:28:0): Error 6, Retries exhausted
(da3:mpt0:0:28:0): Invalidating pack
(da3:mpt0:0:28:0): oustanding 1
(da3:mpt0:0:28:0): oustanding 0
mpt0: request 0xffffff8000921960:17108 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000921960:17108 function 0
mpt0: completing timedout/aborted req 0xffffff8000921960:17108
mpt0: abort of req 0xffffff8000921960:0 completed
(da3:mpt0:0:28:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0 
(da3:mpt0:0:28:0): CAM status: CCB request terminated by the host
(da3:mpt0:0:28:0): Retrying command
(da3:mpt0:0:28:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0 
(da3:mpt0:0:28:0): CAM status: SCSI Status Error
(da3:mpt0:0:28:0): SCSI status: Check Condition
(da3:mpt0:0:28:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mpt0:0:28:0): Retrying command (per sense data)
mpt0: request 0xffffff800091c6b0:23050 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091c6b0:23050 function 0
mpt0: completing timedout/aborted req 0xffffff800091c6b0:23050
mpt0: abort of req 0xffffff800091c6b0:0 completed
mpt0: request 0xffffff800091d040:23073 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091d040:23073 function 0
mpt0: completing timedout/aborted req 0xffffff800091d040:23073
mpt0: abort of req 0xffffff800091d040:0 completed
mpt0: request 0xffffff800091fe60:23086 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091fe60:23086 function 0
mpt0: completing timedout/aborted req 0xffffff800091fe60:23086
mpt0: abort of req 0xffffff800091fe60:0 completed
mpt0: request 0xffffff8000923f10:23088 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000923f10:23088 function 0
mpt0: completing timedout/aborted req 0xffffff8000923f10:23088
mpt0: abort of req 0xffffff8000923f10:0 completed
mpt0: request 0xffffff800091db80:23101 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091db80:23101 function 0
mpt0: completing timedout/aborted req 0xffffff800091db80:23101
mpt0: abort of req 0xffffff800091db80:0 completed
mpt0: request 0xffffff800091f0e0:23278 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091f0e0:23278 function 0
mpt0: completing timedout/aborted req 0xffffff800091f0e0:23278
mpt0: abort of req 0xffffff800091f0e0:0 completed
mpt0: request 0xffffff80009246f0:23291 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff80009246f0:23291 function 0
mpt0: completing timedout/aborted req 0xffffff80009246f0:23291
mpt0: abort of req 0xffffff80009246f0:0 completed
mpt0: request 0xffffff80009256b0:23297 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff80009256b0:23297 function 0
mpt0: completing timedout/aborted req 0xffffff80009256b0:23297
mpt0: abort of req 0xffffff80009256b0:0 completed
Now last hdd from first pool (da3) appears as faulted.
After a zpool clear everything is normal, again, though...
What the hell is going on....????

Re: I almost had a heart attack...

Posted: 07 Jul 2014 10:59
by crowi
did you switch from extended gui to normal gui?
if not, disable the extended gui and test again.
Just to give it a try...

Re: I almost had a heart attack...

Posted: 07 Jul 2014 11:30
by ChriZathens
Yes, I considered your previous post and after rebooting the NAS, I have now disabled the Extension and will see how it goes from there...
Perhaps the extended gui is quering the disks and in combination with the write to the disks it leads to cam timeouts???

Re: I almost had a heart attack...

Posted: 07 Jul 2014 11:34
by crowi
Perhaps the extended gui is quering the disks and in combination with the write to the disks it leads to cam timeouts?
Yes, that is my assumption, at least on my system :?

Re: I almost had a heart attack...

Posted: 07 Jul 2014 11:57
by ChriZathens
If it turns out to be true, it will be a real shame..
I really like Extended GUI...

Re: I almost had a heart attack...

Posted: 07 Jul 2014 12:16
by b0ssman
btw is your card running with the it firmware?

http://lime-technology.com/forum/index. ... #msg121129

Re: I almost had a heart attack...

Posted: 07 Jul 2014 13:36
by ChriZathens
Yes, latest IT Firmware from LSI website

Re: I almost had a heart attack...

Posted: 08 Jul 2014 09:33
by ChriZathens
Ok, extended gui is disabled today...
Transferring a few files showed no issues..
However, I tried to run a script which shows the status of the disks:

Code: Select all

#!/bin/sh
ZPOOL=/sbin/zpool
SMARTCTL=/usr/local/sbin/smartctl
LOG=/mnt/stoplog.txt
pools=$($ZPOOL list -H -o name)
for pool in $pools; do
echo ""
echo "------------ ZFS Pool: $pool ------------"
devices=$($ZPOOL  iostat -v $pool| grep da |awk '{print $1}')
for disk in $devices; do
$SMARTCTL -n standby -q silent -A /dev/$disk
case $? in
        2) MSG=" Standby"; MSG_TEMP="n/a";;
        0) MSG="Spinning"; MSG_TEMP=`$SMARTCTL -n standby -A /dev/$disk | awk '/Temperature_Celsius/ {print $10}'`;;
esac;

echo "Disk: $disk -> $MSG | Temperature: $MSG_TEMP  "
done
done
The result was a temporary hang and dmesg reported the following:

Code: Select all

mpt0: request 0xffffff800091ce00:39910 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff800091ce00:39910 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x80
mpt0: mpt_cam_event: 0x80
mpt0: completing timedout/aborted req 0xffffff800091ce00:39910
mpt0: request 0xffffff8000926f70:39960 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff8000926f70:39960 function 0
mpt0: completing timedout/aborted req 0xffffff8000926f70:39960
mpt0: abort of req 0xffffff8000926f70:0 completed
mpt0: request 0xffffff8000925c50:39980 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff8000925c50:39980 function 0
mpt0: completing timedout/aborted req 0xffffff8000925c50:39980
mpt0: abort of req 0xffffff8000925c50:0 completed
mpt0: request 0xffffff800091a220:39993 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff800091a220:39993 function 0
mpt0: completing timedout/aborted req 0xffffff800091a220:39993
mpt0: abort of req 0xffffff800091a220:0 completed
mpt0: request 0xffffff800091cd70:39995 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff800091cd70:39995 function 0
mpt0: completing timedout/aborted req 0xffffff800091cd70:39995
mpt0: abort of req 0xffffff800091cd70:0 completed
mpt0: request 0xffffff8000926940:40008 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff8000926940:40008 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x80
mpt0: completing timedout/aborted req 0xffffff8000926940:40008
mpt0: request 0xffffff8000921330:40061 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff8000921330:40061 function 0
mpt0: completing timedout/aborted req 0xffffff8000921330:40061
mpt0: abort of req 0xffffff8000921330:0 completed
Ideas?
EDIT
P.S.: I noticed that I run the script at the same time when a 30 min cron job run (a job that stops the disks):

Code: Select all

#!/bin/sh
# Simple script to put the disks in standby mode
# in case they cannot be stoppped via Nas4Free settings
# It can be used as a cronjob.
# It checks if the disks are part of a pool with a scrub 
# in process, or if they are already in standby mode
ZPOOL=/sbin/zpool
SMARTCTL=/usr/local/sbin/smartctl
LOG=/mnt/stoplog.txt
#Get names of our pools
pools=$($ZPOOL list -H -o name)
echo "===================================" >> $LOG
date >> $LOG
echo "===================================" >> $LOG
for pool in $pools; do
        echo "*------------- $pool -------------*" >> $LOG
        #Get device names for this pool
        pool_devices=$($ZPOOL  iostat -v $pool| grep da |awk '{print $1}')
                #We don't want to stop a disk while the pool is scrubbing, so let's check
                if $ZPOOL status -v $pool | grep -q "scrub in progress"; then
                echo " scrub in process... skipping" >> $LOG
                echo "" >> $LOG
        else
                for device in $pool_devices; do
                                        #If a disk is already stopped, then there is no need to stop it again
                                        if $SMARTCTL --nocheck standby -i /dev/$device | grep -q "STANDBY"; then
                                        echo "Disk $device is already in standby mode" >> $LOG
                                        else
                                        echo -n "Stop $device: " >> $LOG
					#Stop the disks
                                       # echo -n "$device: " >> $LOG
                                        camcontrol stop $device >> $LOG
					fi
                done
                echo "done.." >> $LOG
                sleep 5
        fi
done
Seems that those two scripts are exhausting the controller???? Is this normal?

Re: I almost had a heart attack...

Posted: 08 Jul 2014 09:53
by crowi
Is this normal?
Don't know if this is normal, but also on my Marvel chip it does not like the queries (caused by the ex-gui) during transfers.
So it might be an BSD internal issue, but I also noticed system hangs on windows when trying to run smart reporting tools (e.g. Crystaldiskinfo) during transfers. :-/

Re: I almost had a heart attack...

Posted: 08 Jul 2014 09:55
by b0ssman
i just noticed that you have 4 Samsung 2TB drives. are they Samsung EcoGreen F4 ?

if so did you update the firmware? without that they do not like to be smart queried at the same time as data transfer.

Re: I almost had a heart attack...

Posted: 08 Jul 2014 10:12
by ChriZathens
There was an issue back in the days when I bought them, about a defective firmware.
So when I bought them, the first thing I did was to upgrade the firmware with the supposed fixed one. (I also remember that the defective and the fixed firmware had the same number and people were shouting at Samsung that they wanted to cover up the issue)

But the issue also occured in the first pool, specifically disks da0 and da2 showed exhausting errors; and they are seagates...
Thanks a lot for your ideas guys! I hope to finally find what is going on (I am starting to blame the LSI controller...After all mpt0 IS the LSI controller which throws the errors. Although it might be a driver issue and not a controller issue)

P.S.: For now I also changed the cron job which stops the disks and it now runs every 1 hour instead of 30 minutes.
But the real fix would be to accomplish standby in LSI controllers (come on FreeBSD guys, fix the code please!!!)

Re: I almost had a heart attack...

Posted: 15 Jul 2014 10:01
by crest
Hello again ChriZathens!
ChriZathens wrote:Yes, I considered your previous post and after rebooting the NAS, I have now disabled the Extension and will see how it goes from there...
Perhaps the extended gui is quering the disks and in combination with the write to the disks it leads to cam timeouts???
Back from vacation I found your thread and yes, eGUI is querying the disks frequently (about every 6 - 9 seconds along with the refresh of the status page). I already thought about another solution for these queries which will lower the amount of disk accesses with a slighty reduced accuracy via the System calls service delay time parameter in eGUI - this is in my opinon not a big drawback if disk states and temperatures are delayed shown for some seconds ...

I also discussed a performance issue during runnning scrub on ZFS with crowi and I came to the conclusion that I should do a complete redesign of the disk queries to lower the load caused by eGUI, but this will take a lot of time which I currently are not able to spend ... :(

regards
crest

Re: I almost had a heart attack...

Posted: 15 Jul 2014 12:42
by ChriZathens
Hello, Crest!
Nice to have you back!
No worries, man. Take your time. We are already obliged for this great extension you offered to us!
Cheers!!!!