This is the old XigmaNAS forum in read only mode,
it will taken offline by the end of march 2021!



I like to aks Users and Admins to rewrite/take over important post from here into the new fresh main forum!
Its not possible for us to export from here and import it to the main forum!

I almost had a heart attack...

Hard disks, HDD, RAID Hardware, disk controllers, SATA, PATA, SCSI, IDE, On Board, USB, Firewire, CF (Compact Flash)
Forum rules
Set-Up GuideFAQsForum Rules
Post Reply
User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

I almost had a heart attack...

Post by ChriZathens »

Hello to all...
A few days ago I had ripped a movie (1080p, about 16GB in size) and yesterday I wanted to transfer it to my NAS.
My second pool consisting of 4x2TB disks has a total of 5.2TB space and contains all my movies. It has about 500GB free space now
Anyways, I noticed that while the transfer was in progress, the webui showed some hangs. I thought that it could not be caused by high CPU usage, since the machine I was transferring from is an atom CPU machine, with a not so great NIC, which is capable of sending at about 25MB/sec. So, I found it odd to have webui hangs due to high cpu usage.
Indeed, after I managed to see the status page again, I saw to my surprise that the pool appeared as degraded and device da7 was shown as faulted. Damn, I said, a disk failed me..
Luckily, I have a new spare 2TB WD Red, so I thought I should replace the faulty disk as soon as I returned from work (I was doing the job remotely from my office)
I should also mention that the 4 Samsung 2TB disks were bought all together, but after two months one of them started having unrecoverable errors and it was RMA'd. The disk which appeared as faulty was that one...
In the meanwhile the transfer was continuing. The point where I almost had a heart attack, is when I refreshed the status page after about 5 minutes (which had hanged again..) and now the pool appeared as unavailable. zpool status now showed 2 faulty disks :shock: :shock: :shock:
After recovering without a brain damage left (I hope) I thought to myself that it can't be that 2 disks failed me at the same time (especially because I remembered that the first one was not from the same batch)
I thought that maybe it is a cabling error and I shut down the NAS in order to look into it at the afternoon.
I realized that it was not shutting down, though, so I instructed the babysitter (my mother in law, lollll) to press and hold the power button. But after a while I told her to turn it on again, because I was curious.
Indeed, after it turned on the pool was again online with a notice that it resilvered 3GB in 1m without finding errors.
It successfully turned off this time and today I decided to check it again and possibly change the cable which connects the 4 2TB drives. ( It is a SFF-8087 to 4xSATA cable).
But after turning it on, I transferred the movie again and this time all was fine... :?
So decided to run a "stress test" by scrubbing the pool (I will probably not let it finish)
As of now the outcome is:

Code: Select all

pool: Media2
 state: ONLINE
  scan: scrub in progress since Sat Jul  5 18:13:08 2014
        460G scanned out of 6.42T at 204M/s, 8h31m to go
        0 repaired, 7.00% done
config:

	NAME        STATE     READ WRITE CKSUM
	Media2      ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da4     ONLINE       0     0     0

errors: No known data errors
I still am not sure whether it is or not the cable faulty.. (any ideas welcome...)

Just wanted to share this little adventure with you and possibly get some ideas from you, about what the problem might be
And just for the record, half of the data in that pool ( the half that I care about) are also in a backup. But losing so much data, even with a backup available, is always frustrating...

P.S. : Additional info:
  1. The S.M.A.R.T values are not telling me something. In particular, for the disks that appeared as faulty, UDMA_CRC_Error_Count is 0 (while in the past when I had a failed cable, the number increased to 37 for the hard disk attached to the faulty cable)
  2. The NAS is inside a wall mounted cabinet and the room's air conditioning unit is sending cool air directly to the cabinet. At the time of the problem, however, the air conditioning was turned off and temperatures in Athens this period are above 30 Celsius. So another idea is that maybe the whole SAS controller got overheated and started to throw errors (????). Right now the air conditioning is on, but I will turn it off and see what happens... Now, after an hour of scrubbing, the temps at the pool's disks are at 32 Celsius, but without the air conditioning they are reaching almost 50 (while scrubbing. In normal usage they are about 42-44)
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
b0ssman
Forum Moderator
Forum Moderator
Posts: 2438
Joined: 14 Feb 2013 08:34
Location: Munich, Germany
Status: Offline

Re: I almost had a heart attack...

Post by b0ssman »

did you safe the dmesg output?
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

Nope ( i forgot to...)
I am monitoring it now and hopefully I will see something helpful (although it never happens when you expect it...)
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
crowi
Forum Moderator
Forum Moderator
Posts: 1176
Joined: 21 Feb 2013 16:18
Location: Munich, Germany
Status: Offline

Re: I almost had a heart attack...

Post by crowi »

Are you using the normal webgui or the extended gui?
Using the extended gui gives me ahci errrors when accessing the webgui during hard disk load:
e.g.
Jul 4 07:33:15 nas4free kernel: ahcich2: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 40 serr 00000000 cmd 10008517
Jul 4 07:33:15 nas4free kernel: ahcich2: Timeout on slot 5 port 0
NAS 1: Milchkuh: Asrock C2550D4I, Intel Avoton C2550 Quad-Core, 16GB DDR3 ECC, 5x3TB WD Red RaidZ1 +60 GB SSD for ZIL/L2ARC, APC-Back UPS 350 CS, NAS4Free 11.0.0.4.3460 embedded
NAS 2: Backup: HP N54L, 8 GB ECC RAM, 4x4 TB WD Red, RaidZ1, NAS4Free 11.0.0.4.3460 embedded
NAS 3: Office: HP N54L, 8 GB ECC RAM, 2x3 TB WD Red, ZFS Mirror, APC-Back UPS 350 CS NAS4Free 11.0.0.4.3460 embedded

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

I am using the extended gui...
In the meanwhile:

Code: Select all

 pool: Media2
 state: ONLINE
  scan: scrub in progress since Sat Jul  5 18:13:08 2014
        1.65T scanned out of 6.42T at 206M/s, 6h43m to go
        0 repaired, 25.67% done
config:

	NAME        STATE     READ WRITE CKSUM
	Media2      ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da4     ONLINE       0     0     0

errors: No known data errors
And Temps:
CPU: 50
Disks from 44 to 46 (air condition off - outside temperature at 28 degrees 20:30 in the evening)
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
crowi
Forum Moderator
Forum Moderator
Posts: 1176
Joined: 21 Feb 2013 16:18
Location: Munich, Germany
Status: Offline

Re: I almost had a heart attack...

Post by crowi »

Disks from 44 to 46
That's not really critical but I wouldn't feel comfortable above 40°C.
Can you put a fan near the disks?
NAS 1: Milchkuh: Asrock C2550D4I, Intel Avoton C2550 Quad-Core, 16GB DDR3 ECC, 5x3TB WD Red RaidZ1 +60 GB SSD for ZIL/L2ARC, APC-Back UPS 350 CS, NAS4Free 11.0.0.4.3460 embedded
NAS 2: Backup: HP N54L, 8 GB ECC RAM, 4x4 TB WD Red, RaidZ1, NAS4Free 11.0.0.4.3460 embedded
NAS 3: Office: HP N54L, 8 GB ECC RAM, 2x3 TB WD Red, ZFS Mirror, APC-Back UPS 350 CS NAS4Free 11.0.0.4.3460 embedded

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

Now, after almost 4 hours of scrubbing without A/C, temperatures have gone to 50-52 Celsius
In normal conditions, A/C is always on while scrubbing - but today I wanted to perform the stress test...
And BTW the case has 8 trays for HDDs and in front of them there are already 2 120mm fans for HDD cooling
And I am sure that the normal temperatures for the disks while scrubbing are lower. It is just that in my case the NAS is inside a closed cabinet, as I mentioned in my first post.
Imagine that, in the summer and without A/C, a while back when the NAS was on 24/7, the CPU temperature was dropping about 5 degrees just by opening the cabinet door.


Anyways, the thing is that after all this time everything is still normal...
I will leave it scrubbing a little more and then I will cancel scrubbing and shut it down... If it happens again in the following days I will see what I will do..
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

OK, update about my issue:
Timeouts occur again today and now the first pool is the one which appears to have problems....
This is the relevant output of dmesg:

Code: Select all

mpt0: request 0xffffff800091fdd0:12656 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091fdd0:12656 function 0
mpt0: completing timedout/aborted req 0xffffff800091fdd0:12656
mpt0: abort of req 0xffffff800091fdd0:0 completed
mpt0: request 0xffffff800091f0e0:12658 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091f0e0:12658 function 0
mpt0: completing timedout/aborted req 0xffffff800091f0e0:12658
mpt0: abort of req 0xffffff800091f0e0:0 completed
mpt0: request 0xffffff8000925590:12671 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000925590:12671 function 0
mpt0: completing timedout/aborted req 0xffffff8000925590:12671
(da0:mpt0:0:25:0): READ(10). CDB: 28 0 94 68 21 90 0 0 8 0 
(da0:mpt0:0:25:0): CAM status: CCB request terminated by the host
(da0:mpt0:0:25:0): Retrying command
mpt0: abort of req 0xffffff8000925590:0 completed
mpt0: request 0xffffff80009222f0:12675 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff80009222f0:12675 function 0
mpt0: completing timedout/aborted req 0xffffff80009222f0:12675
(da0:mpt0:0:25:0): READ(10). CDB: 28 0 94 68 21 90 0 0 8 0 
(da0:mpt0:0:25:0): CAM status: CCB request terminated by the host
(da0:mpt0:0:25:0): Retrying command
mpt0: abort of req 0xffffff80009222f0:0 completed
(da0:mpt0:0:25:0): READ(10). CDB: 28 0 94 68 21 90 0 0 8 0 
(da0:mpt0:0:25:0): CAM status: SCSI Status Error
(da0:mpt0:0:25:0): SCSI status: Check Condition
(da0:mpt0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mpt0:0:25:0): Retrying command (per sense data)
(da1:mpt0:0:26:0): READ(10). CDB: 28 0 92 8e 9c 20 0 0 8 0 
(da1:mpt0:0:26:0): CAM status: SCSI Status Error
(da1:mpt0:0:26:0): SCSI status: Check Condition
(da1:mpt0:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da1:mpt0:0:26:0): Retrying command (per sense data)
mpt0: request 0xffffff80009160e0:12700 timed out for ccb 0xfffffe00170e0000 (req->ccb 0xfffffe00170e0000)
mpt0: attempting to abort req 0xffffff80009160e0:12700 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x80
mpt0: mpt_cam_event: 0x80
mpt0: completing timedout/aborted req 0xffffff80009160e0:12700
mpt0: request 0xffffff800091e5a0:15986 timed out for ccb 0xfffffe00170e0000 (req->ccb 0xfffffe00170e0000)
mpt0: attempting to abort req 0xffffff800091e5a0:15986 function 0
mpt0: completing timedout/aborted req 0xffffff800091e5a0:15986
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 93 40 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 e8 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 38 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 90 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
mpt0: abort of req 0xffffff800091e5a0:0 completed
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 93 40 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: SCSI Status Error
(da2:mpt0:0:27:0): SCSI status: Check Condition
(da2:mpt0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mpt0:0:27:0): Retrying command (per sense data)
mpt0: request 0xffffff8000922c80:16004 timed out for ccb 0xfffffe00170e0000 (req->ccb 0xfffffe00170e0000)
mpt0: attempting to abort req 0xffffff8000922c80:16004 function 0
mpt0: completing timedout/aborted req 0xffffff8000922c80:16004
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 93 40 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 90 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 e8 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 38 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: CCB request terminated by the host
(da2:mpt0:0:27:0): Retrying command
mpt0: abort of req 0xffffff8000922c80:0 completed
(da2:mpt0:0:27:0): WRITE(10). CDB: 2a 0 4c 25 92 38 0 0 58 0 
(da2:mpt0:0:27:0): CAM status: SCSI Status Error
(da2:mpt0:0:27:0): SCSI status: Check Condition
(da2:mpt0:0:27:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da2:mpt0:0:27:0): Retrying command (per sense data)
mpt0: request 0xffffff8000915fc0:16165 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000915fc0:16165 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x80
mpt0: completing timedout/aborted req 0xffffff8000915fc0:16165
mpt0: request 0xffffff8000919f50:17083 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000919f50:17083 function 0
mpt0: completing timedout/aborted req 0xffffff8000919f50:17083
(da3:mpt0:0:28:0): READ(10). CDB: 28 0 55 59 c0 48 0 0 8 0 
(da3:mpt0:0:28:0): CAM status: CCB request terminated by the host
(da3:mpt0:0:28:0): Retrying command
(da3:mpt0:0:28:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 
(da3:mpt0:0:28:0): CAM status: CCB request terminated by the host
(da3:mpt0:0:28:0): Retrying command
mpt0: abort of req 0xffffff8000919f50:0 completed
(da3:mpt0:0:28:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 
(da3:mpt0:0:28:0): CAM status: SCSI Status Error
(da3:mpt0:0:28:0): SCSI status: Check Condition
(da3:mpt0:0:28:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mpt0:0:28:0): Error 6, Retries exhausted
(da3:mpt0:0:28:0): Invalidating pack
(da3:mpt0:0:28:0): oustanding 1
(da3:mpt0:0:28:0): oustanding 0
mpt0: request 0xffffff8000921960:17108 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000921960:17108 function 0
mpt0: completing timedout/aborted req 0xffffff8000921960:17108
mpt0: abort of req 0xffffff8000921960:0 completed
(da3:mpt0:0:28:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0 
(da3:mpt0:0:28:0): CAM status: CCB request terminated by the host
(da3:mpt0:0:28:0): Retrying command
(da3:mpt0:0:28:0): READ CAPACITY(10). CDB: 25 0 0 0 0 0 0 0 0 0 
(da3:mpt0:0:28:0): CAM status: SCSI Status Error
(da3:mpt0:0:28:0): SCSI status: Check Condition
(da3:mpt0:0:28:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mpt0:0:28:0): Retrying command (per sense data)
mpt0: request 0xffffff800091c6b0:23050 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091c6b0:23050 function 0
mpt0: completing timedout/aborted req 0xffffff800091c6b0:23050
mpt0: abort of req 0xffffff800091c6b0:0 completed
mpt0: request 0xffffff800091d040:23073 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091d040:23073 function 0
mpt0: completing timedout/aborted req 0xffffff800091d040:23073
mpt0: abort of req 0xffffff800091d040:0 completed
mpt0: request 0xffffff800091fe60:23086 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091fe60:23086 function 0
mpt0: completing timedout/aborted req 0xffffff800091fe60:23086
mpt0: abort of req 0xffffff800091fe60:0 completed
mpt0: request 0xffffff8000923f10:23088 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff8000923f10:23088 function 0
mpt0: completing timedout/aborted req 0xffffff8000923f10:23088
mpt0: abort of req 0xffffff8000923f10:0 completed
mpt0: request 0xffffff800091db80:23101 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091db80:23101 function 0
mpt0: completing timedout/aborted req 0xffffff800091db80:23101
mpt0: abort of req 0xffffff800091db80:0 completed
mpt0: request 0xffffff800091f0e0:23278 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff800091f0e0:23278 function 0
mpt0: completing timedout/aborted req 0xffffff800091f0e0:23278
mpt0: abort of req 0xffffff800091f0e0:0 completed
mpt0: request 0xffffff80009246f0:23291 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff80009246f0:23291 function 0
mpt0: completing timedout/aborted req 0xffffff80009246f0:23291
mpt0: abort of req 0xffffff80009246f0:0 completed
mpt0: request 0xffffff80009256b0:23297 timed out for ccb 0xfffffe0013f18800 (req->ccb 0xfffffe0013f18800)
mpt0: attempting to abort req 0xffffff80009256b0:23297 function 0
mpt0: completing timedout/aborted req 0xffffff80009256b0:23297
mpt0: abort of req 0xffffff80009256b0:0 completed
Now last hdd from first pool (da3) appears as faulted.
After a zpool clear everything is normal, again, though...
What the hell is going on....????
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
crowi
Forum Moderator
Forum Moderator
Posts: 1176
Joined: 21 Feb 2013 16:18
Location: Munich, Germany
Status: Offline

Re: I almost had a heart attack...

Post by crowi »

did you switch from extended gui to normal gui?
if not, disable the extended gui and test again.
Just to give it a try...
NAS 1: Milchkuh: Asrock C2550D4I, Intel Avoton C2550 Quad-Core, 16GB DDR3 ECC, 5x3TB WD Red RaidZ1 +60 GB SSD for ZIL/L2ARC, APC-Back UPS 350 CS, NAS4Free 11.0.0.4.3460 embedded
NAS 2: Backup: HP N54L, 8 GB ECC RAM, 4x4 TB WD Red, RaidZ1, NAS4Free 11.0.0.4.3460 embedded
NAS 3: Office: HP N54L, 8 GB ECC RAM, 2x3 TB WD Red, ZFS Mirror, APC-Back UPS 350 CS NAS4Free 11.0.0.4.3460 embedded

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

Yes, I considered your previous post and after rebooting the NAS, I have now disabled the Extension and will see how it goes from there...
Perhaps the extended gui is quering the disks and in combination with the write to the disks it leads to cam timeouts???
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
crowi
Forum Moderator
Forum Moderator
Posts: 1176
Joined: 21 Feb 2013 16:18
Location: Munich, Germany
Status: Offline

Re: I almost had a heart attack...

Post by crowi »

Perhaps the extended gui is quering the disks and in combination with the write to the disks it leads to cam timeouts?
Yes, that is my assumption, at least on my system :?
NAS 1: Milchkuh: Asrock C2550D4I, Intel Avoton C2550 Quad-Core, 16GB DDR3 ECC, 5x3TB WD Red RaidZ1 +60 GB SSD for ZIL/L2ARC, APC-Back UPS 350 CS, NAS4Free 11.0.0.4.3460 embedded
NAS 2: Backup: HP N54L, 8 GB ECC RAM, 4x4 TB WD Red, RaidZ1, NAS4Free 11.0.0.4.3460 embedded
NAS 3: Office: HP N54L, 8 GB ECC RAM, 2x3 TB WD Red, ZFS Mirror, APC-Back UPS 350 CS NAS4Free 11.0.0.4.3460 embedded

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

If it turns out to be true, it will be a real shame..
I really like Extended GUI...
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
b0ssman
Forum Moderator
Forum Moderator
Posts: 2438
Joined: 14 Feb 2013 08:34
Location: Munich, Germany
Status: Offline

Re: I almost had a heart attack...

Post by b0ssman »

btw is your card running with the it firmware?

http://lime-technology.com/forum/index. ... #msg121129
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

Yes, latest IT Firmware from LSI website
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

Ok, extended gui is disabled today...
Transferring a few files showed no issues..
However, I tried to run a script which shows the status of the disks:

Code: Select all

#!/bin/sh
ZPOOL=/sbin/zpool
SMARTCTL=/usr/local/sbin/smartctl
LOG=/mnt/stoplog.txt
pools=$($ZPOOL list -H -o name)
for pool in $pools; do
echo ""
echo "------------ ZFS Pool: $pool ------------"
devices=$($ZPOOL  iostat -v $pool| grep da |awk '{print $1}')
for disk in $devices; do
$SMARTCTL -n standby -q silent -A /dev/$disk
case $? in
        2) MSG=" Standby"; MSG_TEMP="n/a";;
        0) MSG="Spinning"; MSG_TEMP=`$SMARTCTL -n standby -A /dev/$disk | awk '/Temperature_Celsius/ {print $10}'`;;
esac;

echo "Disk: $disk -> $MSG | Temperature: $MSG_TEMP  "
done
done
The result was a temporary hang and dmesg reported the following:

Code: Select all

mpt0: request 0xffffff800091ce00:39910 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff800091ce00:39910 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x80
mpt0: mpt_cam_event: 0x80
mpt0: completing timedout/aborted req 0xffffff800091ce00:39910
mpt0: request 0xffffff8000926f70:39960 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff8000926f70:39960 function 0
mpt0: completing timedout/aborted req 0xffffff8000926f70:39960
mpt0: abort of req 0xffffff8000926f70:0 completed
mpt0: request 0xffffff8000925c50:39980 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff8000925c50:39980 function 0
mpt0: completing timedout/aborted req 0xffffff8000925c50:39980
mpt0: abort of req 0xffffff8000925c50:0 completed
mpt0: request 0xffffff800091a220:39993 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff800091a220:39993 function 0
mpt0: completing timedout/aborted req 0xffffff800091a220:39993
mpt0: abort of req 0xffffff800091a220:0 completed
mpt0: request 0xffffff800091cd70:39995 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff800091cd70:39995 function 0
mpt0: completing timedout/aborted req 0xffffff800091cd70:39995
mpt0: abort of req 0xffffff800091cd70:0 completed
mpt0: request 0xffffff8000926940:40008 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff8000926940:40008 function 0
mpt0: mpt_wait_req(1) timed out
mpt0: mpt_recover_commands: abort timed-out. Resetting controller
mpt0: mpt_cam_event: 0x80
mpt0: completing timedout/aborted req 0xffffff8000926940:40008
mpt0: request 0xffffff8000921330:40061 timed out for ccb 0xfffffe001708f800 (req->ccb 0xfffffe001708f800)
mpt0: attempting to abort req 0xffffff8000921330:40061 function 0
mpt0: completing timedout/aborted req 0xffffff8000921330:40061
mpt0: abort of req 0xffffff8000921330:0 completed
Ideas?
EDIT
P.S.: I noticed that I run the script at the same time when a 30 min cron job run (a job that stops the disks):

Code: Select all

#!/bin/sh
# Simple script to put the disks in standby mode
# in case they cannot be stoppped via Nas4Free settings
# It can be used as a cronjob.
# It checks if the disks are part of a pool with a scrub 
# in process, or if they are already in standby mode
ZPOOL=/sbin/zpool
SMARTCTL=/usr/local/sbin/smartctl
LOG=/mnt/stoplog.txt
#Get names of our pools
pools=$($ZPOOL list -H -o name)
echo "===================================" >> $LOG
date >> $LOG
echo "===================================" >> $LOG
for pool in $pools; do
        echo "*------------- $pool -------------*" >> $LOG
        #Get device names for this pool
        pool_devices=$($ZPOOL  iostat -v $pool| grep da |awk '{print $1}')
                #We don't want to stop a disk while the pool is scrubbing, so let's check
                if $ZPOOL status -v $pool | grep -q "scrub in progress"; then
                echo " scrub in process... skipping" >> $LOG
                echo "" >> $LOG
        else
                for device in $pool_devices; do
                                        #If a disk is already stopped, then there is no need to stop it again
                                        if $SMARTCTL --nocheck standby -i /dev/$device | grep -q "STANDBY"; then
                                        echo "Disk $device is already in standby mode" >> $LOG
                                        else
                                        echo -n "Stop $device: " >> $LOG
					#Stop the disks
                                       # echo -n "$device: " >> $LOG
                                        camcontrol stop $device >> $LOG
					fi
                done
                echo "done.." >> $LOG
                sleep 5
        fi
done
Seems that those two scripts are exhausting the controller???? Is this normal?
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
crowi
Forum Moderator
Forum Moderator
Posts: 1176
Joined: 21 Feb 2013 16:18
Location: Munich, Germany
Status: Offline

Re: I almost had a heart attack...

Post by crowi »

Is this normal?
Don't know if this is normal, but also on my Marvel chip it does not like the queries (caused by the ex-gui) during transfers.
So it might be an BSD internal issue, but I also noticed system hangs on windows when trying to run smart reporting tools (e.g. Crystaldiskinfo) during transfers. :-/
NAS 1: Milchkuh: Asrock C2550D4I, Intel Avoton C2550 Quad-Core, 16GB DDR3 ECC, 5x3TB WD Red RaidZ1 +60 GB SSD for ZIL/L2ARC, APC-Back UPS 350 CS, NAS4Free 11.0.0.4.3460 embedded
NAS 2: Backup: HP N54L, 8 GB ECC RAM, 4x4 TB WD Red, RaidZ1, NAS4Free 11.0.0.4.3460 embedded
NAS 3: Office: HP N54L, 8 GB ECC RAM, 2x3 TB WD Red, ZFS Mirror, APC-Back UPS 350 CS NAS4Free 11.0.0.4.3460 embedded

User avatar
b0ssman
Forum Moderator
Forum Moderator
Posts: 2438
Joined: 14 Feb 2013 08:34
Location: Munich, Germany
Status: Offline

Re: I almost had a heart attack...

Post by b0ssman »

i just noticed that you have 4 Samsung 2TB drives. are they Samsung EcoGreen F4 ?

if so did you update the firmware? without that they do not like to be smart queried at the same time as data transfer.
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

There was an issue back in the days when I bought them, about a defective firmware.
So when I bought them, the first thing I did was to upgrade the firmware with the supposed fixed one. (I also remember that the defective and the fixed firmware had the same number and people were shouting at Samsung that they wanted to cover up the issue)

But the issue also occured in the first pool, specifically disks da0 and da2 showed exhausting errors; and they are seagates...
Thanks a lot for your ideas guys! I hope to finally find what is going on (I am starting to blame the LSI controller...After all mpt0 IS the LSI controller which throws the errors. Although it might be a driver issue and not a controller issue)

P.S.: For now I also changed the cron job which stops the disks and it now runs every 1 hour instead of 30 minutes.
But the real fix would be to accomplish standby in LSI controllers (come on FreeBSD guys, fix the code please!!!)
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

User avatar
crest
Hardware & Software Guru
Hardware & Software Guru
Posts: 538
Joined: 02 Jul 2012 22:25
Location: Vienna, Austria - GMT+1
Status: Offline

Re: I almost had a heart attack...

Post by crest »

Hello again ChriZathens!
ChriZathens wrote:Yes, I considered your previous post and after rebooting the NAS, I have now disabled the Extension and will see how it goes from there...
Perhaps the extended gui is quering the disks and in combination with the write to the disks it leads to cam timeouts???
Back from vacation I found your thread and yes, eGUI is querying the disks frequently (about every 6 - 9 seconds along with the refresh of the status page). I already thought about another solution for these queries which will lower the amount of disk accesses with a slighty reduced accuracy via the System calls service delay time parameter in eGUI - this is in my opinon not a big drawback if disk states and temperatures are delayed shown for some seconds ...

I also discussed a performance issue during runnning scrub on ZFS with crowi and I came to the conclusion that I should do a complete redesign of the disk queries to lower the load caused by eGUI, but this will take a lot of time which I currently are not able to spend ... :(

regards
crest
NAS1: 11.2.0.4 - Omnius (Revision 6766) x64-embedded; MSI 760GM-P23; AMD Athlon(tm) II X2 250 7.58GiB RAM
NAS2: 11.2.0.4 - Omnius (Revision 6766) x64-embedded; MSI MS-7369; AMD Sempron(tm) LE-1250 8022MiB RAM
UPS: APC Back-UPS ES 550G
Extensions: OneButtonInstaller, Extended GUI, NextOwnCloud, BitTorrent Sync, Syncthing, Downloady, Midnight Commander, NCDU, MySQL, Rclone, Themes:

User avatar
ChriZathens
Forum Moderator
Forum Moderator
Posts: 758
Joined: 23 Jun 2012 09:14
Location: Athens, Greece
Contact:
Status: Offline

Re: I almost had a heart attack...

Post by ChriZathens »

Hello, Crest!
Nice to have you back!
No worries, man. Take your time. We are already obliged for this great extension you offered to us!
Cheers!!!!
My Nas
  1. Case: Fractal Design Define R2
  2. M/B: Supermicro x9scl-f
  3. CPU: Intel Celeron G1620
  4. RAM: 16GB DDR3 ECC (2 x Kingston KVR1333D3E9S/8G)
  5. PSU: Chieftec 850w 80+ modular
  6. Storage: 8x2TB HDDs in a RaidZ2 array ~ 10.1 TB usable disk space
  7. O/S: XigmaNAS 11.2.0.4.6625 -amd64 embedded
  8. Extra H/W: Dell Perc H310 SAS controller, crosflashed to LSI 9211-8i IT mode, 8GB Innodisk D150SV SATADOM for O/S

Backup Nas: U-NAS NSC-400, Gigabyte MB10-DS4 (4x4TB Seagate Exos disks in RaidZ configuration - 32GB RAM)

Post Reply

Return to “Hard disk & controller”