I almost had a heart attack...
Posted: 05 Jul 2014 18:00
Hello to all...
A few days ago I had ripped a movie (1080p, about 16GB in size) and yesterday I wanted to transfer it to my NAS.
My second pool consisting of 4x2TB disks has a total of 5.2TB space and contains all my movies. It has about 500GB free space now
Anyways, I noticed that while the transfer was in progress, the webui showed some hangs. I thought that it could not be caused by high CPU usage, since the machine I was transferring from is an atom CPU machine, with a not so great NIC, which is capable of sending at about 25MB/sec. So, I found it odd to have webui hangs due to high cpu usage.
Indeed, after I managed to see the status page again, I saw to my surprise that the pool appeared as degraded and device da7 was shown as faulted. Damn, I said, a disk failed me..
Luckily, I have a new spare 2TB WD Red, so I thought I should replace the faulty disk as soon as I returned from work (I was doing the job remotely from my office)
I should also mention that the 4 Samsung 2TB disks were bought all together, but after two months one of them started having unrecoverable errors and it was RMA'd. The disk which appeared as faulty was that one...
In the meanwhile the transfer was continuing. The point where I almost had a heart attack, is when I refreshed the status page after about 5 minutes (which had hanged again..) and now the pool appeared as unavailable. zpool status now showed 2 faulty disks
After recovering without a brain damage left (I hope) I thought to myself that it can't be that 2 disks failed me at the same time (especially because I remembered that the first one was not from the same batch)
I thought that maybe it is a cabling error and I shut down the NAS in order to look into it at the afternoon.
I realized that it was not shutting down, though, so I instructed the babysitter (my mother in law, lollll) to press and hold the power button. But after a while I told her to turn it on again, because I was curious.
Indeed, after it turned on the pool was again online with a notice that it resilvered 3GB in 1m without finding errors.
It successfully turned off this time and today I decided to check it again and possibly change the cable which connects the 4 2TB drives. ( It is a SFF-8087 to 4xSATA cable).
But after turning it on, I transferred the movie again and this time all was fine...
So decided to run a "stress test" by scrubbing the pool (I will probably not let it finish)
As of now the outcome is:
I still am not sure whether it is or not the cable faulty.. (any ideas welcome...)
Just wanted to share this little adventure with you and possibly get some ideas from you, about what the problem might be
And just for the record, half of the data in that pool ( the half that I care about) are also in a backup. But losing so much data, even with a backup available, is always frustrating...
P.S. : Additional info:
A few days ago I had ripped a movie (1080p, about 16GB in size) and yesterday I wanted to transfer it to my NAS.
My second pool consisting of 4x2TB disks has a total of 5.2TB space and contains all my movies. It has about 500GB free space now
Anyways, I noticed that while the transfer was in progress, the webui showed some hangs. I thought that it could not be caused by high CPU usage, since the machine I was transferring from is an atom CPU machine, with a not so great NIC, which is capable of sending at about 25MB/sec. So, I found it odd to have webui hangs due to high cpu usage.
Indeed, after I managed to see the status page again, I saw to my surprise that the pool appeared as degraded and device da7 was shown as faulted. Damn, I said, a disk failed me..
Luckily, I have a new spare 2TB WD Red, so I thought I should replace the faulty disk as soon as I returned from work (I was doing the job remotely from my office)
I should also mention that the 4 Samsung 2TB disks were bought all together, but after two months one of them started having unrecoverable errors and it was RMA'd. The disk which appeared as faulty was that one...
In the meanwhile the transfer was continuing. The point where I almost had a heart attack, is when I refreshed the status page after about 5 minutes (which had hanged again..) and now the pool appeared as unavailable. zpool status now showed 2 faulty disks
After recovering without a brain damage left (I hope) I thought to myself that it can't be that 2 disks failed me at the same time (especially because I remembered that the first one was not from the same batch)
I thought that maybe it is a cabling error and I shut down the NAS in order to look into it at the afternoon.
I realized that it was not shutting down, though, so I instructed the babysitter (my mother in law, lollll) to press and hold the power button. But after a while I told her to turn it on again, because I was curious.
Indeed, after it turned on the pool was again online with a notice that it resilvered 3GB in 1m without finding errors.
It successfully turned off this time and today I decided to check it again and possibly change the cable which connects the 4 2TB drives. ( It is a SFF-8087 to 4xSATA cable).
But after turning it on, I transferred the movie again and this time all was fine...
So decided to run a "stress test" by scrubbing the pool (I will probably not let it finish)
As of now the outcome is:
Code: Select all
pool: Media2
state: ONLINE
scan: scrub in progress since Sat Jul 5 18:13:08 2014
460G scanned out of 6.42T at 204M/s, 8h31m to go
0 repaired, 7.00% done
config:
NAME STATE READ WRITE CKSUM
Media2 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
da5 ONLINE 0 0 0
da7 ONLINE 0 0 0
da6 ONLINE 0 0 0
da4 ONLINE 0 0 0
errors: No known data errorsJust wanted to share this little adventure with you and possibly get some ideas from you, about what the problem might be
And just for the record, half of the data in that pool ( the half that I care about) are also in a backup. But losing so much data, even with a backup available, is always frustrating...
P.S. : Additional info:
- The S.M.A.R.T values are not telling me something. In particular, for the disks that appeared as faulty, UDMA_CRC_Error_Count is 0 (while in the past when I had a failed cable, the number increased to 37 for the hard disk attached to the faulty cable)
- The NAS is inside a wall mounted cabinet and the room's air conditioning unit is sending cool air directly to the cabinet. At the time of the problem, however, the air conditioning was turned off and temperatures in Athens this period are above 30 Celsius. So another idea is that maybe the whole SAS controller got overheated and started to throw errors (????). Right now the air conditioning is on, but I will turn it off and see what happens... Now, after an hour of scrubbing, the temps at the pool's disks are at 32 Celsius, but without the air conditioning they are reaching almost 50 (while scrubbing. In normal usage they are about 42-44)