Page 1 of 1

Help understanding SMART message & replace drive procedure

Posted: 27 Mar 2014 23:31
by bunk3m
I started getting errors from one of my two drives. Both are 3TB mirrored in a pool using ZFS. The error

Code: Select all

smartd[10762]: Device: /dev/da2, SMART Failure: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE
And version of nas4free is NAS4Free 9.1.0.1 - Sandstorm

I keep backup of the drive but now I'm trying to understand what the problem is and how soon the drive will fail. I won't be able to replace the drive before the weekend at the earliest. In addition, I've never had to replace a drive using a ZFS pool so I'd appreciate if someone could suggest some search terms so I can find and learn about how to do this. If I get confused I'll ask but I try to learn myself first.

The other drive is the same brand and it shows PASSED but has this read failure that has me worried too.

Code: Select all

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%      8431         -
. The failing drive (below) shows 90%.

SMART drive info for failing drive is below. There are two errors and I don't know what they mean and if this is something that is fixable or if the drive is trashed.

thanks in advance for any help!

Code: Select all

Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
LU WWN Device Id: 5 000c50 0505b894a
Firmware Version: CC26
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Mar 27 18:16:46 2014 EDT

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (  73)	The previous self-test completed having
					a test element that failed and the test
					element that failed is not known.
Total time to complete Offline
data collection: 		(  584) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 333) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail  Always       -       111708992
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   031   030   030    Pre-fail  Always   In_the_past 20439754810287
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5589
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       10
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   099   099   000    Old_age   Always       -       1
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   066   062   045    Old_age   Always       -       34 (Min/Max 28/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       10
193 Load_Cycle_Count        0x0032   096   096   000    Old_age   Always       -       9641
194 Temperature_Celsius     0x0022   034   040   000    Old_age   Always       -       34 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       4649h+58m+50.145s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3458260451
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1405151903

SMART Error Log Version: 1
ATA Error Count: 2
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 5567 hours (231 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  29d+20:12:32.662  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  29d+20:12:32.661  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  29d+20:12:32.634  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  29d+20:12:32.634  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  29d+20:12:32.607  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 5567 hours (231 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  29d+20:12:32.662  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  29d+20:12:32.661  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  29d+20:12:32.634  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  29d+20:12:32.634  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  29d+20:12:32.607  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: unknown failure    90%      5573         0

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Re: Help understanding SMART message & replace drive procedu

Posted: 28 Mar 2014 02:20
by kenZ71
It is a risk, but you should be ok to wait.

For the replacement procedure it is this easy:

Power down, remove the failed drive, add the replacement. Boot up & enter this command:
zpool replace tank c0t3d0


Where tank is the name of your pool & c0t3d0 is the name of the failed drive.

This link gives a more in depth write up
http://mattwilson.org/blog/solaris/repl ... -with-zfs/

Re: Help understanding SMART message & replace drive procedu

Posted: 30 Mar 2014 16:33
by bunk3m
Thanks kenZ71!

Worked like a charm! So easy it actually looks too easy.

Thanks so much.

I read the link but was concerned that I had to reformat the drive or add/join the pool etc. After reading a few additional posts I came across one from Oracle that mentioned that the command

Code: Select all

zpool replace tank c0t3d0
will take the raw drive and setup everything including resilvering the drive.** Very cool! Love that ZFS!

**N.B. The assumption is that autoreplace = on is set. I suspect it was set in the default as I don't remember setting this.

Now I have to run some tests on the dying drive that Seagate expects and then RMA, package and ship. I hear that Seagate is a real stickler for shipping packaging.

Thanks again for your help!