This is the old XigmaNAS forum in read only mode,
it will taken offline by the end of march 2021!
I like to aks Users and Admins to rewrite/take over important post from here into the new fresh main forum!
Its not possible for us to export from here and import it to the main forum!
it will taken offline by the end of march 2021!
I like to aks Users and Admins to rewrite/take over important post from here into the new fresh main forum!
Its not possible for us to export from here and import it to the main forum!
[SOLVED] Some checksum errors on ramdom disks after scrub
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
[SOLVED] Some checksum errors on ramdom disks after scrub
I'm running 6 2TB drives in RAID-Z2
System is stable and is able to saturate GBit link over samba
CPU is AMD Athlon 64bit, 3 cores low power version
12 GByte of memory (non-ECC)
power consumption is less then 100 Watt and the PSU is 300 Watt
Every weekend I run a scrub.
After every scrub there are a small (<10) number of CHEKSUM errors on random disks
1: Is this a problem?
2: What could be the cause?
I recently moved all components to a different case (different routing of cables) and that did not make a change
---------------------------- Edit: Summary so you do not have to read the whole thread -----------------------------------------------------------
Faulty memory was expected.
Running memtest86+ in SMP mode did reveal a faulty memory module.
After removal of that memory module all memory configurations with more then 4GByte cause memtest+ to hang.
All other computers tested had the same problem so a bug in memtest+ in SMP mode was assumed.
The solution was to switch to Round Robin Testing.
After the memory test was OK a binary comparison of all files on the zfs pool and a backup made before the data was moved to the ZFS system was done (most of the files stored are read only)
This revealed that NO files where corrupted by ZFS scrub with unreliable memory
I guess I have been lucky.
So if you have scrub checksum errors and your SMART data does not show any sector read errors you either have bad SATA cables, a bad SATA controller or some bad memory.
System is stable and is able to saturate GBit link over samba
CPU is AMD Athlon 64bit, 3 cores low power version
12 GByte of memory (non-ECC)
power consumption is less then 100 Watt and the PSU is 300 Watt
Every weekend I run a scrub.
After every scrub there are a small (<10) number of CHEKSUM errors on random disks
1: Is this a problem?
2: What could be the cause?
I recently moved all components to a different case (different routing of cables) and that did not make a change
---------------------------- Edit: Summary so you do not have to read the whole thread -----------------------------------------------------------
Faulty memory was expected.
Running memtest86+ in SMP mode did reveal a faulty memory module.
After removal of that memory module all memory configurations with more then 4GByte cause memtest+ to hang.
All other computers tested had the same problem so a bug in memtest+ in SMP mode was assumed.
The solution was to switch to Round Robin Testing.
After the memory test was OK a binary comparison of all files on the zfs pool and a backup made before the data was moved to the ZFS system was done (most of the files stored are read only)
This revealed that NO files where corrupted by ZFS scrub with unreliable memory
I guess I have been lucky.
So if you have scrub checksum errors and your SMART data does not show any sector read errors you either have bad SATA cables, a bad SATA controller or some bad memory.
Last edited by erik on 20 Aug 2014 08:57, edited 3 times in total.
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
- b0ssman
- Forum Moderator

- Posts: 2438
- Joined: 14 Feb 2013 08:34
- Location: Munich, Germany
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
possible bad memory.
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
Should an extensive memory test be able to detect bad memory?
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
- b0ssman
- Forum Moderator

- Posts: 2438
- Joined: 14 Feb 2013 08:34
- Location: Munich, Germany
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
also post all smart values of your drives
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
Smart output of the 6 drives in the pool
Code: Select all
S.M.A.R.T. [/dev/ada1]:
-----------------------
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1CH164
Serial Number: W3404JYS
LU WWN Device Id: 5 000c50 06a74b2b3
Firmware Version: CC27
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Aug 10 02:10:03 2014 CEST
==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 584) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 223) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 180901944
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 228
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 052 042 030 Pre-fail Always - 2353727804374
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5916
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 63
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
190 Airflow_Temperature_Cel 0x0022 069 047 045 Old_age Always - 31 (0 1 33 25 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1182
194 Temperature_Celsius 0x0022 031 053 000 Old_age Always - 31 (128 0 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 5838h+56m+58.577s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 9057344195
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 71013783423
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
S.M.A.R.T. [/dev/ada2]:
-----------------------
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD20EARX-00PASB0
Serial Number: WD-WMAZA8686773
LU WWN Device Id: 5 0014ee 159fed85b
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Aug 10 02:10:03 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (37800) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 364) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0027 171 164 021 Pre-fail Always - 6416
4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3857
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15397
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 113
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 54
193 Load_Cycle_Count 0x0032 081 081 000 Old_age Always - 357578
194 Temperature_Celsius 0x0022 121 102 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 14090 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
S.M.A.R.T. [/dev/ada3]:
-----------------------
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF)
Device Model: WDC WD20EARS-00MVWB0
Serial Number: WD-WMAZA3284834
LU WWN Device Id: 5 0014ee 6ab6bf962
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Sun Aug 10 02:10:04 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (37500) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 361) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0027 173 166 021 Pre-fail Always - 6341
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 173
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 5904
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 63
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 093 093 000 Old_age Always - 322918
194 Temperature_Celsius 0x0022 121 102 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 4595 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
S.M.A.R.T. [/dev/ada4]:
-----------------------
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1CH164
Serial Number: W3404JZH
LU WWN Device Id: 5 000c50 06a74b31d
Firmware Version: CC27
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Aug 10 02:10:04 2014 CEST
==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 584) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 227) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 148746976
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 226
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 068 053 030 Pre-fail Always - 55921103494
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5918
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 63
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 0 0 1
189 High_Fly_Writes 0x003a 099 099 000 Old_age Always - 1
190 Airflow_Temperature_Cel 0x0022 069 049 045 Old_age Always - 31 (0 1 32 24 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1174
194 Temperature_Celsius 0x0022 031 051 000 Old_age Always - 31 (128 0 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 5835h+05m+56.756s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 9066308355
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 70377532013
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 4603 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
S.M.A.R.T. [/dev/ada5]:
-----------------------
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1CH164
Serial Number: W1E11PVX
LU WWN Device Id: 5 000c50 051eb5eec
Firmware Version: CC43
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Aug 10 02:10:04 2014 CEST
==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 592) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 218) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 127160368
3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 097 097 020 Old_age Always - 4062
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 4467542702
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 15849
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 97
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1 2 2
189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
190 Airflow_Temperature_Cel 0x0022 070 049 045 Old_age Always - 30 (0 2 31 24 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 36
193 Load_Cycle_Count 0x0032 056 056 000 Old_age Always - 88468
194 Temperature_Celsius 0x0022 030 051 000 Old_age Always - 30 (128 0 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 11016h+58m+31.318s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 21010366576
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 150569571429
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
S.M.A.R.T. [/dev/ada6]:
-----------------------
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1CH164
Serial Number: W3404JTK
LU WWN Device Id: 5 000c50 06a74b9b3
Firmware Version: CC27
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Aug 10 02:10:04 2014 CEST
==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 584) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 219) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail Always - 83157728
3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 221
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 062 051 030 Pre-fail Always - 210545958622
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5932
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 65
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 069 049 045 Old_age Always - 31 (Min/Max 25/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 21
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1171
194 Temperature_Celsius 0x0022 031 051 000 Old_age Always - 31 (0 15 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 5850h+15m+14.592s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 9247831329
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 71073539332
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
- b0ssman
- Forum Moderator

- Posts: 2438
- Joined: 14 Feb 2013 08:34
- Location: Munich, Germany
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
some command timeouts.
did you run wdidle on your greens?
it could be that one cpu core has a problem. did you run the memtest in smt mode?
did you run wdidle on your greens?
it could be that one cpu core has a problem. did you run the memtest in smt mode?
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
How to run memtest in SMT mode?
How to run wdidle?
The errors also appear on the Seagate drives
I run virtualbox with a windows machine and a continous backup so I never see any disk going into standby
Memory usage indicates always at least 3 GByte free
How to run wdidle?
The errors also appear on the Seagate drives
I run virtualbox with a windows machine and a continous backup so I never see any disk going into standby
Memory usage indicates always at least 3 GByte free
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
- b0ssman
- Forum Moderator

- Posts: 2438
- Joined: 14 Feb 2013 08:34
- Location: Munich, Germany
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
read this
http://www.sgvulcan.com/load-cycle-coun ... rom-linux/
howto
https://www.youtube.com/watch?v=J2eYyRI_F98
memtest smt
http://www.memtest.org/
press f2 at the start
http://www.sgvulcan.com/load-cycle-coun ... rom-linux/
howto
https://www.youtube.com/watch?v=J2eYyRI_F98
memtest smt
http://www.memtest.org/
press f2 at the start
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
Thanks very much for your advice.
All BIOS settings where on default/safe, I never ever had any system crash but I did for the first time run a memtest and it did report errors and crashed......................
Very bad but good to have detected.
I'm now experimenting with various memory configurations to see if it s a single defective memory module or something else.
Will let you know if I can get memtest running and if so, the zfs errors did disappear
All BIOS settings where on default/safe, I never ever had any system crash but I did for the first time run a memtest and it did report errors and crashed......................
Very bad but good to have detected.
I'm now experimenting with various memory configurations to see if it s a single defective memory module or something else.
Will let you know if I can get memtest running and if so, the zfs errors did disappear
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
- b0ssman
- Forum Moderator

- Posts: 2438
- Joined: 14 Feb 2013 08:34
- Location: Munich, Germany
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
you will most likely have destroyed a lot of your data
please read
http://forums.freenas.org/index.php?thr ... zfs.15449/
please read
http://forums.freenas.org/index.php?thr ... zfs.15449/
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.
-
substr
- experienced User

- Posts: 113
- Joined: 04 Aug 2013 20:21
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
I think he got lucky. If the checksum had been invalid, ZFS would have been unable to find a correction that matched the checksum, and would have registered a permanent error. Since it sounds like each error was in the data blocks (or only one set of parity.. it is RAIDZ2, after all), but with a valid checksum, ZFS was able to make the repairs, and they were not 'false.'
Would be interested in hearing back if this is the case. But if so, very lucky.
Would be interested in hearing back if this is the case. But if so, very lucky.
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
Interresting HW problem
Any memory module or module combination beyond 4GByte fails in the memory test when multiple cores are active.
I tried to have more conservative timing, does not help
Higher CPU voltages (its a low power AMD), does not help
BIOS update, does not help.
So now I am back to 4GByte memory, sad, but reading speed over samba is still 100kByte/s, writing is down to 50kByte/s
Most of the data on the server is static and I do have a complete backup from before I moved everything to the ZFS NAS (I also have full backup of everything and every file version in the cloud, just in case....) so I plan to do a full diff of all static files to see the amount of damage being done and if needed I can restore from the backup disks or from the cloud (so glad I bought more disks when building the NAS instead of reusing from the old server)
Will let you know.
Any memory module or module combination beyond 4GByte fails in the memory test when multiple cores are active.
I tried to have more conservative timing, does not help
Higher CPU voltages (its a low power AMD), does not help
BIOS update, does not help.
So now I am back to 4GByte memory, sad, but reading speed over samba is still 100kByte/s, writing is down to 50kByte/s
Most of the data on the server is static and I do have a complete backup from before I moved everything to the ZFS NAS (I also have full backup of everything and every file version in the cloud, just in case....) so I plan to do a full diff of all static files to see the amount of damage being done and if needed I can restore from the backup disks or from the cloud (so glad I bought more disks when building the NAS instead of reusing from the old server)
Will let you know.
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
- b0ssman
- Forum Moderator

- Posts: 2438
- Joined: 14 Feb 2013 08:34
- Location: Munich, Germany
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
i would not use that hardware anymore.
Nas4Free 11.1.0.4.4517. Supermicro X10SLL-F, 16gb ECC, i3 4130, IBM M1015 with IT firmware. 4x 3tb WD Red, 4x 2TB Samsung F4, both GEOM AES 256 encrypted.
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
Well......
I started testing my other computers and all of them hang in test 7 with more then 4 GByte.
It seems that memtest86+ has a bug in SMP testing that causes the test to hang during block move above 4GByte
Will have to find another memtest to recheck what works and what not. Seems round robin testing could work also
The memtest86+ website says SMP testing is experimental, indeed it is
I started testing my other computers and all of them hang in test 7 with more then 4 GByte.
It seems that memtest86+ has a bug in SMP testing that causes the test to hang during block move above 4GByte
Will have to find another memtest to recheck what works and what not. Seems round robin testing could work also
The memtest86+ website says SMP testing is experimental, indeed it is
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
RRB testing is rock solid with 8GByte (I remove the memory module that caused check errors in memtest)
Now restarting NAS4FREE and doing a binary compare, this will take some time.
Now restarting NAS4FREE and doing a binary compare, this will take some time.
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
First binary compare of 20000 files did not show any problem.
Now comparing 200000 files (130GByte)
If that is all ok I will compare 400GByte
This could be a nice experiment in testing how robust a scrub on an actual 6 disk RAIDZ2 system is against memory errors.
Guess only double(or triple?) memory errors would propagate?
Now comparing 200000 files (130GByte)
If that is all ok I will compare 400GByte
This could be a nice experiment in testing how robust a scrub on an actual 6 disk RAIDZ2 system is against memory errors.
Guess only double(or triple?) memory errors would propagate?
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
200GByte compare, no errors.
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
-
erik
- experienced User

- Posts: 83
- Joined: 14 Jul 2014 09:45
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
All files checked, no corruption.
Scrub done, zero checksum errors.
ZFS is more robust then expected.
Scrub done, zero checksum errors.
ZFS is more robust then expected.
primary NAS: 2*8Tb raidz1, backup NAS: 6*2TB raidz2, remote backup NAS: 3*2TB raidz1 : All NAS4Free 11.0
-
substr
- experienced User

- Posts: 113
- Joined: 04 Aug 2013 20:21
- Status: Offline
Re: Small amount of checksum errors on ramdom drives after s
Yes, the extra protection helps you as long as the memory corruption did not cause the checksum to be mis-calculated/corrupted. If the checksum is corrupted, that block is gone, no matter what level of redundancy you have. That is why memory corruption (and not using non-ECC) is considered a bad idea with ZFS.
If you can't trust the computation integrity of the CPU, memory, etc., you've got a disaster.
If your problem is actually bad memory, you must be the luckiest case I've ever seen. (Edit:) So lucky that you might keep an eye out for the problem continuing and questioning whether it is something like the disk controller instead.
If you can't trust the computation integrity of the CPU, memory, etc., you've got a disaster.
If your problem is actually bad memory, you must be the luckiest case I've ever seen. (Edit:) So lucky that you might keep an eye out for the problem continuing and questioning whether it is something like the disk controller instead.