dslreports logo
site
 
    All Forums Hot Topics Gallery
spc

spacer




how-to block ads


Search Topic:
uniqs
772
share rss forum feed


norwegian
Premium
join:2005-02-15
Outback

1 edit

[hard drive] HDD smart query...

Any chance of someone pointing out to me what is happening here - 2 identical drives bought at the same time. Would it be the cable at fault.

This is in raid 0, and while if it breaks there is no problem, I'd prefer to be ahead of the game, not behind.
Logs below.
The first does not show the errors, but the second has DMA errors and they only just started if I read it correctly?

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10 family
Device Model:     ST3160815AS
Serial Number:    5RX45AMA
Firmware Version: 3.CHF
User Capacity:    160,041,885,696 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Mar  2 17:11:53 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82)Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:  ( 433) seconds.
Offline data collection
capabilities:  (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:  (   2) minutes.
Extended self-test routine
recommended polling time:  (  52) minutes.
SCT capabilities:        (0x0035)SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0002   097   097   000    Old_age   Always       -       0
  4 Start_Stop_Count        0x0033   098   098   020    Pre-fail  Always       -       2128
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   086   060   030    Pre-fail  Always       -       426488865
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7524
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0033   098   098   020    Pre-fail  Always       -       2120
184 End-to-End_Error        0x0033   100   253   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x003a   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x0022   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x001a   070   051   000    Old_age   Always       -       30 (Lifetime Min/Max 24/30)
194 Temperature_Celsius     0x0000   030   049   000    Old_age   Offline      -       30 (0 14 0 0)
195 Hardware_ECC_Recovered  0x0032   078   073   000    Old_age   Always       -       15534028
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0000   200   200   000    Old_age   Offline      -       0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      6716         -
# 2  Short offline       Interrupted (host reset)      70%      6146         -
# 3  Short offline       Completed without error       00%         0         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

-----------------------------------------------------

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
 
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10 family
Device Model:     ST3160815AS
Serial Number:    5RX4060D
Firmware Version: 3.CHF
User Capacity:    160,041,885,696 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Mar  2 17:38:53 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82)Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection:  ( 433) seconds.
Offline data collection
capabilities:  (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003)Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01)Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:  (   2) minutes.
Extended self-test routine
recommended polling time:  (  52) minutes.
SCT capabilities:        (0x0035)SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0002   097   097   000    Old_age   Always       -       0
  4 Start_Stop_Count        0x0033   099   099   020    Pre-fail  Always       -       1876
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       593096534
  9 Power_On_Hours          0x0032   083   083   000    Old_age   Always       -       15209
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0033   099   099   020    Pre-fail  Always       -       1858
184 End-to-End_Error        0x0033   100   253   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x003a   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x0022   092   092   000    Old_age   Always       -       8
190 Airflow_Temperature_Cel 0x001a   069   051   000    Old_age   Always       -       31 (Lifetime Min/Max 24/31)
194 Temperature_Celsius     0x0000   031   049   000    Old_age   Offline      -       31 (0 13 0 0)
195 Hardware_ECC_Recovered  0x0032   078   070   000    Old_age   Always       -       81301623
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0000   200   200   000    Old_age   Offline      -       451
 
SMART Error Log Version: 1
ATA Error Count: 507 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
 
Error 507 occurred at disk power-on lifetime: 13787 hours (574 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 37 98 c3 42 40  Error: ICRC, ABRT 55 sectors at LBA = 0x0042c398 = 4375448
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 ff b8 17 c3 42 40 00      00:20:02.829  READ DMA EXT
  25 ff b8 57 c2 42 40 00      00:20:02.827  READ DMA EXT
  25 ff b8 97 c1 42 40 00      00:20:02.826  READ DMA EXT
  25 ff b8 d7 c0 42 40 00      00:20:02.825  READ DMA EXT
  25 ff a8 27 c0 42 40 00      00:20:02.824  READ DMA EXT
 
Error 506 occurred at disk power-on lifetime: 13787 hours (574 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 97 80 b3 42 40  Error: ICRC, ABRT 151 sectors at LBA = 0x0042b380 = 4371328
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 ff b8 5f b3 42 40 00      00:20:01.320  READ DMA EXT
  25 ff b8 9f b2 42 40 00      00:20:01.319  READ DMA EXT
  25 ff b8 df b1 42 40 00      00:20:01.336  READ DMA EXT
  25 ff b8 1f b1 42 40 00      00:20:01.335  READ DMA EXT
  25 ff 80 97 b0 42 40 00      00:20:01.334  READ DMA EXT
 
Error 505 occurred at disk power-on lifetime: 13787 hours (574 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 2e 5b 42 40  Error: ICRC, ABRT at LBA = 0x00425b2e = 4348718
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 ff 80 af 5a 42 40 00      00:19:59.707  READ DMA EXT
  25 ff 90 1f 5a 42 40 00      00:19:59.706  READ DMA EXT
  25 ff b8 5f 59 42 40 00      00:19:59.704  READ DMA EXT
  25 ff b8 9f 58 42 40 00      00:19:59.703  READ DMA EXT
  25 ff a0 f7 57 42 40 00      00:19:59.702  READ DMA EXT
 
Error 504 occurred at disk power-on lifetime: 13787 hours (574 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 3f 28 bf 08 40  Error: ICRC, ABRT 63 sectors at LBA = 0x0008bf28 = 573224
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 ff 80 e7 be 08 40 00      00:19:57.273  READ DMA EXT
  25 ff 80 67 be 08 40 00      00:19:57.272  READ DMA EXT
  25 ff 80 e7 bd 08 40 00      00:19:57.271  READ DMA EXT
  25 ff 80 67 bd 08 40 00      00:19:57.271  READ DMA EXT
  25 ff 80 e7 bc 08 40 00      00:19:57.270  READ DMA EXT
 
Error 503 occurred at disk power-on lifetime: 13787 hours (574 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 67 b0 83 9a 40  Error: ICRC, ABRT 103 sectors at LBA = 0x009a83b0 = 10126256
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 ff 78 9f 83 9a 40 00      00:11:58.980  READ DMA EXT
  25 ff 00 9f 82 9a 40 00      00:11:58.980  READ DMA EXT
  25 ff 90 07 82 9a 40 00      00:11:58.977  READ DMA EXT
  25 ff 00 07 81 9a 40 00      00:11:58.975  READ DMA EXT
  25 ff 00 07 80 9a 40 00      00:11:58.975  READ DMA EXT
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     15209         -
# 2  Short offline       Completed without error       00%     14401         -
# 3  Extended offline    Aborted by host               20%     14401         -
# 4  Short offline       Completed without error       00%     13803         -
# 5  Short offline       Completed without error       00%     13783         -
# 6  Short offline       Completed without error       00%     13773         -
# 7  Short offline       Completed without error       00%         0         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

--
The only thing necessary for the triumph of evil is for good men to do nothing - Edmund Burke



koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

1 recommendation

Speaking strictly about the drive with serial number 5RX4060D --

The only anomaly shown here is a high number of CRC errors: 451 accumulated over the course of 1858 hours.

The SMART error log contains a count of 507 errors, but only has space to store the most 5 recent errors, so how long this issue has been going on is unknown. The most recent error occurred at 13787 power-on hours (which was roughly 1422 hours in the past from the time the SMART attribute snapshot was taken). Example:

Error 507 occurred at disk power-on lifetime: 13787 hours (574 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.
  
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 37 98 c3 42 40  Error: ICRC, ABRT 55 sectors at LBA = 0x0042c398 = 4375448
  
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 ff b8 17 c3 42 40 00      00:20:02.829  READ DMA EXT
  25 ff b8 57 c2 42 40 00      00:20:02.827  READ DMA EXT
  25 ff b8 97 c1 42 40 00      00:20:02.826  READ DMA EXT
  25 ff b8 d7 c0 42 40 00      00:20:02.825  READ DMA EXT
  25 ff a8 27 c0 42 40 00      00:20:02.824  READ DMA EXT
 

This log entry indicates that the drive was in the process of handling 48-bit read I/O requests for a linear number of LBAs when the most recent request (at 00:20:02.829 timestamp (just an arbitrary number)) resulted in a protocol-level CRC error and returned ABRT status (indicated by the Error: ICRC, ABRT line). Your controller driver and/or the OS should have noticed this condition, as it was sent all the way back to the OS. The OS may have re-tried the read request with success.

The other 4 errors shown are the same type, but for different LBAs, which makes perfect sense given what CRC errors indicate.

Protocol CRC errors are the most difficult type of error to track down because there are many possibilities that could explain the issue. Some examples:

* Physical cabling issues (e.g. bad SATA cable), including cables with crappy shielding
* Dust or other such things within the SATA data connector (on the motherboard or on the drive itself), including a loose connection
* Physical damage to the SATA data connector (on the motherboard or on the disk PCB)
* Physical damage to the disk PCB, particularly near/around the SATA data connector, or traces between the data connector and the PCB's controller; this may also be the result of faulty manufacturing
* Physical damage to the motherboard, particularly near/around the SATA data connector, or traces between the data connector and the motherboard's SATA controller; this may also be the result of faulty manufacturing
* A system which is emitting excess interference/EMI, compounded by one of the above issues

This type of damage is often invisible to the naked eye. Usually what I recommend people do, and in this order, is:

1. Unplug the SATA cable from the motherboard and blow air into the SATA port on the mainboard, as well as around/at the end of the cable. Re-plug the cable and continue to use the system + watch for recurring errors.

2. If errors continue: unplug the SATA cable from the disk and blow air into the SATA port on the drive PCB, as well as around/at the end of the cable. Re-plug the cable and continue to use the system + watch for recurring errors.

3. If errors continue: replace the SATA cable entirely.

4. If errors continue: replace either the motherboard or the disk. (If you have a replacement disk PCB for the exact model and revision and firmware of disk, you can try swapping that out instead).

5. If errors continue: same as #4 but replace whatever the opposing part is (e.g. if in #4 you replaced the disk, now try replacing the motherboard).

6. If errors continue: issue is almost certainly EMI-related, in which case I have no advice on how to troubleshoot this kind of issue.

The reason I recommend this fairly long and drawn out procedure is that it allows the person to figure out where the actual problem was. Most people I encounter just "replace the SATA cable" and report "the issue is gone! It must have been a bad cable!" which is incorrect/inconclusive -- it could have been dust in the port or a loose connection which could have been relieved through air or re-tightening. So which methodology you choose to follow is up to you, but keep an open and logical mind.

Understand that these are not sector-level ECC errors (which some people like to erroneously call "CRC errors"), these are ATA protocol-level CRC errors. Think of it like an IP or TCP or UDP packet: if the packet checksum included in the packet does not match the calculated checksum when received, then that means data integrity can't be verified, hence error. This happens between the two SATA controllers (e.g. motherboard and disk, or HBA and disk).

--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


norwegian
Premium
join:2005-02-15
Outback

Ah, thank you.

The motherboard is known for allocating HDD's to a USB type drive not IDE or SATA, nothing I can do for now there. It is an ASUS M2N32-SLI before the Vista version. Great motherboard except it was sold too early in it's development, just a money grabber for ASUS....we all get caught on a motherboard with faults or quirks, but no RMA offer there for it.

This same motherboard actually had the 2 x raptors (150GB) before the velo's came out in raid from new; you mentioned in another post about poor firmware/product and 1 failed within 3 yrs and I never utilized the RMA process. It's been a costly learning curve this build, but 7 years later it still works.

I might have to start checking the finances for a new build soon.

Appreciate your knowledge and comments.
A little trouble shooting to see if it can be rectified short term, but long term I need a new build realistically.
--
The only thing necessary for the triumph of evil is for good men to do nothing - Edmund Burke