dslreports logo
 
    All Forums Hot Topics Gallery
spc
Search similar:


uniqs
12504
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

rockisland

Premium Member

Bad Hard Drive(s) / Raid Array

I've used koitsu See Profile's excellent instructions (here: »Re: Multiple Hard Drives failing) which have worked for me over the past several years to check on hard drives in Raid arrays and all has worked perfectly.

Fast forward to 2012 / different computer / Another RAID 10 array which keeps degrading and kicking out disks. Probably something is wrong with one of the HDs but, as usual, the issue is figuring out which one.

This time, however, the instructions referenced above aren't working for me. I am getting errors trying to download the packages to the KNOPPIX Live disk so I'm stalled @ that point. I'm using that same KNOPPIX live disk - 6.2, I believe. Is that the issue? Do I need to burn a newer KNOPPIX build?

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

1. What OS, version, and 32-bit or 64-bit?

2. What is responsible for the RAID? Intel RST, i.e. BIOS-level RAID? If so, what exact driver version? If not, what card?

3. Can you please provide some details about the array degredation and what you mean by "kicking disks out"? Screenshots of some of the Event Log entries for these issues would be helpful.

4. If Intel RST-based, present-day smartmontools (meaning since 5.43, but please use 6.0) on Windows offers the ability to get SMART attributes from drives behind Intel RST. Please run a Command Prompt as Administrator (not the same as just launching a normal Command Prompt). I would start with the command smartctl --scan to see if you can get a list of the actual disks/devices. Look for entries with /dev/csmi in their name -- those are ones under RAID. From there, query each of the results using smartctl -x (e.g. smartctl -x /dev/csmi0,1). You can redirect the output to a file using the > C:\driveX.txt method in Command Prompt. Please provide output for all drives either as attachments or pasted here within "Block Codes" (not "Inline Code").

I wouldn't bother with the Knoppix method any more. I dunno what those guys are doing these days, but they seem to be fooling around/screwing around with their distro. The last version I tried (7.x-something) didn't work at all for me (kernel would lock up hard during boot) on multiple hardware.

Thanks.
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

3 edits

rockisland

Premium Member

Windows XP 32-bit
I found smartmontools ver 6+ but couldn't get it to run. I didn't see NVIDIA listed as supported.

The RAID controller is NVIDIA 9.82, I believe Soft RAID; no card

What I mean by degradation is that during POST the array is listed as degraded.

By kicking out a disk I mean that when I go into the RAID controller/setup there is the array and then another disk listed (or two) are listed. Those extra disks *should* be in the array because there are no disks on the machine that are not in the array.

Since I couldn't get smartmontools to run I took each disk out of the machine; and one at a time, put them into an external enclosure on another machine and ran HD Tune pro on each.

One was listed as good - no warnings under Health.
Three had warnings

******************************************

Serial # WMAP41107530
HD Tune Pro: WDC WD1500ADFD-00NLR1 Health

ID                                  Current  Worst    ThresholdData     Status   
(01) Raw Read Error Rate            200      200      51       0        ok       
(03) Spin Up Time                   166      163      21       4741     ok       
(04) Start/Stop Count               99       99       40       1217     ok       
(05) Reallocated Sector Count       200      200      140      0        ok       
(07) Seek Error Rate                200      200      51       0        ok       
(09) Power On Hours Count           74       74       0        19310    ok       
(0A) Spin Retry Count               100      100      51       0        ok       
(0B) Calibration Retry Count        100      100      51       0        ok       
(0C) Power Cycle Count              99       99       0        1005     ok       
(C2) Temperature                    121      98       0        26       ok       
(C4) Reallocated Event Count        200      200      0        0        ok       
(C5) Current Pending Sector         200      200      0        0        ok       
(C6) Offline Uncorrectable          200      200      0        0        ok       
(C7) Ultra DMA CRC Error Count      200      253      0        1        warning  
(C8) Write Error Rate               200      200      51       0        ok 
 
Health Status         : warning
 

***************************************************

Serial # WMAP41565239
HD Tune Pro: WDC WD1500ADFD-00NLR5 Health

ID                                  Current  Worst    ThresholdData     Status   
(01) Raw Read Error Rate            200      200      51       0        ok       
(03) Spin Up Time                   160      160      21       5008     ok       
(04) Start/Stop Count               100      100      40       44       ok       
(05) Reallocated Sector Count       200      200      140      0        ok       
(07) Seek Error Rate                200      200      51       0        ok       
(09) Power On Hours Count           100      100      0        92       ok       
(0A) Spin Retry Count               100      253      51       0        ok       
(0B) Calibration Retry Count        100      253      51       0        ok       
(0C) Power Cycle Count              100      100      0        32       ok       
(C2) Temperature                    123      104      0        24       ok       
(C4) Reallocated Event Count        200      200      0        0        ok       
(C5) Current Pending Sector         200      200      0        0        ok       
(C6) Offline Uncorrectable          200      200      0        0        ok       
(C7) Ultra DMA CRC Error Count      200      253      0        30       warning  
(C8) Write Error Rate               200      200      51       0        ok  
 
Health Status         : warning
 

************************************************

Didn't get the full Health report copied on the third one but it's warning was in:

Current Pending Sector and I believe the count was 1

All disks passed a quick scan

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

Correct, smartmontools does not support nVidia BIOS-level RAID.

Edit: Ah, I see where the 4th disk is. It's the one you said "looks fine". Politely: when I ask someone to provide me SMART attributes of every disk in the array, I expect someone to do exactly what I ask. You're asking for help, but then when I ask for the data needed, you don't provide it. :/

Referring to drive serial numbers to keep things sane:

WMAP41107530 looks fine. Don't worry about 1 CRC error, especially for a drive that's been in use over 19,000 hours. Ignore what the software tells you (re: "warning").

WMAP41565239 looks suspect, but it's hard for me to tell without knowing usage habits and so on. The drive reports a power-on hours count of only 92 hours. Did you recently replace or install this drive? If so, why? If you did replace it, this disk should probably be RMA'd -- 30 CRC errors in only 92 hours of power-on time is way too high. Otherwise it may be possible that the drive's HPA region for storing SMART attributes is wonky, or the drive may have some kind of firmware issue. If you really did replace the drive 92 hours ago, I recommend waiting until the next array degradation event happens and then see if the CRC error count increased.

The 3rd disk (serial unknown) reporting a non-zero number for "Data" on SMART attribute 0xC5 is almost certainly the reason for your array stalling. Any attempt to read data from that LBA will result in an I/O error. If the underlying RAID driver and/or option ROM ("BIOS" or "firmware") considers a read error worthy of ejecting a disk from the array, then that's quite an amusing design. It may also be that the driver/firmware chooses to attempt re-reads of the LBA and that results in continued I/O errors (duh) and after N number of retries the driver/firmware considers the disk bad -- again, amusing design.

I can't tell if this is the case because you used HD Tune Pro rather than smartmontools. smartmontools provides *much* more information from SMART than what HD Tune Pro does, and that information is often crucial in situations like this.

Getting SMART data from stand-alone drives attached to a Windows system via native SATA (XP or Vista or 7) is easy (doesn't matter if the OS shows a drive letter or not): smartctl -x /dev/sdX where X is a lowercase letter like "a" for the 1st disk, "b" for the 2nd, etc..

HD Tune Pro's Error Scan option, with the Quick checkbox checked, does not check every LBA. Please do not use the Error Scan option on the 3rd or 4th disks at this point -- it may result in further issues. smartmontools offers better ways to do LBA scans on disks in this condition.

I would suggest getting SMART attributes for the 3rd and 4th disks please, and I must again press for data from smartmontools smartctl -x from all drives.

Finally, did Windows' Event Log not show you any history/reasoning behind what's going on?
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

rockisland

Premium Member

The other two disks.....
This is being run from the "good" machine with the drives from the "bad" machine in an attached external enclosure. There must be something wrong with what I'm entering into the command prompt because smartmontools is not spitting out any data. :( I included the HD Tune results as well.

I can no longer boot into the bad machine - it's toast. If I can get the array back I can rebuild from a backup on the server or I can just format it and start over. It doesn't much matter - it's my back-up machine.

There were some disk errors in the System logs.

The drive with much lower usage hours was a replacement for a drive that had gone bad.

C:\Users\Martye>smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sde -d usbjmicron # /dev/sde [USB JMicron], ATA device
/dev/csmi2,0 -d ata # /dev/csmi2,0, ATA device
/dev/csmi2,1 -d ata # /dev/csmi2,1, ATA device
/dev/csmi2,2 -d ata # /dev/csmi2,2, ATA device
/dev/csmi2,3 -d ata # /dev/csmi2,3, ATA device
 

C:\Users\Martye>smartctl -x /dev/sde
smartctl 6.0 2012-10-10 r3643 [x86_64-w64-mingw32-win7-sp1] (sf-6.0-1)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
 
Read Device Identity failed: empty IDENTIFY data
 
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
 

Serial # WMAP41573589
HD Tune Pro: WDC WD1500ADFD-00NLR4 Health

ID                                  Current  Worst    ThresholdData     Status   
(01) Raw Read Error Rate            200      200      51       0        ok       
(03) Spin Up Time                   167      160      21       4675     ok       
(04) Start/Stop Count               100      100      40       846      ok       
(05) Reallocated Sector Count       200      200      140      0        ok       
(07) Seek Error Rate                200      200      51       0        ok       
(09) Power On Hours Count           78       78       0        16284    ok       
(0A) Spin Retry Count               100      100      51       0        ok       
(0B) Calibration Retry Count        100      100      51       0        ok       
(0C) Power Cycle Count              100      100      0        682      ok       
(C2) Temperature                    122      102      0        25       ok       
(C4) Reallocated Event Count        200      200      0        0        ok       
(C5) Current Pending Sector         200      200      0        1        warning  
(C6) Offline Uncorrectable          200      200      0        1        ok       
(C7) Ultra DMA CRC Error Count      200      253      0        0        ok       
(C8) Write Error Rate               200      200      51       0        ok       
 
Health Status         : warning
 

********************
C:\Users\Martye>smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sde -d usbjmicron # /dev/sde [USB JMicron], ATA device
/dev/csmi2,0 -d ata # /dev/csmi2,0, ATA device
/dev/csmi2,1 -d ata # /dev/csmi2,1, ATA device
/dev/csmi2,2 -d ata # /dev/csmi2,2, ATA device
/dev/csmi2,3 -d ata # /dev/csmi2,3, ATA device
 

C:\Users\Martye>smartctl /dev/sde
smartctl 6.0 2012-10-10 r3643 [x86_64-w64-mingw32-win7-sp1] (sf-6.0-1)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
 
ATA device successfully opened
 
Use 'smartctl -a' (or '-x') to print SMART (and more) information
 
C:\Users\Martye>smartctl -x /dev/sde
smartctl 6.0 2012-10-10 r3643 [x86_64-w64-mingw32-win7-sp1] (sf-6.0-1)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
 
Read Device Identity failed: empty IDENTIFY data
 
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
 
C:\Users\Martye>smartctl -a /dev/sde
smartctl 6.0 2012-10-10 r3643 [x86_64-w64-mingw32-win7-sp1] (sf-6.0-1)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
 
Read Device Identity failed: empty IDENTIFY data
 
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
 

Serial # WMAP41592444
HD Tune Pro: WDC WD1500ADFD-00NLR4 Health

ID                                  Current  Worst    ThresholdData     Status   
(01) Raw Read Error Rate            200      200      51       0        ok       
(03) Spin Up Time                   167      161      21       4666     ok       
(04) Start/Stop Count               100      100      40       849      ok       
(05) Reallocated Sector Count       200      200      140      0        ok       
(07) Seek Error Rate                200      200      51       0        ok       
(09) Power On Hours Count           78       78       0        16303    ok       
(0A) Spin Retry Count               100      100      51       0        ok       
(0B) Calibration Retry Count        100      100      51       0        ok       
(0C) Power Cycle Count              100      100      0        681      ok       
(C2) Temperature                    120      101      0        27       ok       
(C4) Reallocated Event Count        200      200      0        0        ok       
(C5) Current Pending Sector         200      200      0        0        ok       
(C6) Offline Uncorrectable          200      200      0        0        ok       
(C7) Ultra DMA CRC Error Count      200      253      0        0        ok       
(C8) Write Error Rate               200      200      51       0        ok       
 
Health Status         : ok
 

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

smartctl -x -d usbjmicron /dev/sde may have worked. smartmontools has support for a few USB-SATA bridges (JMicron chips being one of them), but each JMicron chip is different, so sometimes it works, sometimes it doesn't. It all depends on the enclosure -- it's why using USB-based enclosures is not a decent way to do forensics. The -d flag basically gives a hint as to how to communicate with the underlying device; the auto-detection for USB-SATA bridges does not always work correctly, especially on Windows.

WMAP41573589 indicates (via 0xC6) that there has been 1 LBA which the drive could not read nor auto-correct using the ECC section of the sector, and as such (via 0xC5), that LBA has been marked "suspect" (is no longer readable). The data stored in that sector (512 bytes) is lost. The only way to get the drive to re-analyse the LBA (to determine if it's actually OK or if it needs to be remapped to a spare sector) is to issue a write to it. There are many ways to go about doing this if you don't want to do an RMA (and IMO one sector being potentially unreadable after 16,000 hours of use is reasonable). I see no other problems with the drive. I would say this drive is probably the one causing you grief. Let me know how you want to proceed.

WMAP41592444 looks perfectly healthy at this point in time.
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

2 edits

rockisland

Premium Member

said by koitsu:

smartctl -x -d usbjmicron /dev/sde may have worked.

Certainly got a lot more output with that command.
This is the disk with Ser. #WMAP41573589

If the Linux live disk option no longer works and Smartmontools doesn't work with NVIDIA RAID I don't have a lot of choice to determine which disk might be bad except to use the enclosure.

When this disk goes into the enclosure I don't get the message that the disk needs to be formatted before it can be used. The others disk generated that message.

APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Disabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
 
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 4783) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp
ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  72) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   200   200   051    -    0
  3 Spin_Up_Time            POS---   167   160   021    -    4675
  4 Start_Stop_Count        -O--CK   100   100   040    -    847
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -O-R--   200   200   051    -    0
  9 Power_On_Hours          -O--CK   078   078   000    -    16284
 10 Spin_Retry_Count        -O--C-   100   100   051    -    0
 11 Calibration_Retry_Count -O--C-   100   100   051    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    683
194 Temperature_Celsius     -O---K   125   102   000    -    22
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--C-   200   200   000    -    1
198 Offline_Uncorrectable   -O--C-   200   200   000    -    1
199 UDMA_CRC_Error_Count    -O-R--   200   253   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   051    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning
 
ATA_READ_LOG_EXT (addr=0x00:0x00, page=0, n=1) failed: 48-bit ATA commands not s
upported
Read GP Log Directory failed
 
SMART Log Directory Version 1 [multi-sector log support]
SMART Log at address 0x00 has    1 sectors [Log Directory]
SMART Log at address 0x01 has    1 sectors [Summary SMART error log]
SMART Log at address 0x02 has    2 sectors [Comprehensive SMART error log]
SMART Log at address 0x06 has    1 sectors [SMART self-test log]
SMART Log at address 0x09 has    1 sectors [Selective self-test log]
SMART Log at address 0x80 has   16 sectors [Host vendor specific log]
SMART Log at address 0x81 has   16 sectors [Host vendor specific log]
SMART Log at address 0x82 has   16 sectors [Host vendor specific log]
SMART Log at address 0x83 has   16 sectors [Host vendor specific log]
SMART Log at address 0x84 has   16 sectors [Host vendor specific log]
SMART Log at address 0x85 has   16 sectors [Host vendor specific log]
SMART Log at address 0x86 has   16 sectors [Host vendor specific log]
SMART Log at address 0x87 has   16 sectors [Host vendor specific log]
SMART Log at address 0x88 has   16 sectors [Host vendor specific log]
SMART Log at address 0x89 has   16 sectors [Host vendor specific log]
SMART Log at address 0x8a has   16 sectors [Host vendor specific log]
SMART Log at address 0x8b has   16 sectors [Host vendor specific log]
SMART Log at address 0x8c has   16 sectors [Host vendor specific log]
SMART Log at address 0x8d has   16 sectors [Host vendor specific log]
SMART Log at address 0x8e has   16 sectors [Host vendor specific log]
SMART Log at address 0x8f has   16 sectors [Host vendor specific log]
SMART Log at address 0x90 has   16 sectors [Host vendor specific log]
SMART Log at address 0x91 has   16 sectors [Host vendor specific log]
SMART Log at address 0x92 has   16 sectors [Host vendor specific log]
SMART Log at address 0x93 has   16 sectors [Host vendor specific log]
SMART Log at address 0x94 has   16 sectors [Host vendor specific log]
SMART Log at address 0x95 has   16 sectors [Host vendor specific log]
SMART Log at address 0x96 has   16 sectors [Host vendor specific log]
SMART Log at address 0x97 has   16 sectors [Host vendor specific log]
SMART Log at address 0x98 has   16 sectors [Host vendor specific log]
SMART Log at address 0x99 has   16 sectors [Host vendor specific log]
SMART Log at address 0x9a has   16 sectors [Host vendor specific log]
SMART Log at address 0x9b has   16 sectors [Host vendor specific log]
SMART Log at address 0x9c has   16 sectors [Host vendor specific log]
SMART Log at address 0x9d has   16 sectors [Host vendor specific log]
SMART Log at address 0x9e has   16 sectors [Host vendor specific log]
SMART Log at address 0x9f has   16 sectors [Host vendor specific log]
SMART Log at address 0xa0 has   16 sectors [Device vendor specific log]
SMART Log at address 0xa1 has   16 sectors [Device vendor specific log]
SMART Log at address 0xa2 has   16 sectors [Device vendor specific log]
SMART Log at address 0xa3 has   16 sectors [Device vendor specific log]
SMART Log at address 0xa4 has   16 sectors [Device vendor specific log]
SMART Log at address 0xa5 has   16 sectors [Device vendor specific log]
SMART Log at address 0xa6 has   16 sectors [Device vendor specific log]
SMART Log at address 0xa7 has   16 sectors [Device vendor specific log]
SMART Log at address 0xa8 has    1 sectors [Device vendor specific log]
SMART Log at address 0xa9 has    1 sectors [Device vendor specific log]
SMART Log at address 0xaa has    1 sectors [Device vendor specific log]
SMART Log at address 0xab has    1 sectors [Device vendor specific log]
SMART Log at address 0xac has    1 sectors [Device vendor specific log]
SMART Log at address 0xad has    1 sectors [Device vendor specific log]
SMART Log at address 0xae has    1 sectors [Device vendor specific log]
SMART Log at address 0xaf has    1 sectors [Device vendor specific log]
SMART Log at address 0xb0 has    1 sectors [Device vendor specific log]
SMART Log at address 0xb1 has    1 sectors [Device vendor specific log]
SMART Log at address 0xb2 has    1 sectors [Device vendor specific log]
SMART Log at address 0xb3 has    1 sectors [Device vendor specific log]
SMART Log at address 0xb4 has    1 sectors [Device vendor specific log]
SMART Log at address 0xb5 has    1 sectors [Device vendor specific log]
SMART Log at address 0xb6 has    1 sectors [Device vendor specific log]
SMART Log at address 0xb7 has    1 sectors [Device vendor specific log]
SMART Log at address 0xc0 has    1 sectors [Device vendor specific log]
SMART Log at address 0xe0 has    1 sectors [SCT Command/Status]
SMART Log at address 0xe1 has    1 sectors [SCT Data Transfer]
 
SMART Extended Comprehensive Error Log (GP Log 0x03) not supported
 
SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
 
Error 1 occurred at disk power-on lifetime: 16270 hours (677 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle
.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 97 6b 9f 40  Error: UNC 8 sectors at LBA = 0x009f6b97 = 10447767
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 01 08 97 6b 9f 41 00      00:00:32.250  READ DMA EXT
  25 01 08 87 dc 9e 41 00      00:00:32.250  READ DMA EXT
  25 01 01 4e ea a7 46 00      00:00:32.250  READ DMA EXT
  25 01 01 4e ea a7 46 00      00:00:32.250  READ DMA EXT
  61 01 00 ee 89 77 41 00      00:00:32.250  WRITE FPDMA QUEUED
 
SMART Extended Self-test Log (GP Log 0x07) not supported
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error
# 1  Short offline       Completed without error       00%     15315         -
# 2  Short offline       Completed without error       00%     13596         -
# 3  Short offline       Completed without error       00%     12376         -
# 4  Short offline       Completed without error       00%     12329         -
# 5  Conveyance offline  Completed without error       00%      1332         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
SCT Status Version:                  2
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                 22 Celsius
Power Cycle Max Temperature:         22 Celsius
Lifetime    Max Temperature:         53 Celsius
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      5/55 Celsius
Min/Max Temperature Limit:            1/60 Celsius
Temperature History Size (Index):    128 (121)
 
Index    Estimated Time   Temperature Celsius
 122    2012-12-17 12:09    36  *****************
 123    2012-12-17 12:10    36  *****************
 124    2012-12-17 12:11    37  ******************
 ...    ..(  8 skipped).    ..  ******************
   5    2012-12-17 12:20    37  ******************
   6    2012-12-17 12:21    38  *******************
   7    2012-12-17 12:22    37  ******************
   8    2012-12-17 12:23    38  *******************
   9    2012-12-17 12:24    37  ******************
  10    2012-12-17 12:25    38  *******************
 ...    ..(  3 skipped).    ..  *******************
  14    2012-12-17 12:29    38  *******************
  15    2012-12-17 12:30    37  ******************
  16    2012-12-17 12:31    38  *******************
 ...    ..( 21 skipped).    ..  *******************
  38    2012-12-17 12:53    38  *******************
  39    2012-12-17 12:54    39  ********************
  40    2012-12-17 12:55    39  ********************
  41    2012-12-17 12:56    38  *******************
  42    2012-12-17 12:57    38  *******************
  43    2012-12-17 12:58    39  ********************
  44    2012-12-17 12:59    39  ********************
  45    2012-12-17 13:00    39  ********************
  46    2012-12-17 13:01    38  *******************
  47    2012-12-17 13:02    38  *******************
  48    2012-12-17 13:03    39  ********************
 ...    ..( 10 skipped).    ..  ********************
  59    2012-12-17 13:14    39  ********************
  60    2012-12-17 13:15    38  *******************
  61    2012-12-17 13:16    39  ********************
 ...    ..( 12 skipped).    ..  ********************
  74    2012-12-17 13:29    39  ********************
  75    2012-12-17 13:30    38  *******************
  76    2012-12-17 13:31    39  ********************
 ...    ..(  5 skipped).    ..  ********************
  82    2012-12-17 13:37    39  ********************
  83    2012-12-17 13:38     ?  -
  84    2012-12-17 13:39    21  **
  85    2012-12-17 13:40    22  ***
  86    2012-12-17 13:41    22  ***
  87    2012-12-17 13:42    23  ****
  88    2012-12-17 13:43     ?  -
  89    2012-12-17 13:44    24  *****
  90    2012-12-17 13:45    24  *****
  91    2012-12-17 13:46     ?  -
  92    2012-12-17 13:47    25  ******
  93    2012-12-17 13:48    26  *******
  94    2012-12-17 13:49    27  ********
  95    2012-12-17 13:50    28  *********
  96    2012-12-17 13:51    29  **********
  97    2012-12-17 13:52     ?  -
  98    2012-12-17 13:53    22  ***
  99    2012-12-17 13:54     ?  -
 100    2012-12-17 13:55    23  ****
 101    2012-12-17 13:56     ?  -
 102    2012-12-17 13:57    23  ****
 103    2012-12-17 13:58     ?  -
 104    2012-12-17 13:59    25  ******
 105    2012-12-17 14:00    25  ******
 106    2012-12-17 14:01     ?  -
 107    2012-12-17 14:02    22  ***
 108    2012-12-17 14:03    22  ***
 109    2012-12-17 14:04    23  ****
 110    2012-12-17 14:05    25  ******
 111    2012-12-17 14:06    25  ******
 112    2012-12-17 14:07    26  *******
 113    2012-12-17 14:08    27  ********
 114    2012-12-17 14:09    28  *********
 115    2012-12-17 14:10    29  **********
 116    2012-12-17 14:11    30  ***********
 117    2012-12-17 14:12    31  ************
 118    2012-12-17 14:13    32  *************
 119    2012-12-17 14:14    32  *************
 120    2012-12-17 14:15     ?  -
 121    2012-12-17 14:16    22  ***
 
Write SCT (Get) Error Recovery Control Command failed: ATA output registers not supported
SCT (Get) Error Recovery Control command failed
 
Device Statistics (GP Log 0x04) not supported
 
ATA_READ_LOG_EXT (addr=0x11:0x00, page=0, n=1) failed: 48-bit ATA commands not supported
Read SATA Phy Event Counters failed
 

As far as how to proceed - what makes sense to you? I'm open to any suggestions.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
...
  9 Power_On_Hours          -O--CK   078   078   000    -    16284
...
197 Current_Pending_Sector  -O--C-   200   200   000    -    1
198 Offline_Uncorrectable   -O--C-   200   200   000    -    1
...
SMART Error Log Version: 1
ATA Error Count: 1
...
Error 1 occurred at disk power-on lifetime: 16270 hours (677 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle
.
  
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 97 6b 9f 40  Error: UNC 8 sectors at LBA = 0x009f6b97 = 10447767
  
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 01 08 97 6b 9f 41 00      00:00:32.250  READ DMA EXT
  25 01 08 87 dc 9e 41 00      00:00:32.250  READ DMA EXT
  25 01 01 4e ea a7 46 00      00:00:32.250  READ DMA EXT
  25 01 01 4e ea a7 46 00      00:00:32.250  READ DMA EXT
  61 01 00 ee 89 77 41 00      00:00:32.250  WRITE FPDMA QUEUED
 

This clearly indicates a failed read at LBA 10447767. The drive itself detected this condition during a series of 48-bit LBA READ CDBs. The I/O error happened roughly 14 hours ago, and what you see in attributes 197 and 198 are a result of this.

So how do you want to proceed? (See second paragraph)
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

rockisland

Premium Member

My question is whether you think the drive is salvageable or will it always be suspect and I'd be better off replacing it. If it's worth a shot I'd give writing to it a try. It can't hurt anything at this point.

Then what to do with the drive with 30 Ultra DMA CRC Errors in 94 hours of use? That seems like too much especially when compared to the other drives with many times the hours of use. That one may actually be under warranty because it was a replacement last year.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

said by rockisland:

My question is whether you think the drive is salvageable or will it always be suspect and I'd be better off replacing it. If it's worth a shot I'd give writing to it a try. It can't hurt anything at this point. :)

From my perspective there's absolutely nothing anomalous about the drive aside from at least 1 sector that may or may not be bad (won't know until a write is issued to the LBA). Your choices here:

1. Zero the entire drive (writing zeros to every LBA). HD Tune Pro can do this via the Erase tab, or you can use whatever other utility you want (CCleaner for example has this feature too). FORMAT will not do this (at least not on XP), nor will Disk Management. Take a screenshot/snapshot of the SMART attributes before and after the drive is erased. I can do the post-analysis from there.

This choice has the advantage of detecting and dealing with any other LBAs/sectors that may cause issues. Meaning: right now you only know of one, but there may be others (on other areas of the drive you haven't used yet).

On the downside, zeroing the entire drive takes a while.

I tend to recommend this method because it's easiest and can also reveal other sectors that may have issues.

I also tend to recommend that after zeroing, you issue a Error Scan (if using HD Tune Pro) of every LBA on the disk (i.e. un-check the Quick checkbox). This takes a while too, but ensures that every LBA is readable before you put the drive back into the array.

2. Issue a write to the individual LBA that the drive has issues with (LBA 10447767). The drive will re-analyse the individual sector and either remap the LBA to a spare or decide the sector is fine and keep the existing mapping.

This has the advantage of being very quick to do (a single write takes milliseconds), and does not require you to have to back up any data from the drive to begin with (latter doesn't apply in your case since it's used for RAID).

On the downside, doing this is tricky and requires familiarity with tools such as dd (I don't trust any other utility) and exactly what arguments to use (messing these up or omitting one can result in the entire drive being zeroed). You also have to read from that individual LBA first -- why? Because I have seen cases where the drive firmware says LBA X while the OS insists LBA X is perfectly fine and it's LBA X+1 which has the issue (don't ask; this is not an off-by-one mistake, this is just downright something bizarre that I've seen reported here).

In general, on RAID arrays where checksumming filesystems are not used (i.e. NTFS, FAT, ext2, ext3, ext4, etc.), I do not recommend this method unless after doing so you immediately tell the RAID management software to nuke the metadata on the disk and rebuild the array entirely with that drive (i.e. treat the now-repaired drive as a new disk). Failure to do this can/will result in one of your files, when read, returning 512 bytes of zeros where there was previously data. What file is also unknown/undetermined. There's nothing you can do about this situation, sadly (think about the situation if it was a standalone, non-RAID disk).

3. RMA the drive (preferably an Advanced RMA, since it ensures you get a replacement drive first, which you can test fully before sending the other drive back).

This has the advantage of being the simplest choice and usually the least painful, i.e. box the drive up and ship it off.

On the downside, Advanced RMA requires that you have a credit card handy (in case they don't receive the bad drive you get charged for the new one, at a significantly increased price), that you have proper shipping materials (anti-static peanuts/foam, anti-static bags, sturdy box, etc.) for the bad drive, and that it takes about a week to get the replacement drive. The other downside is that if you do this over the phone (please try to avoid that) you have to "prove" to the person you speak to that the drive is bad. They also ask you the question "is this drive in a RAID array?" to which you should answer NO. I've ranted about this sneaky/tricky question in a DSLR/BBR post in the past; I can dig it up if you want. Just answer no and move on. Their website, AFAIK, does not ask this question. For the RMA reason, just say "bad sectors".
said by rockisland:

Then what to do with the drive with 30 Ultra DMA CRC Errors in 94 hours of use? That seems like too much especially when compared to the other drives with many times the hours of use. That one may actually be under warranty because it was a replacement last year.

I already answered this. Quote:
said by koitsu:

... If you really did replace the drive 92 hours ago, I recommend waiting until the next array degradation event happens and then see if the CRC error count [has] increased. ...

rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

rockisland

Premium Member

Not sure what HD Tune did but the drive seems to be really toasted now. The Erase function didn't take very long at all and filled the entire screen with red segments.
The full error scan took seconds and likewise filled the screen with red segments and now the drive no longer shows up in HD Tune.

CCleaner is unusable because the drive doesn't have a drive letter assigned to it.

C:\Users\Martye>smartctl -x -d usbjmicron /dev/sde
smartctl 6.0 2012-10-10 r3643 [x86_64-w64-mingw32-win7-sp1] (sf-6.0-1)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
 
Smartctl open device: /dev/sde [USB JMicron] failed: \\.\PhysicalDrive4: Open failed, Error=2
 

I think we killed it. :)

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

HD Tune Pro didn't "do anything" in that case. The drive is not dead. The drive is in the same condition as before.

The problem you saw many have experienced, including myself -- and to me, this further indicates it's more of a Windows I/O subsystem problem, not an HD Tune Pro problem, because I experienced it with two separate utilities, then without disconnecting/reconnecting the CF card, used a completely different utility which worked fine.

Proof (read, do not skim): »Re: Repair or Replace Disk Warning on Brand New WD Caviar Black.

I can step you through using dd on Windows (as shown in my post, it does work -- link to software) if you'd like. Be aware if you screw this up you can completely destroy all contents of a drive, so you need to be cautious. Start with dd --list and provide the full output here. If the output is multiple pages, please use dd --list > C:\list.txt then open C:\list.txt in Notepad and copy/paste the contents here.

If you aren't sure which drive is the correct one, disconnect the drive (unplug the USB connector), wait 15 seconds, then re-run dd --list and compare the new output to the old. It should become fairly obvious which device is relevant. If it isn't, please attach both outputs (from when the drive is attached, and from when the drive is not attached).

Again: I can help you through this, but you need to be patient.

In general, blame Windows for it's nonsense/bugs/whatever, and the fact that there are not any good utilities of this sort. (I have some others I could recommend but they do stupid things like require you to unplug/replug the device for absolutely no good reason). Using Windows for forensics/repair -- serious PITA.
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

2 edits

rockisland

Premium Member

I'm assuming that destroying drive contents is not an issue as this disk is a member of a RAID array. If the disk is erased it should be no different than replacing it with a new drive and letting the array rebuild. We already tried to erase it with HD Tune.

=> dd software- 0.6 beta or 0.5?

I will need you to walk me through (and thank you for offering) as I am absolutely horrible with command prompts.

Addendum: I got 0.5.zip; extracted it hit run on dd.exe and got a command prompt type window. typed in dd --list hit enter and and got nothing except another copy of the text dd --list.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

First paragraph: correct. The whole premise here is to get the drive to either remap the LBA to a new sector (if the sector is determined as bad) or to clear the "suspect" state (i.e. existing sector is fine). That's all we're effectively trying to do.

0.6 beta is fine.

You need to extract dd.exe from the .zip file and place it somewhere (like C: or wherever you want; C:\ makes it easier). Then launch Command Prompt, and navigate to that path by selecting the drive letter and changing into the directory, i.e. for C:\ :

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
 
C:\Documents and Settings\jdc>C:
 
C:\Documents and Settings\jdc>cd \
 
C:\>
 

From there run dd --list. You should get output that roughly resembles this:

C:\>dd --list
rawwrite dd for windows version 0.6beta3.
Written by John Newbigin <jn@it.swin.edu.au>
This program is covered by terms of the GPL Version 2.
 
Win32 Available Volume Information
\\.\Volume{03ad2fc1-3b45-11e2-bf5c-806d6172696f}\
  link to \\?\Device\HarddiskVolume1
  fixed media
  Mounted on \\.\c:
 
\\.\Volume{03ad2fc2-3b45-11e2-bf5c-806d6172696f}\
  link to \\?\Device\HarddiskVolume2
  fixed media
  Mounted on \\.\d:
 
\\.\Volume{03ad2fc0-3b45-11e2-bf5c-806d6172696f}\
  link to \\?\Device\CdRom0
  CD-ROM
  Mounted on \\.\e:
 
\\.\Volume{fa47b5c0-3b8c-11e2-a637-806d6172696f}\
  link to \\?\Device\CdRom1
  CD-ROM
  Mounted on \\.\f:
 
NT Block Device Objects
\\?\Device\CdRom0
  size is 6682574848 bytes
\\?\Device\CdRom1
  size is 4347138048 bytes
\\?\Device\Harddisk0\Partition0
  link to \\?\Device\Harddisk0\DR0
  Fixed hard disk media. Block size = 512
  size is 120034123776 bytes
\\?\Device\Harddisk0\Partition1
  link to \\?\Device\HarddiskVolume1
\\?\Device\Harddisk1\Partition0
  link to \\?\Device\Harddisk1\DR1
  Fixed hard disk media. Block size = 512
  size is 1000204886016 bytes
\\?\Device\Harddisk1\Partition1
  link to \\?\Device\HarddiskVolume2
 
Virtual input devices
 /dev/zero   (null data)
 /dev/random (pseudo-random data)
 -           (standard input)
 
Virtual output devices
 -           (standard output)
 /dev/null   (discard the data)
 

This is the output I'm looking for, specifically one for when the drive is attached to the system, and one for when it isn't (to determine what the correct \\?\Device\xxx entry is).

To resize the Command Prompt window, please follow this guide:

»physiology.med.unc.edu/w ··· mpt.html

I see that dd doesn't output to stdout (he must be writing to the buffer directly, for no good reason), so redirecting to a file doesn't work. Sigh. I'll have to mail the author about that -- that is just downright stupid, especially for a utility that's supposed to emulate a *IX system, and I'm going to have choice words with him about that.

For now, to copy the contents of the Command Prompt window, please follow one of these guides:

»www.microsoft.com/resour ··· mfr=true
»www.megaleecher.net/Copy ··· s_Window

Then paste the output into a Notepad window and choose Paste and save the results somewhere (doesn't matter where). Do this once with the drive attached, and once with the drive detached, so you'll have 2 files (duh). Then upload each file here using the Preview/Attach button and let me review the rest.

If all of this is too complex/too annoying/doesn't work, I have another alternative program (GUI-based) that I could step you through, but I've never used it for erasing drives (though I do use some of the author's other software) so I don't know if it would have the same issue as HD Tune Pro or Active@ Kill Disk.
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

rockisland

Premium Member

not_attached.txt
3,157 bytes
attached.txt
3,322 bytes
I had dd 0.5 so that is what I used.
Almost nothing is too complex if I have instructions; I'm pretty good at following directions.

txt files attached.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu to rockisland

MVM

to rockisland
Thanks. Here's the important bits:

D:\>diff -u0 not_attached.txt attached.txt
--- not_attached.txt    2012-12-19 10:30:30 -0800
+++ attached.txt        2012-12-19 10:30:33 -0800
@@ -3,0 +4,2 @@
+C:\Users\Martye>C:
+
@@ -104,0 +107,4 @@
+\\?\Device\Harddisk4\Partition0
+  link to \\?\Device\Harddisk4\DR13
+  Fixed hard disk media. Block size = 512
+  size is 150039945216 bytes
 

From this we can tell the necessary device string is \\?\Device\Harddisk4\Partition0

Let's start by trying to read all LBAs on the drive -- this is done purely to make sure we have the right device string when its attached (i.e. if you were to attach another USB device between now and then, the device string would be different). This assumes there is an access LED on the USB enclosure somewhere, indicating reading or writing:

dd if=\\?\Device\Harddisk4\Partition0 of=/dev/null bs=64k --size --progress
 

While this is running, you'll see how many bytes its transferred from the device. Please look at your USB enclosure to see if the LED is blinking or is lit constantly -- if it is, we've got the right device string.

You can press Ctrl-C at any time to stop dd. (Meaning you don't have to wait for the command to end -- this is just a test to see if we've got the right device string)

The next step after this will be to zero the drive, so please do not disconnect the drive from the system between now and then.
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

rockisland

Premium Member

bad_path.txt
3,476 bytes
Well that didn't work - didn't like the path.
said by dd :
Error opening output file: 3 The system cannot find the path specified
So I re-ran dd --list and then immediately re-ran the command you gave me and it still didn't work. See attached.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

Try using \\?\Device\Harddisk4\DR14 instead. These should point to the same thing, but you're really supposed to use the PartitionX entries.

If that doesn't work, then this probably explains why HD Tune Pro is wigging out -- the underlying device subsystem is not working for some reason. I do not have an explanation for this, other than some bug in Windows or the USB enclosure literally crapping out (welcome to why USB-SATA bridges further complicate things when it comes to forensics -- there is a lot of under-the-hood black-box nonsense that goes on). There's really no other explanation.

I would suggest doing this:

- Shut the system down (i.e. full power off)
- If the USB drive is powered via an AC adapter (I sure hope so, because it's a 3.5" drive!), unplug the AC adapter
- Wait 15 seconds
- Power on the PC
- Once the system is up/usable, plug back in the AC adapter
- Re-run dd --list and look for the drive. You'll be able to tell because it's the only drive in your system with a size of 150039945216 bytes
- Look for the relevant/preceding \\?\Device\HarddiskX\PartitionX entry for that drive
- Use that device string for the if= parameter for dd

I can assure you this tool works just fine, and as said, do not have an explanation for why the device subsystem is behaving this way.
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

rockisland

Premium Member

retry.txt
3,700 bytes
Ran it both ways and till nothing - see attached.

Can it be because this drive is damaged enough that it does not receive a drive letter nor do I get the notice from Windows that the drive must be formatted before it can be used? That is different behavior than the other drives I tested.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

Well we're out of options under Windows at this point (the other utility I thought might work doesn't -- possibly because the unit I'm testing with has a filesystem on it, but also possibly because it's intended for CF/flash drives only).

I'm going to have to write a how-to of some sort for a small Linux distribution or possibly FreeBSD or mfsBSD, because it's fairly obvious something is being a jerk at this point. This will take me a while to do -- sorry.

And just to clarify: no, the suspect LBA is not the cause of Windows behaving like this. It would be if the suspect LBA was LBA 0, but it isn't from what I can tell.
koitsu

1 edit

koitsu

MVM

Okay, here's how to do it using mfsBSD (which is standard FreeBSD but very minimal and provides enough utilities to accomplish what we need with a small footprint.

Be aware you should use a USB 2.0 controller for this. USB 3.0 is not decently supported in FreeBSD 9.0, so stick with USB 2.0 please.

Warning: If you are some random Internet user reading this "tutorial", DO NOT perform this on an SSD. This should only be used for mechanical HDDs.

If you want to make sure you don't risk destroying data on any of your other drives, I recommend detaching their SATA cables (you do not need to detach their power cable). I'm not responsible if you mess this up. :-)

1. Download this ISO and burn it to a CD (note that this assumes you have at least a 64-bit capable CPU, i.e. something made in the past 7 years or so).

2. Make sure the USB drive is attached to your system and is AC-powered.

3. Boot the CD.

4. At the main FreeBSD logo text screen, just press Enter and wait. It can take some time for things to start.

5. The kernel should load and eventually present you with a login: prompt. Enter the username root and the password mfsroot. You should now have a mfsbsd# prompt.

6. Issue the command camcontrol devlist, which should show all attached hard disks on your system (including USB ones). Example output:

mfsbsd# camcontrol devlist
<INTEL SSDSA2CW080G3 4PC10302>     at scbus0 target 0 lun 0 (ada0,pass0)
<WDC WD10EFRX-68JCSN0 01.01A01>    at scbus1 target 0 lun 0 (ada1,pass1)
<WDC WD10EFRX-68JCSN0 01.01A01>    at scbus2 target 0 lun 0 (ada2,pass2)
<WDC WD20EFRX-68AX9N0 80.00A80>    at scbus3 target 0 lun 0 (ada3,pass3)
 

In your case, you're looking for the drive labelled something like WDC WD1500ADFD-00NLR4.

What you need specifically is the device name at the end of the output, specifically labelled daX for USB-attached hard disks and adaX for ATA/SATA native (non-USB-attached) hard disks. Using the above example text, WD20EFRX-68AX9N0 is ada3 thus the device path is /dev/ada3. Ignore anything like passX or otherwise.

If the disk does not show up in camcontrol devlist, then the disk is almost certainly not showing up on the USB bus. There are ways to figure this out (using usbconfig dump_device_desc), but is outside the scope of this writing.

7. Check to make sure you can communicate with the disk reliably using dd to read from the drive starting at LBA 0, using the device path as described above for the if= parameter. Example:

mfsbsd# dd if=/dev/ada3 of=/dev/null bs=64k
 

There will be no output shown; it will appear as if the command is doing nothing. At this phase, check the LED on the USB enclosure; if it's lit or blinking, you're communicating with the right device.

BTW, the parameters in question:

* if stands for input file -- in this case, the disk
* of stands for output file -- in this case, a null device (i.e. all the data read from the disk gets thrown out)
* bs stands for block size -- in this case, 64KBytes (65536 bytes)

You can check the status of dd as well -- on FreeBSD press Ctrl-T to get the status of a currently-running process. You'd see something like:

load: 0.00  cmd: dd 31846 [physrd] 3.36r 0.00u 0.14s 0% 1676k
7928+0 records in
7928+0 records out
519569408 bytes transferred in 3.360398 secs (154615435 bytes/sec)
 

The 1st line is the system load, what kernel state the dd process is in, what the PID is, CPU usage for that process. The 2nd and 3rd lines indicate the number of blocks read/written to/from the relevant input/output files/devices, where 1 block correlates with the bs parameter. The 4th line indicates the number of bytes written to the output file/device, as well as general speed.

Assuming the USB LED is blinking or lit, press Ctrl-C to end the process; the LED should cease.

Technical note: FreeBSD's dd interacts with disks directly (what Linux users would call O_DIRECT); unlike Linux, there is no caching involved here.

8. Finally, it's time to zero the drive -- This is non-recoverable, so do not make any typos!

mfsbsd# dd if=/dev/zero of=/dev/ada3 bs=64k
 

The parameters:

* if should be the /dev/zero device, which just spits out raw zeros (byte 0x00)
* of should be the disk (i.e. you're writing to it)
* bs is block size (same as above)

Like earlier, there will be no output. This will zero the entire drive (assuming you let it finish) from LBA 0 to the end. And like earlier, you can press Ctrl-T to get a general status of things.

9. After quite some time (remember, USB 2.0 can do up to something like 35MBytes/second, which is much slower than if you were using SATA natively), the dd command should finish.

However, you will almost certainly see some errors at the very end. The message shown will be something like this:

dd: /dev/daX: end of device
XXXX+X records in
XXXX+X records in
XXX bytes transferred in X seconds (X bytes/sec)
 

This is okay -- the reason is that actual capacity of the drive does not end on an even 64KByte boundary, so it wrote as much as it could, and one of those writes was a partial write (which is fine). Trust me, the entire drive got zeroed. :-)

In the case dd exits with a message like "I/O error" or some other text on-screen (if it's bold/white its from the kernel; if it's grey-ish it's from dd), then writes at some point failed. I'd need a photo of the monitor to know what the condition/issue was. FreeBSD CAM will output quite a lot of useful information (for me anyway) about the error, but I'm not used to using disks across USB so CAM might not be able to get the actual ATA level status. I would suggest re-doing the entire procedure but with the disk attached natively via SATA (preferably AHCI enabled too if possible).

10. In the case things exited smoothly, I recommend ejecting the CD, then cleanly rebooting the system:

mfsbsd# shutdown -r now
 

11. Now go back into Windows and pull SMART statistics using smartmontools (smartctl -x ...). Attribute 197 should no longer have 1 pending LBA for analysis. I'd like to see the output regardless, as it will give some general information about the overall status of the disk after having every LBA written to.

Footnote: if you'd prefer to use a USB flash drive instead of a CD, use this instead and follow this procedure. The CD method should work regardless though, and IMO is probably easiest.
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

1 edit

rockisland

Premium Member

Long story short - this didn't work either.

I was able to identify the correct disk - it was da2

[The prompt I had was mfsbsd# not mfsroot#]

Issued the command mfsbsd# dd if=/dev/da2 of=/dev/null bs=64k
There was a brief flash of the led light and then error messages were spit out.

I wrote down what I could but it might not be exact.

da2 :umass - siml:1:0:0: Read (10). CDB:28 0 0 00 fc 80 00 80 0
CAM status - SCSI status error
SCSI status: Check condition
SCSI sense: Medium error
asc: 11,0 Unrecovered Read Error input/output error

There was more but the bottom line is that it kicked the disk out. I ran the camcontrol devlist command again and the disk was no longer listed.

Tried the zeroing command but since the disk wasn't listed anymore, not too surprisingly, that didn't work either.

I think we should give up on this. I've wasted far too much of your time wrestling with this recalcitrant drive.

rolfp
no-shill zone
Premium Member
join:2011-03-27
Oakland, CA

rolfp to koitsu

Premium Member

to koitsu

You can check the status of dd as well -- unlike Linux ( :-) ),

Linux does have dd progress capability. With pid, using bash history and up arrow:
[rolf@localhost 2012.desktop]$ dd if=/dev/zero of=/dev/null& pid=$!
[1] 20136
[rolf@localhost 2012.desktop]$ kill -USR1 20136
75105421+0 records in
75105420+0 records out
38453975040 bytes (38 GB) copied, 13.2714 s, 2.9 GB/s
[rolf@localhost 2012.desktop]$ kill -USR1 20136
99037284+0 records in
99037283+0 records out
50707088896 bytes (51 GB) copied, 17.5035 s, 2.9 GB/s
[rolf@localhost 2012.desktop]$ kill -USR1 20136
122459324+0 records in
122459324+0 records out
62699173888 bytes (63 GB) copied, 21.6475 s, 2.9 GB/s
[rolf@localhost 2012.desktop]$ kill -USR1 20136
140498135+0 records in
140498134+0 records out
71935044608 bytes (72 GB) copied, 24.8396 s, 2.9 GB/s
[rolf@localhost 2012.desktop]$ kill -9 20136
[1]+  Killed                  dd if=/dev/zero of=/dev/null
[rolf@localhost 2012.desktop]$
 

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

I'm aware of SIGUSR1 on Linux dd, but it's not quite the same:

1. Process itself has to support SIGUSR1 or SIGINFO handler; Ctrl-T on FreeBSD gets data from the kernel (first line) and sends SIGINFO to the underlying fg process (remaining lines),

2. Ctrl-T will always provide that first line for any process -- quite useful when you think something is deadlocked. Can't even begin to describe how many times this has been used in the past 3-4 years on FreeBSD to diagnose ZFS problems or thread deadlock problems,

3. Probably the most important part: the explanation given is how to accomplish disk zeroing, not "let's teach you all the semantics of how to use a UNIX system" (backgrounding a process, repeatedly sending SIGUSR1, etc.).

Details of FreeBSD's Ctrl-T is documented in termios(4), specifically the POSIX.1 extension section (see STATUS).

BTW, since we're nitpicking, please do not get in the habit of using kill -9 (SIGKILL). That's a very, very bad habit people get into that should not be used unless absolutely necessary. Start with SIGTERM (default) and if after a few seconds the processes' SIGTERM handler doesn't end things, you can resort to ruder means. There are many processes out there which have shutdown routines mapped to SIGTERM (close sockets cleanly, close local fds, do some clean-up) -- while SIGKILL is handled by the kernel, which means the process may leaves pid files and tmp files around and not do proper clean-up. Bad habit.
koitsu

2 edits

koitsu to rockisland

MVM

to rockisland
I see what's going on here, and the responsible party is probably the USB-SATA bridge. (I tried to warn you... )

First and foremost: the disk has an unreadable LBA. We know this, because it's almost certainly the one which is in pending ("suspect") state. So, there's going to be an I/O error when trying to read from that LBA. If it's very close to the start of the disk, then the dd command that reads from the disk is going to bail out fairly quick once that LBA is hit. I would know if this is the case if the output from dd was shown (specifically record counts in vs. out, then doing basic math to work out the LBA region and see if the LBA reported in the SMART error log (LBA 10447767) falls in that range).

LBA 10447767 is fairly close to the start of the disk -- that is to say, this LBA can be read only moments after issuing the dd command. Each LBA on that drive is 512 bytes, and we're reading 64KBytes at a time. The byte offset on the disk is quite easy to calculate: 10447767*512 = 5,349,256,704, so around the 5GB mark from the start of the disk.

Now consider how fast a disk can read, even under USB 2.0. Let's just say you were getting 35MBytes/second. Simple math: 5349256704 / (1024*1024*35) = 145 seconds or thereabouts you'd see the I/O error when reading linearly from the start of the disk (at 35MBytes/second).

Make sense so far? Onwards we go:

The I/O error travels back up from the ATA layer to the USB-SATA bridge, which can quite literally choose to do whatever it wants with that ATA status code. And from what I can tell based on the CAM output, it appears that the USB-SATA bridge chooses to pass the ATA message along to the underlying host (OS), and then wedges itself and/or drops itself off the bus.

An alternate situation (for the latter part) is that the OS itself actually forced detach on the USB device as a result of repeated I/O errors or reads which reached an internal timeout. CAM da device timeouts are 60 seconds. I have no idea what the USB driver bus timeout value is on FreeBSD.

Anyway, my recommendation at this point is to continue with step 8 anyway. If you see I/O errors happen as a result of that, then that's a very different situation. Reads != writes. ATA/SATA drives do different things with sectors when read vs. written.

After you zero the drive (assuming you get no I/O errors), you can re-issue the command for reading (step 7) and you shouldn't get I/O errors any more. That's the entire purpose behind what we're doing.

EDIT: Thanks for the mfsroot# vs. mfsbsd# prompt typo. I've fixed that. mfsroot is the password, mfsbsd is the prompt, and mfsroot is (believe it or not) something completely different from FreeBSD 8.x and earlier. I should tell Martin to change the password to mfsbsd just to keep things consistent.

rolfp
no-shill zone
Premium Member
join:2011-03-27
Oakland, CA

rolfp to koitsu

Premium Member

to koitsu
I'm not the one nitpicking; just pointing out one specific incorrect claim, as cited. The most important part: the source of all the extraneous verbiage is not this point of view.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

said by rolfp:

I'm not the one nitpicking; just pointing out one specific incorrect claim, as cited. The most important part: the source of all the extraneous verbiage is not this point of view.

Fair enough. I'll edit the post and remove the comment; I knew about SIGUSR1 prior, but in this case we have an end-user who isn't familiar with *IX greatly, so explaining fg/bg processes and kill and so on, just to get the status of what a command is doing = annoying compared to "just press Ctrl-T".

I don't understand the latter part of your sentence, sorry to say.
rockisland
Premium Member
join:2008-12-15
Friday Harbor, WA

rockisland to koitsu

Premium Member

to koitsu
Got it - will update tomorrow. Done beating my head against this brick wall for today.

Thank you.
rockisland

rockisland to koitsu

Premium Member

to koitsu
Didn't get a chance to pursue this yesterday but went back to it this morning and still no dice.

I did the camcontrol devlist command to make sure nothing had changed and the drive I was after was still da2

Here is the entire list:

cd0, pass0
cd1, pass1
pass2, da0 (flash card reader)
pass3, da1 (ditto)
pass4, da2 (target drive)

didn't bother with the communication check this time since that resulted in the unmounting of the drive last time

went straight to the zero command:

mfsbsd# dd if=/dev/zero of=/dev/da2 bs=64

And got this: dd: /dev/da2: invalid argument

1+0 records in
0+0 records out
0 bytes transferred
etc

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

The bs argument should be bs=64k, not bs=64. Is that a typo here on the forum or actually on the system?

Quite often smaller blocksizes don't work when interfacing with a device that mandates a minimum blocksize (in this case, that minimum would be 512 bytes), which would explain the error.