I know I haven't been around much (new job and life have been keeping me quite busy), but I figured I'd take some time today to document a recovery I'm doing for a colleague of mine. This is also my first time dealing with a Seagate recovery (I'm used to WD disks and WD firmware behaviour).
Note: I haven't even begun the recovery process yet, but I'll try my best to do follow-ups as things progress.
The details are as follows:
- Colleague of mine has a 3TB Seagate Barracuda 7200.14 (1TB platter version), specifically model ST3000DM001-9YN166, which went bad on him -- and by bad I mean a tremendous amount of unreadable, suspect, and remapped LBAs (when I say tremendous... well, just wait for the SMART dump. It's the worst drive I've dealt with yet). He initially approached me saying "do you know how to allocate more space/room for NTFS clusters?" and I was like "wait, what? You exhausted $BadClus$ somehow? WTF?" :-)
- He managed to get "most" of the data off the drive himself (or had backups, not sure), but knew (from talking to me) to not run any programs that would either a) modify things on the drive (or as little as possible), or b) "repair" the drive (this just diminishes the chance of successful recovery if using a data recovery company -- they want the drive as untouched/unmodified as possible). However, there was a partition on the drive which he wanted files from, but couldn't read it (Windows apparently was throwing a fit over it).
- The drive was partitioned into 3 partitions using GPT. All 3 partitions consisted of NTFS-based filesystems. I point this out because the way I do recovery involves use of FreeBSD (to make a copy of the drive, e.g. a disk image or copy of the disk onto a spare), followed by use of some commercial utilities on Windows which I trust to do a decent job. The tricky part here is that I use Windows XP (limited to 2TB partition sizes, and has no GPT support), and the fact that the partition he wants data from is above the 2TB barrier (it's a 700GB partition at the very end of the disk).
- My FreeBSD box uses a
Supermicro X7SBA motherboard, which offers six (6) SATA300 ports all driven by an Intel ICH7R southbridge -- however, all of those ports are used for my own storage and data, thus I had no free SATA ports.
- To further complicate things, I do not have access to the PCIe x16 slot (which only has x8 lanes anyway) on the motherboard -- it's physically blocked by the HSF I use. That left me with PCI-X slots (I do not bother with PCI-X in general), and some 32-bit PCI slots.
- Also needed to ensure whatever SATA add-on card I went with was supported by FreeBSD natively, and used a driver that made of use CAM. Given financial data points, that left me with one choice: Silicon Image using the siis(4) driver, whose author I've talked to many times and who I trust if I was to encounter oddities.
Lots of conundrums here. The choices I made were as follows:
- To rectify the lack-of-SATA-ports situation, I went looking for a PCI-based SATA300 controller that had at least 2 internal ports. I didn't want a RAID option ROM getting in the way (since I use the ICH7R's AHCI option ROM for booting, and multiple disk option ROMs often do not work together / "stack"), but if I had to take one with an option ROM, I'd deal with that if I came to it.
- I tend to like Rosewill for "generic cheap no-frills hardware" of this sort. So I found the
Rosewell RC-312 -- but nobody sells them any more; everyone is using PCIe these days.
- However, the Silicon Image 3124 chip is PCI- or PCI-X-based (other/newer chips are PCIe), so I figured all I needed to do was find a different vendors' card that used the 3124 and go with that.
- I found the
Syba SY-PCI40010 at Amazon for US$40 -- more expensive than I wanted, but I wasn't going to fight over cost. I made the assumption that this thing had better "just work" and put in an order. More on that in just a moment (and is one of the reasons I posted this).
- The 3TB situation with Windows XP was a little more complex. I had to really think this one out. First of all, normally I take an image of a disk (i.e. stored as a file) and then work with that -- but with a 3TB drive as the source medium, I would need a 3TB file, and that wouldn't work because of the 2TB limitation on XP, not to mention
I don't own any disks that large! :-)
- What I came up with was a combination: first, said colleague happened to have a semi-new (used for 14 hours) ST3000DM001 which he could send along with the bad drive. Okay, physical capacity issue solved -- instead of making a disk image of the source disk (as a file), I'd have to just do a disk-to-disk copy. But what about the 2TB limit, and GPT partitioning, and XP? Oh, not to mention the fact I don't have a gigantic drive (ex. 4TB) available for storing a disk image?
- The methodology I'm going to try using is virtualisation, specifically directly connecting the semi-new raw disk (not a partition!) on my XP box to a Windows 7 VM using VirtualBox. I've done this before (both with raw disks and USB sticks), but the problem that kept (well, keeps) going around in my head was this: would the underlying XP OS SATA driver (Intel's AHCI/RST driver) effectively advertise the entire LBA range to the guest (Windows 7)? My gut feeling is that yes it will, and that the "2TB limitation" is purely an XP filesystem limitation.
Rephrased: my feeling is that the underlying SATA driver on XP should support more than 2TB of addressable LBAs, so as long as that's true, the guest OS (Windows 7) under VirtualBox should actually see the entire 3TB drive.
And no, I am not installing Windows 7 for this task. There are too many reasons I stick with XP and this is not up for discussion. If the methodology I've come up with doesn't work I have other avenues of choice (such as getting a SFF PC and installing Windows 7 natively on that), but "destroying" my main workstation is not an option.
So all that would address the need for GPT, in addition to the need for 2TB+ support. Crossing fingers -- I'll deal with that when I get there.
- Both disks (herein referred to as "HELP" for the bad drive, and "NEW" for the semi-new drive) arrived a few days ago, and the SY-PCI40010 arrived today.
First problem encountered:
- Upon hooking both 3TB drives to the SY-PCI40010 (HELP on port 0, NEW on port 1), the option ROM only saw the disk on port 0. I felt the drives and sure enough both were fully spun up, and I wasn't hearing any odd noises coming from either of them.
I disconnected HELP and hooked NEW to port 0 -- the drive showed up. I then hooked HELP to port 1 -- nope, nothing.
I was using the SATA cables that came with the card, so I tried replacing the cable on port 1. Now the behaviour changed: "intermittently" the drive would show up in the option ROM. I then swapped drives (back to HELP = port 0, NEW = port 1), and the "intermittent" behaviour remained tied to port 1.
Short version: port 1 on the new SY-PCI40010 is bad in some way. Do not care to figure out what it is (cold solder joint, bad port, broken traces, whatever) -- it's unreliable.
As such, I moved NEW to port 4. Voila, both drives showed up in the option ROM and in FreeBSD (the siis(4) driver worked through CAM just like I expected). Device names, just to keep things clear:
/dev/ada6 = HELP
/dev/ada7 = NEW
- The VERY FIRST thing I did was take SMART attribute snapshots from both disks (specifically
smartctl -a from both drives) and save the output to separate files.
Here is the output from the HELP drive:
smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.3-BETA1 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST3000DM001-9YN166
Serial Number: W1F0NMKF
LU WWN Device Id: 5 000c50 0511be4f3
Firmware Version: CC4B
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Jun 11 15:50:37 2014 PDT
==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 584) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 338) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 072 072 006 Pre-fail Always - 164837287
3 Spin_Up_Time 0x0003 094 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 367
5 Reallocated_Sector_Ct 0x0033 094 094 036 Pre-fail Always - 8776
7 Seek_Error_Rate 0x000f 065 051 030 Pre-fail Always - 610330700323
9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 16556
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 129
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 48116
188 Command_Timeout 0x0032 100 091 000 Old_age Always - 226 237 247
189 High_Fly_Writes 0x003a 066 066 000 Old_age Always - 34
190 Airflow_Temperature_Cel 0x0022 063 050 045 Old_age Always - 37 (Min/Max 31/37)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 85
193 Load_Cycle_Count 0x0032 050 050 000 Old_age Always - 100392
194 Temperature_Celsius 0x0022 037 050 000 Old_age Always - 37 (0 19 0 0 0)
197 Current_Pending_Sector 0x0012 076 001 000 Old_age Always - 4000
198 Offline_Uncorrectable 0x0010 076 001 000 Old_age Offline - 4000
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 15226h+46m+59.877s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 193198657458464
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 111717728336544
SMART Error Log Version: 1
ATA Error Count: 47744 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 47744 occurred at disk power-on lifetime: 16556 hours (689 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:09:43.301 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 00:09:43.200 READ LOG EXT
60 00 00 ff ff ff 4f 00 00:09:40.452 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:09:40.452 WRITE FPDMA QUEUED
ea 00 00 00 00 00 00 00 00:09:40.425 FLUSH CACHE EXT
Error 47743 occurred at disk power-on lifetime: 16556 hours (689 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:09:40.452 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:09:40.452 WRITE FPDMA QUEUED
ea 00 00 00 00 00 00 00 00:09:40.425 FLUSH CACHE EXT
2f 00 01 10 00 00 00 00 00:09:40.349 READ LOG EXT
60 00 00 ff ff ff 4f 00 00:09:37.460 READ FPDMA QUEUED
Error 47742 occurred at disk power-on lifetime: 16556 hours (689 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:09:37.460 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 00:09:37.408 READ LOG EXT
60 00 00 ff ff ff 4f 00 00:09:34.565 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:09:34.564 WRITE FPDMA QUEUED
2f 00 01 10 00 00 00 00 00:09:34.523 READ LOG EXT
Error 47741 occurred at disk power-on lifetime: 16556 hours (689 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:09:34.565 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:09:34.564 WRITE FPDMA QUEUED
2f 00 01 10 00 00 00 00 00:09:34.523 READ LOG EXT
60 00 00 ff ff ff 4f 00 00:09:31.751 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 00:09:31.614 READ LOG EXT
Error 47740 occurred at disk power-on lifetime: 16556 hours (689 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:09:31.751 READ FPDMA QUEUED
2f 00 01 10 00 00 00 00 00:09:31.614 READ LOG EXT
60 00 00 ff ff ff 4f 00 00:09:28.800 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:09:28.800 WRITE FPDMA QUEUED
ea 00 00 00 00 00 00 00 00:09:28.773 FLUSH CACHE EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 16522 -
# 2 Short offline Completed: read failure 90% 16518 -
# 3 Short offline Completed: read failure 90% 16445 -
# 4 Short offline Completed: read failure 90% 16395 -
# 5 Short offline Completed: read failure 90% 16388 -
# 6 Short offline Completed: read failure 90% 16388 -
# 7 Short offline Completed: read failure 90% 16386 -
# 8 Short offline Completed: read failure 90% 16384 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Lots and lots to say about this awful drive:
- Attributes 1 and 7 are vendor-encoded in a way and tend to always change (Seagate F/W behaviour), so I ignored those.
- Attribute 9 indicated drive lifetime of 16556 power-on hours, so about 690 days / a little under 2 years. Not very impressive for this kind of failure.
- Attribute 5 indicated 8776 reallocated LBAs. They're 4KB LBAs, so that's minimum 67,813,376 bytes of space reallocated/data potentially lost right off the bat.
- Attribute 187 indicated 48116 LBA read or LBA write failures during the drive's lifetime, and chances are most of these were reads.
- Attribute 188 indicates some general ATA-level CDB timeout behaviours logged, but how to decode this data I'm not sure (smartmontools obviously decodes it into 3 fields, but what they represent I don't know -- I'd need to go look at the smartctl code + commit comments to figure it out).
- Attribute 189 indicated 34 events of high-fly writes, which are usually an indicator of a drive head getting too close or being too far away from the platter. The drive has three (3) 1TB platters, thus 6 heads -- which head is anyone's guess, but I don't do physical repair/recovery so knowing which head doesn't help me at all. However, what it does tell me is that it's head misalignment (either at factory or over time) possibly contributed to the failure.
- Attribute 190 indicated the highest temperature the drive ever recorded during its lifetime was 37C, and combined with attribute 194, indicated that the highest temperature was seen in my current environment -- not a surprise considering it's been absurdly hot here lately.
- Attribute 193 indicated the infamous "LCC problem", where the drive had aggressively parked its heads a total of 100392 times during its lifetime. I hate this feature so much... anyway...
- Attribute 197 indicated 4000 LBAs which were marked "suspect", i.e. unreadable. I found this number to be amusing -- 4000 exactly? Just a round, even number like that? Hmm.
- Attribute 198 indicated 4000 LBAs which were marked unremappable. Now here's where it gets tricky, because Seagate drives appear to behave differently than WD drives in this regard. From what I can tell, on this drive, combined with the value in attribute 5, effectively this drive
no longer has any spare LBAs/sectors, thus cannot do any further remapping. The drive has had so many errors that it's exhausted its remapping space. One word: ouch.
- The SMART error log indicated a total of 47744 errors, with the capability to only store details of the last 5. Safe to say most of those 47744 were read errors. Of the 5 shown, they all happened during an NCQ READ request.
- The SMART self-test log showed that some program (or possibly the drive firmware itself? Unsure) had been using short tests. Not sure what my colleague may have run to do this (or as I said, drive doing it itself), but clearly the tests failed in some way. Use of
smartctl -x to review the extended self-test log turned up which LBAs were unreadable (for those tests), just in case I cared (but in this case I don't):
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 16522 4937310040
# 2 Short offline Completed: read failure 90% 16518 4937310040
# 3 Short offline Completed: read failure 90% 16445 4937283288
# 4 Short offline Completed: read failure 90% 16395 4937220368
# 5 Short offline Completed: read failure 90% 16388 4937220368
# 6 Short offline Completed: read failure 90% 16388 4937220368
# 7 Short offline Completed: read failure 90% 16386 4937220368
# 8 Short offline Completed: read failure 90% 16384 4937220368
As I said: what a mess!
As is always the rule when doing recovery, it's important to
never make any modifications to the source material, i.e. avoid any kind of LBA remaps (if possible; if the drive does it itself on a read then we have no control over that), do not mount filesystems/partitions, yadda yadda.
I then moved on to the drive labelled NEW -- because when I'm given a drive to store data on, I test it first (record SMART stats, write zeros to every LBA, record SMART stats, read all LBAs, record SMART stats, review differences along the way), because I don't want the thing crapping out on me mid-recovery -- it just complicates and frustrates.
Here's the output from the NEW drive:
smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.3-BETA1 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST3000DM001-1CH166
Serial Number: W1F4YCJH
LU WWN Device Id: 5 000c50 0735d6462
Firmware Version: CC29
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Jun 11 15:50:33 2014 PDT
==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 89) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 335) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 111 100 006 Pre-fail Always - 32370776
3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 39282
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 15
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 14
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 099 099 000 Old_age Always - 1
190 Airflow_Temperature_Cel 0x0022 066 053 045 Old_age Always - 34 (Min/Max 30/34)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 12
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 71
194 Temperature_Celsius 0x0022 034 047 000 Old_age Always - 34 (0 23 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 8h+55m+13.064s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 9132955008
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 24318750
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Everything here looked good/normal for this model of drive except for one thing: attribute 189 (high-fly writes). Okay, 1 event isn't bad, and maybe it came this way from the factory, so I'll live with it / just make mental note of it.
I then began the process of zeroing the NEW drive.
Second problem encountered:
- While zeroing a drive, since I know what I'm looking at, I tend to watch SMART stats as things progress. I immediately began seeing this (I can't wait for forum regulars to grin and smile at this one because it comes up often):
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 12
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 15
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 16
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 19
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 24
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 26
I stopped the NEW zeroing at this point and spew quite a slew of curse words, specifically because a drive that was working fine (0 CRC errors) just suddenly began to have to do retransmissions. Everyone here probably knows my advice on this matter, so I'll recap the possibilities:
a) Bad SATA cable,
b) Bad SATA port (on either the drive, or the SATA controller),
c) Bad drive PCB,
d) Some other anomaly on the SATA controller (ex. flaky traces, badly-designed card, whatever else),
e) Extreme amounts of electronic interference.
Given that port 1 on the SY-PCI40010 was already bad, my gut feeling was that port 4 might be flaky as well. I immediately put in a replacement/RMA with Amazon because I just do not accept this kind of crap.
To rectify this problem, I did two things at once (not proper troubleshooting technique, BTW, but I really did not care to diagnose which exact thing was the source of the problem):
1) I moved NEW from port 4 to port 3
2) I replaced the SATA cable between NEW and port 3 with a cable I had personally used many times over the years with reliability (thick cable, good shielding, etc.)
I resumed the zeroing, and the CRC problem disappeared (meaning the attribute has stayed at 26).
Now comes the question: how can I reset 26 back to 0? Well, as I've stated before to others: you can't. At least not usually. The funny thing about both drives (but I WOULD NEVER do this on the HELP/bad drive given its condition!) is that they need a firmware upgrade. Firmware updates will sometimes reset SMART attributes back to zero.
So when I get done with this whole recovery nonsense, I plan on updating the firmware on the NEW drive to CC4H and cross my fingers. I've already told my colleague that if it doesn't zero, then he'll just have to remember the value was at 26. There's nothing else I can do about it. I feel awful, but I'm working with a generic crap SATA card (overpriced :P) and hoping for the best.
At this point about 409GB of the NEW drive has been zeroed, and the SMART attributes still look good, although the drive is up to 44C (again not surprised -- it's hot here, and the drive does have 3 platters).