reply to koitsu
Re: Bad Hard Drive(s) / Raid Array Long story short - this didn't work either.
I was able to identify the correct disk - it was da2
[The prompt I had was mfsbsd# not mfsroot#]
Issued the command mfsbsd# dd if=/dev/da2 of=/dev/null bs=64k
There was a brief flash of the led light and then error messages were spit out.
I wrote down what I could but it might not be exact.
da2 :umass - siml:1:0:0: Read (10). CDB:28 0 0 00 fc 80 00 80 0
CAM status - SCSI status error
SCSI status: Check condition
SCSI sense: Medium error
asc: 11,0 Unrecovered Read Error input/output error
There was more but the bottom line is that it kicked the disk out. I ran the camcontrol devlist command again and the disk was no longer listed.
Tried the zeroing command but since the disk wasn't listed anymore, not too surprisingly, that didn't work either.
I think we should give up on this. I've wasted far too much of your time wrestling with this recalcitrant drive.
Mountain View, CA
I see what's going on here, and the responsible party is probably the USB-SATA bridge. (I tried to warn you... )
First and foremost: the disk has an unreadable LBA. We know this, because it's almost certainly the one which is in pending ("suspect") state. So, there's going to be an I/O error when trying to read from that LBA. If it's very close to the start of the disk, then the dd command that reads from the disk is going to bail out fairly quick once that LBA is hit. I would know if this is the case if the output from dd was shown (specifically record counts in vs. out, then doing basic math to work out the LBA region and see if the LBA reported in the SMART error log (LBA 10447767) falls in that range).
LBA 10447767 is fairly close to the start of the disk -- that is to say, this LBA can be read only moments after issuing the dd command. Each LBA on that drive is 512 bytes, and we're reading 64KBytes at a time. The byte offset on the disk is quite easy to calculate: 10447767*512 = 5,349,256,704, so around the 5GB mark from the start of the disk.
Now consider how fast a disk can read, even under USB 2.0. Let's just say you were getting 35MBytes/second. Simple math: 5349256704 / (1024*1024*35) = 145 seconds or thereabouts you'd see the I/O error when reading linearly from the start of the disk (at 35MBytes/second).
Make sense so far? Onwards we go:
The I/O error travels back up from the ATA layer to the USB-SATA bridge, which can quite literally choose to do whatever it wants with that ATA status code. And from what I can tell based on the CAM output, it appears that the USB-SATA bridge chooses to pass the ATA message along to the underlying host (OS), and then wedges itself and/or drops itself off the bus.
An alternate situation (for the latter part) is that the OS itself actually forced detach on the USB device as a result of repeated I/O errors or reads which reached an internal timeout. CAM da device timeouts are 60 seconds. I have no idea what the USB driver bus timeout value is on FreeBSD.
Anyway, my recommendation at this point is to continue with step 8 anyway. If you see I/O errors happen as a result of that, then that's a very different situation. Reads != writes. ATA/SATA drives do different things with sectors when read vs. written.
After you zero the drive (assuming you get no I/O errors), you can re-issue the command for reading (step 7) and you shouldn't get I/O errors any more. That's the entire purpose behind what we're doing.
EDIT: Thanks for the mfsroot# vs. mfsbsd# prompt typo. I've fixed that. mfsroot is the password, mfsbsd is the prompt, and mfsroot is (believe it or not) something completely different from FreeBSD 8.x and earlier. I should tell Martin to change the password to mfsbsd just to keep things consistent.
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.