dslreports logo
site
 
    All Forums Hot Topics Gallery
spc

spacer




how-to block ads


Search Topic:
uniqs
1266
share rss forum feed

JoelC707
Premium
join:2002-07-09
Lanett, AL
kudos:5

1 edit

Random 100% disk usage

Click for full size
AnVir Disk Load Data

ST3320418AS

ST3500641AS

STM3500630AS
downloadDrive1.txt 3,893 bytes
ST3320418AS
downloadDrive2.txt 3,268 bytes
ST3500641AS
downloadDrive3.txt 3,324 bytes
STM3500630AS
I have the following drives in my computer:
ST3320418AS - Boot Drive
ST3500641AS - RAID 0
STM3500630AS - RAID 0

The two RAID drives are done so using Windows Dynamic Disks and have nothing loaded on them. No real plans for them and they could even come out.

At random times, Drive C (the 320GB) will show 100% disk load. I first noticed it just from the activity light being on solid and the computer being dead slow. I installed AnVir Task Manager to view disk load and see nothing out of the ordinary.

When it's doing it's 100% load thing, the tray icon will show 100% load for drive C only and 0% for the other drives/array. It also breaks the load down into read and write load and sometimes both read and write are 100%, sometimes read is 100% and write is 0%.

I am running BOINC on the computer but except for only brief flashes (and never during the 100% load issue) do those tasks ever show up with any kind of disk load. Just for good measure I shut BOINC down and it still did it so I don't think that is the cause.

I've also attached screen shots of HD Tune Pro showing the disk health report for all three drives. Yes the 320GB drive shows a pending sector and a CRC error that probably weren't there before I got the drive but is that really the cause of this?

Nothing else looks really out of the ordinary to me except the F1 and F2 attributes on the 320 and I'm really not sure what's up with those. The Maxtor drive looks like it's about to go kaput if those pending sector and offline uncorrectable counts can be trusted.

Also uploaded txt files from smartctl since I have 1 day left on the HD Tune Pro trial version so I won't be able to get subsequent data from it shortly and it provides another view of the same data.


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

The OS is showing "100% load" on the disk because "disk load" is actually measured by how much time a transaction (ATA CDB) takes. If a disk locks up or acts wonky during I/O, it will show up like this. Every OS works like this, BTW.

As for your disks:

1) Replace the ST3320418AS. This drive is nearing SMART trip threshold for attribute 0xBB, which indicates a large number of reads or writes which could not be corrected using the associated ECC region of the disk. 0xC5 and 0xC6 also don't sit well with me. All of this for a drive that has a power-on count of 1379 hours? Yeah, bad drive.

2) The ST3500641AS is in "meh" shape, although it's more excusable since it's been used for over 30,000 hours. I don't like what I see with attribute 0xBB (again), although those could have accumulated as a result of the LBA remaps (0x05). My advice is, if you want to keep using this drive, completely zero it first, recreate the filesystem, and go from there. You will need to remember that the drive has 19 reallocated LBAs, so if it gets worse later you'll know to replace it. This drive has also experienced SMART threshold trip for external temperature sometime during its life.

3) Replace the STM3500630AS. This drive definitely has something wrong with an area or section of its platters; 0xC5/0xC6 should not be that high, even for almost 33,000 hours of use. This drive has also experienced SMART threshold trip for external temperature sometime during its life.

If these were drives I had, I would be replacing all 3.

P.S. -- You chopped off useful information from the bottom of smartctl output on all 3 examples. Please don't do this.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


JoelC707
Premium
join:2002-07-09
Lanett, AL
kudos:5

Interesting info regarding disk load. I didn't think it was actually something eating up the disk with IO as that should have shown up in AnVir. Well, if it's a stalled transaction or something, then I guess it would have shown up in the list but it wouldn't have been so obvious. Either way, I'm just glad to finally have an answer. I suspected it was a disk issue but wasn't quite sure how to narrow it down.

The original goal of the 500's was to use them as backup drives for my file server. I had a 3rd and was going to stuff them in my machine and just use it as the backup system using RAID 5 within Windows. Sadly one of those drives started power cycling constantly before turning off completely. I took it out and it does the same in another system so I scraped that idea and that's why they have been sitting as an empty RAID 0.

What's bad about all this is, these drives "seemed" fine for a while in their previous use. Sure they have a LOT of "on" time (in the case of the 500's) as they were used in 24/7 file server duty before I upgraded to 1TB drives. I'd have to pull the drives out to see what I put on the label, but I don't recall them having many if any remapped LBAs (less than 5 if there were any).

The temperature issue I can see as they have been used in some very hot environments (these two 500s are the same in a set we have talked about before via PM). I'll definitely be replacing the drives. My ultimate goal is an SSD in each of my two main desktops but every time I get up enough money where I could possibly do that, something else comes along and needs the money lol.

What I am concerned about is if there is something somewhere that is killing these drives. You and I have talked at length via PM about different drives I've had and some of them experiencing wonky things like this. Granted, many of my drives are extremely high "on" time so it could be just old age but that "new" 320 is what's really bothering me about this. I'm afraid to replace it with an SSD until I figure out what the problem is for fear that whatever I put in here next will die in short order.

Not all of these drives have had the same problem in the same systems so I don't think I can narrow it down to just one computer, and I've also got some systems here that seem to be humming along just fine. That said, I did replace the batteries in my UPS so I could put that back in service. Sure it won't help filter out power issues any better than the surge strip I had in it's place (it's not a double-conversion UPS) but it will help filter out sags, and power outages which we have had a few of here lately.

If it's not a house power issue, is it a PSU issue? Is this even something I can test easily? Nothing else seems to be dying so I suspect it's not a PSU issue as I would assume other components would start failing too.

Oh, as or smartctl output, I'm not sure what's up with that. I ran the command in the window and simply told it to dump it's output to a text file. If anything got truncated it was smartctl's doing or I didn't have a switch on that displayed the data you want. This is the command I ran:
smartctl -H -c -A /dev/sdX > C:\Joel\DriveX.txt



koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

Issues relating to AC power wouldn't explain sector-level problems. A drive continually falling off the bus or power-cycling, of course, could be related to AC power, but not sector issues. Even improper voltage wouldn't cause this kind of problem. So no, I do not think a PSU would be the source of this issue.

The only environmental condition I can think of would be humidity, specifically high humidity (lots of water in the air). I had an old roommate back in 1997 who decided one to leave his humidifier on while his PC was running -- that was the end of that PC. I've discussed dust with you in PM, but that's not a likely explanation either given that drives let air out but won't let particles/debris in (at least not easily).

As for the short/terse smartmontools output: smartctl -a or smartctl -x are preferred, particularly the latter. The arguments you're currently using won't examine things like the SMART error log and self-test log, which are useful; same goes for SCT region analysis.

I'm sorry I can't be of more help. Honestly I've never encountered someone with such high drive failure rates before. It might be worthwhile to keep all of the bad drives and consider saving up some cash + talking to a data recovery company and asking if they can simply do analysis to determine if there is any commonality between all the drive failures (vs. doing actual data recovery). If they came back and said "out of the 10 drives you sent us, 9 had indications of condensation", then you'd have a better idea. But I don't know if recovery companies do this sort of thing.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


JoelC707
Premium
join:2002-07-09
Lanett, AL
kudos:5

downloadDrive1-1.txt 24,053 bytesdownloadDrive2-1.txt 9,148 bytesdownloadDrive3-1.txt 10,524 bytes
Ahhh OK, I see my mistake in the smartctl output now, not sure why I did that. I've uploaded new outputs with -x.

Dust isn't an issue, I checked it yesterday. I am in a basement though. I have a portable AC unit down here near the computer so I doubt humidity is the issue but I have experienced that before. Full disclosure, I do have a humidifier over near my bed which is well away from the computer and it has only been running for two days now and this has been going on for much longer than the humidifier has been on. Also, living in Tucson, we had an evap cooler on the house and the 500's where in the house then but not the 320.

I could contact a data recovery company but that's probably going to be costly. Well if its not power issues and not dust or humidity, I don't know what it could be. I guess I'll need to find the best of my spare drives and swap this one out.


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

Thanks for the output.

For the first drive, it looks like either most or all of the uncorrected errors are due to a single LBA (LBA 2969568278). The error log only has enough sectors/room to store 20 log entries but given attributes 197/198 safe to say this is the only sector which needs zeroing (for re-analysis). I would suggest zeroing either that LBA or the entire drive, and then issuing a selective self-test (smartctl -t select,0-max /dev/whatever). This will issue read requests for all LBAs (including those which are remapped), and is firmware-level. You can run this while the drive is in use, but any standard I/O requests to the drive will delay the completion of the analysis (i.e. it will take much longer).

I cannot explain the behaviour of the 2nd drive. I'm not sure why attribute 187 is non-zero. This could be a firmware bug of some sort, as I would expect to see entries in the SMART error log as well as sector errors as a result. It's also very possible, for this model of drive, I'm interpreting the attribute wrong (but I doubt it; even the adjusted values are awful).

For the 3rd drive, things are even more tricky. Definitely LBA 0 exhibited a problem roughly 8400 hours ago, but what I can't explain is why attributes 197/198 have non-zero values yet show absolutely no sign of sector issues other than the single instance of LBA 0. This could also be a firmware design choice or bug; I really don't know.

So with regards to your 2nd and 3rd drives, I have no real explanation. The behaviour I'm seeing there is odd/anomalous to say the least.

One thing I will point out, however, is that both the 2nd and 3rd drives have experienced extremely high temperatures in the past (I would say roughly 50-60C). This may have caused either or both drive(s) to begin behaving oddly; something mechanical going wrong in some way, or possibly tripping some logic error in the firmware. Again, I can only speculate, but it is the one thing both drives have in common.

That's all I can really say about this situation. That's about as much forensic work as I can do.

--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


JoelC707
Premium
join:2002-07-09
Lanett, AL
kudos:5
reply to JoelC707

OK update time. For some reason my computer refuses to boot ubcd so I went to my test bench with my drive. I was going to try and force a zero on that LBA but am kinda glad I didn't. I started to run spin rite on it since it would read/write the entire disk surface. Unfortunately early on it found another suspect LBA and sat at the dynastat recovery on that section for over an hour before I aborted it.

Booted into ubcd, ran the extended self test and it failed on that same LBA that spin rite complained about. Wiped the drive with did with no errors. Rescan and it passes just fine. The pending sector is gone from the smart attributes too. I'll get full details of the attributes as soon as windows finishes installing.


JoelC707
Premium
join:2002-07-09
Lanett, AL
kudos:5
reply to koitsu

Full zero, extended self test, Windows reinstall, and selective 0-max test. So far it looks good. A couple more LBAs showed up it looks like but nothing appears to be pending or reallocated. I'll check it occasionally and see if anything changes.

As for the 500's those are less than stellar. I think I completely killed the Maxtor 500 by doing a full zero. It "completed" according to dd, but did so way sooner than I expected it to. Tried to pull up the smart attribs again and it says it doesn't return an identify device. The Seagate 500 is currently running a full zero but I doubt I'll trust it either, even if it completes and passes tests.

Is there a way to zero a specific LBA using smartctl, testdisk or even dd or something similar?


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

The ST3320418AS looks fine now. Attributes 5, 197, and 198 are all zero, and that's good. It means the suspect LBA was in fact perfectly fine (did not need remapping). That's good.

smartctl does not issue write operations on a LBA level. The only operations it can do are selective scans which are read-only and done at the firmware level. I have no idea what testdisk is (nor do I care what it is). You can absolutely write to a specific LBA using dd for Windows (I do this all the time), but you need to know the exact arguments to give -- one wrong or missing argument and you will end up destroying the drive.

Finally, on Windows, remember that after zeroing an LBA, CHKDSK isn't necessarily going to find anything wrong. You may have zeroed part of a data block/region of a file, which CHKDSK does not verify (it has no way to verify this anyway -- NTFS etc. are not checksumming filesystems). So which file now has 512 bytes of zero in it, where it previously may have held legit data, is unknown. This is why I recommend people just format the drive / write zeros to the entire thing.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


JoelC707
Premium
join:2002-07-09
Lanett, AL
kudos:5

1 recommendation

That's what I was afraid of actually. I had a mostly working system but if I zero out that LBA I could end up with a non-functioning system. It'd still end up formatting it so the end result would be the same and given the other LBAs that showed up (believe you can see them in the logs), it was the better course anyway.

I actually haven't run chkdisk, only smartctl for extended/selective tests. I thought about chkdisk or SFC but figured if I'm going to wipe it then what's the point. Thanks for your help!