dslreports logo
site
 
    All Forums Hot Topics Gallery
spc

spacer




how-to block ads


Search Topic:
uniqs
4676
share rss forum feed

HarryH3
Premium
join:2005-02-21
kudos:3
Reviews:
·Suddenlink

Testing HDD's in RAID 1?

A client of mine has a system that was sold to them as part of a turnkey Xray solution. The system has an Intel motherboard and two WD Black drives that are setup in RAID 1 using Intel Rapid Storage Technology (RST) software.

The backup software failed tonight with the message "error reading volume bitmap. Please run chkdsk /f". Is it possible to test the drives while they're in RAID? RST reports that the array status is normal. And oh yeah, I'm some 250 miles away from the system. That makes swapping cables around somewhat of a challenge!


squircle

join:2009-06-23
Oakville, ON
With fakeraid like that, the drives should still be individually accessible. If you were using *nix, I'd be able to help you, but I'd assume that they'd be accessible individually in Windows and *nix alike.

n_w95482
Premium
join:2005-08-03
Ukiah, CA
reply to HarryH3
Every RAID 1 that I've tested drives from individually has been readable on a PC, both chipset/software-based, and hardware (my 3ware RAID controller). Are you referring testing them with other software, or just running chkdsk as-is?
--
KI6RIT

HarryH3
Premium
join:2005-02-21
kudos:3
Reviews:
·Suddenlink
I'm not a fan of chkdsk, as it doesn't really "fix" anything. It leaves files corrupted and just marks the formerly corrupted drive space as OK to use again. I'd like to pull the SMART info from the drives, but the WD tool for that reports jibberish characters when it scans for the drives. I'm hesitant to break the array just to run the WD diag tool on each drive. I really don't want to open a can of worms.

The system is running XP Pro.


Krisnatharok
Caveat Emptor
Premium
join:2009-02-11
Earth Orbit
kudos:12
HD Tune Pro?


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23
reply to HarryH3
If by "test the drives" you mean "I want to check each individual drive for sector-level errors", I can step you through how to do that using smartmontools -- it's the only utility for Windows that can (potentially) talk to drives behind an Intel controller in RAID mode (what's called Intel MatrixRAID or Intel RST; this is "BIOS-level" RAID). It's a little tricky though, and you'll need administrator access. I also need to know exactly what version of Windows you're using (and if 32-bit or 64-bit).

Tools like HD Tune Pro and others cannot talk to disks being used with MatrixRAID/RST (bus enumeration will result in only 1 disk being shown: the RAID array/volume itself. And that is not a physical disk, obviously. :-) )

The error message you got indicates there may be filesystem-level damage. That could be the result of anything -- drives going bad (sector issues), or, say, the system being powered off abruptly repeatedly or some other anomalies. You can run CHKDSK /F if you want, but I would strongly recommend you at least look at the disks first (as I described above).

Please do not take the disks out of the system and place them in another unless instructed (not to sound egocentric, but by me). Doing so may destroy or impact the Intel RST metadata on the drive, requiring a full array rebuild. And if you MUST take the disks out, please only take *one* disk out at a time, otherwise you risk losing all data.

Let me know, time permitting.

--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

HarryH3
Premium
join:2005-02-21
kudos:3
Reviews:
·Suddenlink
downloadCrystalDiskInfo.txt 22,130 bytes
Hi koitsu! I figured that you would stop by this thread at some point. Yes, thanks to RST, the array is presented to Windows as a single drive. I was able to interrogate both drives using Crystal Disk Info this morning. The output is attached to this post.

The system is running XP Pro, 32-bit.

I ran the "verify" option on the array from within RST and last night the backup ran without the error. But I'm still suspect at this point. I don't like "anomalies" in data.

I have backups stored in two different places, one on an external drive and one on a NAS. Macrium Reflect runs each weeknight, with a full backup once a week and an incremental other nights. I also grab a full backup occasionally and push it to a different server, to have older copies if ever needed. So while it would take some time to do a restore, at least that option is available.

As for taking disks out of the box... That's only an option of last resort, since it's a 4+ hour drive for me to get to it.


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

1 recommendation

The Crystal Disk Info output isn't as good as I'd like -- I really do prefer smartmontools output, and that's also because smartmontools pulls down things like SATA PHY and SATA device counters (which can sometimes explain certain anomalies when SMART attributes look fine), and support SMART error log information (which can tell me if the drive itself found and logged internal anomalies). CDI doesn't have that capability. smartmontools does have support for talking to disks behind Intel MatrixRAID/RST like I said.

So for now, I'll go with what you gave me, but I can't promise as concise of an analysis as a result.

Also, one very key point you didn't provide in your thread: when you said Windows told you to "run chkdsk /f", on what drive/volume did it tell you to run this on? You have 3 disks in your system, and I need to know what Windows drive letters apply to what.

For now he's the analysis I can do with the data I have:

 (1) WDC WD1001FALS-00Y6A0 : 1000.2 GB [X/2/0, cs]
 
- Attribute-wise this drive looks fine, except for one anomaly.
- The drive has been in use for a total of 6053 hours.
- The drive has been power-cycled a total of 33 times.
- The drive has accumulated a total of 163 CRC errors between the drive and the
  controller during its lifetime.  I'll explain more about those later.
 

 (2) WDC WD1001FALS-00Y6A0 : 1000.2 GB [X/2/1, cs]
 
- Attribute-wise this drive looks fine, except for one anomaly.
- The drive has been in use for a total of 6030 hours.
- The drive has been power-cycled a total of 27 times.
- The drive has accumulated a total of 91 CRC errors between the drive and the
  controller during its lifetime.
 

 (3) WDC WD20EADS-00W4B0 : 2000.3 GB [1/X/X, sa1] (V=1058, P=1140)
 
- Attribute-wise this drive looks "sort of" okay by my standards.
- This is a model of WD drive which excessively parks its heads (described on
  the Internet as "the LCC issue").
- The drive has been in use for a total of 4703 hours.
- The drive has parked its heads a total of 8302 times during its lifetime.
- The drive has been power-cycled 994 times during its lifetime.
 

So, things that catch my attention, and the only person that will know the reason for some of these is whoever built or manages the box, or knows how these disks have been used historically (i.e. in other environments/systems, how they were treated there, etc.).

1. Your two 1TB drives both have occasional CRC errors. I should be clear about this: CRC errors can be normal under some situations, specifically environments where electronic interference or RF interference are very very high. If these disks were used in a previous system, it is possible that the issue happened then. I would consider 163 and 91 CRC errors over the course of roughly 6030 hours very high. A proper environment would see maybe a CRC error or two every year, if that.

CRC errors result in either the drive or the controller re-submitting the ATA CDB + payload to the recipient, i.e. "Hey the CRC didn't match for that last transmission" "Okay, I'll resend". If these happen in rapid succession (i.e. resending over and over), it is possible that the OS declared the CDB as failed, and this could cause a filesystem-level anomaly (i.e. write failed). Please refer to the Windows Event Log for details -- something may be in there.

What causes CRC errors? The list of reasons are nearly endless, and these are one of the most difficult to track down + solve.

The most common reason are badly shielded cables (or cables which use crappy copper), cables which aren't making good contact (at either the hard disk end or the controller end), or dust in/around the ports (at either the hard disk end or the controller end).

What I recommend to people experiencing CRC errors is to watch the counters very, very closely and see if they increment and how often, and then see if that incremental nature correlates with data issues on the filesystem.

Most people just go crazy willy-nilly and go out and buy replacement SATA cables and then come back and say "ITS FIXED!" even though it really isn't (many of them don't bother coming back to say "uh, yeah its not fixed :(", while sometimes others do). Since you have 2 disks, you can experiment with one by replacing the SATA cable on one but not the other.

CRC errors can also be caused by electrical interference issues on or near the drive PCB itself, which is something that cannot be solved easily other than by replacing the drives. Sometimes the root cause is as simple as a shoddy soldering joint, or an electrical trace that isn't as solid/secure as it should be (i.e. PCB etching / manufacturing mistake). There is no way for a generic end-user to diagnose this kind of problem (and even I can't diagnose this, because PCBs are often multi-layer and I don't do EE).

I have heard in extreme cases that CRC errors can be caused by PSUs which are emitting massive amounts of electrical interference inside of cases. The solution there is to replace the PSU. I do not know how to debug/diagnose this scenario.

My recommendation would be to simply watch the Event Log for more indications of filesystem problems and when one happens, immediately get a snapshot of what the SMART attributes look like. We can then see what's changed and if CRC errors are potential causes.

2. Your 2TB drive suffers from the "LCC issue" which means it is not suitable for use in an environment where I/O is being done to it regularly/often. I did a write-up of this issue for the WD30EZRX drives, but the WD20EADS, WD20EARS, and many other drives -- more specifically any of the "Green" or "GreenPower" drives -- exhibit this behaviour:

»koitsu.wordpress.com/2012/05/30/···parking/

The only WD drives that I know of right now on market that don't do this are the WD Red drives, and the WD Black drives. The WD Blue drives may or may not do it (I know the 2.5" ones do, but that's because they're intended for laptops, where parking heads makes sense).

Be aware present-day Seagate drives also behave this way. Earlier models of their 1TB platter disks (I don't have model numbers right now) did this but did not increment the LCC counter in SMART (thanks Seagate!), while present-day do increment the LCC counter (so you at least know it's happening). The design/behaviour is still completely wrong for 3.5" disks however, and I keep waiting for someone to start a class-action lawsuit against the MHDD vendors for this decision.

3. Your 2TB drive has a very high power-cycle count: 994. This is separate from the LCC issue. The system isn't losing power (if it was the 1TB drives would show a similarly high number). So unless you know if the drive itself has lost power 900+ times in the past (i.e. it was used in a system where the owner shut the power down regularly, or shut the PC off when they were done -- doesn't matter if it's a clean shutdown or an abrupt power-off), it looks to me like this drive is losing power for some reason.

I would recommend replacing this drive with a WD Red 2TB drive if you can get your hands on one. The I/O speed will be better (not as amazing as the Blacks, but better than the standard Green) and you'll be killing two birds with one stone.

4. None of your drives show any sector-level problems, which is good. So any filesystem anomalies that are occurring, depending on what the filesystem is that's seeing the anomaly, are either the result of CRC errors, or (in the 2TB drive's case) are the result of a drive doing excessive LCC or losing power abruptly.

As a result of this analysis, it is safe for you to run CHDKSK /F on whatever filesystem needs it. Please be sure to follow Windows' recommendation of letting it do the analysis during the next reboot -- do not force an active volume unmount.

Also, a question: can you please tell me what exact Intel RST driver version you're using? It matters. There were known bugs in some versions causing drives to drop off the SATA bus mysteriously (yes really, DSLR users found this out actually, and Intel has since fixed it).

I think that's about all I can say for now, other than: again, I would really prefer to see smartctl -x output for all of these drives.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

o770

join:2002-06-12
Hello koitsu. Thanks for always such valuable information! I beleive you know this and just wanna add it up, 2.5" WD Black also uses the IDLE3 timer.


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23
said by o770:

Hello koitsu. Thanks for always such valuable information! I beleive you know this and just wanna add it up, 2.5" WD Black also uses the IDLE3 timer.

Thanks -- I'm not surprised. The "excessive LCC" (head parking) issue should apply to 2.5" form factor drives -- they're intended to be used in laptops, and laptops are intended to be carried around and picked up/rotated/put down on surfaces, so parking heads there makes sense. However, I feel very strongly this feature should be toggleable on all drives (there ARE many servers now which use 2.5" non-SSD SATA drives!) regardless of form factor.

On WD Black drives (at least the 3.5" ones), APM is not toggleable (the feature is removed entirely), while on WD Green drives APM is toggleable and enabled by default (disabling APM also disables the excessive LCC problem). So for WD Black 2.5" drives with APM enabled and are seeing high LCC, if you can try disabling APM that would be a workaround of sorts.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

n_w95482
Premium
join:2005-08-03
Ukiah, CA
reply to koitsu
Awesome post! Do you know if the RE series drives have the same problem? If you need me to test, I have various RE1-4 drives available.
--
KI6RIT


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23
said by n_w95482:

Awesome post! Do you know if the RE series drives have the same problem? If you need me to test, I have various RE1-4 drives available.

My own experience shows that the only RE-series drives which behave this way are the ones that have the -GP ("GreenPower") suffix, like the RE4-GP. You won't find that drive mentioned on their site any more since it's "old" (ahem).

Basically anything that has "IntelliPark" (be very careful with this term -- WD has 3 different marketing buzzwords right now for 3 different features, and they all start with "Intelli") will behave this way. On some drives you can inhibit it by disabling APM (i.e. smartctl -s apm=off on some drives, or use whatever ability your OS offers), on others you cannot.

Present-day Seagate drives also have this same design/behaviour (all their 1TB platter models), however the drives segregate head-parking ("LCC") from APM, so disabling APM is not a workaround. There is no solution on Seagate drives, and their forum has been filled for about a year (or more) of people complaining, yet the feature remains enabled in their firmware.

The few Hitachi drives I owned did not excessively park their heads in any way. I do not know about Samsung (I don't buy their drives). But all of this is becoming moot anyway since WD and Seagate bought everyone out.

Foonotes (because I see people do this all the time):

1. SATA PM is not APM. SATA PM is power management for the actual underlying SATA PHY and has nothing to do with disk-level features. Do not disable SATA PM!
2. AAM (acoustic adjustment) has nothing to do with APM.
3. Disk APM has nothing to do with the old pre-ACPI power management capability of x86 PCs called APM. Two different things, same acronym.
4. On a purely technical level, APM and head parking are actually separate/independent -- it just so happens that WD on some of their drives/firmwares repeated head parking can be inhibited by disabling APM. There is no official per-ATA-spec CDB or sub-CDB that lets you adjust the head parking/LCC behaviour.
5. On some drives, disabling APM can be accomplished by setting the APM level to 255 (0xff), rather than "off". ATA specification does discuss this at length (and I'm familiar with it), but I'd rather not go into that here if I can avoid it.

--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

n_w95482
Premium
join:2005-08-03
Ukiah, CA
I looked at WD's literature for the RE2/3/4 and saw no mention of IntelliPark, so I guess I don't have to worry about it.
--
KI6RIT


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23
said by n_w95482:

I looked at WD's literature for the RE2/3/4 and saw no mention of IntelliPark, so I guess I don't have to worry about it.

Sadly WD's product literature does not always tell the full story. For example, the product literature for the WD Red drives (when they came out) stated the drives supported IntelliPark even though they didn't. WD has since updated their literature to reflect reality. I even mention this in my tiny review (see paragraph below chart). You can't rely entirely on firmware string either (see my WD30EZRX article, very bottom of page), i.e. firmware version x.y.z is not necessarily going to universally behave one way (and I provide proof of that).

If you already have drives and aren't sure what their behaviour is, just look at SMART attribute 193 / 0xC1. If the RAW_VALUE number is abysmally high (in the thousands or tens of thousands at least), the drive is parking heads excessively. If the drive supports APM, try disabling that to inhibit the excessive head-parking.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

JoelC707
Premium
join:2002-07-09
Lanett, AL
kudos:5
reply to koitsu
said by koitsu:

So for WD Black 2.5" drives with APM enabled and are seeing high LCC, if you can try disabling APM that would be a workaround of sorts.

Looks like I need to find some maintenance time for my servers then. I bought some 2.5" 750GB WD Black drives a few months ago for my virtual servers for VMDKs that did not need the speed of the SAS drives in the servers. If I check them and find the LCC value isn't that high, do you think I should disable APM anyway "just in case"?


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23
No. Leave APM enabled / leave the defaults alone unless you're seeing evidence of excessive parking through SMART attribute 193.

HarryH3
Premium
join:2005-02-21
kudos:3
Reviews:
·Suddenlink
reply to koitsu
said by koitsu:

Also, one very key point you didn't provide in your thread: when you said Windows told you to "run chkdsk /f", on what drive/volume did it tell you to run this on? You have 3 disks in your system, and I need to know what Windows drive letters apply to what.

That message came when backing up Drive C. However, Drive C is a RAID mirror set, consisting of the two 1TB drives listed.

said by koitsu:

For now he's the analysis I can do with the data I have:

 (1) WDC WD1001FALS-00Y6A0 : 1000.2 GB [X/2/0, cs]
 
- Attribute-wise this drive looks fine, except for one anomaly.
- The drive has been in use for a total of 6053 hours.
- The drive has been power-cycled a total of 33 times.
- The drive has accumulated a total of 163 CRC errors between the drive and the
  controller during its lifetime.  I'll explain more about those later.
 

 (2) WDC WD1001FALS-00Y6A0 : 1000.2 GB [X/2/1, cs]
 
- Attribute-wise this drive looks fine, except for one anomaly.
- The drive has been in use for a total of 6030 hours.
- The drive has been power-cycled a total of 27 times.
- The drive has accumulated a total of 91 CRC errors between the drive and the
  controller during its lifetime.
 

Both of the above drives were installed at the same time. It's been a while, but it appears that I booted the system a few times with one a single drive. Otherwise I would expect the in-use time and the power-cycle time to be identical between the two. However, that doesn't explain the 23 hour difference between the in-use times. :o

said by koitsu:

 (3) WDC WD20EADS-00W4B0 : 2000.3 GB [1/X/X, sa1] (V=1058, P=1140)
 
- Attribute-wise this drive looks "sort of" okay by my standards.
- This is a model of WD drive which excessively parks its heads (described on
  the Internet as "the LCC issue").
- The drive has been in use for a total of 4703 hours.
- The drive has parked its heads a total of 8302 times during its lifetime.
- The drive has been power-cycled 994 times during its lifetime.
 
...

2. Your 2TB drive suffers from the "LCC issue" which means it is not suitable for use in an environment where I/O is being done to it regularly/often. I did a write-up of this issue for the WD30EZRX drives, but the WD20EADS, WD20EARS, and many other drives -- more specifically any of the "Green" or "GreenPower" drives -- exhibit this behaviour:

3. Your 2TB drive has a very high power-cycle count: 994. This is separate from the LCC issue. The system isn't losing power (if it was the 1TB drives would show a similarly high number). So unless you know if the drive itself has lost power 900+ times in the past (i.e. it was used in a system where the owner shut the power down regularly, or shut the PC off when they were done -- doesn't matter if it's a clean shutdown or an abrupt power-off), it looks to me like this drive is losing power for some reason.

The above drive is an external drive, connected via USB. (My Book Essential, I think). It typically doesn't show up after a reboot unless the USB cable gets unplugged and plugged back in. But yeah, it does have the stupid "park the heads at every opportunity" firmware.
It gets 1 backup file written to it each weeknight. The rest of the time it's just warming the room at bit. ;)

said by koitsu:

As a result of this analysis, it is safe for you to run CHDKSK /F on whatever filesystem needs it. Please be sure to follow Windows' recommendation of letting it do the analysis during the next reboot -- do not force an active volume unmount.

After a reboot, RST noticed a difference between the drives and automagically started verifying the data. After that completed, the filesystem error is gone. RST doesn't seem to provide much data as just what exactly it did, so I'm at somewhat of a loss for this!

said by koitsu:

Also, a question: can you please tell me what exact Intel RST driver version you're using?

The version is 10.8.0.1003.

said by koitsu:

I think that's about all I can say for now, other than: again, I would really prefer to see smartctl -x output for all of these drives.

I have downloaded smartctl, but I've yet to determine the magic for determining which drive to run it on. /dev/sda finds the array, while /dev/pd1 finds the external drive. Can you provide some hints for getting the correct data for you?

Thanks so much for taking the time to read and help me comprehend this! :)


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23
Intel's RST drivers are up to 11.2. I would strongly suggest upgrading. I've been trying to find the thread here on BBR/DSLR where some of the community found about the drop-off-the-bus bug (and Intel's changelog did in fact confirm it), but sadly I can't find it, otherwise I'd be able to tell you what exact version fixed that bug.

Syntax for the smartctl command: »sourceforge.net/apps/trac/smartm···trollers

Specifically smartctl -x /dev/csmiX,Y, where X and Y are numbers that can start at 0. So the first drive in the RAID set might be 0,0 or it might be 0,1 or 1,0. Really not sure. You can try smartctl --scan to see if that helps give any details. Here's an old post from me about the syntax.

For non-RAID disks or standalone disks, you should use either the drive letter (i.e. smartctl -x D:) or /dev/sdX syntax (where X = "a" for the 1st drive, "b" for the 2nd, etc.). This is a good habit to get into, since on Windows 7 things like /dev/hda don't work for most setups but /dev/sda would be fine.

Generally speaking, for USB-attached disks, you may or may not be able to get SMART stats at all, as it greatly depends on if the USB-SATA bridge/chip supports SMART pass-through or not (many don't). Flip a coin. This is one of the many reasons I hate most USB enclosures.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

HarryH3
Premium
join:2005-02-21
kudos:3
Reviews:
·Suddenlink
downloadsmartctl_output.txt 46,861 bytes
I had seen the new RST version 11.6, but it doesn't include support for XP. Thanks for the pointer to 11.2. It's updated.

The --scan option did the trick! Attached is the smartctl -x data from all three drives. (It seems to have also seen the external drive OK. Perhaps WD has some small bit of intelligence built in to their USB adapter?)


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23
Thanks for the output -- both WD Black drives look fine, barring the CRC errors, so my previous analysis stands true.

The USB-SATA bridge/chip used in the WD enclosure probably just allows SMART passthrough, or it's a model of chip smartmontools has support for. Dunno -- not really a focus of mine right now.

Wish I had an explanation for what happened to your array that would cause filesystem issues, but the best guess I have is that the system might have lost power at some point (despite NTFS being journalled it can still have issues of this sort), possibly one of the drives fell off the bus and made things quite angry, or maybe you got bit by some RST driver bug. Lots of possibilities, with no hard confirmation.

All I can tell you is that your WD Black drives, on a sector level, look perfectly healthy. No need to replace either of them, which I think is ultimately what matters here.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

HarryH3
Premium
join:2005-02-21
kudos:3
Reviews:
·Suddenlink
Thanks koitsu! BTW, do you ever sleep?

The system is on a UPS and write caching is disabled, just in case. High performance isn't an issue, but high availability is, so I try to keep it running all the time.

I'll keep an eye on things and see if perhaps upgrading to 11.2 of RST makes any difference. (Reading through the RST forums on Intel's site really makes it seem like a buggy POS).