Search:  

 
 
   News
home

Benchmarking the MD3000 powervault under linux
(old news - 07:32PM Sunday Sep 09 2007)
Second update: I have finished comparing RAID5 to RAID10. RAID5 is about 20% faster with sequential reads but 20% slower with sequential writes. Orion reports that the total throughput over two RAID5 LUNs is about 20% slower than against two RAID10 LUNs with a 5:95 write:read ratio. Maximum sustained IOP/sec on the "small I/O" test also dropped from over 6000 to 4000 - a 30% decrease. My conclusion is therefore that unless you need maximum possible sequential read performance or maximum possible space utilisation one should avoid RAID5 and pick RAID10.

Update: Just a day after posting this Dell releases the MD3000i which appears to be an MD3000 using gig/e ports instead of SAS ports. Again, the performance of the Dell MD3000i is entirely obscure but my guess is that it is the same as the MD3000. The controller software certainly sounds very familiar! At least this time the "fully equipped" MD3000i with four gig/e ports can blame any benchmark results on the maximum wire speed of 4 gig/e ports (less than 100mb/sec x 4).

I've been spending some time benchmarking the Dell MD-3000 powervault storage array under SuSE 10.2 x86_64 linux. There isn't a lot of information out there on this unit, one of the more useful pages I found is on this blog: Performance of the MD3000 with ORION. In summary: it is ok for the price we paid (half retail), but this storage array, with the guts of an old IBM DS4100 which had an anemic 485 MB/sec internal bus speed, is not able to max out the total sequential read or write performance of the 15 disks it is able to contain. I imagine if you expand it with MD-1000 enclosures this deficit is even more obvious. More on that later.

The setup

Two Dell 1950 hosts, each with two SAS/5e HBA (host bus adapter cards). The idea is to setup a highly available configuration. The SAS/5e cards each have two ports but since the MD3000 has a maximum of 4 ports (two per physical controller module) I am only using ONE port on each card, and four SAS cables. Both Dell 1950 poweredge hosts also have two internal drives connected to PERC5/i configured as a single mirror for the OS. The MD3000 does not support booting a Host OS. It is fully populated with 15 SAS seagate 15k (136gb usable) drives.

Theoretical throughput

Each SAS5/e card is PCI-X, and each slot has a dedicated bus on the 1950. Each SAS cable runs at 3 gigabit/second full duplex. Each of the 15 drives can sustain a sequential read of about 90/mb a second and a write of nearly that much. If they were 300gb drives then read performance would be over 100mb/sec. There is a little if any information from Dell on how the 15 drives are connected to the controller modules.

Click for full size
The Dell advertising blurb describes the MD3000 as having a possible peak bandwidth of 1400MB/sec:
Active-active RAID controllers can produce throughputs up to 1400MB/sec and approximately 90000 IOPS from cache

I wonder if it can reach that speed?

Multi-path support the Dell way

Since each host is connected to the MD3000 via two HBAs, two cables and two MD3000 controller modules, transparent fail-over support would be an obvious item on the wish-list. The Dell resource CD provides an RDAC module that you are required to compile up yourself, and a newer mptsas kernel driver. There were several problems implementing things as Dell expect you to:
    •Dell provided mptsas does not compile on vanilla kernel releases after 2.6.20 (such as fedora core 7) due to an API change to the work queues•Dell provided mptsas does not compile on distro kernels patched for "wide port API" because the code checks for kernel version 2.6.18 or more before enabling it. SuSE 10.2 is, for example, kernel 2.6.16 (with patches). This is easily fixed by adding two #defines.•I also had trouble compiling up RDAC due to an incorrect symlink.•The Dell provided mptsas module is newer than the LSI official drivers, but there are no release notes or history for it so it isn't clear what it fixes or adds vs the standard module from LSI.

After trying the array successfully with Fedora Core 5, CentOS5 (which is RHEL 5 64bit) and exploring all the above issues, in the end I settled on SuSE SLES-10-SP1 x86_64 (Suse 10 service pack 1 for 64bit) and used it as-is, there was no need to install anything other than the Java "SMdevices/SMmonitor/SMagent" stuff on the resource CD.

Multi-path support via multipath tools

As an alternative to the IBM/Dell RDAC solution I went with multipath-tools.

Linux multipath tools provide some amount of device independent support for multipath IO. In brief once configured correctly they export /dev/dm-N devices that one should use instead of /dev/sd? devices. The /dev/dm-N devices are transparently (hopefully!) failed over and back depending on what the multipath demon finds is going wrong with the underlying devices.

The problem with using multipath-tools on this MD3000 is that you must verify your kernel can speak RDAC in the device-mapper. This support comes in the form of a bunch of device mapper kernel patches, and fedora core 7 and perhaps most other distros do not have these by default. (I'm out of my depth here!). You'll know that you've not got them because you can't make multipath-tools work. SuSE 10.2 does have the patches.

The multipath configuration file that I used is:


As you can see, I don't want multipath tools to try to probe either the DRAC5 management card "virtual devices", or the MD3000 management "Access disk" which appears as a 20gb drive that can't be used as a filesystem. Notice that the configuration file refers to the aforementioned kernel rdac support! (this is not the same as the Dell RDAC driver). Without the right kernel, the "path_checker" line will fail to work as will the "hardware_handler" line.

If all multipath-tools are installed without error then after fresh boot you can do something like this (the -d flag is "dry run" and is more likely to get you output than just -ll if you have any other issues with missing kernel features):


You can see that I setup two LUNs, one mapped through HBA #1 with a backup through HBA #2, and the other the reverse. The hot-standby paths are called "Ghosts" by multipath. If I joined these two LUNs up via LVM2 or mdadm (linux software raid) then in theory I am load balancing between the two HBAs. If a single "path" to the storage array fails (HBA, cable or MD3000 controller failure) then one dm-N device moves to its buddy on the other HBA and we should still be in business.

Note that if one attempts to access the array via the Ghost devices, or actually has a path failure, and the Ghost devices are accessed then the MD3000 will report via the management console that it is in a "non-optimal" state because a LUN moved from its "preferred controller" to the backup controller. You can trigger this simply by using dd to read from a Ghost device.

Note: if you do not install a multipath solution and put two cables from the host to the MD3000 you will see two LUNs (sd? devices). If you try to use both at the same time, the controllers will thrash, moving the disk array back and forth from slot 0 to slot 1 trying to keep up with your access pattern and performance will be awful. Don't ask how I figured this out.

Benchmarking Introduction

So it is all setup, how fast is it?

I played around with benchmarking this thing in a number of different ways. I've used software raid to stripe md0 across dm-1 and dm-2 (hoping to see better throughput when both HBAs are teamed), I've tried LVM2 instead of software raid. I've used the underlying devices directly, with and without partitions, and also tried with Ext3 and XFS. In general the more layers the slower things become. For instance, Ext3 on top of LVM2 on top of dm-N on top of sd? might be 20% slower than just raw access to /dev/sdX

Benchmarking tools I tried varied from simple dd for sequential write and read. hdparm -t for sequential non buffered read, "seeker" (see below) for random single block IO, iozone for a grid of data and Oracle "orion" for a simulation of database workloads.

When running any benchmark you have to be aware of the chain of cache in use for the test. There are two possible caches of concern: the physical memory of the machine running the test & the cache on the raid controllers inside the MD3000 (512mb, supposedly, although it isn't clear if this is 256mb per controller or what). There is also probably a small individual per-drive cache but that would be overwhelmed by the other caches.

In order to avoid testing cache speed instead of I/O array speed I made sure that the hosts were rebooted with mem=768m which means they have minimal memory free for blockio cache. The Orion benchmark tool can take into account an amount of cache before it runs - it fills the cache with random data before performance measurements start - so I took advantage of that flag to kill the MD3000 cache. Other than that I basically tried to make sure the tests involved many times the physical available memory of the host.

It is interesting to note that hdparm -t is nearly useless for this MD3000 because although it avoids any host cache it can't avoid the 256mb controller cache. hdparm -t can show the speed of an MD3000 LUN is 300mb/sec. (pretty much the speed of one SAS cable).

The MD3000 "read" cache

It is possible to disable the 'read' cache on a per LUN basis. I've seen written that read caches on external storage units should almost always be disabled because you want to reserve as much cache space as possible for non-blocking write operations. The read cache can be set with a SMcli script which is fed to the SMcli utility:


Setting the cache via the gui management interface is not possible.

My conclusions so far are that for throughput tests, disabling the readCache created too big an impact. Performance on long sequential reads dropped remarkably without a readCache. This was unexpected (how can a tiny 256mb cache help with reading 8gig of data sequentially?) unless the readCache is helping the controller modules read ahead using multiple drives in the disk group, and disabling the cache crimps that ability. So I left the readCache enabled.

Total throughput tests

The unit did not perform to theoretical performance with any total throughput tests involving all drives. With 14 drives (7 unique and 7 more ready with duplicate data) total read performance could, theoretically, approach 80x14 = over one gigabyte a second! With a single SAS card and a JBOD array there is certainly evidence of this performance under linux 2.6. Dual HBA cards and/or two sas cables should support 600mb/sec to the host.

Unfortunately the fastest I could get dd, or iozone to work was about 280mb/sec in one direction. By combining two LUNs using software raid0, to combine controllers, speed rose to 370mb/sec for sequential read.

Orion reported over 400mb/sec total throughput with a mix of read and writes. IOzone would typically report around 300mb/sec seq write and a little more seq read.

By mixing dd out and in, three LUNs, and two disk groups total, throughput grew close to 600mb/sec. Perhaps with further experimentation it would be possible to determine what size disk group is optimal, and what mix of work generates the maximum total throughput.

Single block random reads

Using the seeker.c random seek/read utility, modified to use 48 bit random numbers seeded correctly (not from time in seconds), and run in parallel 60 times or more, I could push the enclosure to about 6000 IOPs/second at which point adding more work just increased latency with no increase in total IOs per second (reading the IOs from iostat). I think this result more correctly reflects the speed of which a 14 disk LUN can work than did the throughput tests. A single SAS drive can do only a few hundred IO operations per second (depends on mainly on the drive's average seek time).

Oracle 'Orion' benchmark

The Orion manual is here.
Orion using a 5% write 95% read mix, generated a matrix of results which I include below in a spreadsheet. Scroll the iframe right to see the results graph. Worryingly, however, on both hosts the full benchmark hard-locked (no kernel panic) the machine about 3/4 the way through the 3+ hour run.



The spreadsheet has two tabs, one for MB/sec the other for IOPs. The matrix of results in the first tab represents a mix of small tasks and large tasks. With many small tasks and no large tasks, total throughput rises more slowly as workload increases (the bottom curve). With all large tasks and no small tasks (top curve), total throughput looks like it will plateau around 400mb/sec as workload increases.

The orion command line is using two LUNs as though they were striped together in a single volume (-simulate raid0). The two LUNs used are not shown, they are listed in a config file that is created before orion runs.

Other Resources:

Note: it appears to me that the MD3000 uses the same controllers as the IBM model DS4100. The same SMcli utility and script, same SMagent/SMmonitor utilities, same controller memory!

Page 4 of the IBM manual gives the performance characteristics of the DS4100 that Dell do not:
IOPS from cache: 70k
IOPS from disk: 10k
Disk through-put: 485MB/sec (although DS4100 only supported slow SATA drives)

You can pick up fully populated (SATA) DS4100s on ebay for $5k, retail was $21k. Pictures of the rear reveals a very similar controller arrangement of ports.

Conclusion

Well, I am rather miffed that the MD3000 is pimped on the Dell site as a state-of-the-art (albeit lower-end) modular storage array but is actually an IBM DS4100 in dell drag - very "End Of Life" gear, no?

Documentation is appalling (the IBM manual is very good, however. Shame Dell didn't copy that as well). There is much more information on tuning the enclosure from IBM, which is also providing the RDAC kernel module, although IBM information is for their DS4100 only.

Performance is adequate for the dollars (we bought ours on ebay as reconditioned equipment) but the controller modules are clearly not capable of driving 15k SAS drives to their limit, and the controller cache memory dates from an era where 1gb of host memory was a big deal!

rss feed About dslreports.com

Random site news information and ponderings, by Justin
Forums » Benchmarking the MD3000 powervault under linux
view: topics flat text 
Post a:

fcisler
Premium
join:2004-06-14
Riverhead, NY

Interesting

Very Interesting, Justin.

We just received (about an hour ago) our MD3000, along with two 2950's (dual quad core, 8GB ram). We are using Windows 2003 R2 SP2, and this is for an exchange cluster.

Once it's setup, before we put it into production, I'll try and get some benchmarks out of it.

justin
Australian
join:1999-05-28
Brooklyn, NY

Host:
IPv6
Webmasters and Dev..
Business Connectiv..
Home/Office setup ..
Console/Handheld g..

Re: Interesting

Oracle/Orion is available as a windows binary. That would be a very easy run!
I'd suggest reading over the IBM redbook manual I linked to, it has an interesting section suggesting that for most applications, except those with a very high percentage of writes, this storage unit works better with RAID5 than RAID10 logical units. That was unexpected, as RAID5 seems to have gone out of favor lately.

AntiFUD

@dell.com

Benchmarking? I think not.

Perhaps when you attempt benchmarking with an operating system that is test with and validated to work with the unit, the results will be more accurate. When the management software for a unit has to be hacked up (yes, hacked up, you even said you were out of your element there!) it cannot be depended upon to produce the same output that it did when run as intended.
As well, was any kernel tuning performed to focus on I/O? You didn't include any in the brief write up of the steps you did perform. So far, the evidence you present in this 'benchmarking' run is flimsy at best, cobbled together sewage at best. Best of luck spreading FUD.

justin
Australian
join:1999-05-28
Brooklyn, NY

Host:
IPv6
Webmasters and Dev..
Business Connectiv..
Home/Office setup ..
Console/Handheld g..

edit:
September 10th, @03:05PM

Re: Benchmarking? I think not.

said by AntiFUD :

Perhaps when you attempt benchmarking with an operating system that is test with and validated to work with the unit, the results will be more accurate.
I'm using SuSE x64 10.1, here is the certification matrix from the Dell manual:


Am I missing something?

When the management software for a unit has to be hacked up (yes, hacked up, you even said you were out of your element there!) it cannot be depended upon to produce the same output that it did when run as intended.
I'm out of my element where it comes to talking about the history of kernel patches to support rdac on device mapper, but actually my SuSE is vanilla 10.1, and the management software isn't changed one jot. Nothing was "hacked up". I got the same performance on CentOS 5 (which is RHEL, also on the certified list) and the RDAC driver .. As one would hope, as the RDAC driver is nothing to do with performance, it handles failover only.

As well, was any kernel tuning performed to focus on I/O?
No kernel tuning is required in order to benchmark external enclosures with IOZone or dd or Orion or Bonnie++. The kernel is hardly involved, the work is done by the HBA and the enclosure. During benchmarks the cpu is nearly 100% idle. If there is "kernel tuning" required in order to extract more performance where is this tuning documented in the MD-3000 install guide? There is actually some minor tuning dicussed in the IBM redbook manual for the DS4100, it involves being careful that the sum of all nr_requests (scsi queues) does not exceed the capabilities of the MD-3000 for if they do, data loss can result.

You didn't include any in the brief write up of the steps you did perform. So far, the evidence you present in this 'benchmarking' run is flimsy at best, cobbled together sewage at best. Best of luck spreading FUD.
The performance I got is duplicated by the other write-up out there: all.thingsit.com benchmark md-3000. Actually he got slightly lower performance.

It is interesting that an @dell.com address is attempting to undermine this write-up now you couldn't have any vested interests, could you?

But I welcome comments from Dell privately or public so fire away..

Can you point to a Dell or Dell authorized benchmark that shows the unit doing better? Do you maintain that the MD-3000 does not have the guts from an IBM DS4100 which has openly advertised total through-put that matches the results above? Stuff like this I'd like to hear from Dell.

fcisler
Premium
join:2004-06-14
Riverhead, NY

Re: Benchmarking? I think not.

Anonymous did bring up one point, which doesn't affect you - but I have seen firsthand.

When we had an oracle cluster setup under linux (RHEL ES, but a CX300), I did a kernel upgrade. Instantly - performance went down the tubes. Couldn't figure out why, so I downgraded the kernel - performance was back to normal.

One weekend when I had some time, I upgraded to the newest kernel and did some testing. While transferring files back and fourth to two different LUN's (one R10 and one R5), the CPU's would peg at 100%!

What the hell is this? A call to dell support, after getting transfered between "gold support" at least 5 different times, resulted in a "please do not update until dell tells you to do so". Uhh...ok? A kernel revision BUMP can degrade the performance THAT MUCH? wow!

One of the many reasons we dumped that setup. Neither Dell (who was also our support for RHEL, we could not contact them directly) or Oracle was of any use WHATSOEVER! Seriously...we had dell setup the thing...but once performance started degrading, neither one was of any help whatsoever. Both companies were not ready to support this setup.

Anyway, back on topic....

We are actually paying for an installation of this whole shebang. I've never setup a SAS array, and for "warranty purposes" - dell must do an installation of the MD3000, two nodes of the cluster, and software. Fine by me....I'll get to learn how Dell does it (hopefully) right.

Interesting you say R5 works better than R10. This was just creating a single LUN on the array...all hardware? No software raid involved? For everything lately, except a couple multi TB arrays strictly used for a "storage repository", I've been going R10.

I haven't read through those manuals yet...but do you know off the top of your head if any of the disks also hold the "OS" for the array itself (ala emc)?

The price we got for this while cluster is completely absurd (in a good way, I wish I could post what we paid for the array alone), and it will only be doing Exchange 2007...so I 'm hoping that performance on the MD3000 isn't going to be an issue with it.

On the other hand, though, we were looking at setting up a new oracle cluster possibly with the MD....but if this is the performance - I'd be better loading up a 2950 with 4 SAS drives, doing R10 across them, and setting up two standalone units. The price difference between the new CX3-* series and MD3000 was a big issue...and I guess this is another instance of "you get what you pay for"....

justin
Australian
join:1999-05-28
Brooklyn, NY

Host:
IPv6
Webmasters and Dev..
Business Connectiv..
Home/Office setup ..
Console/Handheld g..

Re: Benchmarking? I think not.

Yeah I know there are instances of upgrades killing features that optimize for speed. Happens all the more often when there are layers involved for instance for a while LVM killed performance that you got out of the underlying devices. I tend not to automatically upgrade critical production stuff that works unless I need a new feature, for that very reason.

Not concerned here because the dots are all joined up. The spec published by IBM (but not Dell) matches the throughput test I've done, which matches the benchmarks done on the all.thingsit blog, give or take 20%. lower level benchmarks match higher level ones. Stuff like that.

As for RAID5 I've not tested it yet but that is very clearly what the IBM manual says vs RAID10. Something about the firmware in the controllers optimized for RAID5 LUNs It may not really be worth it because if the whole shabang is capped at around 400mb/sec or so then there isn't a lot of point increasing sequential read throughput from 200mb/sec to perhaps 300mb/sec with the potential to slightly decrease write speed unless you really need the extra space RAID5 provides. There is also the issue that if all the drives are the same age they may start failing in batches and in that case you probably want RAID10 not RAID5 as it can suffer more drive failures!

The MD3000 stores the firmware on the pluggable controllers (I think). Anyway, it stores nothing on the drives thats for sure. The firmware isn't very large, or complex. It does support some basic performance collection so you can check if you have drive hot spots but the management interface does not give you any tools to review this info - you're on your own sucking that data out and plotting it or whatever. There are also those two "unlockable" features at ridiculous cost to look at (LUN copy and LUN snapshot). I'm not bothering with them because LVM2 gives the same thing for free.
Forums » Benchmarking the MD3000 powervault under linux

Friday, 25-Jul
15:52:11
Terms of Use | Privacy Policy | Hosting by www.nac.net - DSL,Hosting & Co-lo | feedback | contact
8th year online! © 1999-2008 dslreports.com.