dslreports logo
site
 
    All Forums Hot Topics Gallery
spc

spacer




how-to block ads


Search Topic:
uniqs
2417
share rss forum feed

sk1939
Premium
join:2010-10-23
Mclean, VA
kudos:10
Reviews:
·T-Mobile US
·Verizon FiOS

1 edit

1 recommendation

Disk Transfer Speeds


RAID 1 with 2 Seagate ST3500418AS drives

QNAP with Hitachi Hitachi HDS723020BLA642

120GB Samsung 830 Series
I had someone ask me in person the other day about the peformance between a local disk and network storage like NFS or iSCSI, so I thought I would share a bit of what I found here.

First off, let me start off by saying that while NFS is slightly faster and has less overhead, iSCSI if FAR more friendly to set up in a Windows environment, however for a server environment note the performance overhead if you don't have a NIC that supports ToC and iSCSI offload (disabled for this test) such as a Broadcom NetExtreme II 5709. Also note the CPU overhead for iSCSI which is around 4% (this PC idles between 1 and 4% so adjust as a reference).

Obviously your transfer speed depends on your NAS appliance and drive, which in my case is a single drive QNAP TS-119 PII+ with a Hitachi HDS723020BLA642 (thanks again Koitsu) respectively. As a result, speeds are going to be higher with something like a EMC VNXe 3100, a dedicated storage server, or even a higher end NAS.


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

1 edit

Click for full size
Intel 510 SSD
Thank you for posting this -- this is more or less what I was hoping to see/get insights to.

I'm a little surprised by the iSCSI results -- I expected something higher (more like 70MBytes/sec or so). But iSCSI may not be the source of the issue -- do you see similar rates to the QNAP using other protocols (CIFS/SMB, NFS, and FTP). Actually, strike that -- exclude NFS from the test list. It's a pain in the ass and varies horribly depending on both the server and client OSes, as well as massive numbers of tuning parameters, protocol versions (2 vs. 3 vs. 4), and whether or not you're using TCP or UDP. So yeah, stuff NFS for these tests.

The reason I mention FTP, BTW, despite not being a network filesystem protocol: it has significantly less overhead than other protocols. For example on my home LAN (gigE), from a Windows XP Pro system I can send/receive 97-98MBytes/second *repeatedly* to/from my FreeBSD box (filesystem is a ZFS mirror consisting of 2 disks, but I got this performance even with 1 disk, as well as with an empty ZFS ARC (applies to receiving only)), while CIFS/SMB is about 70MBytes/sec (using Samba + use of AIO + very specific tunings in smb.conf). In my environment, 25MByte/sec hit is pretty major (that's 1/4th the total bandwidth available), so that's why I ask about FTP.

Also just an observation in passing, nothing to worry about or get too concerned over: the SSD results you posted look very erratic for read speeds. They appear to drop then recover every 2GBytes or so. How much space is free on that SSD presently? If very little, then it's wear-levelling. If a lot, then I wonder if there's a firmware update for it that improves things. I'd expect something a little more "linear". Or do you have HD Tune Pro's Benchmark -> Test/speed accuracy slider set all the way up at top (Fast)? I tend to drop it to the notch below that.

Attached is a read benchmark screenshot from my Intel 510 (120GB) SSD for comparison, running on Windows XP Pro (and has always been used on that OS -- read: NO TRIM SUPPORT). You can see the used disk space in the Intel tool.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

sk1939
Premium
join:2010-10-23
Mclean, VA
kudos:10
Reviews:
·T-Mobile US
·Verizon FiOS


512k iSCSI

512k SSD
said by koitsu:

Thank you for posting this -- this is more or less what I was hoping to see/get insights to.

I'm a little surprised by the iSCSI results -- I expected something higher (more like 70MBytes/sec or so). But iSCSI may not be the source of the issue -- do you see similar rates to the QNAP using other protocols (CIFS/SMB, NFS, and FTP). Actually, strike that -- exclude NFS from the test list. It's a pain in the ass and varies horribly depending on both the server and client OSes, as well as massive numbers of tuning parameters, protocol versions (2 vs. 3 vs. 4), and whether or not you're using TCP or UDP. So yeah, stuff NFS for these tests.

Your welcome, glad someone was able to benefit from it.

It's a little slow, but it's in keeping with file copy speed that I've seen so far over the network. SAMBA/AFP/CIFS nets me somewhere between 25-40MB/s, depending. However, that is basing it off off Windows Explorer, which is notorious for lying. Interface traffic reports it as peaking at 430Mbps on average though, which works out to be around 54MB/s. The other thing to remember is that the QNAP uses a Marvell ARMADA 300 ARM CPU running at 2.0GHz to process everything, and that the iSCSI stack is layered on top of the EXT3 filesystem on the disk.

said by koitsu:

The reason I mention FTP, BTW, despite not being a network filesystem protocol: it has significantly less overhead than other protocols. For example on my home LAN (gigE), from a Windows XP Pro system I can send/receive 97-98MBytes/second *repeatedly* to/from my FreeBSD box (filesystem is a ZFS mirror consisting of 2 disks, but I got this performance even with 1 disk, as well as with an empty ZFS ARC (applies to receiving only)), while CIFS/SMB is about 70MBytes/sec (using Samba + use of AIO + very specific tunings in smb.conf). In my environment, 25MByte/sec hit is pretty major (that's 1/4th the total bandwidth available), so that's why I ask about FTP.

I haven't tried FTP admittedly, but I don't expect performance to be significantly better in this particular case due to being CPU limited. I do expect that for a higher-powered NAS/SAN that FTP would be faster, due to less overhead, however many enterprise-grade SAN/NAS devices don't support FTP if I do recall, but then we go into other, lower overhead technologies like Fibre Channel and FCoE. There are some benefits to iSCSI over FTP in the sense that Windows maps it as a local disk with all the benefits involved, as well as the ability to boot from iSCSI as a local disk (again using certain network controllers).

said by koitsu:

Also just an observation in passing, nothing to worry about or get too concerned over: the SSD results you posted look very erratic for read speeds. They appear to drop then recover every 2GBytes or so. How much space is free on that SSD presently? If very little, then it's wear-levelling. If a lot, then I wonder if there's a firmware update for it that improves things. I'd expect something a little more "linear". Or do you have HD Tune Pro's Benchmark -> Test/speed accuracy slider set all the way up at top (Fast)? I tend to drop it to the notch below that.

Attached is a read benchmark screenshot from my Intel 510 (120GB) SSD for comparison, running on Windows XP Pro (and has always been used on that OS -- read: NO TRIM SUPPORT). You can see the used disk space in the Intel tool.

That is due to the use of 64kb sectors instead of the standard 512k when running the benchmark. The Benchmark setting is set all the way down to "Accuracy". Attached is a benchmark using standard 512k sectors.


DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

1 edit

When I get a 4th drive for my qnap I'll post some similar pics.

as mine is running on an Atom not an Arm. (TS-469-Pro)
though I also plan to get a dual or quad intel nec for my computer and setup teaming (my Qnap also has dual nics on it.)

Slightly OT but any idea of how to get windows to expand a 2TB volume on a 4TB disk?

On my Qnap I started with 2x 2tb drives in raid1, and then after I finished copying everything from my computer's 2tb drive to the iScsi disk I moved it's 2tb drive over to the qnap and converted it to raid5 (took about 8 hours)

then on the qnap I made the iSCSI lun expand to 4TB, and windows sees a 4TB disk.

but I can't get it to expand (I really don't want to have to get a drive to copy everything to and reformat the iSCSI drive.)


Extide

join:2000-06-11
84129

2 edits
reply to sk1939

Here is a run of my iSCSI disk at home. This disk is backed by a Linux box running ZFSonLinux, with a 7 drive array in RaidZ2 (6 drives + 1 hotspare, with two parity drives) RaidZ2 is sort of similar to RAID6. They are all Hitachi 5k3000 2TB drives. I am using a ZVOL as the iSCSI target, and then my windows 2008R2 server box acts as the initiator, and thus there is an NTFS volume on top of the ZFS ZVOL. I get just under 7TB of usable space in windows. I am losing space to file systems twice as I am running NTFS on top of ZFS. The windows box has 18GB of RAM and the Linux box has 24GB of RAM. The iSCSI runs over a dedicated gigabit ethernet link between the two boxes. Both machines also have another gigabit ethernet connection to my LAN. I use this space for mass storage so speed is not super important to me. All of the data on here is also backed up in other places.

NOTE1: CPU is 100% as the machine runs folding@home

NOTE2: Seek times in the first pic are very low because of caching.

sk1939
Premium
join:2010-10-23
Mclean, VA
kudos:10

What are the stats on the Linux box? Disk transfer speed seems somewhat low given the amount of hardware involved. Maybe due to the software RAID, but still...



koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

Guesses for the performance hit:

* Use of 6 disks in a single ZFS vdev that makes up the entire raidz2 pool. I forget the rule of thumb here but I believe multiple vdevs of 2-disk mirrors (thus "multiples mirrors which are striped") would perform better; raidzX doesn't perform that great, and added parity (raidz2 and raidz3) results in an even bigger performance hit

* With ZFS (and I imagine any RAID or RAID-like implementation, for that matter!) the speed of the I/O transaction is often limited to the speed of the slowest disk. I.e. if you have a single disk that is performing like crap (much worse than the others), it's going to influence I/O on the entire pool. This has been proven time and time again on the FreeBSD lists, where a person has one disk that's performing like total crap (excessive ECC being done by the drive, etc.) and their I/O rates are abysmal. "zpool iostat -v 1" can help track this down, or "gstat -I500ms" on FreeBSD (I'd love to know what Linux has that's like gstat).

* Use of ZFSonLinux -- is this the fuse implementation or the kernel implementation? If the former then that explains it; if the latter then possibly all the internal/kernel-level I/O isn't optimised for speed? I'm grasping at straws on this one (if kernel-level)

* The extremely large number of abstraction layers between client I/O layer and physical disk layer. Take a look: Windows client I/O -> NTFS filesystem -> iSCSI client -> Ethernet -> iSCSI server -> ZFS on Linux -> physical disk. As an *IX admin who heavily applies the KISS principle in every way/shape/form -- yuck!

We can safely rule out improper alignment due to 4096-byte sectors because the 5K3000 2TB model drives use 512-byte sectors.

P.S. -- Extide See Profile, seek times are extremely low not because of "caching" but because of use of iSCSI. Seek times can't be measured reliably this way given the use of Ethernet (see sk1939 See Profile's iSCSI results too -- same thing); you have to look at seek times on the actual machine that has the physical disks. So yes, you can safely ignore the seek times in the benchmark, but it's not due to "caching".
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


sk1939
Premium
join:2010-10-23
Mclean, VA
kudos:10
Reviews:
·T-Mobile US
·Verizon FiOS

Makes sense.

To expand on the I/O in my post a bit further:

-Further testing shows that writes FROM the NAS are faster than writes to, peaking out around 680Mbps (85MB/s), with a single large file, non-directory.

- FTP on the NAS nets me around 5-10MB/s of performance over iSCSI, however file navigation isn't as nice.

-NFS is a PITA, and with basic (default) configuration (which you can't change without root access), performance is about the same depending on file type. Plays well with Linux though.

-Dual port NICs are useless if you don't have a switch that can do port channels/teaming, and have it enabled (oops). Dosen't affect performance though.



DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3
reply to koitsu


With Write Cache on the QNap off


With Write Cache on the QNap on

any ideas to improve the starting speeds?

This is on a QNAP TS-469-Pro with 3x 2TB drives all with 64mb cache on the drives.


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

You're using HD Tune 2.55. Why? Those benchmarks are read benchmarks -- HD Tune 2.55 (free) doesn't do write benchmarks! HD Tune's home page even states that quite boldly:

quote:
12 February 2008: HD Tune Pro released!

HD Tune Pro is an extended version of HD Tune which includes many new features such as: write benchmark, secure erasing, AAM setting, folder usage view, disk monitor, command line parameters and file benchmark.

So write caching isn't going to have any effect on a read benchmark.

You will need to talk to QNAP about why the NAS does not get good read speeds until the ~40% point (which is 40% of 2200GBytes). Only they can explain this behaviour.

Talking about write caching:

The term "write caching" is also too vague in this context, I'm sorry to say. Individual disk drives have write caching (and it can be toggled), so possibly that setting you've adjusted does that. I don't know. But NAS units which have actual RAM used for I/O caching (think: hardware RAID controllers with RAM) also have a form of write caching, where they actually store data written to the array in RAM and flush it to the physical disk when the NAS firmware deems it convenient/best. That form of write caching is 100% independent of disk write caching. So possibly the setting adjusts that -- but again, I don't know.

Like I've told you in PM, to work out things like this, you really need to rely 100% on the vendor (QNAP). They are the only ones who know how their product functions behind the scenes.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

Click for full size
I used the free one because I had installed the trial before and it's timed out so I'd have to pay for it.

The cacheing is related to the EXT4 filesystem, I'll get a screenshot of the page with the setting


DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

Well just tried HD tune pro (to see if the timeout thing had stopped and it says I need to remove all partitions to run the write test.



koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

2 edits

said by DarkLogix:

Well just tried HD tune pro (to see if the timeout thing had stopped and it says I need to remove all partitions to run the write test.

Yes, and it's correct -- it's a device benchmark test, not a file benchmarking test. What do you think a write benchmark actually does when benchmarking a device? It writes a bunch of data to the device directly. And what do you think that's going to do to a filesystem that's on the device?

A file-based benchmark is not going to provide the same information as a device-based benchmark. You have an entire filesystem abstraction layer in the way of the former, which has its own layer of caching as well as many other caveats. There are other utilities which use file-based tests if that's what interests you, but I do not care about those -- with a file-based benchmark, you can't do LBA benchmarking because there's no way to guarantee what LBA you get; you simply say "open a file, write X number of bytes to it" and how the filesystem layer chooses to organise those blocks of bytes is purely up to it. They may be linearly stored on the disk, and then again they may not be. Welcome to filesystem fragmentation!).

TL;DR -- yes, that's correct; and to do device-level write benchmarks, you need to remove all filesystems from the device.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23
reply to DarkLogix

said by DarkLogix:

The cacheing is related to the EXT4 filesystem, I'll get a screenshot of the page with the setting.

Shame on QNAP for putting this at the "Hardware" level of the configuration interface -- this is not a hardware feature, and has nothing to do with neither disk write caching nor "hardware RAID" write caching. Vendors today, sigh...

What's being described there is a filesystem feature of ext4 called "delayed allocation". Rather than explain it, I'll just link folks to the details and they can read it themselves:

»ext4.wiki.kernel.org/index.php/E···location

And folks using ext4 natively on QNAP products should absolutely read this:

»en.wikipedia.org/wiki/Ext4#Delay···ata_loss

According to all I can find online, QNAP products (some of them anyway) appear to be Linux-based. Thus I would ask QNAP what exact Linux kernel version they're using. If 2.6.30 or later, then great. If 2.6.29 or 2.6.28, then I would ask them if they've backported the aforementioned ext4 delayed allocation patch. If not, then folks should turn that feature off else risk data loss when power is lost (unless the device is on a UPS and you trust the UPS fully, of course).

Use of that feature only applies if the NAS itself is using ext4 as a filesystem. If iSCSI can export a "volume" which actually gets written to the NAS filesystem as a single file, and that filesystem is ext4, then you'd be susceptible to this problem. If iSCSI only exports "volumes" that correlate directly (1:1) with a RAID or RAID-like series of devices (disks), and lets the iSCSI client choose to format the volume as whatever filesystem it wants (e.g. NTFS, ext3, ext4, etc.) then delayed allocation doesn't apply (at the NAS level -- instead, if your iSCSI client system used ext4, you may want to disable the feature there, see above paragraph of course).

Hope that explains things a bit more. This is just further proof, in my opinion, of how/why storage devices and solutions in general are not as simple as companies (and people) want to make them out to be. They are never, ever simple.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

Well my understanding of what they do on the Qnap is it does a software raid on the drives and formats them directly with EXT4, then creates a lun on that Filesystem for iSCSI to target.

SO tehn on windows I have the iSCSI disk formatted with NTFS (as a GPT drive) (still not sure why windows won't let me entend the volume as I've already done the extension on the QNAP and windows sees the full 4TB just can't extend the 2TB partition to take the whole drive.



DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3
reply to koitsu

Ok with the kernel of 2.6.33.2 it should be safe to turn on the delay allocation right?



koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

First paragraph and 2nd-to-last: »en.wikipedia.org/wiki/Ext4#Delay···ata_loss


--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


Extide

join:2000-06-11
84129

2 edits
reply to koitsu

said by koitsu:

Guesses for the performance hit:

* Use of 6 disks in a single ZFS vdev that makes up the entire raidz2 pool. I forget the rule of thumb here but I believe multiple vdevs of 2-disk mirrors (thus "multiples mirrors which are striped") would perform better; raidzX doesn't perform that great, and added parity (raidz2 and raidz3) results in an even bigger performance hit

* With ZFS (and I imagine any RAID or RAID-like implementation, for that matter!) the speed of the I/O transaction is often limited to the speed of the slowest disk. I.e. if you have a single disk that is performing like crap (much worse than the others), it's going to influence I/O on the entire pool. This has been proven time and time again on the FreeBSD lists, where a person has one disk that's performing like total crap (excessive ECC being done by the drive, etc.) and their I/O rates are abysmal. "zpool iostat -v 1" can help track this down, or "gstat -I500ms" on FreeBSD (I'd love to know what Linux has that's like gstat).

* Use of ZFSonLinux -- is this the fuse implementation or the kernel implementation? If the former then that explains it; if the latter then possibly all the internal/kernel-level I/O isn't optimised for speed? I'm grasping at straws on this one (if kernel-level)

* The extremely large number of abstraction layers between client I/O layer and physical disk layer. Take a look: Windows client I/O -> NTFS filesystem -> iSCSI client -> Ethernet -> iSCSI server -> ZFS on Linux -> physical disk. As an *IX admin who heavily applies the KISS principle in every way/shape/form -- yuck!

We can safely rule out improper alignment due to 4096-byte sectors because the 5K3000 2TB model drives use 512-byte sectors.

P.S. -- Extide See Profile, seek times are extremely low not because of "caching" but because of use of iSCSI. Seek times can't be measured reliably this way given the use of Ethernet (see sk1939 See Profile's iSCSI results too -- same thing); you have to look at seek times on the actual machine that has the physical disks. So yes, you can safely ignore the seek times in the benchmark, but it's not due to "caching".

-6 disks in RaidZ2 works well (you want n^2 data disks, so 1,2,4,... and I have 4)

-The disks are all identical, and all the same age

-This is the kernel modules implementation

-The reason I bothered with the iSCSI business at all instead of just using samba is because samba just sucks and is too slow. Using iSCSI and a windows box I get significantly better performance over SMB than samba alone.

-You cant eliminate the 4k alignment issue, except for the fact that I am (fairly certain at least) that I aligned then correctly. These drives internally DO use 4k sectors, only exposing 512b as emulation.

-Also I say the seek times are very low because I ran the benchmark twice and the first time I got seek times like what you would expect for 5400rpm drives, and then the second run the seek times were flat. That's caching

In any case, yeah the performance is a bit disappointing. I haven't really bothered to look into it too much though as it is fine for my purposes. I wouldn't be surprised if I have the disks aligned wrong, as that could easily cause performance like I am seeing.

EDIT: The average read speed during a ZPOOL scrub is ~95MB/sec, still pretty weak. I wonder if folding on the box slows down ZFS much any. That could be part of it too.

EDIT2: Based on the output of zpool iostat the bottleneck is in the network. I see a big write of about 140MB to the array (evenly distributed across all 6 drives) and then about 4 seconds of nothing, then another big write, then 4 seconds of nothing. I could try messing with jumbo packets, and also putting better NIC's in the boxes for the machine-to-machine link. It might actually be a lot easier to make this fast than I thought...


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

1 edit

1. I didn't say 6 disks in raidz2 didn't work. I said performance-wise you're going to get better performance with a set of striped vdevs, where each vdev is a mirror consisting of 2 disks. This is akin to RAID 1+0, and as I'm sure you know, RAID 1+0 speed-wise (reads and writes) beats the living pants off RAID-6. This same design applies to ZFS as well.

raidz1 is faster than raidz2, and raidz2 is faster than raidz3. Thus, WRT point #4 below and the scrub "benchmark" you give, I am not surprised in the least, but #5 may also play a role.

2. First you say you went to iSCSI because Samba is too slow, and that using iSCSI and a Windows machine gets you good performance over SMB. I'm confused. You just said that you went with iSCSI rather than CIFS/SMB...?

3. The Hitachi 5K3000 2TB models use physical 512 byte sectors. I can assure you because 1) I sold one of these drives to sk1939 See Profile, 2) Hitachi's own documentation says so, and 3) smartctl also confirms HGST's documentation even for their 3TB models. It's fine to 4096-byte align a drive of this sort, of course! I'm just saying that alignment isn't a factor in this case. The 3TB disks use 4096-byte sectors; the highest capacity disk you can provide with 512-byte sectors is 2TB. You can't go any larger without 4096-byte, due to LBA addressing limitations. Edit: Nope, I'm completely wrong on this part. In fact I'm not even sure what I was thinking when I said that, to be honest. Maximum capacity with LBA48 = 144PB:

 2^48 = 281474976710656 LBAs
281474976710656 *  512 byte sector =   144,115,188,075,855,872 byte capacity
281474976710656 * 4096 byte sector = 1,152,921,504,606,846,976 byte capacity
 

4. A ZFS scrub is a read-only operation unless anomalies are found. So from this we can tell that reading from the pool is around 95MBytes/second (barring anything #5 causes), which is almost 2x higher than what iSCSI is putting out.

But as I said above -- and I am happy to show you benchmarks if you want to see them -- at least with ZFS+Samba on my setup (testing read speeds from the pool only, pool = a single mirror of 2 disks), I can get ~70MBytes/second via gigE, but with ZFS+FTP I can get 97-98MBytes/second. This is why I say that 95MBytes/second using a raidz2 pool sounds about right.

5. As you know ZFS is CPU-bound, so yes, I imagine your Folding client does result in a hit on performance. How much should be easy to discern: shut off the Folding client, re-run the tests, voilà. :-)

6. You cannot use "zpool iostat" SOLELY to discern where the bottleneck is. ZFS reads/writes to the disks in "bursts" (specifically it's the TXG timeout -- I'm talking about tunable vfs.zfs.txg.timeout. DO NOT LET THE WORD "TIMEOUT" MAKE YOU THINK ITS SOME KIND OF FAILURE TIMEOUT), so the behaviour will always look like that. I'm familiar with this tunable because on FreeBSD it used to default to 30, which was way too long; it's now 5.

What you need is something like gstat -I500ms on FreeBSD or iostat -x 1 on Solaris. What you need to see is device I/O speed being performed on a per-device basis, and I'm talking about at the I/O layer (e.g. libata), not at the filesystem layer. I don't know enough about Linux to know what it has that can do this.

What I'm trying to say here: I do not believe the problem is necessarily a network bottleneck. What you're describing, especially the "every 4 seconds" thing, is absolutely how ZFS behaves (on Solaris and on FreeBSD, and it sounds like on Linux as well). That's just how it works.

--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.


koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

1 edit
reply to sk1939

I'll toss in some of my own benchmarks just for the hell of it.

Client:
- Windows XP Pro SP3 32-bit, 16GB RAM (only 4GB usable, obviously)
- Intel i7-2600K
- Gigabyte GA-P67A-UD3-B3
- AHCI enabled, NCQ enabled
- Intel RST driver set 11.2.0.1006
- Realtek 8111E NIC @ gigE
- Realtek driver set 5.800.0719.2012
- Drive D: = WD10EFRX (WD Red, 1TB); attached to SATA600 port
- Drive Y: = CIFS/SMB mount to server \\icarus\CD_Images

Client network stack tunings:

Windows Registry Editor Version 5.00
 
; Decimal equivalents
; ------------------------------------------------------------------
; DefaultTTL = 64
; EnablePMTUBHDetect = 1    (OS default, entry is for clarity)
; EnablePMTUDiscovery = 1   (OS default, entry is for clarity)
; SackOpts = 1
; Tcp1323Opts = 1           (window scaling enabled, TS disabled)
; TcpMaxDupAcks = 2         (OS default, entry is for clarity)
; TcpWindowSize = 496400    ((85*1460) * (2^2)) -- 85 is adjustable
;
; NOTE: For TcpWindowSize, the Tweak Tester on DSLR recommends a
; value between 470120 and 1251220 for our connection.
; ------------------------------------------------------------------
;
; RFC1323 timestamps are disabled "as a precaution" WRT this issue:
; http://www.dslreports.com/forum/r26290497-DSLR-timeouts-from-Linux-land
;
; ------------------------------------------------------------------
 
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"DefaultTTL"=dword:00000040
"EnablePMTUBHDetect"=dword:00000001
"EnablePMTUDiscovery"=dword:00000001
"SackOpts"=dword:00000001
"Tcp1323Opts"=dword:00000001
"TcpMaxDupAcks"=dword:00000002
"TcpWindowSize"=dword:00079310
 

Server:
- FreeBSD 9.1-PRERELEASE, 64-bit, 8GB RAM, kernel build date: Fri Oct 12 04:37:26 PDT 2012
- Intel Q9550
- Supermicro X7SBA
- AHCI enabled, NCQ enabled
- Intel 82573E NIC @ gigE
- FreeBSD em0 (Intel NIC) driver version 7.3.2
- ZFS v28 used
- Samba 3.6.7 with AIO enabled
- FTP enabled
- Disk ada1: WD10EFRX (WD Red, 1TB); attached to SATA300 port
- Disk ada2: WD10EFRX (WD Red, 1TB); attached to SATA300 port
- Disks ada1 and ada2 are a ZFS mirror (effectively RAID-1)
- /storage/CD_Images exported via Samba

Server ZFS pool details:

root@icarus:/root # zpool status data
  pool: data
 state: ONLINE
  scan: none requested
config:
 
        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
 
errors: No known data errors
 
root@icarus:/root # df -k
Filesystem   1024-blocks      Used      Avail Capacity  Mounted on
...
data/storage   943827908 426242276  517585632    45%    /storage
 

Server tunings that play a role in throughput (network or disk or other features):

/boot/loader.conf --
 
# We use Samba built with AIO support for increased network I/O speed
#
aio_load="yes"
 
/etc/sysctl.conf --
 
# Increase send/receive buffer maximums from 256KB to 16MB.
# FreeBSD 7.x and later will auto-tune the size, but only up to the max.
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
 
# Double send/receive TCP datagram memory allocation.  This defines the
# amount of memory taken up by default *per socket*.
net.inet.tcp.sendspace=65536
net.inet.tcp.recvspace=65536
 

Server Samba tuning parameters:

#
# The below options increase throughput substantially.  Be aware
# that AIO support requires the aio.ko kernel module loaded,
# and Samba to be built with AIO enabled.  Important notes:
#
# http://www.samba.org/samba/docs/man/manpages-3/smb.conf.5.html
#
# 1) If any of the path directories reside on ZFS, disabling sendfile
# support is a wise choice.  There are known problems with sendfile
# and mmap on ZFS, including resulting in 2x the amount of memory
# used on the machine (VM cache + ZFS cache).  For further details,
# see freebsd-fs or freebsd-stable thread, subject "8.1-STABLE:
# zfs and sendfile: problem still exists".
#
# 2) (2011/10/03) "socket options" SO_SNDBUF and SO_RCVBUF do not
# appear to matter on FreeBSD, or our /etc/sysctl.conf adjustments
# somehow take care of this (or maybe AIO?).  The performance is the
# same with or without these two socket options on 8.2-STABLE and
# newer.  TCP_NODELAY is also default since Samba 2.x.
#
# 3) (2011/10/03) My previously-mentioned "aio write behind" option
# is incorrect; see the officia smb.conf(5) man page for the syntax.
# It's not a yes/no toggleable, thus serves no purpose.
#
# 4) "aio {read,write} size" variables define the minimum number of
# bytes before using AIO.  We set these to 1024, although many websites
# recommend a value of 1 (which seems a bit aggressive).
#
# 4) "read raw" is enabled by default, but has no real bearing on
# speed when disabled.  Thus we leave it enabled.
#
use sendfile = no
aio read size = 1024
aio write size = 1024
 

Relevant Samba server config for CD_Images export:

[CD_Images]
        path = /storage/CD_Images
        writable = yes
        guest ok = yes
        guest only = yes
        valid users = @storage
        force user = storage
        force group = storage
 

Files involved in testing:

-rwxr--r--  1 storage  storage  668362752 Sep 22 03:11 /storage/CD_Images/FreeBSD/9.1-RC/FreeBSD-9.1-RC1-amd64-disc1.iso
-rwxr--r--  1 storage  storage  714375168 Sep 22 03:12 /storage/CD_Images/FreeBSD/9.1-RC/FreeBSD-9.1-RC1-amd64-memstick.img
 

Server rebooted to clear ZFS ARC cache. This ensures that the data being read off disk by ZFS has to be done right then and there, e.g. file is not being served from RAM.

Verification using top (note ARC size is 15MBytes):

ARC: 15M Total, 4824K MRU, 9858K MFU, 16K Anon, 146K Header, 472K Other
 

Test #1: FTP: client fetches FreeBSD-9.1-RC1-amd64-memstick.img from server

D:\>ftp icarus.home.lan
Connected to icarus.home.lan.
220 icarus.home.lan FTP server (Version 6.00LS) ready.
User (icarus.home.lan:(none)): jdc
331 Password required for jdc.
Password:
230 User jdc logged in.
ftp> bi
200 Type set to I.
ftp> cd /storage/CD_Images/FreeBSD/9.1-RC
250 CWD command successful.
ftp> get FreeBSD-9.1-RC1-amd64-memstick.img
200 PORT command successful.
150 Opening BINARY mode data connection for 'FreeBSD-9.1-RC1-amd64-memstick.img' (714375168 bytes).
226 Transfer complete.
ftp: 714375168 bytes received in 7.52Seconds 95047.25Kbytes/sec.
ftp> quit
221 Goodbye.
 

Results: 95MBytes/second

Checking ZFS ARC size after test #1, since that file's contents should now be in the ARC:

ARC: 700M Total, 683M MRU, 13M MFU, 16K Anon, 1869K Header, 1685K Other
 

Test #2: Samba: client fetches FreeBSD-9.1-RC1-amd64-disc1.iso from server

Note: the reason I'm using FreeBSD-9.1-RC1-amd64-disc1.iso (different file from test #1) is because if I used the file from test #1, data would be served from the ZFS ARC (i.e. straight out of RAM) and I wanted to make sure that I was transferring data that wasn't already in the ARC.

Note: Since Windows copy.exe doesn't provide throughput/speed indicators, and I didn't want to deal with GUI crap, I resorted to using netstat -i -b -n 1 on the FreeBSD box, which indicates network throughput, then issue a copy command in Windows. It's the best I can do; sorry.

D:\>copy Y:\FreeBSD\9.1-RC\FreeBSD-9.1-RC1-amd64-disc1.iso
        1 file(s) copied.
 

root@icarus:/root # netstat -inb 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
         1     0     0         60          1     0         90     0
         1     0     0         60          1     0         90     0
         1     0     0         60          1     0         90     0
         2     0     0        150          2     0        180     0
      1177     0     0      87688        442     0   13743784     0
      3995     0     0     293736        950     0   47911412     0
      4607     0     0     338778       1096     0   55280760     0
      4471     0     0     328794       1063     0   53659096     0
      4531     0     0     333306       1078     0   54500648     0
      4458     0     0     327900       1061     0   53515502     0
      4449     0     0     327132       1057     0   53392154     0
      4269     0     0     313938       1016     0   51236930     0
      4403     0     0     323859       1048     0   52899581     0
      4516     0     0     332631       1085     0   54193717     0
      4487     0     0     330207       1076     0   53762549     0
      4493     0     0     330935       1080     0   53639864     0
      4635     0     0     341031       1107     0   55179296     0
      1435     0     0     106198        356     0   17016818     0
        17     0     0       1784         17     0       1370     0
        15     0     0       1644         16     0       1276     0
        16     0     0       1714         15     0       1182     0
^C
 

Results: about 55MBytes/second.

Checking ZFS ARC size after test #2, since that file's contents should now also be in the ARC:

ARC: 1341M Total, 1320M MRU, 15M MFU, 16K Anon, 3482K Header, 2826K Other
 

Yup, it is.

Test #3: same as test #2, except since the file contents are now in the ARC, there's basically no physical disk I/O needed for reading that file (it's completely in RAM), but obviously read(2) calls still have to be used to read from the filesystem:

D:\>del FreeBSD-9.1-RC1-amd64-disc1.iso
 
D:\>copy Y:\FreeBSD\9.1-RC\FreeBSD-9.1-RC1-amd64-disc1.iso
        1 file(s) copied.
 

root@icarus:/root # netstat -inb 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
         1     0     0         60          1     0        154     0
         1     0     0         60          1     0        122     0
       667     0     0      49906        294     0    7366712     0
      3325     0     0     244587        791     0   39938494     0
      3705     0     0     272403        881     0   44405301     0
      3406     0     0     250416        808     0   40702159     0
      3770     0     0     277386        899     0   45386522     0
      4699     0     0     345495       1116     0   56323601     0
      3336     0     0     245304        793     0   39938611     0
      4662     0     0     342819       1109     0   55966313     0
      4652     0     0     342162       1106     0   55859700     0
      4639     0     0     341154       1106     0   55609528     0
      4671     0     0     344180       1121     0   56102596     0
      4484     0     0     330385       1080     0   53701382     0
      4666     0     0     343756       1122     0   55979771     0
      4664     0     0     343636       1121     0   55918136     0
       582     0     0      43447        155     0    6732742     0
        16     0     0       1714         17     0       1370     0
        16     0     0       1714         15     0       1198     0
 

Bottom line: you can see from all the above evidence that physical disk I/O speed is not the bottleneck here (else FTP would have shown the same speed as CIFS/SMB), nor is the physical network.

That leaves two possibilities: 1) Samba's file reading I/O model might not be very optimised (e.g. read(2) using 16KByte buffers rather than maybe 64KBytes), or 2) CIFS/SMB protocol or protocol implementation model. Hard to say which without looking at source code or using ktrace/truss (strace on Linux).

I do not use iSCSI (nor will I -- no need), so I can't do those tests as a comparison.

Welcome to all the nuances of doing "benchmarks" when combined with a network interface and filesystem-related protocols, and the fact that most of this crap (including Samba, as well as those NAS units -- I'm left with the impression the QNAP device actually uses Samba) is still "black box".
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.

Extide

join:2000-06-11
84129

I still have a part of my ZFS volume left over, that is still shared out over samba. I will do some messing around with that, for comparison.



koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

Cool! Then we can compare results and see what's goin' on. Well, as best we can anyway.

In the meantime I should probably install some Linux distro on a VM and see if I can figure out what their equivalent to gstat or iostat is, for disk I/O monitoring.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.



DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

Heres a benchmark after messing around and moving my data to a properly formated 3.5TB lun

(seems win7 used a allacation unit size that didn't permit expantion past 2TB) so I had to make a new lun and move nearly 1TB of data over (yay for thin provisioned luns, btw is there any preformance loss for thin over thick luns?)

BTW I think my reltek nic might be holding it back. (so I think I'll get a dual or quat intel nec at some point and do nic teaming)

sk1939
Premium
join:2010-10-23
Mclean, VA
kudos:10

I've always hated Realtek nics; their drivers sucked back in the day (probably still do) and their performance was only average at best.



koitsu
Premium,MVM
join:2002-07-16
Mountain View, CA
kudos:23

Their drivers are... "meh". It depends greatly on what NIC and PHY you get with them; some are okay (meaning no longer suffering from Rx/Tx Checksum Offloading bugs, for example), while others are just trash. My current workstation NIC does have some ARP-related bugs in the driver, pertaining to when the driver loads and begins to handle the first few Ethernet frames coming across the wire (for DHCP specifically). Took me a lot of time to analyse this.

But before crapping all over them, make sure you see the client hardware details of my post here -- because my tests are using a Realtek NIC (on the client side): »Re: Disk Transfer Speeds

My next motherboard will use an Atheros NIC, though I have no idea what sorts of "fun" I'll get to deal with there. The only NICs I trust these days are Intel, and you rarely find them on Gigabyte, Asus, MSI, or other consumer-brand boards (I have no interest in Intel boards -- and not to mention, Intel has begun shipping most of their boards with Realtek NICs! How's that for irony?). Sad panda.
--
Making life hard for others since 1977.
I speak for myself and not my employer/affiliates of my employer.



DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

Well as my iSCSI has 2 nics and supports nic teaming (as does my switch) I plan to get a dual or quad intel nic (also for the intel iSCSI boot rom, anyone have any experience with it?)

so that I can do nic teaming on both client and array side.

I plan to setup teaming on the array soon, just need to config a couple ports and config the nic's on the array.

My system uses a gigabyte board with a realtek nic., so I'm inclined to go quad.

if I go dual then it'll be dedicated to iSCSI, but if I go quad then I'll disable the on-board nic.


sk1939
Premium
join:2010-10-23
Mclean, VA
kudos:10

It works if you have it set up to boot properly, it's not like booting from PXE.

Quad nics are expensive ($120 on Ebay), compared to dual port nics which you can pick up for $30 or so.



DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

Well I don't really want to boot from iSCSI but just connect to the array before windows is loaded.

Currently I have to wait at the login screen for a little bit before logging in as my desktop background is on the array (sure I could move the file to the C drive but I'd rather just have a way to connect to the array before windows loads.)

since even at say 2gbit/sec it wouldn't be as fast as my SSD, but I want it to be connected pre-windows.


sk1939
Premium
join:2010-10-23
Mclean, VA
kudos:10

Doesn't really work like that, as the Microsoft iSCSI initiator is completely separate. The purpose of the iSCSI on the nic is to allow hard disk-less servers.



DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

Well I was hoping it'd act like a iSCSI HBA and be able to fuly handle the iSCSI so I could use it instead of the Microsoft iSCSI initiator.