Friday, January 20th, 2012 | Author:

Here’s a quick weigh-in on the new experimental device-mapper thin provisioning (and improved snapshots) that exists in linux kernel 3.2. I recently compiled a kernel and tested it, and it looks rather promising. I would expect this to become stable more quickly than, say, btrfs (which obviously has different design goals but could also be used as a means of snapshotting files/vm disks). With any luck, LVM will be fitted for support relatively soon.

These are quick and dirty, sequential write test was with ‘dd if=/dev/zero of=testfile bs=1M count=16000 conv=fdatasync’, and results are in MB/s.  Random IO test was done with fio:

[global]
ioengine=libaio
iodepth=4
invalidate=1
direct=1
thread
ramp_time=20
time_based
runtime=300

[8RandomReadWriters]
rw=randrw
numjobs=8
blocksize=4k
size=512M

 

At any rate, it looks like we’re well on our way to high performance LVM snapshots that work well  (finally!).

 

 

Category: Stuff  | Leave a Comment
Thursday, December 01st, 2011 | Author:

Version 0.9 gains some functionality that allows it to be used as a quick and dirty random I/O generator. Seekmark does a good job of pounding your disk as hard as possible in all-or-nothing fashion, but now you can specify a delay to insert between seeks to reduce the load, in order to simulate some scenario. For example you may want to test performance of some other application while the system is semi-busy doing random I/O on a database file, or you may want to test shared storage between multiple hosts where one host has 4 processes doing 64k random reads every 20ms and another host has 2 processes, one doing busy 4k random writes as fast as possible and the other doing 128k reads every 50ms.

With this, is a new -e option, that runs seekmark in endless mode. That is, it will simply run until it’s killed.

As usual, see the seekmark page, linked at the top of the blog.

Category: SeekMark  | Leave a Comment
Wednesday, August 24th, 2011 | Author:

Ceph, an up and coming distributed file system, has a lot of great design goals. In short, it aims to distribute both data and metadata among multiple servers, providing both fault tolerant and scalable network storage. Needless to say, this has me excited, and while it’s still under heavy development, I’ve been experimenting with it and thought I’d share a few simple benchmarks.

I’ve tested two different ‘flavors’ of Ceph, the first I believe is referred to as “Ceph filesystem”, which is similar in function to NFS, where the file metadata (in addition to the file data) is handled by remote network services and the filesystem is mountable by multiple clients. The second is a “RADOS block device”, or RBD. This refers to a virtual block device that is created from Ceph storage. This is similar in function to iSCSI, where remote storage is mapped into looking like a local SCSI device. This means that it’s formatted and mounted locally and other clients can’t use it without corruption (unless you format it with a cluster filesystem like GFS or OCFS).

If you’re wondering what RADOS is, it’s Ceph’s acronym version of RAID. I believe it stands for “Reliable Autonomous Distributed Object Store”. Technically, the Ceph filesystem is implemented on top of RADOS, and other things are capable of using it directly as well, such as the RADOS gateway, which is a proxy server that provides object store services like that of Amazon’s S3. A librados library is also available that provides an API for customizing your own solutions.

I’ve taken the approach of comparing cephfs to nfs, and rbd to both iscsi and multiple iscsi devices striped over different servers. Mind you, Ceph provides many more features, such as snapshots and thin provisioning, not to mention the fault tolerance, but if we were to replace the function of NFS we’d put Ceph fs in its place; likewise if we replaced iSCSI, we’d use RBD. It’s good to keep this in mind because of the penalties involved with having metadata at the server; we don’t expect Ceph fs or NFS to have the metadata performance of a local filesystem.

  • Ceph (version .032) systems were 3 servers running mds+mon services. These were quad core servers, 16G RAM. The storage was provided by 3 osd servers (24 core AMD box, 32GB RAM, 28 available 2T disks, LSI 9285-8e), each server used 10 disks, one osd daemon for each 2T disk, and an enterprise SSD partitioned up with 10 x 1GB journal devices. Tried both btrfs and xfs on the osd devices, for these tests there was no difference. CRUSH placement defined that no replica should be on the same host, 2 copies of data and 3 copies of metadata. All servers had gigabit NICs.
  • Second Ceph system has monitors, mds, and osd all on one box. This was intended to be a more direct comparison to the NFS server below, and used the same storage device served up by a single osd daemon.
  • NFS server was one of the above osd servers with a group of 12 2T drives in RAID50 formatted xfs and exported.
  • RADOS benchmarks ran on the same two Ceph systems above, from which a 20T RBD device was created.
  • ISCSI server was tested with one of the above osd servers exporting a 12 disk RAID50 as a target.
  • ISCSI-md was achieved by having all three osd servers export a 12 disk RAID50 and the client striping across them.
  • All filesystems were mounted noatime,nodiratime whether available or not. All servers were running kernel 3.1.0-rc1 on centos 6. Benchmarks were performed using bonnie++, as well as a few simple real world tests such as copying data back and forth.

ceph-nfs-iscsi-benchmarks.ods

The sequential character writes were cpu bound on the client in all instances; the sequential block writes (and most sequential reads) were limited by the gigabit network. The Ceph fs systems seem to do well on seeks, but this did not translate directly into better performance in the create/read/delete tests. It seems that RBD is roughly in a position where it can replace iSCSI, but the Ceph fs performance needs some work (or at least some heavy tuning on my part) in order to get it up to speed.

It will take some digging to determine where the bottlenecks lie, but in my quick assessment most of the server resources were only moderately used, whether it be the monitors, mds, or osd devices. Even the fast journal SSD disk only ever hit 30% utilization, and didn’t help boost performance significantly over the competitors who don’t rely on it.

Still, there’s something to be said for this, as Ceph allows storage to fail, be dynamically added, thin provisioned, rebalanced, snapshots, and much more, with passable performance, all in pre-1.0 code.  I think Ceph has a big future in open source storage deployments, and I look forward to it being a mature product that we can leverage to provide dynamic, fault-tolerant network storage.

 

 

 

 

 

 

 

 

 

 

Category: Stuff  | Leave a Comment
Friday, June 03rd, 2011 | Author:

I just updated SeekMark to include a write seek test. I was initially reluctant to do this, because nobody would ever want to screw up their filesystem by performing a random write test to the disk it resides on, right?? Of course not, but occasionally you need to benchmark a disk, for the sake of benchmarking, and aren’t worried about the data. And of course, I didn’t care about that functionality until I needed it myself!

So here we have version 0.8 of SeekMark, which adds the following features:

  • write test via “-w” flag, with a required argument of “destroy-data”
  • allows for specification of io size via the “-i” flag, from 1byte to 1048576 bytes (1 megabyte). The intended purpose of the benchmark (which is to test max iops and latency) is still best fulfilled by the default io size of 512, but changing the io size can be useful in certain situations.
  • added “-q” flag per suggestions, which skips per-thread reporting and limits output to the result totals and any errors that possibly arise

Now head on over to the SeekMark page and get it!

Category: Stuff  | Leave a Comment
Monday, February 14th, 2011 | Author:

High quality, dual screen 3840×1080 desktop backgrounds, fresh from my fractmark utility. These are large PNG files, because I’m a sucker for detail.

-y -.6003 -x-.367 -X -.3658 -l 60000

Wednesday, February 09th, 2011 | Author:

I just added a new page for SeekMark, a little program that I put together recently to test the number of random accesses/second to disk. It’s threaded and will handle RAID arrays well, depending on the number of threads you select. I’m fairly excited about how this turned out, it helped me prove someone wrong about whether or not a particular RAID card did split seeks on RAID1 arrays. The page is here, or linked to at the top of my blog, for future reference.  I’d appreciate hearing results/feedback if anyone out there gives it a try.

Here are some of my own results, comparing a linux md raid10, 5 disk array against the underlying disks. I’ll also show the difference in the results that threading the app made:

single disk, one thread:

  [root@server mlsorensen]# ./seekmark -t1 -f/dev/sda4 -s1000
  Spawning worker 0
  thread 0 completed, time: 13.46, 74.27 seeks/sec

  total time: 13.46, time per request(ms): 13.465
  74.27 total seeks per sec, 74.27 seeks per sec per thread

single disk, two threads:

  [root@server mlsorensen]# ./seekmark -t2 -f/dev/sda4 -s1000
  Spawning worker 0
  Spawning worker 1
  thread 0 completed, time: 27.29, 36.64 seeks/sec
  thread 1 completed, time: 27.30, 36.63 seeks/sec

  total time: 27.30, time per request(ms): 13.650
  73.26 total seeks per sec, 36.63 seeks per sec per thread

Notice we get pretty much the same result, about 74 seeks/sec total.

5-disk md-raid 10 on top of the above disk, one thread:

  [root@server mlsorensen]# ./seekmark -t1 -f/dev/md3 -s1000
  Spawning worker 0
  thread 0 completed, time: 13.09, 76.41 seeks/sec

  total time: 13.09, time per request(ms): 13.087
  76.41 total seeks per sec, 76.41 seeks per sec per thread

Still pretty much the same thing. That’s because we’re reading one small thing and waiting for the data before continuing. Our test is blocked on a single spindle!

four threads:

  [root@server mlsorensen]# ./seekmark -t4 -f/dev/md3 -s1000
  Spawning worker 0
  Spawning worker 1
  Spawning worker 2
  Spawning worker 3
  thread 1 completed, time: 15.02, 66.57 seeks/sec
  thread 2 completed, time: 15.46, 64.69 seeks/sec
  thread 3 completed, time: 15.57, 64.24 seeks/sec
  thread 0 completed, time: 15.69, 63.74 seeks/sec

  total time: 15.69, time per request(ms): 3.922
  254.96 total seeks per sec, 63.74 seeks per sec per thread

Ah, there we go. 254 seeks per second. Now we’re putting our spindles to work!

Category: Stuff  | 16 Comments
Tuesday, February 08th, 2011 | Author:

I’ve just added a page for FractMark, a simple multi-threaded fractal-based benchmark. Read more about it (and download it) here.

On a side note, some of you may be familiar with a similarly simple i/o benchmark called PostMark. It was written under contract for Network Appliance  and is known as an easy, portable, random i/o generator.  At this point the sourcode has been pretty much abandoned as far as I can tell, so I’ve picked it up and have begun adding some bugfixes as well as some enhancements. The primary things I’ve done so far are to add an option for synchronous writes  in Linux, as well as threaded transactions, which should give people the flexibility to test scenarios where they might have many processes creating random i/o.

If this interests you, I’ll be posting the source code and patches coming soon!

Category: Stuff  | Leave a Comment
Friday, October 08th, 2010 | Author:

I’ve been playing with blktrace recently, and wanted to understand the flow of IO better and the order in which events take place. For example, when I see something like the following, I can better make sense of it:

8,48   0       50     0.966969101   163  A   W 1370345506 + 8 <- (8,49) 1370345472
8,49   0       51     0.966969659   163  Q   W 1370345506 + 8 [flush-8:48]
8,49   0       52     0.966970358   163  M   W 1370345506 + 8 [flush-8:48]
8,48   0       53     0.966972523   163  A   W 1370345538 + 8 <- (8,49) 1370345504
8,49   0       54     0.966973082   163  Q   W 1370345538 + 8 [flush-8:48]
8,49   0       55     0.966974199   163  G   W 1370345538 + 8 [flush-8:48]
8,49   0       56     0.966974967   163  I   W 1370345538 + 8 [flush-8:48]
8,49   0       60     0.966985444   163  D   W 1370345538 + 8 [flush-8:48]
8,49   0      150     0.967601527     0  C   W 1370345538 + 8 [0]

I couldn’t find any sort of detailed, step-by-step flow of blktrace events, so I put this flowchart together.  The information is based largely on what is described in the book “Understanding the Linux Kernel”, as well as man pages and other bits I’ve gleaned from the internet.

I believe it to be largely correct, although I’m not exactly the king of flowcharts so there may be some errors in the layout. There’s a discrepancy between when a queue is plugged according to the book, vs what I see in blktrace.  The book says the queue is checked for emptiness, then the queue is plugged, then a request descriptor is allocated, but blktrace reliably shows that the queue is plugged after the request descriptor is allocated.  If anyone has better information, or perhaps a better flow chart, I’d appreciate seeing it.

Category: Storage  | 3 Comments
Wednesday, March 24th, 2010 | Author:

Ok, I’ve been sitting on these for a few days and wanted to get them out there. I’ve got an old server that I configured with 4x750GB western digital black SATA drives, and used an LSI2008 controller in raid10 with the default 64k stripe size. It’s a 4 core xeon 5200 series, I believe, with 24G of RAM. The OS for every test was CentOS 5.4 with default virtual memory/sysctl configurations and minimal packages.

The VMs were built with 300GB virtual disks and 4GB of memory.   The KVM guests had disk drivers and cache settings as indicated, and was on ext3 with a QCOW2 image (options were 1MB cluster size, preallocated metadata).  The ESX guest had the pvscsi driver enabled, and a 2MB cluster size was used on the filesystem due to the filesize limitations.

First off, postmark.  For those who don’t know, postmark is a simple, yet decent utility that’s designed to give an idea of small I/O workloads. It’s tunable, but geared toward web/mail server type loads. It creates a ton of small files, then does random operations on them, and spits out the results.

Here’s the config:

set buffering false
set number 100000
set transactions 50000
set size 512 65536
set read 4096
set write 4096
set subdirectories 5
show
run
quit

The primary things I’m interested in here are the KVM virtio performance (specifically writethrough and nocache) compared to ESX4.1 and the native host disks. The IDE driver and writeback tests were done just to see what would happen, but they’re not exactly what I’d prefer to use in production. The KVM virtio driver is nearly on par with native speed when it comes to reads and writes. It falls behind a bit on the actual operations per second, but it must have made up the time somewhere, since the benchmark overall only took a hair longer (if this confuses you, basically the benchmark goes through several stages: creating dirs, creating files, performing transactions, deleting files, deleting dirs. Only the transactions part counts towards the ops/sec number).  The ESX guest didn’t do as well with the pvscsi drivers. In fact, this benchmark alone would be enough to put any of my concerns about virtio performance and choosing KVM vs the tried and true ESX.

Next up, iozone.  This benchmark tries to create a sort of ‘map’ of the disk I/O, by testing various file sizes at various record sizes, creating sort of a matrix.  I regret to say that the read numbers are a bit skewed, as my setup didn’t include an unmountable volume on every system, that’s really the only way to clean out the read caches between tests with this benchmark.  Still, we’ve got some good write numbers and some interesting cache comparisions.

As you can see, the host’s read cache is much faster than the guest’s. Still, some of those guests are posting upwards of 1GB/s, not bad, but it does give us some insight into the overhead of the vm, we’re likely seeing the added latency of fetching the data from cache and passing it through, which can be pretty big when memory speed is measured in ns.

Also of note yet again is the good performance of virtio, and that writethrough and writeback score roughly the same.  The KVM IDE driver didn’t fare so well, in fact in writeback mode it caused the mount to go read-only repeatedly, so I gave up on it. ESX, again, not so good, beating out only the VMs that aren’t using any cache.

Here we see the huge performance boost that a VM using writeback cache can attain, the virtio driver has no problem with cheating.  Now, we’re not talking about storage controller, battery-backup writeback, we’re talking about writes going into the host’s dirty memory and being considered complete.  As such, you had better trust that your host won’t crash or suddenly reboot, or at the least make sure you’ve got snapshots you can roll back to in case of an emergency. You can be fairly certain that your writes will be committed within a minute or so at the worst (check the hosts dirty_expire_centisecs), most likely much sooner unless the host is spending a lot of time in IOWAIT, my point being that if you choose to go this route you can be certain that you’ve at least got a good, recent snapshot if you can get a few minutes away from the latest one before catastrophe.

Here is the same data, with the writeback taken out, so we can get a better look at the rest of the pack.

Not really too much exciting about the sequential graph, except for the  nocache and writethrough VMs being faster than the host. As a guess I’d attribute this to the 1M cluster size on the qcow2 file, i.e. even though we’re writing 4k at a time in the VM, it’s probably writing them in much larger chunks when the writes hit the host. I also did some ‘dd’ tests in each of these systems, but the results were very similar so I’m not going to rehash them.

Random writes… here we actually see ESX perk up a bit and hold its own on 4-8k record sizes. The host is even faster on the low end, and in some respects this random write graph mirrors the iops results from our postmark test if you kind of average together the left half.

In all I must say that I’m fairly pleased with the progress of KVM and it’s I/O performance.

Category: Stuff  | 3 Comments
Monday, March 08th, 2010 | Author:

I’ve always had a focus on storage throughout my career.  I’ve managed large enterprise vSANs with FC switches, commercial NAS filers, deployed iSCSI over ethernet, and managed ESX with both FC and NFS backends.  I’ve been entrusted to build very large storage servers, up to 32U, with Linux and off the shelf components.  Needless to say, I feel comfortable claiming that I know a little more than the average systems guy about storage, and particularly how Linux handles I/O, so when I turned my attention to benchmarking virtual machine disk performance, I found some interesting behaviors that most who seek to measure such things should probably be aware of, at least to interpret results, if they can’t otherwise be compensated for.

One of the primary things is how the Linux caching mechanisms can throw a wrench in things if you don’t think through what you’re doing.  One needs to be aware of which caches are in effect during each test. For example, it’s common to test with datasets larger than the system’s memory in order to stretch the system beyond its ability to cache, however, consider a 4GB virtual guest on a physical server with 32GB RAM.  Usually the guest systems are run with at least write-through cache from the host’s perspective (speaking in general terms, this can obviously be controlled by the end user on at least some virtualization platforms), so while the experimenter might think that using an 8GB dataset will be sufficient on the guest, or that issuing a drop_caches request between tests on the guest will suffice,  this dataset is likely to be saved in its entirety in the host’s read cache as it goes to underlying storage, artificially boosting the results.  Similarly, performing a write test on the guest and comparing it to the same write test on the host is almost certainly going to give the host an unfair advantage if the experimenter doesn’t take into account the increase in dirty memory available on the host, usually specified in percent of physical memory.

On top of that, there’s the complexity of testing  X number of virtual machines and forming a summation of how they all perform simultaneously on a physical host.  There are some pretty standard methods defined for doing this, such as putting some sort of load on each guest, and then benchmarking one while the others are running their dummy loads, but again, one must be careful, particularly with the dummy loads, that they’re not just looping tests that are small enough to cache, unless, of course, that’s the real-world behavior of the application, which brings me to my point.

It’s kind of a complex beast, trying to get meaningful results, and especially to share them with others who may have different expectations.  One has to determine a goal in disk benchmarking, and it’s usually one of two things; the testing of raw disk performance or an attempt to measure the real-world performance of an application or given I/O pattern.  The former would involve disabling any and all caches, while the latter would strive to utilize the caches how they normally would be.  The challenge in all of this, as mentioned, is that some people will value one set, while others will value the other.  Raw disk performance will tell you a lot about  just how good the setup is, for example whether one should go with that raid6 setup or do raid50 instead, on the other hand, does it really matter how well the disks perform without caches, don’t we want to know how it’s actually going to run?

No matter how it’s done, the most important thing of all is to frame your data properly. “This was the goal or purpose, these are the tests, this is the setup, here are the results”.  I’ve been running some tests that I’ll share shortly, but I wanted to get some of these cosiderations down, as I’ve rarely heard anyone speak of them while reading through the benchmarks of others, which frankly, has made much of the data I’ve seen surrounding vm performance largely useless.

Finally, lest this post be all rambling and not provide anything of concrete usefulness to individuals out there, the following are some mechanisms for controlling Linux caching.

Flush caches (page, dentries, inodes):  ‘echo 3 > /proc/sys/vm/drop_caches’

The above won’t do anything for dirty memory, which can be cleaned up with a ‘sync’, however, this won’t have much bearing on the write test you run afterward, you’ll need to know a little more about how dirty memory works. It would be naive to compare a system with 32G of memory, 3.2 of which can absorb pending writes, with a 4G system that only has 400M with which to cache writes.

In particular, two values are of importance:  /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio. These two numbers are specified as percentages. dirty_background_ratio tells you how big your dirty memory can get before pdflush kicks in and starts writing it out to disk. dirty_ratio is always higher (the code actually rewrites dirty_background_ratio to half of this if dirty_ratio < dirty_background_ratio), and is the point where applications skip dirty memory and are forced to write direct to disk. Usually this means that pdflush isn’t keeping up with your writes, and the system is potentially in trouble, but could also just mean that you’ve set it very low because you don’t want to cache writes.  For example, you may want to do this if you know you’ll be doing monster writes for extended periods, no sense in bloating up some huge amount of dirty memory only to have the processes forced to write sync AND contend with pdflush threads trying to do writeback.  On the flip side, increasing these values can give you a nice cache to absorb large, intermittent writes.

Both of these have time based counterparts, dirty_expire_centisecs and dirty_writeback_centisecs, such that pdflush will kick in and start doing writeback by age regardless of how much is there. For example, it might do writeback at 500MB OR when data in dirty memory has been around for longer than 15 seconds.  Newer kernels also allow an alternative specification of an actual number, rather than percent, in dirty_bytes and dirty_background_bytes.

There are quite a few more things I could share, but I think I’ll leave with just one more: /proc/sys/vm/vfs_cache_pressure. Usually this is set at 100 by default. Increasing this number will cause the system to tend to clean up/minimize directory and inode read caches (the stuff that’s cleaned up by drop_caches), decreasing the number will cause it to horde more.

Stay tuned for some benchmarks of KVM virtio and IDE with no cache, writethrough, and writeback, compared to VMware ESX paravirtualized disks.