Wednesday, February 09th, 2011 | Author:

I just added a new page for SeekMark, a little program that I put together recently to test the number of random accesses/second to disk. It’s threaded and will handle RAID arrays well, depending on the number of threads you select. I’m fairly excited about how this turned out, it helped me prove someone wrong about whether or not a particular RAID card did split seeks on RAID1 arrays. The page is here, or linked to at the top of my blog, for future reference.  I’d appreciate hearing results/feedback if anyone out there gives it a try.

Here are some of my own results, comparing a linux md raid10, 5 disk array against the underlying disks. I’ll also show the difference in the results that threading the app made:

single disk, one thread:

  [root@server mlsorensen]# ./seekmark -t1 -f/dev/sda4 -s1000
  Spawning worker 0
  thread 0 completed, time: 13.46, 74.27 seeks/sec

  total time: 13.46, time per request(ms): 13.465
  74.27 total seeks per sec, 74.27 seeks per sec per thread

single disk, two threads:

  [root@server mlsorensen]# ./seekmark -t2 -f/dev/sda4 -s1000
  Spawning worker 0
  Spawning worker 1
  thread 0 completed, time: 27.29, 36.64 seeks/sec
  thread 1 completed, time: 27.30, 36.63 seeks/sec

  total time: 27.30, time per request(ms): 13.650
  73.26 total seeks per sec, 36.63 seeks per sec per thread

Notice we get pretty much the same result, about 74 seeks/sec total.

5-disk md-raid 10 on top of the above disk, one thread:

  [root@server mlsorensen]# ./seekmark -t1 -f/dev/md3 -s1000
  Spawning worker 0
  thread 0 completed, time: 13.09, 76.41 seeks/sec

  total time: 13.09, time per request(ms): 13.087
  76.41 total seeks per sec, 76.41 seeks per sec per thread

Still pretty much the same thing. That’s because we’re reading one small thing and waiting for the data before continuing. Our test is blocked on a single spindle!

four threads:

  [root@server mlsorensen]# ./seekmark -t4 -f/dev/md3 -s1000
  Spawning worker 0
  Spawning worker 1
  Spawning worker 2
  Spawning worker 3
  thread 1 completed, time: 15.02, 66.57 seeks/sec
  thread 2 completed, time: 15.46, 64.69 seeks/sec
  thread 3 completed, time: 15.57, 64.24 seeks/sec
  thread 0 completed, time: 15.69, 63.74 seeks/sec

  total time: 15.69, time per request(ms): 3.922
  254.96 total seeks per sec, 63.74 seeks per sec per thread

Ah, there we go. 254 seeks per second. Now we’re putting our spindles to work!

Category: Stuff
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

18 Responses

  1. 1
    Daniel 

    Hey,
    got to say I really enjoy reading what you’re writing about. Very interesting and helpful!
    (And this little program is going to help me with random io benchmarks)

  2. 2
    admin 

    Thanks, I just hope people find this stuff useful.

  3. 3
    Heiko 

    Hi,
    tried your nice little proggi, but i’am a bit confused about the figures. Doing it on a supermicro X8SIE with a WD5002ABYS disk the number of seeks keeps counting up with the number of threads.
    ./seekmark.bin -tNNN -f/dev/sda -s1XXX
    1 thread: 81.69 total seeks per sec
    10 threads: 139.67 total seeks per sec
    40 threads: 179.73 total seeks per sec
    100 threads: 184.07 total seeks per sec
    200 threads: 183.53 total seeks per sec
    So there seems to be the saturation point.

    What is the correct interpration of these numbers ?
    i.e. 184seeks/sec*4096bytes(Kernel block) = 736kBytes/sec random seek (read ?) access

    What about random write ? Is that measurable ?

    Will seekmark give hindsight into hardware raid controller performance ?

    And while i’am griping, a -q switch would be nice to silence the “thread completed” and “spawning worker” messages.

    Thanks a lot for your work.

  4. 4
    admin 

    Like the seeker.c program it’s based on, the numbers shouldn’t really be used as a random read test, but rather a random access test from the system point of view. Basically we tell the drive to ‘go here’, then ‘go there’ and see how long it takes to reach each destination.

    It will in fact give information on your RAID solution, not really read cache but how well it splits accesses out among the disks. e.g. add another disk to a RAID1 and watch the performance go up, but only if your RAID solution is smart enough to balance the requests across all disks in the array. This was the primary motivation to write it over the seeker.c, as with that program you always get the same number of seeks regardless of how many disks you have, because it’s one thread that random accesses sequentially.

    Command queuing is also another thing you’ll see the effects of as you add threads. You can see how intelligent the OS and hardware are at reordering things to minimize head thrashing. You can tweak the queue_depth of your disk, as well as the nr_requests and see how things change.

    So basically, it’s a worst-case iops generator. Your numbers sound about right, the SAN vendors I’ve worked with say to plan safely for about 150 iops per spindle for the budget SATA disks, when sizing loads.

    For random read/write, you’re better off with something that will test multiple record sizes and manipulate multiple files so that filesystem performance is also taken into account.

    One other thing I didn’t really mention that you can do with SeekMark is that you can give it something like a database file instead of a disk, and see how the access for that file is. If it’s very large or is highly fragmented, the performance for the file will be roughly equal to the disk performance, but if it’s laid out sequentially, is relatively small(1% of disk or less perhaps), and is located on a faster portion of the platter, you should actually see better performance random seeking through that database than the whole drive.

    I’ll try to remember the -q if I make another release. In the mean time, you can just comment the ‘printf’ statements you don’t want and recompile if you’d like.

  5. Looks great, though I’d be much happier downloading a .tar.gz file that I can ./configure, make, sudo make install rather than a naked binary or a .c file I need to create a build for myself… That would give me the best of both worlds – the ability to inspect the source and make sure there is nothing nefarious, and the ability to get it running on my system easily.

  6. 6
    admin 

    Thanks. I’ll keep that in mind for future releases. It seemed simple enough to just compile; there’s nothing to configure, but I can understand how people would want the straightforward ‘make’. For now I just added the compile command to the page for reference.

  7. 7
    Don Bradken 

    Hi,
    I was testing this out, thank you, it is very unbelieving though.
    # ./seekmark -t4 -f/dev/md1 -s10000

    Benchmarking against /dev/md1 114478 MB

    Spawning worker 0 to do 10000 seeks
    Spawning worker 1 to do 10000 seeks
    Spawning worker 2 to do 10000 seeks
    Spawning worker 3 to do 10000 seeks
    thread 0 completed, time: 2.00, 5000.00 seeks/sec, 0.2ms per request
    thread 3 completed, time: 2.00, 4995.00 seeks/sec, 0.2ms per request
    thread 1 completed, time: 2.01, 4970.18 seeks/sec, 0.2ms per request
    thread 2 completed, time: 2.03, 4935.83 seeks/sec, 0.2ms per request

    total time: 2.03, time per request(ms): 0.051
    19743.34 total seeks per sec, 4935.83 seeks per sec per thread

  8. 8
    admin 

    20k seeks/second, .05ms seek time, that would be normal for SSD. Some of the later models can do 50k or more. What is your disk type? How is md1 configured?

    Oh, also keep in mind if you’re testing a small partition and not a whole disk you may get artificially high numbers because you are really only seeking through a small part of the platter. With a ~120GB md1 this does not seem to be your case unless you have a small slice of multiple 3T drives or something.

  9. 9
    jack 

    hi there;

    root@world [~]# hdparm -tT /dev/sdb1

    /dev/sdb1:
    Timing cached reads: 11530 MB in 2.00 seconds = 5777.55 MB/sec
    Timing buffered disk reads: 1032 MB in 3.00 seconds = 343.85 MB/sec
    root@world [~]# hdparm -tT /dev/sdc1

    /dev/sdc1:
    Timing cached reads: 12268 MB in 2.00 seconds = 6146.71 MB/sec
    Timing buffered disk reads: 750 MB in 3.00 seconds = 249.89 MB/sec
    root@world [~]#

    sdb1 – is ssd hard disk 80 G
    sdc1 – 7200 rpm – 2 T

    A friend test show his timing cached reads at 12000MB, is this the reason that his server read content faster?

    thank you very much for your time
    jack

  10. 10
    admin 

    Probably not, depends on what you mean. I don’t feel I have enough information on what specific read situation you’re interested in (does it load web pages faster? Do backups faster? Run benchmarks faster?). Cached reads come out of RAM, you’re getting ~6GB/sec, the difference between that and 12GB/sec surely won’t be noticeable to human perception under most circumstances (I can think of a few situations where you might see a difference). Now maybe your friend has more RAM available to cache things, which will make some reads faster, but without knowing specifics I can say that in general your cached read speed is not going to be the bottleneck on your server.

  11. 11
    John Lauro 

    Sometimes it reports errors, especially with high thread count. Without threads it generally is safe to assume you will get back all the data but you still shouldn’t assume so.

    From man 2 read:
    On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by
    this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for
    example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we
    are reading from a pipe, or from a terminal), or because read() was interrupted by a signal. On error, -1 is returned,
    and errno is set appropriately…

    Whenever dealing with threads, you have to handle the case of being interrupted…

  12. 12
    admin 

    I’m having trouble reproducing. On a single platter I’ve gone up to 1024 threads repeatedly, tried various things on several systems. I can make the system so IO bound that it barely responds, and still no errors.

    I’m already accounting for end of file, I’m not reading from something that would return fewer bytes than requested unless there were a legitimate problem (disk error), and I’m not sure I understand what you mean by a thread being interrupted causing a short read (aside from sigkill or sigterm, which stops the program anyway). Even if I force one thread to sleep while the others work, when that thread resumes it finishes.

    Perhaps you’re under the impression that the code is written in such a way that the threads are blocking each other in order to read (via mutex or some such); this is not the case and would defeat the purpose of the test. Even in that scenario, a thread interrupted during a read should complete the read once resumed rather than returning half a read, no?

    I hope you will respond to this as I admittedly have not had a whole lot of practice with threaded applications and if I’m doing something wrong I’d like to know. From my perspective, what you’re saying is similar to saying that a ‘dd’ process will get fewer bytes than requested because apache is also running and preempts it occasionally.

    One of the systems I just tested on said “Error: unable to read at current position of”, but I also see corresponding disk errors in dmesg. I would be interested in what the error string you were getting was (after the position).


    Error: unable to read at current position of 5380230873088: Input/output error
    ^C
    [root@client mnt]# dmesg|tail
    sd 10:0:0:1: [sdc] CDB: Read(16): 88 00 00 00 00 04 78 bd 58 00 00 00 00 08 00 00
    end_request: recoverable transport error, dev sdc, sector 19205543936
    sd 10:0:0:1: [sdc] Unhandled error code
    sd 10:0:0:1: [sdc] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK

  13. 13
    John Lauro 

    I think the string was Success. Right now I am having trouble reproducing. Previously was only getting a few occurrences with -s 500000 and -t 9 and no problems are being reported.

    How many physical cores are on your box? There is 4 on the test box I used. You will not be able to reproduce the problem on a single core machine (and probably not a hyper threaded core either) as they will never have the race condition where you might have a signal cause read to return early.

  14. 14
    admin 

    Ok, interesting. I tried 1024 threads on my 4 core i5-2500 and 2048 threads on a 24 core dual proc AMD box.

    Easy enough to catch only errors instead of checking whether read returns the number of bytes requested, although I’m still curious as to what’s going on.

    I’ve updated the code, it will now error if the return from read is < 0, and warn if > 0, printing out what it actually got. I don’t want to just ignore the second case, I’d like to figure out what’s going on even if it doesn’t affect the test results in any meaningful way. The only thing that would make sense to me at this point is something wrong in the end of file code, that on rare occasion it selects something less than iosize distance from the end of file (easy to test with a small file and lots of seeks).

    seekmark code

    seekmark binary

  15. 15
    John Lauro 

    Was having difficulty reproducing… but this seems a little easier… I created a 256mb file under /dev/shm (tmpfs).

    ./seekmark -f /dev/shm/jj -t 99 -i 16386 -q

    READ benchmarking against /dev/shm/jj 256 MB

    Error: unable to read at current position of 210547824: Success
    Error: unable to read at current position of 160666416: Success
    Error: unable to read full io of 16386 from position 10091987 in file of size 268435456
    Error: unable to read at current position of 260707888: Success
    Error: unable to read full io of 16386 from position 175627616 in file of size 268435456
    Error: unable to read at current position of 259659472: Success
    Error: unable to read at current position of 218516672: Success
    Error: unable to read at current position of 18808008: Success
    Error: unable to read full io of 16386 from position 216163200 in file of size 268435456
    Error: unable to read at current position of 251391168: Success
    Error: unable to read full io of 16386 from position 84515800 in file of size 268435456
    Error: unable to read full io of 16386 from position 20497350 in file of size 268435456
    Error: unable to read at current position of 10298503: Success
    Error: unable to read full io of 16386 from position 49741456 in file of size 268435456
    Error: unable to read full io of 16386 from position 161113104 in file of size 268435456
    Error: unable to read at current position of 131998752: Success
    Error: unable to read at current position of 203560704: Success

    total time: 0.90, time per READ request(ms): 0.002
    551224.94 total seeks per sec, 5567.93 READ seeks per sec per thread

    (above was tested on a 12 core box)

  16. 16
    admin 

    Ok, I was able to reproduce with a small file and lots of seeks. I’ve changed the code to allow each thread to open its own fd and that seems to have fixed the problem. Look for v0.82 shortly.

    Update:Done. Thanks for reporting.

  17. 17
    Javier Picorel 

    I’m a little confused. By running seekmark I have the following output:

    bash-3.2# ./seekmark -t1 -f/dev/sda -s10000 -i4096

    Spawning worker 0 to do 10000 seeks
    thread 0 completed, time: 119.98, 83.35 seeks/sec, 12.0ms per request

    Thus, to calculate IOPS = 1000ms/12.0ms

    However, by running sar at the same time:

    sar -bp 1 40

    03:48:32 PM tps rtps wtps bread/s bwrtn/s
    Average: 141.36 139.56 1.80 1229.33 30.37

    The value tps (transactions per second) is not equal to IOPS. Do you know why they don’t match? Thanks!

  18. 18
    admin 

    Ultimately they’re measuring from two different points. The benchmark tells you what it’s requesting, and sar tells you what’s going on at the block device. It might be alignment, e.g. if the benchmark requests part of two blocks, that could show up as two transactions at the block level even though it’s not two seeks (the blocks being adjacent).

Leave a Reply