Stuff

Ceph and RBD benchmarks

Ceph, an up and coming distributed file system, has a lot of great design goals. In short, it aims to distribute both data and metadata among multiple servers, providing both fault tolerant and scalable network storage. Needless to say, this has me excited, and while it’s still under heavy development, I’ve been experimenting with it and thought I’d share a few simple benchmarks.

I’ve tested two different ‘flavors’ of Ceph, the first I believe is referred to as “Ceph filesystem”, which is similar in function to NFS, where the file metadata (in addition to the file data) is handled by remote network services and the filesystem is mountable by multiple clients. The second is a “RADOS block device”, or RBD. This refers to a virtual block device that is created from Ceph storage. This is similar in function to iSCSI, where remote storage is mapped into looking like a local SCSI device. This means that it’s formatted and mounted locally and other clients can’t use it without corruption (unless you format it with a cluster filesystem like GFS or OCFS).

If you’re wondering what RADOS is, it’s Ceph’s acronym version of RAID. I believe it stands for “Reliable Autonomous Distributed Object Store”. Technically, the Ceph filesystem is implemented on top of RADOS, and other things are capable of using it directly as well, such as the RADOS gateway, which is a proxy server that provides object store services like that of Amazon’s S3. A librados library is also available that provides an API for customizing your own solutions.

I’ve taken the approach of comparing cephfs to nfs, and rbd to both iscsi and multiple iscsi devices striped over different servers. Mind you, Ceph provides many more features, such as snapshots and thin provisioning, not to mention the fault tolerance, but if we were to replace the function of NFS we’d put Ceph fs in its place; likewise if we replaced iSCSI, we’d use RBD. It’s good to keep this in mind because of the penalties involved with having metadata at the server; we don’t expect Ceph fs or NFS to have the metadata performance of a local filesystem.

  • Ceph (version 0.32) systems were 3 servers running mds+mon services. These were quad core servers, 16G RAM. The storage was provided by 3 osd servers (24 core AMD box, 32GB RAM, 28 available 2T disks, LSI 9285-8e), each server used 10 disks, one osd daemon for each 2T disk, and an enterprise SSD partitioned up with 10 x 1GB journal devices. Tried both btrfs and xfs on the osd devices, for these tests there was no difference. CRUSH placement defined that no replica should be on the same host, 2 copies of data and 3 copies of metadata. All servers had gigabit NICs.
  • Second Ceph system has monitors, mds, and osd all on one box. This was intended to be a more direct comparison to the NFS server below, and used the same storage device served up by a single osd daemon.
  • NFS server was one of the above osd servers with a group of 12 2T drives in RAID50 formatted xfs and exported.
  • RADOS benchmarks ran on the same two Ceph systems above, from which a 20T RBD device was created.
  • ISCSI server was tested with one of the above osd servers exporting a 12 disk RAID50 as a target.
  • ISCSI-md was achieved by having all three osd servers export a 12 disk RAID50 and the client striping across them.
  • All filesystems were mounted noatime,nodiratime whether available or not. All servers were running kernel 3.1.0-rc1 on centos 6. Benchmarks were performed using bonnie++, as well as a few simple real world tests such as copying data back and forth.

ceph-nfs-iscsi-benchmarks.ods

The sequential character writes were cpu bound on the client in all instances; the sequential block writes (and most sequential reads) were limited by the gigabit network. The Ceph fs systems seem to do well on seeks, but this did not translate directly into better performance in the create/read/delete tests. It seems that RBD is roughly in a position where it can replace iSCSI, but the Ceph fs performance needs some work (or at least some heavy tuning on my part) in order to get it up to speed.

It will take some digging to determine where the bottlenecks lie, but in my quick assessment most of the server resources were only moderately used, whether it be the monitors, mds, or osd devices. Even the fast journal SSD disk only ever hit 30% utilization, and didn’t help boost performance significantly over the competitors who don’t rely on it.

Still, there’s something to be said for this, as Ceph allows storage to fail, be dynamically added, thin provisioned, rebalanced, snapshots, and much more, with passable performance, all in pre-1.0 code.  I think Ceph has a big future in open source storage deployments, and I look forward to it being a mature product that we can leverage to provide dynamic, fault-tolerant network storage.

 

 

 

 

 

 

 

 

 

 

2 comments

  1. Steve

    Going over your benchmarks I see speeds in excess of 125 MB/s while your network is listed as GigE.

    Can you break down how these numbers came to be? Everything other than “nfs on single server (19 TB 12 disk raid50 backend, xfs)” seq. read = 150 MB/s, and “3 disk md raid0 + xfs on iscsi to 3 servers/targets (3 x 12disk raid50) 55TB fs” seq. read = 146 MB/s seems probable. Was this measured on the client side, and how?

    Thanks,
    Steve

    Thanks,
    Steve

  2. marcusadmin Post author

    Sure, valid question. The version of Bonnie was 1.03e, that’s what was available for the platform at the time. Even though the data size exceeded the system RAM size, you can still observe some of the caching effects in the reads. If I were to do it again today, I’d just use fio.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>