Wednesday, March 24th, 2010 | Author: admin

Ok, I’ve been sitting on these for a few days and wanted to get them out there. I’ve got an old server that I configured with 4x750GB western digital black SATA drives, and used an LSI2008 controller in raid10 with the default 64k stripe size. It’s a 4 core xeon 5200 series, I believe, with 24G of RAM. The OS for every test was CentOS 5.4 with default virtual memory/sysctl configurations and minimal packages.

The VMs were built with 300GB virtual disks and 4GB of memory.   The KVM guests had disk drivers and cache settings as indicated, and was on ext3 with a QCOW2 image (options were 1MB cluster size, preallocated metadata).  The ESX guest had the pvscsi driver enabled, and a 2MB cluster size was used on the filesystem due to the filesize limitations.

First off, postmark.  For those who don’t know, postmark is a simple, yet decent utility that’s designed to give an idea of small I/O workloads. It’s tunable, but geared toward web/mail server type loads. It creates a ton of small files, then does random operations on them, and spits out the results.

Here’s the config:

set buffering false
set number 100000
set transactions 50000
set size 512 65536
set read 4096
set write 4096
set subdirectories 5
show
run
quit

The primary things I’m interested in here are the KVM virtio performance (specifically writethrough and nocache) compared to ESX4.1 and the native host disks. The IDE driver and writeback tests were done just to see what would happen, but they’re not exactly what I’d prefer to use in production. The KVM virtio driver is nearly on par with native speed when it comes to reads and writes. It falls behind a bit on the actual operations per second, but it must have made up the time somewhere, since the benchmark overall only took a hair longer (if this confuses you, basically the benchmark goes through several stages: creating dirs, creating files, performing transactions, deleting files, deleting dirs. Only the transactions part counts towards the ops/sec number).  The ESX guest didn’t do as well with the pvscsi drivers. In fact, this benchmark alone would be enough to put any of my concerns about virtio performance and choosing KVM vs the tried and true ESX.

Next up, iozone.  This benchmark tries to create a sort of ‘map’ of the disk I/O, by testing various file sizes at various record sizes, creating sort of a matrix.  I regret to say that the read numbers are a bit skewed, as my setup didn’t include an unmountable volume on every system, that’s really the only way to clean out the read caches between tests with this benchmark.  Still, we’ve got some good write numbers and some interesting cache comparisions.

As you can see, the host’s read cache is much faster than the guest’s. Still, some of those guests are posting upwards of 1GB/s, not bad, but it does give us some insight into the overhead of the vm, we’re likely seeing the added latency of fetching the data from cache and passing it through, which can be pretty big when memory speed is measured in ns.

Also of note yet again is the good performance of virtio, and that writethrough and writeback score roughly the same.  The KVM IDE driver didn’t fare so well, in fact in writeback mode it caused the mount to go read-only repeatedly, so I gave up on it. ESX, again, not so good, beating out only the VMs that aren’t using any cache.

Here we see the huge performance boost that a VM using writeback cache can attain, the virtio driver has no problem with cheating.  Now, we’re not talking about storage controller, battery-backup writeback, we’re talking about writes going into the host’s dirty memory and being considered complete.  As such, you had better trust that your host won’t crash or suddenly reboot, or at the least make sure you’ve got snapshots you can roll back to in case of an emergency. You can be fairly certain that your writes will be committed within a minute or so at the worst (check the hosts dirty_expire_centisecs), most likely much sooner unless the host is spending a lot of time in IOWAIT, my point being that if you choose to go this route you can be certain that you’ve at least got a good, recent snapshot if you can get a few minutes away from the latest one before catastrophe.

Here is the same data, with the writeback taken out, so we can get a better look at the rest of the pack.

Not really too much exciting about the sequential graph, except for the  nocache and writethrough VMs being faster than the host. As a guess I’d attribute this to the 1M cluster size on the qcow2 file, i.e. even though we’re writing 4k at a time in the VM, it’s probably writing them in much larger chunks when the writes hit the host. I also did some ‘dd’ tests in each of these systems, but the results were very similar so I’m not going to rehash them.

Random writes… here we actually see ESX perk up a bit and hold its own on 4-8k record sizes. The host is even faster on the low end, and in some respects this random write graph mirrors the iops results from our postmark test if you kind of average together the left half.

In all I must say that I’m fairly pleased with the progress of KVM and it’s I/O performance.

Category: Stuff  | 2 Comments
Monday, March 08th, 2010 | Author: admin

I’ve always had a focus on storage throughout my career.  I’ve managed large enterprise vSANs with FC switches, commercial NAS filers, deployed iSCSI over ethernet, and managed ESX with both FC and NFS backends.  I’ve been entrusted to build very large storage servers, up to 32U, with Linux and off the shelf components.  Needless to say, I feel comfortable claiming that I know a little more than the average systems guy about storage, and particularly how Linux handles I/O, so when I turned my attention to benchmarking virtual machine disk performance, I found some interesting behaviors that most who seek to measure such things should probably be aware of, at least to interpret results, if they can’t otherwise be compensated for.

One of the primary things is how the Linux caching mechanisms can throw a wrench in things if you don’t think through what you’re doing.  One needs to be aware of which caches are in effect during each test. For example, it’s common to test with datasets larger than the system’s memory in order to stretch the system beyond its ability to cache, however, consider a 4GB virtual guest on a physical server with 32GB RAM.  Usually the guest systems are run with at least write-through cache from the host’s perspective (speaking in general terms, this can obviously be controlled by the end user on at least some virtualization platforms), so while the experimenter might think that using an 8GB dataset will be sufficient on the guest, or that issuing a drop_caches request between tests on the guest will suffice,  this dataset is likely to be saved in its entirety in the host’s read cache as it goes to underlying storage, artificially boosting the results.  Similarly, performing a write test on the guest and comparing it to the same write test on the host is almost certainly going to give the host an unfair advantage if the experimenter doesn’t take into account the increase in dirty memory available on the host, usually specified in percent of physical memory.

On top of that, there’s the complexity of testing  X number of virtual machines and forming a summation of how they all perform simultaneously on a physical host.  There are some pretty standard methods defined for doing this, such as putting some sort of load on each guest, and then benchmarking one while the others are running their dummy loads, but again, one must be careful, particularly with the dummy loads, that they’re not just looping tests that are small enough to cache, unless, of course, that’s the real-world behavior of the application, which brings me to my point.

It’s kind of a complex beast, trying to get meaningful results, and especially to share them with others who may have different expectations.  One has to determine a goal in disk benchmarking, and it’s usually one of two things; the testing of raw disk performance or an attempt to measure the real-world performance of an application or given I/O pattern.  The former would involve disabling any and all caches, while the latter would strive to utilize the caches how they normally would be.  The challenge in all of this, as mentioned, is that some people will value one set, while others will value the other.  Raw disk performance will tell you a lot about  just how good the setup is, for example whether one should go with that raid6 setup or do raid50 instead, on the other hand, does it really matter how well the disks perform without caches, don’t we want to know how it’s actually going to run?

No matter how it’s done, the most important thing of all is to frame your data properly. “This was the goal or purpose, these are the tests, this is the setup, here are the results”.  I’ve been running some tests that I’ll share shortly, but I wanted to get some of these cosiderations down, as I’ve rarely heard anyone speak of them while reading through the benchmarks of others, which frankly, has made much of the data I’ve seen surrounding vm performance largely useless.

Finally, lest this post be all rambling and not provide anything of concrete usefulness to individuals out there, the following are some mechanisms for controlling Linux caching.

Flush caches (page, dentries, inodes):  ‘echo 3 > /proc/sys/vm/drop_caches’

The above won’t do anything for dirty memory, which can be cleaned up with a ‘sync’, however, this won’t have much bearing on the write test you run afterward, you’ll need to know a little more about how dirty memory works. It would be naive to compare a system with 32G of memory, 3.2 of which can absorb pending writes, with a 4G system that only has 400M with which to cache writes.

In particular, two values are of importance:  /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio. These two numbers are specified as percentages. dirty_background_ratio tells you how big your dirty memory can get before pdflush kicks in and starts writing it out to disk. dirty_ratio is always higher (the code actually rewrites dirty_background_ratio to half of this if dirty_ratio < dirty_background_ratio), and is the point where applications skip dirty memory and are forced to write direct to disk. Usually this means that pdflush isn’t keeping up with your writes, and the system is potentially in trouble, but could also just mean that you’ve set it very low because you don’t want to cache writes.  For example, you may want to do this if you know you’ll be doing monster writes for extended periods, no sense in bloating up some huge amount of dirty memory only to have the processes forced to write sync AND contend with pdflush threads trying to do writeback.  On the flip side, increasing these values can give you a nice cache to absorb large, intermittent writes.

Both of these have time based counterparts, dirty_expire_centisecs and dirty_writeback_centisecs, such that pdflush will kick in and start doing writeback by age regardless of how much is there. For example, it might do writeback at 500MB OR when data in dirty memory has been around for longer than 15 seconds.  Newer kernels also allow an alternative specification of an actual number, rather than percent, in dirty_bytes and dirty_background_bytes.

There are quite a few more things I could share, but I think I’ll leave with just one more: /proc/sys/vm/vfs_cache_pressure. Usually this is set at 100 by default. Increasing this number will cause the system to tend to clean up/minimize directory and inode read caches (the stuff that’s cleaned up by drop_caches), decreasing the number will cause it to horde more.

Stay tuned for some benchmarks of KVM virtio and IDE with no cache, writethrough, and writeback, compared to VMware ESX paravirtualized disks.

Thursday, June 25th, 2009 | Author: admin

So you want to migrate your existing Linux partitions to software raid1… I’ve read recently about folks migrating to software raid by actually copying data. I’ve been doing this on-the-fly (sort of), without copying the data, but instead just initializing the partition as an md device and mounting it as such with the data intact. Keep in mind that it needs a slice at the end of the filesystem for the md superblock (with the default version .9), which is why the resize2fs is used.  Now, if you want to rearrange how your data is mounted then you’re out of luck, but if you just want to migrate existing partitions to raid partitions, here’s an example with an ext3 filesystem.

started with data on mounted /dev/sdm1, want to add /dev/sdn1 in raid1

umount /dev/sdm1

##see how many blocks we currently have

tune2fs -l /dev/sdm1 | grep “Block count”

##subtract 64 blocks from the current block count, making space for the md superblock.

resize2fs /dev/sdm1 <blocks>

mdadm –create /dev/md0 –raid-devices=2 –level=raid1 /dev/sdm1 missing

##see that initial data is still there…
mount /dev/md0 /mnt
ls -la /mnt

##add mirror device
mdadm –manage /dev/md0 –add /dev/sdn1

##check
cat /proc/mdstat
mdadm –detail /dev/md0

##use parted to change partition type to ‘fd’, linux raid auto, make sure the kernel can find it on reboot.

parted /dev/sdm set 1 raid on

parted /dev/sdn set 1 raid on

Category: Linux, Storage  | Leave a Comment
Friday, February 06th, 2009 | Author: admin

So, this morning I was installing the latest Snort from RPM at home, and ran into an issue that kept me busy for a little while. Basically what had happened was that snort was not logging to the mysql database.  I had defined my output properly in /etc/snort/snort.conf, and even verified that snort could log in, but it wasn’t even attempting to write to it. I immediately found that it was writing text files in /var/log/snort, but it took me a bit to realize that ALERTMODE was set to ‘fast’ in /etc/sysconfig/snort. This bypasses any database config you might have.  A pretty embarrassing mistake, but since I found a lot of people on forums out there with the same problem and nobody posting solutions, I figured I’d better share.

Wednesday, February 04th, 2009 | Author: admin

I recently provided this solution for a system with high iowait. It’s a monitoring system with highly transient data, yet not entirely temporary data. The server we happened to put it on had plenty of memory and CPU, but only a mirrored set of drives (a blade), and the application isn’t really important enough to eat expensive SAN space.  The solution was to utilize the memory in the system as a ramdisk.

It’s a simple procedure. You create the ramdisk, then make an LVM logical volume from it (which is why we don’t use tmpfs). Occasionally you’ll snapshot it and backup the contents, and also back it up during a controlled shutdown. On startup, you simply re-create the ramdisk and restore the backup.

This solution should work for a variety of applications, however, you have to pay attention to certain circumstances, such as applications that might keep changes in memory rather than flushing to disk regularly (a.k.a. writeback cache, in which case you’ll want to shut down the app before performing the backup), as well as the obvious chance of losing the data collected after the last backup in the event of an unforeseen interruption such as power loss.

First off, you have to have ramdisk support in your kernel.  If support is compiled into the kernel (i.e. Redhat based installations), you’ll need to add the option “ramdisk_size” to the kernel line in /boot/grub/grub.conf (or menu.lst), with the size specified in KB. For example, “ramdisk_size=10485760″ would give you 10GB ramdisks. Afterward, simply reboot and you should have /dev/ram0 thru /dev/ram15. Yes, it creates 16 devices by default, but it doesn’t eat your memory unless you use it.

I prefer to have ramdisk support built as a module (such as opensuse does), because you can just reload the module in order to add/resize ramdisks. To do this, you have to know the filename of the module, usually rd.ko or brd.ko.

server:/lib # find . -name brd.ko
./modules/2.6.25.18-0.2-default/kernel/drivers/block/brd.ko
./modules/2.6.25.5-1.1-debug/kernel/drivers/block/brd.ko

Then load the module:

server:/lib # modprobe brd rd_size=1048576

server:/lib # fdisk -l /dev/ram0

Disk /dev/ram0: 1073 MB, 1073741824 bytes
255 heads, 63 sectors/track, 130 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0×00000000

Of course you can add this module to the appropriate module configuration file (depending on your distribution) for future reboots.

Now that we’ve covered how the ramdisks work, here’s the script I’ve included in /etc/rc.local (Red Hat) to create/restore the ramdisk LVM volume at boot.

#create pv :
pvcreate /dev/ram0 /dev/ram1 /dev/ram2

#create vg :
vgcreate vg1 /dev/ram0 /dev/ram1 /dev/ram2

#create lv :
lvcreate -L 18G -n ramlv vg1

#create fs:
mkfs.ext2 /dev/vg1/ramlv

#mount lv :
mount -o noatime /dev/vg1/ramlv /mnt/ramlv

#restore data
cd /mnt/ramlv && tar -zxf /opt/ramlv.tar

#start the service that relies on the high performance ramdisk

/etc/init.d/zenoss start

The backup is run at shutdown and on a cron. The main reason I’m using tar with gzip is that it facilitates super fast restores, as the data I have gets a 5:1 compression ratio. With the disk being the major bottleneck in this particular server, I get roughly a 4-5x speed boost when copying the data back to ram from the compressed archive compared to copying from a file level disk backup. YMMV, another option is to simply rsync the data to disk on a schedule. With tar, the backups work harder, but don’t really impact the performance of the ramdisk while they’re running. Here’s the script:

#!/bin/bash

#logging
echo
echo “#############################################”
date
echo “#############################################”

#create snapshot:
/usr/sbin/lvcreate -L 5G -s -n ramsnap /dev/vg1/ramlv

#mount snapshot:
/bin/mount -o noatime /dev/vg1/ramsnap /mnt/snapshot

#logging
df -h /mnt/snapshot
echo TIME

#backup the only directory in /mnt/snapshot (named ‘perf’):
mv -f /opt/ramlv.tar /opt/ramlv-old.tar
cd /mnt/snapshot && time /bin/tar zcf /opt/ramlv.tar perf

#logging
ls -la /opt/ramlv.tar
/usr/sbin/lvdisplay /dev/vg1/ramsnap
/usr/sbin/lvs

#remove snapshot:
cd && /bin/umount /mnt/snapshot &&  /usr/sbin/lvremove -f /dev/vg1/ramsnap

And there you have it. Insane disk performance at your fingertips. A few things, you’ll want to ensure that your snapshot volume is sizable enough to hold any changes that may occur while you’re performing the backup (which is why I’ve got the logging commands in there). You’ll also probably want to keep a few backups in case the system dies while one is being performed, hence the ‘mv -f’ command in the script. Other than that, have fun, and feel free to share your experiences and improvements.

Category: Linux, Storage  | Leave a Comment
Monday, December 29th, 2008 | Author: admin

So, I’m interested in SSDs, or Solid State Disks.  We’re seeing them come to the enterprise storage market, claiming that they can do the work of ten or more fiber channel platter based drives, albeit without the capacity.  I presume the reason that this works from a marketing perspective is that many applications need performance more than they do capacity, I know of several instances where we look at the number of spindles and only use a fraction of the storage space on each drive.  At any rate, now that major vendors are marketing them to the enterprise, it’s only a matter of time until the good stuff trickles down to the common folk.

Most enterprise storage is SLC NAND flash, which is inherently faster and more robust than the cheaper MLC that is commonly used un USB thumb drives, memory cards, and the like.  Both technologies have undergone improvements over the years, with vendors recently marketing a ten-fold increase in write cycles for both.  Even though the technology is improving, MLC is still the more consumer oriented device, and will be for the foreseeable future, because it’s cheaper to make at a given bit density.  While this may seem like it would relegate consumer SSDs to the lackluster performance seen on USB thumb drives, vendors get around this with their flash controllers, finding creative ways to write to arrays of flash chips, boosting performance enough to make MLC a viable option in the consumer storage space, which brings me to the main point of today’s article.

There are a lot of claims and counter-claims thrown about when it comes to SSDs.  Some say that they use more power, some say that they’re more power friendly.  One camp points to great read times, while another claims poor write performance.  So, being the curious individual that I am, I decided to run some of my own tests.  What follows is my own analysis of the products that I could get my hands on.  Be forewarned, these aren’t exhaustive tests, rather I focused primarily on my usage patterns and real-world situations.

Now to introduce the contenders:

  • Western Digital 2.5″ 5400RPM Scorpio, 160GB
  • Samsung 64GB SSD
  • OCZ “Solid series”, 60GB SSD

The comparison between the two SSDs is particularly of interest to me, because the Samsung was a $500 upgrade last February (of course I got a better deal than that), and the OCZ Solid SSD was recently purchased for $135. It’s their value line product and supposedly the lowest performing of their current line-up.

The two main areas I’m going to focus on are performance and power consumption.  I’m using two platforms, a Dell m1330 laptop, and a Lenovo ideapad S10, which is an Atom-based netbook. One special thing to note regarding setup, instructions from OCZ state to turn off AHCI or risk time-outs, pauses in your system.  Apparently they don’t handle (or need for that matter), some of the features of AHCI, such as Native Command Queueing.  I did not notice any difference on the m1330 with Vista SP1, but XP on the S10 definitely had long “WTF!?” pauses that were fixed by simply disabling AHCI in the BIOS. On to the benchmarks…

I’m beginning with the most relevant data captured, from iozone.  I’ll offer links to the full data, but one must be careful with interpreting the results, because iozone gives a complete overview of your entire platform, meaning that you see performance of processor caches, buffer caches, and disk. This can sometimes make it difficult to draw meaningful conclusions if one doesn’t understand all of the data. Another piece of information that will help you draw meaningful information from the data below, you can use the process monitor tool from SysInternals to view the transaction sizes that your various applications use. For example, my antivirus scanner reads files in 4k requests at a time, large files being copied with Explorer in Windows XP seem to be read and written 64k at a time, while in Vista files are read 1024k at a time and written 64k at a time. The behavior of the application, along with file sizes and their location on disk, are key in understanding the effects of the below data.

Many people have seen the phenomenal boot times of SSDs, and these tables highlight the reason. Comparing the older Samsung SSD to the Scorpio spindle, we see that random reads for the most common transaction sizes (4k-64k) are about 10 to 13 times faster. The OCZ SSD also shows this trend, and adds a big bump to sequential reads as well.  In exchange, however, we get slower writes, on the order of about 2 to 5 times, with random writes taking a big hit.  Still, it should be noted that random write performance isn’t particularly great even for the Scorpio at common transaction sizes.  All in all it seems to be a good tradeoff, especially considering that most data is write once, read many.

Another thing that this highlights is the benefit of defragmentation.  Many have asked the question “should you defragment an SSD?”, and the common wisdom is that defragmentation isn’t necessary with SSDs.  While they should indeed be limited in order to preserve the (currently unknown) longevity of the flash, one needs only to look at the performance between random and sequential to see that even SSDs benefit from defragmentation.  Some people are concerned about the write lifetime of flash, and while manufacturers try to put people at ease with their various wear optimization techniques, the reality is that most of these devices are too new to have a proven track record either way.  For the record, I’ve had my Samsung for about a year now, have beaten the hell out of it as far as writes, and haven’t had any issues yet ;-) .

full iozone data -  xls csv

Here’s another look, this time by a simpler benchmark from ATTO on the XP netbook. I won’t go into too much detail here as it’s more of the same, but you can view the results by clicking on the thumbnails below.  Note that if one were to only use this tool, one might not see why the samsung SSD is subjectively much faster in day to day use than the Scorpio drive.

ATTO- WD Scorpio 160GB

ATTO- WD Scorpio 160GB

ATTO Samsung SSD 64GB

ATTO Samsung SSD 64GB

ATTO OCZ Core series 60GB

ATTO OCZ Solid series 60GB

Next up, a real world scenario: virus scan. This should show a huge improvement when moving to SSD, according to the iozone results. Some of the information will be sequential, but most will be random.  On top of that, as I mentioned, the virus scanner I’m using seems to read files 4k at a time. The setup is Avast! Antivirus, running a standard scan on Vista SP1.

The results speak for themselves. The iozone data seems to translate into real-world performance.

Now for battery life. I performed two tests, one was watching an xvid encoded 480p movie from the hard disk, the other was pretty much idle, with a script writing a small amount of data to the hard drive every 30 seconds.  The movie was chosen because it did a good job of generating  a constant stream of i/o (64k at a time) while not being absurdly taxing on the disk like running a benchmark might, a good real-world scenario. Actual results should in theory end up somewhere between the two benchmarks.

The m1330 loses a bit of life when switching to SSD, about 8 minutes. However, with the S10 it seems to be a wash. There are too many differences between the two platforms to pinpoint the cause. It could be due to the more aggressive performance settings I have in the m1330′s power options, could be a hardware difference, or even XP vs Vista.  All we can really say is that the mileage isn’t due to the disk alone, but how the platform reacts to it. Your mileage will depend on your platform, but the difference isn’t much.

Again, we see conflicting results. The S10 likes SSDs, while the m1330 doesn’t. I searched through the power options, and there were a few differences on the m1330 in regards to processor frequencies, but the hard disk settings were the same between platforms.  I have a hunch that the S10, being a smaller, lower wattage platform overall, will be more sensitive to the actual power consumption of the drive. Make of it what you will, the differences don’t seem to be all that much either way.

In summary, it seems that SSDs, in their current incarnation, offer a large boost to read performance in exchange for a medium-small cut in write performance. There are differences in battery life, but the differences are relatively small and differ between platforms. I would like to get my hands on some of the higher-end SSDs such as the intel x25, but until the prices come down, I think that comparing these (now) budget SSDs has been a useful exercise.

Wednesday, December 03rd, 2008 | Author: admin

Well, I’m back from the holiday binge, and I’ve brought some shiny new 3D graphics with me! On that tangent, Blender is a pretty cool application, and I wish I had the time to get to know it better.  I’m not a 3D guru, didn’t take any classes in it, and have never used any high dollar computer modeling packages, so don’t take that as a professional endorsement, but for the curious it should prove to be a worthy diversion ( be sure to do the tutorials from their site). Update: Here is a link to the blender files used to create the images in this article.

On to the article.  The purpose here is to give someone who has never had any exposure to LVM, or Logical Volume Manager, a basic understanding of the concepts and how to use it. Later on I hope to dig deeper into the details, such as alignment with RAID volumes, snapshots, and other features.  I do assume some understanding of plain old partitioning, but will brush over that topic as well in order to highlight differences in the process.

Let’s start with the ‘why?’.  Most Linux users will be familiar with the idea of carving up their hard disk into one or more usable containers (partitions) and applying file systems to them, and for casual users, that’s generally sufficient.  The process might go something like this:

  • Start with a raw hard disk. The device name might come up as “/dev/sda”.
  • using ‘fdisk’, or if you’re fancy some graphical partitioning tool, you might assign the first 200 megabytes or so as the first partition on “/dev/sda”, creating “/dev/sda1″. You decide that this will hold your system boot data “/boot”.
  • You then assign the next  one or two gigabytes, creating “/dev/sda2″, to be used as virtual memory, or swap.
  • A casual user might then take the rest of the drive space and create “/dev/sda3″ for the operating system data, user data, and everything else. They may instead choose to make more partitions, one just for the operating system, one just for user data, etc, but for the sake of simplicity we’ll stick with three partitions at the moment.
representation of a physical hard disk being partitioned into usable containers

Representation of a physical hard disk being partitioned into usable containers

  • You then create the filesystems on each partition. You can think of this as giving the partition structure or priming it for use.  For example, you’d run ‘mkswap /dev/sda2′ to make the partition we created into swap space, or ‘mkfs.ext2 /dev/sda1′ to allow you to store your boot files on partition sda1 using the ext2 filesystem.
Representation of creating swap space and a boot filesystem out of partitions.

Creating swap space and a boot file system out of partitions.

So here we have a fairly basic, standard partitioning setup, but what does a user do if they find that they’ve filled “/dev/sda3″ with their movies?  Not only are they out of space, but their computer is unstable because the system doesn’t have any place to store temporary operating data.  Maybe the user desires to make “/dev/sda3″ larger, but there’s no more room on the disk. Their only option is to add another hard drive, “/dev/sdb”, and create a new partition, “/dev/sdb1″ to be used exclusively for “/media/movies” (or what have you), but now “/dev/sda3″ is mostly an empty, oversized partition.

What would really be handy in this case is to be able to create a partition out of pieces of two different physical drives. We’d want to shrink “/dev/sda3″, use it exclusively for the system data, and then take some free space from “/dev/sda” and “/dev/sdb” and create a partition for user data.

While classic partitions can be resized with certain tools, they’re limited by the physical boundaries of the drive size, as well as the location and size of the other partitions on the drive. For example, if we wanted to grow “/dev/sda3″, we’d need some free space, and it would have to be immediately adjacent to “/dev/sda3″.  This is the basic problem that LVM is designed to solve.

The primary advantage of LVM is that it abstracts physical disk boundaries away from partitions. Instead of physical disks, you now have a pool of storage that can be made up of one, two, three and three quarters disks, or whatever you may have. That pool of storage can be metered out to partitions or taken back from them in small chunks.  Don’t forget that there are other advanced functions that also make LVM useful, but are beyond the scope of this article.

So let’s get started with an overview of LVM in action. We’ve made partitions for “/boot” and swap, but now instead of making “/dev/sda3″ into “/” or the root volume, we’re going to take it and a partition from a second installed drive, “/dev/sdb1″, and create what’s called a volume group. This volume group is going to serve as the pool of storage discussed earlier.

The first step is to take our partitions and mark them as physical volumes that LVM can use in a pool. This is a simple process that involves running the ‘pvcreate’ command on each partition that we want to make available. This command, simply put, creates a metadata header on each partition that will store LVM information and allow it to work its magic.  As an aside, it’s not strictly necessary to create partitions, you can ‘pvcreate’ an entire disk if you’d like (i.e. /dev/sdb instead of /dev/sdb1).

root@linux:~# pvcreate /dev/sda3 /dev/sdb1
Physical volume “/dev/sda3″ successfully created
Physical volume “/dev/sdb1″ successfully created

Marking partitions as physical volumes to be used with LVM

Marking partitions as physical volumes to be used with LVM

Next, we bundle those together into a single volume group.  This is done with the ‘vgcreate’ command. This will write information about the volume group into the metadata header of each physical volume in the group.  It will also create the volume group device, such as “/dev/vg0″. Note that we can add more physical volumes to this volume group at any time, using the ‘vgextend’ command.

root@linux:~# vgcreate vg0 /dev/sda3 /dev/sdb1
Volume group “vg0″ successfully created

Creating a volume group from two physical volumes

Creating a volume group from two physical volumes

Now we’ve got our big pool of storage. Notice the grid marks on it. This was my attempt to portray that the volume is divided into small pieces, or physical extents, usually 4 megabytes each by default. These physical extents are the building blocks for logical volumes, which will serve as replacements for our classic partitions.  Creating logical volumes is basically the process of assigning these extents to a defined container. These extents can come from any physical volume in the volume group, it doesn’t really matter (but can optionally be controlled, for example with the contiguous flag), we’re basically just taking pieces from the pool and assigning them to a new logical volume. This process also creates device nodes for us, “/dev/vg0/lv0″ and “/dev/vg0/lv1″. Note how the node goes “/dev/<volume group name>/<logical volume name>”.

root@linux:~# lvcreate –extents 5120 –name lv0 /dev/vg0
Logical volume “lv0″ created
root@linux:~# lvcreate –extents 20480 –name lv1 /dev/vg0
Logical volume “lv1″ created

creating logical volumes from physical extents in pool vg0

creating logical volumes from physical extents in pool vg0

Now we’ve got logical volumes, the LVM equivalent of partitions. Note that I created a 20 gigabyte volume called “lv0″ by assigning 5,120 x four megabyte extents, and an 80 gigabyte “lv1″ with 20,480 extents. I did this for the sake of the example, in practice you could also use “–size 20G” instead of “–extents 5120″.  Note also that I did not use all of the volume group, there are spare extents on the right waiting to be added to lv0, lv1, or used for a new logical volume.

These new logical volumes can now be treated as normal partitions and formatted with the filesystem of your choice. In this example, we’re going to use the 20 gigabyte volume for the root filesystem “/”, and the larger, 80 gigabyte volume for user data on “/home”.

root@linux:~# mkfs.ext3 /dev/vg0/lv0

root@linux:~# mkfs.ext3 /dev/vg0/lv1

using mkfs.ext3 to format logical volumes

using mkfs.ext3 to format logical volumes

And that’s it for the basics. We’ve covered how to use LVM to create volumes that are a replacement for classic partitions, breaking physical disk barriers. One of the things I like about LVM is the simplicity. The commands are consistent, pvcreate, vgcreate, lvcreate, etc. All you have to do is remember the concepts, physical volumes to volume groups to logical volumes, and you can figure out the commands and what order to do them in.

I’ll leave you with a few examples of how to view the status of your new LVM volumes, as well as expanding a logical volume while online.

Extend logical volume:

root@linux:~# lvextend –size +5G /dev/vg0/lv0
Extending logical volume lv0 to 25.00 GB
Logical volume lv0 successfully resized

root@linux:~# lvdisplay /dev/vg0/lv0
— Logical volume —
LV Name                /dev/vg0/lv0
VG Name                vg0
LV UUID                By4T9J-wPhq-fDYt-JuNE-t3Bc-UoB8-CC6TaV
LV Write Access        read/write
LV Status              available
# open                 1
LV Size                25.00 GB
Current LE             6400
Segments               2
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           254:0

Resize file system:

root@linux:~# resize2fs /dev/vg0/lv0
resize2fs 1.41.3 (12-Oct-2008)
Filesystem at /dev/vg0/lv0 is mounted on /; on-line resizing required
old desc_blocks = 2, new_desc_blocks = 2
Performing an on-line resize of /dev/vg0/lv0 to 6553600 (4k) blocks.
The filesystem on /dev/vg0/lv0 is now 6553600 blocks long.

root@linux:~# df -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg0-lv0    25G  173M   24G   1% /

View physical volumes:

root@linux:~# pvdisplay
— Physical volume —
PV Name               /dev/sda3
VG Name               vg0
PV Size               64.76 GB / not usable 3.19 MB
Allocatable           yes (but full)
PE Size (KByte)       4096
Total PE              16578
Free PE               0
Allocated PE          16578
PV UUID               YF6kv3-xA34-4UB2-uWYc-W061-e06E-XKiGzj

— Physical volume —
PV Name               /dev/sdb1
VG Name               vg0
PV Size               74.50 GB / not usable 1.03 MB
Allocatable           yes
PE Size (KByte)       4096
Total PE              19073
Free PE               8771
Allocated PE          10302
PV UUID               G8FjdC-PkW4-L0yq-TMsz-cwh7-XoXN-swqZOY

View volume groups:

root@linux:~# vgdisplay
— Volume group —
VG Name               vg0
System ID
Format                lvm2
Metadata Areas        2
Metadata Sequence No  6
VG Access             read/write
VG Status             resizable
MAX LV                0
Cur LV                2
Open LV               1
Max PV                0
Cur PV                2
Act PV                2
VG Size               139.26 GB
PE Size               4.00 MB
Total PE              35651
Alloc PE / Size       26880 / 105.00 GB
Free  PE / Size       8771 / 34.26 GB
VG UUID               DY0Hsk-vT57-pMnV-Rrgm-u2Hb-U8kh-Xqvgcp

View logical volumes:

root@linux:~# lvdisplay
— Logical volume —
LV Name                /dev/vg0/lv0
VG Name                vg0
LV UUID                By4T9J-wPhq-fDYt-JuNE-t3Bc-UoB8-CC6TaV
LV Write Access        read/write
LV Status              available
# open                 1
LV Size                25.00 GB
Current LE             6400
Segments               2
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           254:0

— Logical volume —
LV Name                /dev/vg0/lv1
VG Name                vg0
LV UUID                ZjPt4x-AHK7-Q0tj-JXSZ-p9Kw-kg2C-GIdbJN
LV Write Access        read/write
LV Status              available
# open                 0
LV Size                80.00 GB
Current LE             20480
Segments               2
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           254:1

Category: Storage  | Tags: , , ,  | 2 Comments
Thursday, November 20th, 2008 | Author: admin

Setup:

15Mbit download (tests to 11Mbit at http://speedtest.net) via FTTH connection.

46″ 1080P Sony Bravia (not XBR, year-old model)

So, I’ve signed up for the free netflix trial to test out streaming to the 360. Frankly, after renting titles from the video marketplace the netflix version looks horrible in comparison. In subjective terms it looks to be about the same quality as youtube.  If you’re familiar with the netflix instant play option via PC, it looks about the same. In playing, it’s quicker than the xbox video marketplace as it doesn’t have to buffer as much. I suspect that on a standard definition television it would be quite sufficient and look similar to VHS.  It gave me ‘two bars’ on video quality testing. I recently requested an upgrade to 30Mbit down (they offer up to 60Mbit but it’s $100/mo), so hopefully that will go through soon and I can update you on whether that improves things, though I doubt the majority of people have such speeds available in the US.

UPDATE (12/05/08): Ok, so there are a few select titles that they currently offer in HD, which buffers quickly and looks great in comparison to the above. It’s not very straightforward on the netflix page which titles can stream in HD, but basically it seems that you have to go to the ‘blu-ray’ genre and look for the disks that offer ‘play it now’. I guess that would seem to make sense, but I don’t imagine they get their HD streaming content off of the Blu-Ray disc itself so I’m not sure why they’re coupled. For example, they could just as easily offer a link to the HD streaming from the DVD version of the title, or offer a section where you can browse just movies that can stream in HD.  If the plain streaming weren’t so unwatchable I wouldn’t car as much, but at this point the only streaming worth watching are the HD titles offered through Netflix and the titles offered through XBOX Live marketplace.  I believe they’re just getting the hang of this and just came out of beta, so I look forward to improvements in this regard.

Category: Recreation  | Leave a Comment
Wednesday, November 19th, 2008 | Author: admin

Ok, so this is more of a recreational post, but we can have one of these once in awhile, right?  I, like many others, downloaded the much awaited fall XBOX update, dubbed the ‘NXE’, or New XBOX Experience.  I don’t care much for the new avatar system, I could take it or leave it, but it’s not too bad and not really overbearing and ‘in your face’ aside from actually forcing you to create one. If I had to like one thing about it, it’s the fact that my gamer pic is now of the avatar looking heroically upward and to the right, rather than the blue snail that I so entusiastically chose from the anemic default options of the older system.

I haven’t had a lot of time to play with it, but I like the new interface much better than the blades. It opens the functionality up. For example, the ‘Iron Man’ rental from the video marketplace shows up and is visible without being out of place or looking like it’s in a designated ad spot that we’ve all been trained to ignore, whereas before I don’t think most people even realized they could rent movies.

I look forward to trying out the Netflix streaming, too.  I may update this post with my experiences on that, since it seems that most people have been looking forward to that feature the most.

I did run into a bit of a bug, I had previously had it connected to my linux server for streaming my movie collection. It still works, but it had me download the codecs package again, which there was some weirdness with. It acted as though I needed to download it, I got prompted to install it, then it showed it was installed but did not work. Then I went into the prompt to install again, where it was checked as downloaded and installed already. I selected it, and got ‘to use this feature, please launch the game that it was intended for’, or something to that effect. Instead, I opted to reinstall, and that time it actually showed it downloading, installing, and then it worked.

In all, though, I’d say this refresh was a good move.

Category: Recreation  | Leave a Comment
Monday, November 17th, 2008 | Author: admin

What’s up world?  Things aren’t so great, eh?  Well, we feel the pain where I work as well, but it’s not so bad because we’ve always been stretched pretty thin.  Instead, people are leaving for greener pastures.  As for me, I’m sticking around for the moment.

Mainly I just wanted to post an update and stay in the habit of writing. I’ve got some documentation on LVM that I’m working on, but I took a tangent by deciding to refresh my 3D graphics hobby and am creating visuals for the write-up. It will primarily be a primer on the concept of LVM, followed by a how-to and hopefully some performance metrics.

As far as work is concerned, we’ve been working on several projects, such as migrating one of our VMware clusters off of SAN storage and on to Filer with plain old NFS.  The environment was way over-engineered, and we’ve found that we’ve got a huge, expensive piece of storage sitting there almost idle.  So off to the cheaper stuff with free (built-in, that is) de-duplication!  The other major project we’ve been working on is migrating one of our Oracle RAC clusters from one data center to another (both on-site) and in the process going from cheap, low performance SAN to more expensive, high performance SAN.  We’ve done the first portion and the systems are now running over ISL, tomorrow we’ll take a few nodes down and move them,  bring them up in the new data center, then bring the other nodes down and move them.  In order to do this we had to fiber two switches together across rooms to provide the private cluster interconnects. Fun stuff.

I’ve also taken a look at the RHCE book, I’m about a fifth of the way through it, but haven’t done a whole lot as far as prep. Still looking forward to it, though.

Category: Stuff  | Leave a Comment