ASM vs ZFS vs UFS performance

One of the items that seems to come up every time we deploy a new database (on solaris) is whether we should use ASM or ZFS as filesystem for the datafiles. They each have their advantages and disadvantages in management and features. I won’t go into explaining the differences between and reasons for or against using one or the other today but Jason Arneil has a good introduction on this. He does not mention performance though and this has been an issue where I would think/say that ZFS propably has a bit of overhead in comparison with ASM. I have never been able to put an actual number on the expected performance hit. My hunch was to prefer ASM when the DB has critical performance demands and consider ZFS in situations where the extra features come in handy.

But the day has come where I am actually interested in finding this out, so I sat down with Kevin Closson’s Silly Little Oracle Benchmark (SLOB) and designed a tiny little test case: A logical domain with 4 cores of a T4-2 sparc server with Solaris 11.1, Oracle 11.2.0.3 and a 7320 ZFS storage appliance. I created two volumes (same parameters, zpool, everything) on the Storage and attached it to the server via Fibre Channel. While both HBAs support 8Gb/s, the switch is a bit older and only supports 4Gb/s which limits the throughput to about 800MB/s. The whole test data fits easily into the 24GB of DRAM cache in the controller, so at least for read workloads, we don’t have to worry about the performance of disks or SSDs at all.

I set up 32 slob users and ran the test with 32 readers in parallel. I looked at the output of iostat and confirmed the result in the awr report where it reports the physical reads per second. I also ran tests with 8 writers but did not bother to set up redo logs and such properly. I’ll post my numbers, but please take the write numbers with a grain of salt.

I have done these tests with ASM before, so I was not surprised to see that we max out at about 100.000 IOPS. Not too shabby. The limit here is the FC throughput since we are hitting the 800MB/s with 100.000 8kB reads.

On my first run against the tablespace on ZFS, I did not bother modifying the ARC cache and simply let ZFS cache and read all my data, just for the fun of measuring and comparing the effect of caching on this level. I was expecting IOPS through the roof, close to the maximum number of LIOs that you would see when slobbing with a larger SGA. But to my surprise, that was not the case. The only thing through the roof was kernel load on the box, iostat reported no reads against the LUN. Something very wrong must be going on there. But since this is not very representative of what we’d do in production, I went on with the tests that actually do limit the ZFS ARC cache size. But maybe this is something interesting for another day.

Next, I limited the ZFS ARC cache to about 100MB and re-ran my tests. This time, I was able to get the disks to do something, but I maxed out at about 6.000 IOPS when I realized that the ZFS blocksize was set to 128kB which is propably a pretty stupid idea when doing random read tests on 8k data blocks. Oh, and I was not the first one to notice this, read about it in more detail from Don Seiler.

Next round: drop the users, tablespace and datafile, set the zfs recordsize to 8k and set the cache to only hold metadata, recreate everything and re-run. Here are the settings I used for this zfs:

zfs set recordsize=8k slob/dbf
zfs set atime=off slob/dbf
zfs set primarycache=metadata slob/dbf

And sure enough, I did get somewhat meaningful numbers this time. The awr report listed about 74k IOPS. Interestingly, iostat showed only 60k IOPS against the disk volume, so maybe those additional 15k were really lucky to be fetched from the only 100MB of arc cache. I’ll give ZFS the benefit of doubt for now. Again, I did see a lot of kernel or system CPU utilization, but not as much as when I read everything from the ZFS cache.
I did play around a little bit with the primarycache parameter and turning it off completely did decrease the throughput while enabling data to be cached increased it but I was not interested in results that employ OS caching so I chose to go with the cached metadata only. I also briefly tried to disable checksumming but did not notice an effect for reads.

I also figured it could be interesting to compare this to UFS. The setup was easy enough: destroy the zpool, partition the disk, create a UFS filesystem with an 8k blocksize and make sure to mount it with the directio option. Then, run the same workload as previously. The performance is in the same ballpark as ZFS, equal for writes but about 20% less read IOPS.

storage read IOPS (32 readers) write IOPS (8 writers) CPU utilization (reading)
ASM 97.845 9.203 55%
ZFS 48.902 7.391 69%
UFS 41.320 7.469 60%

In conclusion, in this test setup there is a significant performance hit in using ZFS or UFS versus ASM. And this is not only time lost in fetching data from disk but also actual CPU time being burned. Still, this is no reason to rule out ZFS completely for datafile storage. There are cases where you may want to employ easy cloning or transparent compression or even encryption and can live with a performance hit. And I am sure there are also cases where you can benefit from hybrid storage pools to outperform ASM. One could employ a local PCIe flash card as a cache to improve ZFS performance.

I would have also loved to compare NFS but the ethernet link in this setup is only 1Gb, so that would be rather pointless.

UPDATE 2012-01-10: I ran the ZFS tests again with the primarycache parameter set to metadata only to make sure I was not reading data from the ZFS cache at all and this time the numbers of iostat and the awr report matched. I was asked if I could also compare the performance to UFS and that was easily added.

UPDATE 2014-01-12: I have been thinking about these results for a while now and I might have made a mistake in the setup. I may have missed to properly align the partitions for ZFS and UFS. This would lead to one IO on the server resulting in two IOs to the storage array and explain why the numbers were about 50% of the ASM numbers.

5 thoughts on “ASM vs ZFS vs UFS performance

  1. Bjorn:

    Very nice post and use of SLOB. It’s quite interesting to see ~25,000 SLOB-PIOPS/s per core on T4! Would you be interested in collecting LIOPS for us? It’s simple. Just make the SGA sufficiently large to eliminate PREADS. Then, modify reader.sql to loop 500,000 times as cached test execution needs many more iterations. Then run a scale-up loop that probes for the optimal under of SLOB sessions per core that gives you the best SLOBops/core. A SLOBop is one iteration of the reader.sql work loop. so if the loop is 500,000 and you run a single SLOB user (session) then you’ve performed 500K SLOBops. Then divide by job completion time to get SLOBops/s.

    Also, did you have to tweak any of the scripts in the SLOB kit for Solaris?

  2. Howdy Bjoern,

    I’m interested in your ASM test details as well. How did you set up the test? Did you use the same ZFS SA, mapped a single LUN through the FC, and slapped ASM on it?

  3. Kevin, the only thing I needed to modify was the makefile. My compiler is gcc, not cc but that was no biggie (would have saved 3 seconds if the variable defined on the first line was actually used).
    I have already done a lot of LIO testing but still have to make sense of it, tidy things up nicely and write that blog.

    Kelvin, that is almost how I set this up. From the same ZFS SA I created two (identical) LUNs, mapped both through FC. One forms an ASM diskgroup, the other was used to create a zpool. And I think it would also be a useful test to drop the zpool and try with UFS.

  4. Hello,

    Nice post.

    If you want to check the I/O evolution per second (Number, performance…) during the SLOB test (Instead of the average per second provided by the AWR snapshot) you may found those scripts helpfull :

    sysstat.pl
    and
    system_event.pl

    Those scripts basically take snapshots of the associated cumulative views each second (default interval) and computes the differences with the previous snapshot.

    You can filter of Stats and wait events of interest.

    Those scripts are available here : http://bdrouvot.wordpress.com/perl-scripts-2/

    That way you have a real-time information of what’s going on instead of an aggregated one provided by AWR.

    Bertrand

  5. Pingback: Oracle Frühstück zu Datenbanken auf ZFS | portrix systems

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>