One of the items that seems to come up every time we deploy a new database (on solaris) is whether we should use ASM or ZFS as filesystem for the datafiles. They each have their advantages and disadvantages in management and features. I won’t go into explaining the differences between and reasons for or against using one or the other today but Jason Arneil has a good introduction on this. He does not mention performance though and this has been an issue where I would think/say that ZFS propably has a bit of overhead in comparison with ASM. I have never been able to put an actual number on the expected performance hit. My hunch was to prefer ASM when the DB has critical performance demands and consider ZFS in situations where the extra features come in handy.
But the day has come where I am actually interested in finding this out, so I sat down with Kevin Closson’s Silly Little Oracle Benchmark (SLOB) and designed a tiny little test case: A logical domain with 4 cores of a T4-2 sparc server with Solaris 11.1, Oracle 188.8.131.52 and a 7320 ZFS storage appliance. I created two volumes (same parameters, zpool, everything) on the Storage and attached it to the server via Fibre Channel. While both HBAs support 8Gb/s, the switch is a bit older and only supports 4Gb/s which limits the throughput to about 800MB/s. The whole test data fits easily into the 24GB of DRAM cache in the controller, so at least for read workloads, we don’t have to worry about the performance of disks or SSDs at all.
I set up 32 slob users and ran the test with 32 readers in parallel. I looked at the output of iostat and confirmed the result in the awr report where it reports the physical reads per second. I also ran tests with 8 writers but did not bother to set up redo logs and such properly. I’ll post my numbers, but please take the write numbers with a grain of salt.
I have done these tests with ASM before, so I was not surprised to see that we max out at about 100.000 IOPS. Not too shabby. The limit here is the FC throughput since we are hitting the 800MB/s with 100.000 8kB reads.
On my first run against the tablespace on ZFS, I did not bother modifying the ARC cache and simply let ZFS cache and read all my data, just for the fun of measuring and comparing the effect of caching on this level. I was expecting IOPS through the roof, close to the maximum number of LIOs that you would see when slobbing with a larger SGA. But to my surprise, that was not the case. The only thing through the roof was kernel load on the box, iostat reported no reads against the LUN. Something very wrong must be going on there. But since this is not very representative of what we’d do in production, I went on with the tests that actually do limit the ZFS ARC cache size. But maybe this is something interesting for another day.
Next, I limited the ZFS ARC cache to about 100MB and re-ran my tests. This time, I was able to get the disks to do something, but I maxed out at about 6.000 IOPS when I realized that the ZFS blocksize was set to 128kB which is propably a pretty stupid idea when doing random read tests on 8k data blocks. Oh, and I was not the first one to notice this, read about it in more detail from Don Seiler.
Next round: drop the users, tablespace and datafile, set the zfs recordsize to 8k and set the cache to only hold metadata, recreate everything and re-run. Here are the settings I used for this zfs:
zfs set recordsize=8k slob/dbf zfs set atime=off slob/dbf zfs set primarycache=metadata slob/dbf
And sure enough, I did get somewhat meaningful numbers this time.
The awr report listed about 74k IOPS. Interestingly, iostat showed only 60k IOPS against the disk volume, so maybe those additional 15k were really lucky to be fetched from the only 100MB of arc cache. I’ll give ZFS the benefit of doubt for now. Again, I did see a lot of kernel or system CPU utilization, but not as much as when I read everything from the ZFS cache.
I did play around a little bit with the primarycache parameter and turning it off completely did decrease the throughput while enabling data to be cached increased it but I was not interested in results that employ OS caching so I chose to go with the cached metadata only. I also briefly tried to disable checksumming but did not notice an effect for reads.
I also figured it could be interesting to compare this to UFS. The setup was easy enough: destroy the zpool, partition the disk, create a UFS filesystem with an 8k blocksize and make sure to mount it with the directio option. Then, run the same workload as previously. The performance is in the same ballpark as ZFS, equal for writes but about 20% less read IOPS.
|storage||read IOPS (32 readers)||write IOPS (8 writers)||CPU utilization (reading)|
In conclusion, in this test setup there is a significant performance hit in using ZFS or UFS versus ASM. And this is not only time lost in fetching data from disk but also actual CPU time being burned. Still, this is no reason to rule out ZFS completely for datafile storage. There are cases where you may want to employ easy cloning or transparent compression or even encryption and can live with a performance hit. And I am sure there are also cases where you can benefit from hybrid storage pools to outperform ASM. One could employ a local PCIe flash card as a cache to improve ZFS performance.
I would have also loved to compare NFS but the ethernet link in this setup is only 1Gb, so that would be rather pointless.
UPDATE 2012-01-10: I ran the ZFS tests again with the primarycache parameter set to metadata only to make sure I was not reading data from the ZFS cache at all and this time the numbers of iostat and the awr report matched. I was asked if I could also compare the performance to UFS and that was easily added.
UPDATE 2014-01-12: I have been thinking about these results for a while now and I might have made a mistake in the setup. I may have missed to properly align the partitions for ZFS and UFS. This would lead to one IO on the server resulting in two IOs to the storage array and explain why the numbers were about 50% of the ASM numbers.