About two years ago, Kyle Hailey suggested I blog about DTrace analytics in the ZFS Storage Appliance. I thought it would be a good idea but I never actually got around to doing it. But this week had some debugging to do with a storage problem and maybe have an interesting story to share (if this is actually interesting will be up to you to decide). Investigations started because a database was experiencing slow IO sometimes. Slow meaning ‘db file sequential read’ waits exceeding 30ms (less than 10ms is what I would consider normal) for a majority of events. And sometimes because some queries still returned data quickly. I still have not gotten around to the *sometimes* part of it but I guess this has to do with caching on the storage array end. This ZFSSA has 24GB of DRAM that is used to hold what is called the ARC – Adaptive Replacement Cache and maybe some of the queries hit data that was in the cache while some others were just unlucky and had to fetch things from disk. But disk access should not take longer than 3-8ms and we were seeing an average of 25ms for some queries. Continue reading
I just stumbled across this and could not find it anywhere else on the net. I set up a ZFS Appliance with Oracle VM and their storageconnect plugin according to the documentation pdf (which are pretty easy step-by-step instructions) but in this case the OVM Server and the ZFS Appliance were not in the same network and access is denied by default in the firewall between those nets. So trying to register the appliance as a FC Storage led to this error that just tells me that the connection timed out. Continue reading
Oracle just released version 1.08 of the ZFSSA monitor app for iPhone and iPad. It now supports the new firmware versions AK-2011.1.7.0 (released just last week) and AK-2011.1.8.0 (not even out yet). Other than that, there does not seem to be anything new with the app and everything I have written about before still applies.
I just thought this was funny. It propably is not but I’ll post it anyway. I just wanted to set the white locator on a 7120 ZFS storage appliance so that a field engineer could swap some parts to actually make it work. So I logged on to the iLOM (which is conveniently identical to all other Oracle hardware), checked the syntax for the LED again and tried to set it On: Continue reading
Today, I attended an informal Oracle breakfast event which included a presentation by Joerg Moellenkamp about best practices for running Oracle databases on ZFS filesystems. There is a whitepaper that describes most of the issues that you should consider. In this post, I’d simply like to share my notes on the presentation, things that I find important or were new to me.
Joerg made a case for using ZFS to mirror data because this will give ZFS another chance to repair broken blocks or checksums. I never thought about it that way and preferred to let the SAN take care of mirroring but I will consider this in the future.
I was aware that the more a zpool got filled up the more effort it was for the system to find free blocks and that this leads to slower performance. What I did not know is that zfs actually switches to a different algorithm to find free blocks at a certain level and that this level is configurable with metaslab_df_free_pct. Older releases switch at about 80% full and try to find the “best fit” for new data which slows things down even more. Read more about it here.
One issue that I did find out just a few days ago is that you cannot set the primarycache and secondarycache parameters independently. The way that L2ARC caching (using read-optimized SSDs as cache devices in a hybrid storage pool) works is by only writing to this second level cache when data is cleaned out from the primary cache. So if you disable the caching of data or metadata for your primarycache (memory), then this data will also never make it’s way to your SSDs. This post is really helpful to understand the internals behind it (and then it becomes very obvious)
The theory of “IOPS inflation” was also briefly discussed: Due to ZFS’s copy-on-write behaviour, blocks that are updated get written to a different location on disk which may lead to a degradation in performance for sequential reads that would benefit from the blocks being in the ‘proper’ order like backups or full scans. While this has not been an issue for our databases (and Joerg also mentioned that he only knows one case), I’d like to take some time and construct a demo for some further studies sometime.
Update 2013/02/27: Bart Sjerps wrote an excellent blog article that shows the fragmentation on the physical disks that occurs when updating random blocks with Oracle on ZFS. He uses SLOB and introduces a searchable ASCII string to look at the raw files. There is no conclusion (yet) about how big the impact on performance for full scans or backups are but it does become very clear that fragmentation does easily occur and that this will lead to more IOPS to the disks to read a number of “sequential” blocks.
Wahrscheinlich ist die Ankündigung etwas kurzfristig, aber am 21. Februar wird es in Hamburg wieder ein Oracle Business Breakfast geben, bei dem Jörg Möllenkamp und Dirk Nitschke über Datenbanken auf ZFS und ZFS Storage Appliance sprechen werden während Teilnehmer auf Kosten von Oracle frühstücken. Das Konzept hat sich schon bei den letzten Veranstaltungen bewährt, und das Thema verspricht sehr interessant zu werden, zumal wir uns erst kürzlich damit beschäftigt hatten. Zur Anmeldung.
One of the items that seems to come up every time we deploy a new database (on solaris) is whether we should use ASM or ZFS as filesystem for the datafiles. They each have their advantages and disadvantages in management and features. I won’t go into explaining the differences between and reasons for or against using one or the other today but Jason Arneil has a good introduction on this. He does not mention performance though and this has been an issue where I would think/say that ZFS propably has a bit of overhead in comparison with ASM. I have never been able to put an actual number on the expected performance hit. My hunch was to prefer ASM when the DB has critical performance demands and consider ZFS in situations where the extra features come in handy.
But the day has come where I am actually interested in finding this out, so I sat down with Kevin Closson’s Silly Little Oracle Benchmark (SLOB) and designed a tiny little test case: A logical domain with 4 cores of a T4-2 sparc server with Solaris 11.1, Oracle 184.108.40.206 and a 7320 ZFS storage appliance. I created two volumes (same parameters, zpool, everything) on the Storage and attached it to the server via Fibre Channel. While both HBAs support 8Gb/s, the switch is a bit older and only supports 4Gb/s which limits the throughput to about 800MB/s. The whole test data fits easily into the 24GB of DRAM cache in the controller, so at least for read workloads, we don’t have to worry about the performance of disks or SSDs at all.
I set up 32 slob users and ran the test with 32 readers in parallel. I looked at the output of iostat and confirmed the result in the awr report where it reports the physical reads per second. I also ran tests with 8 writers but did not bother to set up redo logs and such properly. I’ll post my numbers, but please take the write numbers with a grain of salt.
I have done these tests with ASM before, so I was not surprised to see that we max out at about 100.000 IOPS. Not too shabby. The limit here is the FC throughput since we are hitting the 800MB/s with 100.000 8kB reads.
On my first run against the tablespace on ZFS, I did not bother modifying the ARC cache and simply let ZFS cache and read all my data, just for the fun of measuring and comparing the effect of caching on this level. I was expecting IOPS through the roof, close to the maximum number of LIOs that you would see when slobbing with a larger SGA. But to my surprise, that was not the case. The only thing through the roof was kernel load on the box, iostat reported no reads against the LUN. Something very wrong must be going on there. But since this is not very representative of what we’d do in production, I went on with the tests that actually do limit the ZFS ARC cache size. But maybe this is something interesting for another day.
Next, I limited the ZFS ARC cache to about 100MB and re-ran my tests. This time, I was able to get the disks to do something, but I maxed out at about 6.000 IOPS when I realized that the ZFS blocksize was set to 128kB which is propably a pretty stupid idea when doing random read tests on 8k data blocks. Oh, and I was not the first one to notice this, read about it in more detail from Don Seiler.
Next round: drop the users, tablespace and datafile, set the zfs recordsize to 8k and set the cache to only hold metadata, recreate everything and re-run. Here are the settings I used for this zfs:
zfs set recordsize=8k slob/dbf zfs set atime=off slob/dbf zfs set primarycache=metadata slob/dbf
And sure enough, I did get somewhat meaningful numbers this time.
The awr report listed about 74k IOPS. Interestingly, iostat showed only 60k IOPS against the disk volume, so maybe those additional 15k were really lucky to be fetched from the only 100MB of arc cache. I’ll give ZFS the benefit of doubt for now. Again, I did see a lot of kernel or system CPU utilization, but not as much as when I read everything from the ZFS cache.
I did play around a little bit with the primarycache parameter and turning it off completely did decrease the throughput while enabling data to be cached increased it but I was not interested in results that employ OS caching so I chose to go with the cached metadata only. I also briefly tried to disable checksumming but did not notice an effect for reads.
I also figured it could be interesting to compare this to UFS. The setup was easy enough: destroy the zpool, partition the disk, create a UFS filesystem with an 8k blocksize and make sure to mount it with the directio option. Then, run the same workload as previously. The performance is in the same ballpark as ZFS, equal for writes but about 20% less read IOPS.
|storage||read IOPS (32 readers)||write IOPS (8 writers)||CPU utilization (reading)|
In conclusion, in this test setup there is a significant performance hit in using ZFS or UFS versus ASM. And this is not only time lost in fetching data from disk but also actual CPU time being burned. Still, this is no reason to rule out ZFS completely for datafile storage. There are cases where you may want to employ easy cloning or transparent compression or even encryption and can live with a performance hit. And I am sure there are also cases where you can benefit from hybrid storage pools to outperform ASM. One could employ a local PCIe flash card as a cache to improve ZFS performance.
I would have also loved to compare NFS but the ethernet link in this setup is only 1Gb, so that would be rather pointless.
UPDATE 2012-01-10: I ran the ZFS tests again with the primarycache parameter set to metadata only to make sure I was not reading data from the ZFS cache at all and this time the numbers of iostat and the awr report matched. I was asked if I could also compare the performance to UFS and that was easily added.
UPDATE 2014-01-12: I have been thinking about these results for a while now and I might have made a mistake in the setup. I may have missed to properly align the partitions for ZFS and UFS. This would lead to one IO on the server resulting in two IOs to the storage array and explain why the numbers were about 50% of the ASM numbers.
There was a time when a phone was a phone and you used it to make phone calls. We are past that time. I do use my device more than I ever did but the least recently function is propably the “phone”. Anyway, here is a nifty, fun app that I like to show off – not much more than that. It is the ZFS appliance iPhone monitor. If you have a ZFS Storage appliance 7120, 7320 or 7420, there is a little iPhone app that you can use to monitor and possibly analyze your storage system with as long as your phone has a way to reach the storage over the network. I use VPN for that.
You download the app from iTunes and enter the logon data for your storage appliance, the minimum firmware level they mention in the release notes is 2010.08.17 but we have noticed that the app does not work with 2011.04.24.2. But it is recommended to upgrade to the latest release anyway, and the app will work with that.
It is a good idea to create a new local user on the appliance that does not have administration rights granted, you then add the appliance to your app by entering a name, ip address, the name of the newly created user (or root if you are really confident) and the user’s password.
The app provided read-only access to the monitoring screens of the appliance including your dashboard and analytic worksheets. Analytics uses dTrace probes in the underlying Solaris OS and enables you to dive deeply into everything that is going on inside your storage array, like IO latency by disk, throughput by initiator or target, cache hits and misses at the memory and flash level and so on,. This is good enough to show off the amazing suite of analytics and instrumentation on these devices. Unfortunately, you have to create and save analytics worksheets in a “real” browser before accessing them on your phone or tablet. The only thing you can do _to_ your appliance from the app is turning the system or disk locator LEDs on or off. But you can’t create or modify volumes or shares from the app or reconfigure settings. That way, you cannot accidentally do anything stupid with your production systems.
I don’t really use this app for anything productive or for real-life performance troubleshooting but I do like it as a conversation-starter for ZFS storage. Another great, fun, geeky and equally useless starter is Brendan Gregg’s excellent video about shouting in the datacenter.