About two years ago, Kyle Hailey suggested I blog about DTrace analytics in the ZFS Storage Appliance. I thought it would be a good idea but I never actually got around to doing it. But this week had some debugging to do with a storage problem and maybe have an interesting story to share (if this is actually interesting will be up to you to decide). Investigations started because a database was experiencing slow IO sometimes. Slow meaning ‘db file sequential read’ waits exceeding 30ms (less than 10ms is what I would consider normal) for a majority of events. And sometimes because some queries still returned data quickly. I still have not gotten around to the *sometimes* part of it but I guess this has to do with caching on the storage array end. This ZFSSA has 24GB of DRAM that is used to hold what is called the ARC – Adaptive Replacement Cache and maybe some of the queries hit data that was in the cache while some others were just unlucky and had to fetch things from disk. But disk access should not take longer than 3-8ms and we were seeing an average of 25ms for some queries.
Anyway. The storage for this database resides on a ZFSSA and it is sharing the Appliance with a bunch of other systems. Some of them are also databases, some of them are VMs and some are SAN boot volumes for Solaris servers. The ZFS Storage Appliances (7120 in this case) is basically an x64 server running Solaris and a special GUI for management and analytics. This device has 11 disks, one is a spare and the other 10 form a mirrored zpool. There is also an SSD that is used to cache writes to the pool, a ZIL (Z intent log) in ZFS speak. The Solaris portion of the appliance not only provides ZFS but also access to DTrace instrumentation. A number of DTrace probes are accessible through the GUI in an easy analytics dashboard.
But first a word of warning: The Appliance has a “status” dashboard page with a bunch of graphs on it and it also has cute icons showing the status from sunny via cloudy, rainy to hurricane style storm. This page is mostly useless. The threshold values for the icons do not make much sense and the averages you see here (CPU, IO load, etc) can only be used to get a very rough understanding of what is going on.
So the first question I had was if we were able to actually see the same sort of latency issue on the ZFSSA as we saw in the database trace files. Easy. Bring up the analytics worksheet and add the statistic for Fibre Channel Operations broken down by latency. This is roughly how it looks. And as you can see, there are plenty of IOs with a significant latency.
Alright, we are on to something. We can also confirm if this latency is coming from the disks or somewhere else in the software stack by looking at the latency by disk. And again you can see that of about 1000 IOPS in total about a third cannot be satisfied within less than 10ms.
So either someone is shouting at our disks in the datacenter or we know that we do have a problem. Actually I would not expect this Appliance to support much more than about 1000 random disk IOPS (it is just 10 disks at a slow 7.200rpm) and the next question is: where is this coming from? This next screenshot shows disk IOPS (by type of operation), FC requests by type and lastly the FC operations drilled down by LUN.
There are a number of interesting observations:
- – there is a ton of writes going to the disks
- – the number of write operations coming into the Appliance via FC is a magnitude lower (less than 100 compared to more than 1000) but shows the same pattern
- – most of these writes come from two LUNs. These LUNs are exported to two Solaris servers that have a ZFS built on top of them.
After a bit of thinking and digging around, we can explain why the IOs coming in are amplified like that. And the reason is as simple and stupid as it can be. I just mentioned that there is a ZFS built on top of that LUN or volume. That ZFS is using a default recordsize of 128kB. Meaning that all files larger than 128kB will be written in chunks of 128kB and every write operation to the storage backend is going to be in chunks of 128kB. You can verify this on the server like this:
sol11server ~ # zfs get recordsize zp01 NAME PROPERTY VALUE SOURCE zp01 recordsize 128K default
The zvol on the storage on the other hand has a different recordsize. You can check this by clicking the edit button the LUN in the Shares/LUN view of the ZFSSA. This is how it will look:
So on this end the recordsize is a mere 8kB. And this is where the puzzle pieces come back together. The ZFS on the server writes data out in chunks of 128kB. These are split into 16 chunks of 8kB each and each of these can reside on a different disk. This is what amplifies the workload between the LUN and the physical disks. So very easy (and rather obvious) takeaway: align the recordsizes when you build a ZFS on a ZVOL. Actually, this is true for pretty much any filesystem. What makes this trap so easy to fall into is the fact that both of these values are defaults. When you create a zpool and zfs filesystem on Solaris, 128kB is the default. And when you create a ZVOL on the ZFS Appliances, 8kB is the default.
My next tasks: figure out how to get out of this. You cannot change the recordsize of an existing zvol, so I am out of luck on the ZFSSA side. You can change the recordsize of a ZFS filesystem but I *think* that will only affect new files, not existing ones. Have to check. The last option would be to create a new volume and migrate the storage from the old one but that requires some downtime for the applications on that server.
But what I really wanted to show is how the DTrace analytics part of the ZFSSA is helping to quickly identify these issues. In this case it was the ability to check the incoming IO requests on a LUN or initiator basis to identify where to look for the underlying problem.