DTrace analytics in ZFS Storage Appliance

About two years ago, Kyle Hailey suggested I blog about DTrace analytics in the ZFS Storage Appliance. I thought it would be a good idea but I never actually got around to doing it. But this week had some debugging to do with a storage problem and maybe have an interesting story to share (if this is actually interesting will be up to you to decide). Investigations started because a database was experiencing slow IO sometimes. Slow meaning ‘db file sequential read’ waits exceeding 30ms (less than 10ms is what I would consider normal) for a majority of events. And sometimes because some queries still returned data quickly. I still have not gotten around to the *sometimes* part of it but I guess this has to do with caching on the storage array end. This ZFSSA has 24GB of DRAM that is used to hold what is called the ARC – Adaptive Replacement Cache and maybe some of the queries hit data that was in the cache while some others were just unlucky and had to fetch things from disk. But disk access should not take longer than 3-8ms and we were seeing an average of 25ms for some queries. Continue reading

revisiting Solaris 11 Automatic Installer without DHCP

A while ago I blogged about my experience with the new automatic installer in Solaris 11 and my special setup in which I refused to use a DHCP service because DHCP is simply something we do not use in our datacenter. That particular blog post has been one of my most popular (by visits) so this must be something that other admins experience aswell.

One of the issues I came across was that while you can easily specify IP, gateway and installation server on the commandline without DHCP, the install will break at a point where it needs to install some packages from ‘pkg.oracle.com’ but could not resolve that name due to a lack of DNS servers being set up. I described a rather tedious workaround in that post but came up with something a little better today. Continue reading

troubleshooting a CPU eating java process

Just yesterday we had a situation where a java process (a JBOSS app server) started to raise the CPU load on a machine significantly. Load average increased and top showed that the process consumed a lot of CPU. The planned “workaround”? Blame something (garbage collection is always an easy victim), restart and hope for the best. After all, there is not much to diagnose here if there is nothing in the logs, right? wrong! There are a few things that we can do with a running java vm to diagnose these issues from a solaris or linux commandline and find the root cause. Continue reading

notes on Oracle’s new lineup of SPARC T5 and M5 servers

Larry Ellison himself just announced the latest generation of SPARC processors. Both he and John Fowler talked a lot about benchmarks and how these systems compete with IBM power series systems. Very exciting stuff but the announcement lacked a bit of technical details. I have compiled some information about these new systems:

The T4, T5 and the M5 share more than the names suggest. All of those CPUs are made up of the same basic building block: The S3 core. So each of those processors has the same basic per-core features of 8 threads, two pipelines and everything else that has already been present in the T4. The T4 has 8 of those S3 cores per chip and 4MB of L3 cache. The difference to the T5 is it packs twice the cores and memory: 16 cores and 8MB L3 cache and clocked at 3.6GHz compared to the T4’s 2.85GHz (3GHz in the T4-4).
The name M5 would suggest a successor to the chips used in the M3000 to M9000 series systems, the SPARC64. But in fact the M5 CPU is really made up of the same S3 core as the T4 and T5 with 6 cores and 48MB L2 cache. So all of these chips will support Oracle VM for SPARC (I still like to call it LDoms) even with live migration across these machines. They also all feature 10GbE and acceleration for cryptographics directly on the CPU.

The T4 systems are still available and won’t be EOL’d anytime soon. In fact, the T5 completes the portfolio on the top end while smaller and mid-size systems are still only being covered by T4 systems. The glueless 8 socket T5-8 systems is the top end of the line with 128cores that all access the same main memory with a single hop. But even when you compare the 2 and 4 socket variants it becomes clear that for T5, size does matter. What Larry and John did not mention was that in addition to the T5-2, T5-4 and T5-8 there will also be a T5-1B blade module with just a single T5 CPU but there will be no T5-1 (as of today).

I am obviously a big fan of Solaris but I am also very curious if Oracle will ever bring Linux to the SPARC platform. Back in the very old days, SUN hat a cooperation with ubuntu on the T1000 and T2000 systems for some time but it was not a huge success. Larry has made some comments in the direction of wanting to port OEL to SPARC in 2010 but still has to follow suit.

Oracle has yet to update the core factor table to include the factor for T5 chips. Anything but 0.5 would put a dent into all performance/price calculations. The core factor table has been updated and lists both the T5 and M5 processors with a factor of 0.5 which means that organizations can upgrade existing SPARC or Intel database systems to the same number of T5 cores without having to worry about adding extra costs for new DB licenses. Additionally, LDom virtualization and hard partitioning help out by allowing to run a database on a subset of cores and only license what you actually need or use.

trying to make solaris a little better every day

I just received a long error stack followed by this nice little request to log an SR and report this.

pkg: This is an internal error in pkg(5) version 93c2e5a1fc89. Please log a
Service Request about this issue including the information above and this
message.

I guess we’ll see if the issue will be improved by my report but having logged a lot of feature requests and bug reports recently I have to say it is very satisfying to go over release notes and find fixes for things you reported. Btw if you are curious what happend up there: I (actually a script I was calling) tried to set up a repositpory from a local directory which had some whitespace in the path which is apparently not cool if you want to turn that into a URI.

Oracle databases on ZFS

Today, I attended an informal Oracle breakfast event which included a presentation by Joerg Moellenkamp about best practices for running Oracle databases on ZFS filesystems. There is a whitepaper that describes most of the issues that you should consider. In this post, I’d simply like to share my notes on the presentation, things that I find important or were new to me.

oracle on zfsJoerg made a case for using ZFS to mirror data because this will give ZFS another chance to repair broken blocks or checksums. I never thought about it that way and preferred to let the SAN take care of mirroring but I will consider this in the future.

I was aware that the more a zpool got filled up the more effort it was for the system to find free blocks and that this leads to slower performance. What I did not know is that zfs actually switches to a different algorithm to find free blocks at a certain level and that this level is configurable with metaslab_df_free_pct. Older releases switch at about 80% full and try to find the “best fit” for new data which slows things down even more. Read more about it here.

One issue that I did find out just a few days ago is that you cannot set the primarycache and secondarycache parameters independently. The way that L2ARC caching (using read-optimized SSDs as cache devices in a hybrid storage pool) works is by only writing to this second level cache when data is cleaned out from the primary cache. So if you disable the caching of data or metadata for your primarycache (memory), then this data will also never make it’s way to your SSDs. This post is really helpful to understand the internals behind it (and then it becomes very obvious)

The theory of “IOPS inflation” was also briefly discussed: Due to ZFS’s copy-on-write behaviour, blocks that are updated get written to a different location on disk which may lead to a degradation in performance for sequential reads that would benefit from the blocks being in the ‘proper’ order like backups or full scans. While this has not been an issue for our databases (and Joerg also mentioned that he only knows one case), I’d like to take some time and construct a demo for some further studies sometime.

Update 2013/02/27: Bart Sjerps wrote an excellent blog article that shows the fragmentation on the physical disks that occurs when updating random blocks with Oracle on ZFS. He uses SLOB and introduces a searchable ASCII string to look at the raw files. There is no conclusion (yet) about how big the impact on performance for full scans or backups are but it does become very clear that fragmentation does easily occur and that this will lead to more IOPS to the disks to read a number of “sequential” blocks.

Solaris 11 datalink multipath

The release notes of Solaris 11.1 mention a new network trunking feature that does not require LACP or the setup of ICMP. I finally had time to look a little bit closer at it. The motivation to do trunking or aggregation are availability and load-balancing. The idea is to combine multiple interfaces so that the system survives the failure of a NIC, cable or switch (availability) and to also allow a higher throughput of data than what a single interface could provide (load-balancing). With Solaris 11.1 there are three basis methods to choose from and I want to briefly introduce them. There actually is a pretty good comparison chart in the documentation. Continue reading

using PCIe direct IO with ldoms

Recent versions of logical domains (or Oracle VM for SPARC) allow you to assign single PCIe devices to a guest ldom so IO from that ldom does not have to go through the primary domain. I am setting this up for 2 FC HBA on a T4-2 system with two domains, one for prod and one for test. Assigning DIO devices to a guest domain (which then becomes an IO-Domain) will prevent you from doing live migration of this domain and it will also provide a new dependancy to the primary domain because if the primary goes down or reboots, so does the PCI bus and with it the access to the HBA. But since we also boot from a ZFS provided by the primary domain, this dependancy was already there aswell. Another option would be to assign a whole PCIe bus to a guest domain (making it a so-called root domain) but extra caution needs to be taken if the primary domains boots from a disk controller attached to the PCIe bus to be shared. And some more thaught needs to be put into your networking configuration aswell.
The whole process is documented well, this post basically repeats the steps that I have taken and adds the multipath configuration from the guest domain. Continue reading

debugging IO calls in Solaris 11 with kstat

Solaris 11 update 1 was released just a few days ago and I have been curious to find out about new features. One of the things mentioned in the What’s-New guide that caught my attention were “File System Statistics for Oracle Solaris Zones”. Until now, there was no way to tell which zone was responsible for how much IO. So if you found high IO utilization for your devices using iostat or zpool iostat or anything similar, it was no trivial task to find the responsible zone or process. DTrace scripts like iotop or iosnoop helped a bit but in the case of ZFS they this does not get you too far because they do not report the user process that is really responsible for the IO. This is where these new kstat statistics for each filesystem type in each zone come in handy. Continue reading

Solaris 11.1 update (new features)

John Fowler announced the first major update release of Solaris 11 at OpenWorld in October and now it is available for download. I have not tried it yet but will definitely do so in the very near future. The changelog (What’s new in Solaris 11.1) guide is 19 pages long and has a lot of information. Some of these new features are very interestind and I hope I can set aside some time to play with and blog about them. These are the ones that caught most of my attention:

Easy integration of Solaris Zones on shared storage. Basically, you offer a LUN to zonecfg and it will implicitly create a new zpool for that zone. We have done this manually before to make it easier to move zones around between servers. Basically a poor-man’s solaris cluster. Along with a promised performance improvement of 97% to attach a zone on another server this could make this process much smoother.

I do not really get “File System Statistics for Oracle Solaris Zones” yet but if it allows me to use iostat or similar tools to find the utlization per zone that would be great. Obviously this is not an issue for zones that already reside on their own LUN but if multiple zones share a filesystem or zpool it is a pain to find out which one actually issues most IO requests. Another great tool for consolidated environments. (Update 10/27/2012: blogged about it)

They promise to increase the file transfer throughput of SSH and SCP which may come in handy. Not that I really need this in productions but when you transfer gigabytes of installation media with scp (because I am too lazy to set anything else up) I am happy for a few minutes more productivity.

Link Aggregation across multiple switches – this should be interesting to investigate. With the current implementation you can only use dladm for IEEE 802.3 bonding which works great but does not allow one to span multiple switches and implies that you cannot setup redundancy through two seperate switches (we overcome this by using cisco 6500 switches with redundant supervisors and linecards but it is still only one chassis). Solaris IP Multipathing or IPMP was a solution that offered more availability but is also a pain to setup and does not allow the bandwidth to scale with the number of NICs (Update 02/09/2013: blogged about it).

The guide also vaguely mentions improvements made specifically to run the Oracle database faster. I have been looking forward to improvements like these ever since Oracle acquired SUN and am excited to try, test and learn more about this. They appear to have tweaked the way memory is being handled and allocated to speed up the instance startup by a factor of 8. Steve Sistare has some more details on this. And there is some kernel support for RAC locking:

In the kernel itself, there has been a long history of improvements to benefit Oracle software,
the latest being acceleration for Oracle RAC where improvements in lock management are
expected to yield up to 20 percent throughput improvement over Oracle Solaris 11 11/11.

Please check back in the next weeks as I will take closer look at some of these features and blog about them in detail.

oow solaris zone cloning demo

At our OpenWorld kiosk we want to show a few of the unique Solaris features that we use every day. One of them is cloning of zones or containers. So I set up a simple demo that I want to share so you can try this at home. You would script all of the steps (and more) to automate the whole process but for the demo I want to show all the steps manually so you get an idea of how simple this is instead of just acknowledging that a script can clone an environment in a few seconds. I prepared this by installing Solaris 11, creating a network bond interface called aggr1 (but you can use your interface of choice) and a zpool called zp01 that will hold my zone roots. Continue reading