While Oracle may – even with the newest release – not be web scale, it sure has flex everything in RAC. These are actually really great features but with the names being so similar (just like flash and flashback in current releases), it is easy to get them confused and I had a very hard time explaining these technologies to Martin Berger and Allan Robertson via twitter. The two features I want to write about today are flex ASM and flex cluster and I was lucky enough to sit in two excellent presentations by Markus Michalewicz and Nitin Vengurlekar at Collobarate 2013. Continue reading
I just received the fantastic news that OTN is recognizing me as an Oracle ACE Director. I feel very honored to be considered part of this special club.
The Oracle ACE program awards enthusiasts and community contributors by ackowledging the effort they put into sharing of knowledge. In addition to that the ACE Director designation supports individuals additionally to engage even more with Oracle through regular web conferences with engineering, an annual briefing before OOW and assistance with speaking opportunities. All they are asking for in return are things I enjoy doing anyway: Write the occasional blog post or article about Oracle technologies and speak to and with old and new friends and fellow nerds at conferences to share and spread enthusiasm for technology.
Thanks go out to everyone in this community for giving so much knowledge and motivation to share. You are great friends and I enjoy every meeting at conferences aswell as following you guys on the interwebs. My partners and colleagues at portrix, thank you for providing a platform where innovation and constant learning is encouraged and seen as competetive edge. My parents who are still driving me mad by demanding I come and fix their PCs (they have more than me) every time they accidentally delete “the internet”. According to them, “fixing PCs” is what I studied in college and what I do for a living. And I guess they are not too far off.
Larry Ellison himself just announced the latest generation of SPARC processors. Both he and John Fowler talked a lot about benchmarks and how these systems compete with IBM power series systems. Very exciting stuff but the announcement lacked a bit of technical details. I have compiled some information about these new systems:
The T4, T5 and the M5 share more than the names suggest. All of those CPUs are made up of the same basic building block: The S3 core. So each of those processors has the same basic per-core features of 8 threads, two pipelines and everything else that has already been present in the T4. The T4 has 8 of those S3 cores per chip and 4MB of L3 cache. The difference to the T5 is it packs twice the cores and memory: 16 cores and 8MB L3 cache and clocked at 3.6GHz compared to the T4′s 2.85GHz (3GHz in the T4-4).
The name M5 would suggest a successor to the chips used in the M3000 to M9000 series systems, the SPARC64. But in fact the M5 CPU is really made up of the same S3 core as the T4 and T5 with 6 cores and 48MB L2 cache. So all of these chips will support Oracle VM for SPARC (I still like to call it LDoms) even with live migration across these machines. They also all feature 10GbE and acceleration for cryptographics directly on the CPU.
The T4 systems are still available and won’t be EOL’d anytime soon. In fact, the T5 completes the portfolio on the top end while smaller and mid-size systems are still only being covered by T4 systems. The glueless 8 socket T5-8 systems is the top end of the line with 128cores that all access the same main memory with a single hop. But even when you compare the 2 and 4 socket variants it becomes clear that for T5, size does matter. What Larry and John did not mention was that in addition to the T5-2, T5-4 and T5-8 there will also be a T5-1B blade module with just a single T5 CPU but there will be no T5-1 (as of today).
I am obviously a big fan of Solaris but I am also very curious if Oracle will ever bring Linux to the SPARC platform. Back in the very old days, SUN hat a cooperation with ubuntu on the T1000 and T2000 systems for some time but it was not a huge success. Larry has made some comments in the direction of wanting to port OEL to SPARC in 2010 but still has to follow suit.
Oracle has yet to update the core factor table to include the factor for T5 chips. Anything but 0.5 would put a dent into all performance/price calculations. The core factor table has been updated and lists both the T5 and M5 processors with a factor of 0.5 which means that organizations can upgrade existing SPARC or Intel database systems to the same number of T5 cores without having to worry about adding extra costs for new DB licenses. Additionally, LDom virtualization and hard partitioning help out by allowing to run a database on a subset of cores and only license what you actually need or use.
Heute Morgen sind die virtuellen Wahlurnen bei der DOAG geöffnet. Alle Mitglieder wählen die Vertreter der Delegiertenversammlung. Diese neue Gremium wurde durch eine Satzungsänderung im letzten November eingeführt und übernimmt den Großteil der Aufgaben der Mitgliederversammlung. In Zukunft entscheiden und entlasten die Delegierten über den DOAG Vorstand, und nehmen den Finanzbericht ab.
Als Vertreter für korporative Mitglieder mit weniger als 500 Angestellten stelle ich mich ebenfalls zur Wahl und freue mich über jede Stimme aus der Mitgliedschaft. Eine entsprechende email mit einem Link zum “Wahlzettel” wurde in den letzten Tagen verschickt.
Today, I attended an informal Oracle breakfast event which included a presentation by Joerg Moellenkamp about best practices for running Oracle databases on ZFS filesystems. There is a whitepaper that describes most of the issues that you should consider. In this post, I’d simply like to share my notes on the presentation, things that I find important or were new to me.
Joerg made a case for using ZFS to mirror data because this will give ZFS another chance to repair broken blocks or checksums. I never thought about it that way and preferred to let the SAN take care of mirroring but I will consider this in the future.
I was aware that the more a zpool got filled up the more effort it was for the system to find free blocks and that this leads to slower performance. What I did not know is that zfs actually switches to a different algorithm to find free blocks at a certain level and that this level is configurable with metaslab_df_free_pct. Older releases switch at about 80% full and try to find the “best fit” for new data which slows things down even more. Read more about it here.
One issue that I did find out just a few days ago is that you cannot set the primarycache and secondarycache parameters independently. The way that L2ARC caching (using read-optimized SSDs as cache devices in a hybrid storage pool) works is by only writing to this second level cache when data is cleaned out from the primary cache. So if you disable the caching of data or metadata for your primarycache (memory), then this data will also never make it’s way to your SSDs. This post is really helpful to understand the internals behind it (and then it becomes very obvious)
The theory of “IOPS inflation” was also briefly discussed: Due to ZFS’s copy-on-write behaviour, blocks that are updated get written to a different location on disk which may lead to a degradation in performance for sequential reads that would benefit from the blocks being in the ‘proper’ order like backups or full scans. While this has not been an issue for our databases (and Joerg also mentioned that he only knows one case), I’d like to take some time and construct a demo for some further studies sometime.
Update 2013/02/27: Bart Sjerps wrote an excellent blog article that shows the fragmentation on the physical disks that occurs when updating random blocks with Oracle on ZFS. He uses SLOB and introduces a searchable ASCII string to look at the raw files. There is no conclusion (yet) about how big the impact on performance for full scans or backups are but it does become very clear that fragmentation does easily occur and that this will lead to more IOPS to the disks to read a number of “sequential” blocks.
Wahrscheinlich ist die Ankündigung etwas kurzfristig, aber am 21. Februar wird es in Hamburg wieder ein Oracle Business Breakfast geben, bei dem Jörg Möllenkamp und Dirk Nitschke über Datenbanken auf ZFS und ZFS Storage Appliance sprechen werden während Teilnehmer auf Kosten von Oracle frühstücken. Das Konzept hat sich schon bei den letzten Veranstaltungen bewährt, und das Thema verspricht sehr interessant zu werden, zumal wir uns erst kürzlich damit beschäftigt hatten. Zur Anmeldung.
One of the items that seems to come up every time we deploy a new database (on solaris) is whether we should use ASM or ZFS as filesystem for the datafiles. They each have their advantages and disadvantages in management and features. I won’t go into explaining the differences between and reasons for or against using one or the other today but Jason Arneil has a good introduction on this. He does not mention performance though and this has been an issue where I would think/say that ZFS propably has a bit of overhead in comparison with ASM. I have never been able to put an actual number on the expected performance hit. My hunch was to prefer ASM when the DB has critical performance demands and consider ZFS in situations where the extra features come in handy.
But the day has come where I am actually interested in finding this out, so I sat down with Kevin Closson’s Silly Little Oracle Benchmark (SLOB) and designed a tiny little test case: A logical domain with 4 cores of a T4-2 sparc server with Solaris 11.1, Oracle 22.214.171.124 and a 7320 ZFS storage appliance. I created two volumes (same parameters, zpool, everything) on the Storage and attached it to the server via Fibre Channel. While both HBAs support 8Gb/s, the switch is a bit older and only supports 4Gb/s which limits the throughput to about 800MB/s. The whole test data fits easily into the 24GB of DRAM cache in the controller, so at least for read workloads, we don’t have to worry about the performance of disks or SSDs at all.
I set up 32 slob users and ran the test with 32 readers in parallel. I looked at the output of iostat and confirmed the result in the awr report where it reports the physical reads per second. I also ran tests with 8 writers but did not bother to set up redo logs and such properly. I’ll post my numbers, but please take the write numbers with a grain of salt.
I have done these tests with ASM before, so I was not surprised to see that we max out at about 100.000 IOPS. Not too shabby. The limit here is the FC throughput since we are hitting the 800MB/s with 100.000 8kB reads.
On my first run against the tablespace on ZFS, I did not bother modifying the ARC cache and simply let ZFS cache and read all my data, just for the fun of measuring and comparing the effect of caching on this level. I was expecting IOPS through the roof, close to the maximum number of LIOs that you would see when slobbing with a larger SGA. But to my surprise, that was not the case. The only thing through the roof was kernel load on the box, iostat reported no reads against the LUN. Something very wrong must be going on there. But since this is not very representative of what we’d do in production, I went on with the tests that actually do limit the ZFS ARC cache size. But maybe this is something interesting for another day.
Next, I limited the ZFS ARC cache to about 100MB and re-ran my tests. This time, I was able to get the disks to do something, but I maxed out at about 6.000 IOPS when I realized that the ZFS blocksize was set to 128kB which is propably a pretty stupid idea when doing random read tests on 8k data blocks. Oh, and I was not the first one to notice this, read about it in more detail from Don Seiler.
Next round: drop the users, tablespace and datafile, set the zfs recordsize to 8k and set the cache to only hold metadata, recreate everything and re-run. Here are the settings I used for this zfs:
zfs set recordsize=8k slob/dbf zfs set atime=off slob/dbf zfs set primarycache=metadata slob/dbf
And sure enough, I did get somewhat meaningful numbers this time.
The awr report listed about 74k IOPS. Interestingly, iostat showed only 60k IOPS against the disk volume, so maybe those additional 15k were really lucky to be fetched from the only 100MB of arc cache. I’ll give ZFS the benefit of doubt for now. Again, I did see a lot of kernel or system CPU utilization, but not as much as when I read everything from the ZFS cache.
I did play around a little bit with the primarycache parameter and turning it off completely did decrease the throughput while enabling data to be cached increased it but I was not interested in results that employ OS caching so I chose to go with the cached metadata only. I also briefly tried to disable checksumming but did not notice an effect for reads.
I also figured it could be interesting to compare this to UFS. The setup was easy enough: destroy the zpool, partition the disk, create a UFS filesystem with an 8k blocksize and make sure to mount it with the directio option. Then, run the same workload as previously. The performance is in the same ballpark as ZFS, equal for writes but about 20% less read IOPS.
|storage||read IOPS (32 readers)||write IOPS (8 writers)||CPU utilization (reading)|
In conclusion, in this test setup there is a significant performance hit in using ZFS or UFS versus ASM. And this is not only time lost in fetching data from disk but also actual CPU time being burned. Still, this is no reason to rule out ZFS completely for datafile storage. There are cases where you may want to employ easy cloning or transparent compression or even encryption and can live with a performance hit. And I am sure there are also cases where you can benefit from hybrid storage pools to outperform ASM. One could employ a local PCIe flash card as a cache to improve ZFS performance.
I would have also loved to compare NFS but the ethernet link in this setup is only 1Gb, so that would be rather pointless.
UPDATE 2012-01-10: I ran the ZFS tests again with the primarycache parameter set to metadata only to make sure I was not reading data from the ZFS cache at all and this time the numbers of iostat and the awr report matched. I was asked if I could also compare the performance to UFS and that was easily added.
There was a time when a phone was a phone and you used it to make phone calls. We are past that time. I do use my device more than I ever did but the least recently function is propably the “phone”. Anyway, here is a nifty, fun app that I like to show off – not much more than that. It is the ZFS appliance iPhone monitor. If you have a ZFS Storage appliance 7120, 7320 or 7420, there is a little iPhone app that you can use to monitor and possibly analyze your storage system with as long as your phone has a way to reach the storage over the network. I use VPN for that.
You download the app from iTunes and enter the logon data for your storage appliance, the minimum firmware level they mention in the release notes is 2010.08.17 but we have noticed that the app does not work with 2011.04.24.2. But it is recommended to upgrade to the latest release anyway, and the app will work with that.
It is a good idea to create a new local user on the appliance that does not have administration rights granted, you then add the appliance to your app by entering a name, ip address, the name of the newly created user (or root if you are really confident) and the user’s password.
The app provided read-only access to the monitoring screens of the appliance including your dashboard and analytic worksheets. Analytics uses dTrace probes in the underlying Solaris OS and enables you to dive deeply into everything that is going on inside your storage array, like IO latency by disk, throughput by initiator or target, cache hits and misses at the memory and flash level and so on,. This is good enough to show off the amazing suite of analytics and instrumentation on these devices. Unfortunately, you have to create and save analytics worksheets in a “real” browser before accessing them on your phone or tablet. The only thing you can do _to_ your appliance from the app is turning the system or disk locator LEDs on or off. But you can’t create or modify volumes or shares from the app or reconfigure settings. That way, you cannot accidentally do anything stupid with your production systems.
I don’t really use this app for anything productive or for real-life performance troubleshooting but I do like it as a conversation-starter for ZFS storage. Another great, fun, geeky and equally useless starter is Brendan Gregg’s excellent video about shouting in the datacenter.
Here is a quick post concerning storage space requirements when using the Total Recall database feature. We did not notice this when we first set things up and were quite surprised when we looked a little closer at the space usage patterns of our flashback archives after a few weeks.
The default size for the initial extents of partitions was changed in 126.96.36.199 from 64kB to 8MB (details), propably with the intent that any table worth partitioning would usually be big enough to call for large extents anyway. But when you are using flashback archives, a new partition will be generated for you daily, no matter if you actually change anything in the base table (and generate fba data) or not. Multiply this by the number of tables you have enabled for total recall and this may easily add up to a significant amount of data.
This really would not be a very big issue if there FBAs would make use of deferred segment creation. But even though this is enabled at the instance level, the archives are implicitly created with “SEGMENT CREATION IMMEDIATE”. This is a look at the DDL of one of the underlying fba_hist tables
CREATE TABLE "GPM"."SYS_FBA_HIST_75880" ( "RID" VARCHAR2(4000 BYTE), "STARTSCN" NUMBER, "ENDSCN" NUMBER, "XID" RAW(8), "OPERATION" VARCHAR2(1 BYTE), "DTYPE" VARCHAR2(124 BYTE), "ID" NUMBER(19,0), "CODE" VARCHAR2(64 BYTE), "NAME" VARCHAR2(512 BYTE), [...] ) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 COMPRESS FOR OLTP STORAGE( BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT) TABLESPACE "FB_ARCHIVE" PARTITION BY RANGE ("ENDSCN") (PARTITION "PART_14188342" VALUES LESS THAN (14188342) SEGMENT CREATION IMMEDIATE PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 COMPRESS FOR OLTP LOGGING STORAGE(INITIAL 8388608 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT) TABLESPACE "FB_ARCHIVE" , ...
At this moment there is no way to change the partitioning parameters (to either make the initial extents smaller or use set segment creation to deferred) that are being used for flashback archives so the only chance to influence this is by altering the new (hidden) parameter file _partition_large_extents _before_ enabling total recall for a table.
A few days have passed and with some distance between now and the annual DOAG conference in Nuremberg, it is time to put down some words and review past week’s events.
Let me start with the conculsion: Best DOAG ever! And that is not because there was a huge cake for the conference’s 25th anniversary but because of the small things that happen between sessions. This year was ripe with networking opportunities and nice little chats both with old friends and new ones. Continue reading
After having spent the last sunny days of the year rocking out to the Hives at yuerba buena gardens in San Francisco, the days are getting shorter, darker and colder back here in Germany. As fall is slowly morphing into winter it is time for the annual trip to Nuremberg for the largest Oracle conference in Europe: DOAG. This year will be specially busy trying to share as much knowledge as possible in 3 (and a half) days: Two presentations, one expert panel discussion and three full days of RACattack! Continue reading
Today, at the annual german partner satellite day, portrix received an award as Oracle’s Specialized Partner of the Year for the Travel and Transportation industry. Being recognized like that is a huge honour for us as it shows that others are seeing us in the same light as we see ourselves.
So what exactly is our offering for the transportation industry? First of all, our software global price management (GPM) ensures the highest level of transparency and easy implementation when calculating shipping rates for LSPs. The software manages all existing rates that LSPs may have with their forwarders includig all additional costs and surcharges and calculates and compares the shipping rates over different routes and carriers. The software is a huge success and we have already sold and implemented this to a number of LSPs.
In addition to providing a business benefit to our customers, we are also very proud of the technology behind it. We integrate tightly with the Oracle 11g database. Two examples are Total Recall and Advanced Queueing. Total Recall allows us to setup flashback archives on our core tables and keep this historical data for several years. Not only is this important for compliance reasons but we also use this technology in our application to provide views at older versions of data or even perform Rate Retrieval queries as if done at a past point in time. You can hear more about this at my presentations for the DOAG and UKOUG conferences later this year.
In order to provide the highest levels of performance for sometimes very complex data structures, we do a lot of caching of java objects within our application servers. And while caching is as easy as storing things in memory, knowing when you need to invalidate or refresh that cache is vital, especially in an environment where several app servers or even other apps might modify your base data. This is where we utilize advanced queueing to push messages from the database to the servers whenever critical data changes so the app knows when to refresh the cached data. I hope I can find time to give you some simplified code examples later.
But the database is not our only point of contact with the red stack. In addition to providing the software for our customers to be operated in their datacenter, we also run it in our own datacenter in a SaaS model. That platform is implemented completely on the red stack including Solaris 11, Sun Fire Servers and ZFS unified storage appliances. Our internal deployments for development and the SaaS platforms rely heaviliy on Solaris zone virtualization allowing us to clone and deploy a number of servers on a single box. Building these zones is heaviliy scripted and upgrading our software version (or patching an application server) is usually done by creating a new zone and replacing the old one instead of touching the existing installation. This way we know that we can reproduce the exact setup later. Solaris 11 also offers great features in terms of analyzing application performance, including DTrace.
Everything is complemented with ZFS storage. Hybrid storage pools (combining flash and disk storage) work great in our environment where we need to keep (but rarely touch) a huge amount of historical data. Snapshots and Cloning enable us to make copies of complete databases for staging, testing and development purposes with very little time and effort.