Just yesterday we had a situation where a java process (a JBOSS app server) started to raise the CPU load on a machine significantly. Load average increased and top showed that the process consumed a lot of CPU. The planned “workaround”? Blame something (garbage collection is always an easy victim), restart and hope for the best. After all, there is not much to diagnose here if there is nothing in the logs, right? wrong! There are a few things that we can do with a running java vm to diagnose these issues from a solaris or linux commandline and find the root cause.
This system is Solaris 11 with Java 7 but the troubleshooting steps and tools are the same on older versions of Solaris and Java aswell as on Linux.
turn on thread-view in top
The first thing we can do is to turn on the thread-view in top. Just start top and hit “H” or start with the “-t” flag. Here is an output from my machine yesterday:
load averages: 5.11, 5.23, 5.27; up 52+01:50:08 15:13:15 266 threads: 260 sleeping, 6 on cpu CPU states: 36.7% idle, 62.8% user, 0.5% kernel, 0.0% iowait, 0.0% swap Kernel: 1975 ctxsw, 168 trap, 2246 intr, 1399 syscall, 6 flt Memory: 128G phys mem, 3915M free mem, 132G total swap, 132G free swap PID LWP USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND 26313 9999 portrix 0 0 11G 11G cpu/1 22.6H 12.71% java 26313 9999 portrix 0 0 11G 11G cpu/5 23.1H 12.64% java 26313 9999 portrix 0 0 11G 11G cpu/0 22.9H 12.62% java 26313 9999 portrix 0 0 11G 11G cpu/3 22.7H 12.54% java 26313 65 portrix 0 0 11G 11G cpu/6 25.7H 12.54% java 26313 9999 portrix 59 0 11G 11G sleep 69:41 0.19% java 26313 9999 portrix 60 0 11G 11G sleep 0:02 0.05% java 26313 35 portrix 59 0 11G 11G sleep 297:05 0.03% java 26313 9999 portrix 59 0 11G 11G sleep 74:35 0.03% java 26313 9999 portrix 59 0 11G 11G sleep 35:29 0.01% java 26313 21 portrix 59 0 11G 11G sleep 46:46 0.01% java 9705 1 portrix 59 0 4400K 2868K cpu/4 0:00 0.00% top 26313 11 portrix 59 0 11G 11G sleep 5:51 0.00% java 26313 18 portrix 59 0 11G 11G sleep 22:15 0.00% java 25257 1 webservd 59 0 8024K 1188K sleep 6:23 0.00% nginx 25258 1 webservd 59 0 8024K 996K sleep 6:22 0.00% nginx
What can we read from this?
- The average load has been steady around 5 for a while. Meaning that 5 processes or LWPs have been and are still running on CPU
- process 26313 has 5 LWPs (threads) that are hogging CPU. This is an 8 core machine and 12.5% CPU utilization in top represents 100% of one of those CPU cores. So these 5 threads are truly stuck spinning around and burning as much CPU as they can.
- I could not figure out why the LWP number is 9999 for some of these LWPs
Analyze GC behaviour with jstat
JAVA comes with a few commandline utilities that can be used to investigate some of the VM properties. One of those utilitiies is jstat which can be used (among other things) to read statistics about the garbage collector. I printed those stats in intervals of 10 seconds:
portrix@jboss42 ~ $ /usr/java/bin/jstat -gc 26313 10s S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT 106816.0 108224.0 0.0 88819.4 3196032.0 2143395.7 6828032.0 6611615.9 131072.0 119782.1 11187 431.555 27 101.982 533.537 106816.0 108224.0 0.0 88819.4 3196032.0 2504908.0 6828032.0 6611615.9 131072.0 119782.1 11187 431.555 27 101.982 533.537 106816.0 108224.0 0.0 88819.4 3196032.0 2519750.7 6828032.0 6611615.9 131072.0 119782.1 11187 431.555 27 101.982 533.537 106816.0 108224.0 0.0 88819.4 3196032.0 2804560.2 6828032.0 6611615.9 131072.0 119782.1 11187 431.555 27 101.982 533.537 106816.0 108224.0 0.0 88819.4 3196032.0 3098403.7 6828032.0 6611615.9 131072.0 119782.1 11187 431.555 27 101.982 533.537 106816.0 108224.0 0.0 88819.4 3196032.0 3194522.2 6828032.0 6611615.9 131072.0 119782.1 11187 431.555 27 101.982 533.537
I don’t know a whole lot about the different pools and generations in java memory management and the way GC collection works but at this point I am merely looking at the levels of activity. Notice how all these numbers don’t really change by much. The numbers I am mostly interested in are the integer numbers for YGC (young genration garbage collection – this is the “cheap” one) and FGC (Full Garbage Collection – this usually halts the VM for a few seconds). Neither one increases by much – a good indicator that this java vm is not wasting a whole lot of time on garbage collection. When you see a process with heap memory problems and thus GC problems you will notice that these numbers change a whole lot. But in this case, GC had an alibi, we cannot blame it for this issue, so let’s move on to the next little troubleshooting tool.
Review running threads in a stack dump with jstack
We saw earlier that this system has 5 threads that are just spinning on CPU. It would be very useful to learn what these threads are doing and why. The best way (that I know of) to find this information is to take one or a few stack dumps of the running VM and check for threads being executed at that time. The jstack utiility will return the stack dump of each java thread along with its state. Since 5 threads are on CPU almost all the time, I am expecting to find them in the running state in all my stack dumps and hopefully also find them to show a similar stack dump each time. This is how you run jstack. It usually returns quickly and does not seem to impact the performance of the running VM by much. Wait a few seconds between the runs.
portrix@jboss42 ~ $ /usr/java/bin/jstack 26313 > /var/tmp/jstack.out portrix@jboss42 ~ $ /usr/java/bin/jstack 26313 > /var/tmp/jstack2.out portrix@jboss42 ~ $ /usr/java/bin/jstack 26313 > /var/tmp/jstack3.out
This saves the stack dump to three files. Have a look at them and don’t be intimidated by the amount of information. We do not need to understand what all this is about. All we are looking for is the 5 threads in running state, so search for “RUNNABLE”. Ignore the first hit, it will just be the jstack util itself:
"Attach Listener" daemon prio=3 tid=0x0000000004e43800 nid=0x505e waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE
There may also be some other runnable threads that we should ignore. But further down, we can find 4-5 Stacks with this information:
"http-0.0.0.0-8080-Processor108" daemon prio=3 tid=0x000000000300e000 nid=0x398f runnable [0xffff80ff9451d000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.getEntry(HashMap.java:446) at java.util.HashMap.get(HashMap.java:405) at net.portrix.sw.backend.user.SomeCache.access(SomeCache.java:38) ...
And I could find these same stacks in the other dump files. So the threads we saw spinning in top earlier are burning their time on HashMap.getEntry. Remember, I know next to nothing about java but I can enter this into google along with something like “100% CPU” or such. This yields plenty of results. This blog post by Pieree-Hugues Charbonneau is very similar to the one I am currently writing and this article on stackoverflow explains things a bit more. Apparently HashMap is not threadsafe and can lead to locking or spinning issues when accessed concurrently and one should either use ConcurrentHashMap or serialize access to the object yourself. This information together with the code in our software that called it should be enough for development to fix the issue so we don’t have to just rely on turning things off and on again but rather fix issues.