troubleshooting a CPU eating java process

Just yesterday we had a situation where a java process (a JBOSS app server) started to raise the CPU load on a machine significantly. Load average increased and top showed that the process consumed a lot of CPU. The planned “workaround”? Blame something (garbage collection is always an easy victim), restart and hope for the best. After all, there is not much to diagnose here if there is nothing in the logs, right? wrong! There are a few things that we can do with a running java vm to diagnose these issues from a solaris or linux commandline and find the root cause.

This system is Solaris 11 with Java 7 but the troubleshooting steps and tools are the same on older versions of Solaris and Java aswell as on Linux.

turn on thread-view in top

The first thing we can do is to turn on the thread-view in top. Just start top and hit “H” or start with the “-t” flag. Here is an output from my machine yesterday:

load averages:  5.11,  5.23,  5.27;               up 52+01:50:08 15:13:15
266 threads: 260 sleeping, 6 on cpu
CPU states: 36.7% idle, 62.8% user,  0.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 1975 ctxsw, 168 trap, 2246 intr, 1399 syscall, 6 flt
Memory: 128G phys mem, 3915M free mem, 132G total swap, 132G free swap

   PID  LWP USERNAME PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 26313 9999 portrix    0    0   11G   11G cpu/1   22.6H 12.71% java
 26313 9999 portrix    0    0   11G   11G cpu/5   23.1H 12.64% java
 26313 9999 portrix    0    0   11G   11G cpu/0   22.9H 12.62% java
 26313 9999 portrix    0    0   11G   11G cpu/3   22.7H 12.54% java
 26313   65 portrix    0    0   11G   11G cpu/6   25.7H 12.54% java
 26313 9999 portrix   59    0   11G   11G sleep   69:41  0.19% java
 26313 9999 portrix   60    0   11G   11G sleep    0:02  0.05% java
 26313   35 portrix   59    0   11G   11G sleep  297:05  0.03% java
 26313 9999 portrix   59    0   11G   11G sleep   74:35  0.03% java
 26313 9999 portrix   59    0   11G   11G sleep   35:29  0.01% java
 26313   21 portrix   59    0   11G   11G sleep   46:46  0.01% java
  9705    1 portrix   59    0 4400K 2868K cpu/4    0:00  0.00% top
 26313   11 portrix   59    0   11G   11G sleep    5:51  0.00% java
 26313   18 portrix   59    0   11G   11G sleep   22:15  0.00% java
 25257    1 webservd  59    0 8024K 1188K sleep    6:23  0.00% nginx
 25258    1 webservd  59    0 8024K  996K sleep    6:22  0.00% nginx

What can we read from this?

  • The average load has been steady around 5 for a while. Meaning that 5 processes or LWPs have been and are still running on CPU
  • process 26313 has 5 LWPs (threads) that are hogging CPU. This is an 8 core machine and 12.5% CPU utilization in top represents 100% of one of those CPU cores. So these 5 threads are truly stuck spinning around and burning as much CPU as they can.
  • I could not figure out why the LWP number is 9999 for some of these LWPs

Analyze GC behaviour with jstat

JAVA comes with a few commandline utilities that can be used to investigate some of the VM properties. One of those utilitiies is jstat which can be used (among other things) to read statistics about the garbage collector. I printed those stats in intervals of 10 seconds:

portrix@jboss42 ~ $ /usr/java/bin/jstat -gc 26313 10s
S0C      S1C       S0U   S1U     EC        EU        OC         OU         PC       PU        YGC    YGCT     FGC   FGCT    GCT   
106816.0 108224.0  0.0   88819.4 3196032.0 2143395.7 6828032.0  6611615.9  131072.0 119782.1  11187  431.555  27    101.982  533.537
106816.0 108224.0  0.0   88819.4 3196032.0 2504908.0 6828032.0  6611615.9  131072.0 119782.1  11187  431.555  27    101.982  533.537
106816.0 108224.0  0.0   88819.4 3196032.0 2519750.7 6828032.0  6611615.9  131072.0 119782.1  11187  431.555  27    101.982  533.537
106816.0 108224.0  0.0   88819.4 3196032.0 2804560.2 6828032.0  6611615.9  131072.0 119782.1  11187  431.555  27    101.982  533.537
106816.0 108224.0  0.0   88819.4 3196032.0 3098403.7 6828032.0  6611615.9  131072.0 119782.1  11187  431.555  27    101.982  533.537
106816.0 108224.0  0.0   88819.4 3196032.0 3194522.2 6828032.0  6611615.9  131072.0 119782.1  11187  431.555  27    101.982  533.537

I don’t know a whole lot about the different pools and generations in java memory management and the way GC collection works but at this point I am merely looking at the levels of activity. Notice how all these numbers don’t really change by much. The numbers I am mostly interested in are the integer numbers for YGC (young genration garbage collection – this is the “cheap” one) and FGC (Full Garbage Collection – this usually halts the VM for a few seconds). Neither one increases by much – a good indicator that this java vm is not wasting a whole lot of time on garbage collection. When you see a process with heap memory problems and thus GC problems you will notice that these numbers change a whole lot. But in this case, GC had an alibi, we cannot blame it for this issue, so let’s move on to the next little troubleshooting tool.

Review running threads in a stack dump with jstack

We saw earlier that this system has 5 threads that are just spinning on CPU. It would be very useful to learn what these threads are doing and why. The best way (that I know of) to find this information is to take one or a few stack dumps of the running VM and check for threads being executed at that time. The jstack utiility will return the stack dump of each java thread along with its state. Since 5 threads are on CPU almost all the time, I am expecting to find them in the running state in all my stack dumps and hopefully also find them to show a similar stack dump each time. This is how you run jstack. It usually returns quickly and does not seem to impact the performance of the running VM by much. Wait a few seconds between the runs.

portrix@jboss42 ~ $ /usr/java/bin/jstack 26313 > /var/tmp/jstack.out
portrix@jboss42 ~ $ /usr/java/bin/jstack 26313 > /var/tmp/jstack2.out
portrix@jboss42 ~ $ /usr/java/bin/jstack 26313 > /var/tmp/jstack3.out

This saves the stack dump to three files. Have a look at them and don’t be intimidated by the amount of information. We do not need to understand what all this is about. All we are looking for is the 5 threads in running state, so search for “RUNNABLE”. Ignore the first hit, it will just be the jstack util itself:

"Attach Listener" daemon prio=3 tid=0x0000000004e43800 nid=0x505e waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

There may also be some other runnable threads that we should ignore. But further down, we can find 4-5 Stacks with this information:

"http-0.0.0.0-8080-Processor108" daemon prio=3 tid=0x000000000300e000 nid=0x398f runnable [0xffff80ff9451d000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap.getEntry(HashMap.java:446)
        at java.util.HashMap.get(HashMap.java:405)
        at net.portrix.sw.backend.user.SomeCache.access(SomeCache.java:38)
...

And I could find these same stacks in the other dump files. So the threads we saw spinning in top earlier are burning their time on HashMap.getEntry. Remember, I know next to nothing about java but I can enter this into google along with something like “100% CPU” or such. This yields plenty of results. This blog post by Pieree-Hugues Charbonneau is very similar to the one I am currently writing and this article on stackoverflow explains things a bit more. Apparently HashMap is not threadsafe and can lead to locking or spinning issues when accessed concurrently and one should either use ConcurrentHashMap or serialize access to the object yourself. This information together with the code in our software that called it should be enough for development to fix the issue so we don’t have to just rely on turning things off and on again but rather fix issues.

2 thoughts on “troubleshooting a CPU eating java process

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>