JVM Diagnostics & Troubleshooting in Production
When backend systems suffer from performance issues in production (e.g., memory leaks, high CPU usage, or stuck threads), developers must know how to inspect the running JVM.
1. Core JVM Command-Line Toolsโ
The JDK includes several utilities for querying and analyzing running Java processes.
| Tool | Purpose | Key Commands |
|---|---|---|
jcmd | Universal command tool (recommended) | jcmd <pid> VM.uptime, jcmd <pid> Thread.print |
jstack | Prints thread stack traces | jstack -l <pid> |
jmap | Generates heap dumps and statistics | jmap -dump:live,format=b,file=heap.hprof <pid> |
jstat | Monitors garbage collection and compilation | jstat -gcutil <pid> 1000 |
jinfo | Views/modifies JVM system properties and flags | jinfo -flags <pid> |
2. Troubleshooting Thread Contention & Deadlocksโ
When system performance slows down, threads are often waiting on locks or external resources.
Identifying Deadlocks with jstackโ
A deadlock happens when Thread 1 holds Lock A and waits for Lock B, while Thread 2 holds Lock B and waits for Lock A.
- Find the Java process ID (PID):
jps -l# orjcmd
- Print the thread dump:
jstack <pid> > thread_dump.txt
- Search
thread_dump.txtfor"Found one Java-level deadlock:". The JVM automatically identifies deadlocks and prints the exact locks and threads involved:Found one Java-level deadlock:============================="Thread-1":waiting to lock monitor 0x00007f (object 0x00a1, a java.lang.Object),which is held by "Thread-2""Thread-2":waiting to lock monitor 0x00007e (object 0x00a2, a java.lang.Object),which is held by "Thread-1"
Troubleshooting CPU Spikesโ
If a JVM container exhibits 100% CPU usage:
- Find which thread ID (TID) is consuming the CPU using
top -H -p <pid>. Note down the TID in decimal (e.g.,12345). - Convert the decimal TID to hexadecimal:
12345\rightarrow0x3039. - Capture a thread dump:
jstack <pid> > threads.txt. - Look for the
nid(native thread ID) matching0x3039in the thread dump. This points you directly to the offending line of code:"Pool-worker-1" #23 prio=5 os_prio=0 cpu=45.2% ... nid=0x3039 runnable [0x00007f...]java.lang.Thread.State: RUNNABLEat com.example.service.HeavyTask.loopForever(HeavyTask.java:42)
3. Troubleshooting Memory Leaks & OutOfMemoryError (OOM)โ
If heap usage continues to grow without dropping after garbage collection, the application has a memory leak.
1. Real-time GC Tracking with jstatโ
Run jstat to check if memory is reclaimed after Full GC (FGC):
jstat -gcutil <pid> 1000
- Look at the
O(Old gen percentage) andM(Metaspace percentage). - If
Ostays close to100%after consecutiveFGCcounts increment, memory is leaked.
2. Capture a Heap Dumpโ
If an OOM occurs, you want a heap dump. Enable automatic dumps in your JVM arguments:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/logs/heap.hprof
To trigger a heap dump manually on a running process:
jcmd <pid> GC.heap_dump /var/logs/manual_heap.hprof
# or
jmap -dump:live,format=b,file=/var/logs/manual_heap.hprof <pid>
Note: Capture dumps during low-traffic periods if possible, as writing gigabytes of heap data freezes the JVM during the write.
3. Analyze Dumps using Eclipse MAT (Memory Analyzer Tool)โ
Open the .hprof file in MAT:
- Leak Suspects Report: MAT automatically group instances and identifies dominant memory consumers.
- Dominator Tree: Lists objects sorted by their retained size (how much memory is freed if the object is garbage collected).
- Paths to GC Roots: Right-click on a leaking object
\rightarrowPath To GC Roots\rightarrowexclude all phantom/weak/soft references. This shows which strong reference keeps the object in memory.
4. Off-Heap Memory Leak Troubleshootingโ
Sometimes, memory leaks happen outside the heap, causing the container process to get killed by the OS (OOM Killer) even though heap usage is low. For a detailed diagram and structural overview of where these Native/Off-Heap components reside, see the JVM Memory Layout Section.
Diagnostic Stepsโ
- Enable Native Memory Tracking (NMT): Start your JVM with NMT enabled:
-XX:NativeMemoryTracking=detail
- Establish Baseline:
jcmd <pid> VM.native_memory baseline
- Check Differences: After the leak grows, print the NMT diff:
jcmd <pid> VM.native_memory detail.diff
- Interpret Output: Look for growth in the Internal or Symbol sections. Large allocations here are typically:
- Direct ByteBuffers (
ByteBuffer.allocateDirect()) from netty or file transfers. - Unreleased class loaders creating class definition leaks.
- Direct ByteBuffers (