Profiling JVM Applications in Production
Profiling JVM Applications in Production
in Production
Sasha Goldshtein @goldshtn
CTO, Sela Group github.com/goldshtn
https://s.sashag.net/srecon0318
Workshop Introduction
• Mission:
Apply modern, low-overhead, production-ready tools to monitor and
improve JVM application performance on Linux
• Objectives:
qIdentifying overloaded resources
qProfiling for CPU bottlenecks
qVisualizing and exploring stack traces using flame graphs
qRecording system events (I/O, network, GC, etc.)
qProfiling for heap allocations
Course Introduction
• Target audience:
Application developers, system administrators, production engineers
• Prerequisites:
Understanding of JVM fundamentals, experience with Linux system
administration, familiarity with OS concepts
• Lab environment:
EC2, delivered through the browser during the course dates
• Course hands-on labs:
https://github.com/goldshtn/linux-tracing-workshop
Course Plan
• JVM and Linux performance information sources
• CPU sampling
• Flame graphs and symbols
• Lab: Profiling with perf and async-profiler
• eBPF
• BCC tools
• Lab: Tracing file opens
• GC tracing and allocation profiling
• Lab: Allocation profiling
The Lab Environment
• Follow the link provided by the instructor
• Sign up or log in with Google
• Enter the classroom token
• Click the beaker-in-a-cloud icon to get
your own lab instance
• Wait for the terminal to initialize
JVM and Linux
Performance Sources
Performance Information Sources attach interface
(jcmd)
Java Flight
Java applications Recorder
Other USDT (dtrace)
JMX applications probes
mbeans Class loader
JVM
Serviceability API
uprobes System libraries GC JIT JVMTI agents
hsperf (jstat)
Syscall interface +PrintCompilation
Filesystem TCP/IP +PrintGC & other
Kernel
kprobes
Block I/O Ethernet Scheduler Mem Tracepoints
Tracepoints Device drivers ”software events”
PMU
Other devices CPU events
USE Checklist for Linux Systems
http://www.brendangregg.com/USEmethod/use-linux.html
U: mpstat -P 0 CPU
S: vmstat 1
E: perf Core Core
U: perf
U: iostat 1
LLC
S: iostat -xz 1
U: perf E: …/ioerr_cnt
FSB
Memory I/O
GFX PCIe SSD
controller controller
allocate
U: top, U: JVMTI,
jstack USDT
E: JVMTI
OVERHEAD
This traces various kernel page cache functions and maintains in-kernel counts, which are asynchronously copied
to user-space. While the rate of operations can be very high (>1G/sec) we can have up to 34% overhead, this is still
a relatively efficient way to trace these events, and so the overhead is expected to be small for normal
workloads. Measure in a test environment.
—man cachestat (from BCC)
CPU Sampling
Sampling vs. Tracing
• Sampling works by getting a snapshot or a call stack every N
occurrences of an interesting event
• For most events, implemented in the PMU using overflow counters and
interrupts
CPU sample
pid 121 pid 121 pid 408 pid 188 CPU time
GC blocked Thread 1
$ cat /tmp/perf-1882.map
7f2cd1108880 1e8 Ljava/lang/System;::arraycopy
7f2cd1108c00 200 Ljava/lang/String;::hashCode
7f2cd1109120 2e0 Ljava/lang/String;::indexOf
7f2cd1109740 1c0 Ljava/lang/String;::charAt
…
7f2cd110ce80 120 LHello;::doStuff
7f2cd110d280 140 LHello;::fidget
7f2cd110d5c0 120 LHello;::fidget
7f2cd110d8c0 120 LHello;::fidget
…
Generating Map Files
• For interpreted or JIT-compiled languages, map files need to be
generated at runtime
• Java: perf-map-agent
create-java-perf-map.sh $(pidof java)
• This is a JVMTI agent that attaches on demand to the Java process
• Additional options include dottedclass, unfoldall, sourcepos
• Consider -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints for
more accurate inline info
• Other runtimes:
• Node: node --perf-basic-prof-only-functions app.js
• Mono: mono --jitmap ...
• .NET Core: export COMPlus_PerfMapEnabled=1
Fixed Report; Still Broken
# perf report --stdio
# Children Self Command Shared Object Symbol
# ........ ........ ....... .................. ......................
#
100.00% 0.00% java perf-3828.map [.] call_stub
|
---call_stub
LHello;::fidget
…
Walking Stacks
• To successfully walk stacks, perf requires* FPO to be disabled
• This is an optimization that uses EBP/RBP as a general-purpose register rather
than a frame pointer
• C/C++: -fno-omit-frame-pointer
• Java: -XX:+PreserveFramePointer since Java 8u60
When debug information is present, perf can use libunwind and figure out FPO-
enabled stacks, but not for dynamic languages
Fixed Report
# perf report --stdio
# Children Self Command Shared Object Symbol
# ........ ........ ....... .................. ......................
#
100.00% 99.65% java perf-4005.map [.] LHello;::fidget
|
--99.65%--start_thread
JavaMain
jni_CallStaticVoidMethod
jni_invoke_static
JavaCalls::call_helper
call_stub
LHello;::main
LHello;::doStuff
LHello;::identifyWidget
LHello;::fidget
…
Real-World Stack Reports
# perf report --stdio | wc -l
14823
Flame Graphs
• A visualization method (adjacency graph), very
useful for stack traces, invented by Brendan
Gregg
• http://www.brendangregg.com/flamegraphs.html
• Turns 1000s of stack trace pages into a single
interactive graph
• Example scenarios:
• Identify CPU hotspots on the system/application
• Show stacks that perform heavy disk accesses
• Find threads that block for a long time and the stack
where they do it
Reading a Flame Graph
• Each rectangle is a function • Wider frames are more common
• Y-axis: stack depth • Supports zoom, find
• X-axis: sorted stacks (not time) • Filter with grep 😎
Generating a Flame Graph
$ git clone https://github.com/BrendanGregg/FlameGraph
$ sudo perf record -F 97 -g -p `pidof java` -- sleep 10
$ sudo perf script |
FlameGraph/stackcollapse-perf.pl |
FlameGraph/flamegraph.pl > flame.svg
Not Just For Methods
• For just a package-level understanding of where your time goes, use
pkgsplit-perf.pl and generate a package-level flame graph:
From http://www.brendangregg.com/blog/2017-06-30/package-flame-graph.html
Lab: CPU Investigation With perf And
Flame Graphs 💻
Problems with perf
• Only Java 8u60 and later is supported (to disable FPO)
• Disabling FPO has a small performance impact (up to 10% in
pathological cases)
• Symbol resolution requires an additional agent
• Interpreter frames can’t be resolved (shown as “Interpreter”)
• Recompiled methods can be misreported (appear more than once in
the perf map)
• Stack depth is usually limited to 127 (again, think Spring)
• Can be configured since Linux 4.8 using
/proc/sys/kernel/perf_event_max_stack
async-profiler
JVMTI Agents
• A JVMTI (JVM Tool Interface) agent can be loaded with -agentpath
or attached through the JVM attach interface
• Examples of functionality:
• Trace thread start and stop events
• Count monitor contentions and wait times
• Aggregate class load and unload information
• Full event reference:
http://docs.oracle.com/javase/8/docs/platform/jvmti/jvmti.html
AsyncGetCallTrace
• Internal API introduced to support lightweight profiling in Oracle
Developer Studio
• Produces a single thread’s stack without waiting for a safepoint
• Designed to be called from a signal handler
• Used by Honest Profiler (by Richard Warburton and contributors):
https://github.com/jvm-profiling-tools/honest-profiler
async-profiler
• Open source profiler by Andrei Pangin and contributors:
https://github.com/jvm-profiling-tools/async-profiler
kernel user
perf async-profiler
• Java ≧8u60 to disable FPO • Works on older Java versions
• Disabling FPO has a perf penalty • FPO can stay on
• Need a map file • No map file is required
• Interpreter frames are not • Interpreter frames are
supported supported
• System-wide profiling is possible • In theory, native and Java stacks
• Can profile containers from the don’t always sync
host (or from a sidecar) • Profiling runs in-process (so, in-
container)
Lab: Profiling With async-profiler 💻
eBPF
What’s Wrong With perf?
• perf relies on pushing a lot of data to user space, through files, for
analysis
• Downloading a file at ∼1Gb/s produces ∼89K netif_receive_skb events/s
(19MB/s including stacks)
perf_events perf.data
BPF: 1990
• Invented by McCanne and Jacobson at Berkeley, 1990-1992:
instruction set, representation, implementation of packet filters
kernel user
probes BPF runtime control program
sockets verifier & JIT control program
BPF program
syscalls
BPF program
BPF map BPF compiler
BPF Tracing
kernel user
①
kprobes BPF runtime control
② control program
program
tracepoints BPF program
③ ③ ⑤ application
perf_events
map USDT
① installs BPF program and attaches to events ④ user-space program is invoked with
② events invoke the BPF program data from the shared buffer
③ BPF program updates a map or pushes a ⑤ user-space program reads statistics
new event to a buffer shared with user-space from the map and clears it if necessary
BPF Tracing Features in The Linux Kernel
Version Feature Scenarios
4.1 kprobes/uprobes attach Dynamic tracing with BPF becomes possible
24 4.1 bpf_trace_printk BPF programs can print output to ftrace pipe
4.3 perf_events output Efficient tracing of large amounts of data for
16.04 analysis in user-space
4.6 Stack traces Efficient aggregation of call stacks for profiling
or tracing
4.7 Tracepoints support API stability for tracing programs
25 4.9 perf_events attach Low-overhead profiling and PMU sampling
16.10
The Old Way And The New Way
kernel user monitor
LATμs # distribution
VFS k{,ret}probe:
vfs_read perf | awk | … 0 - 1
1 – 2
…
…
|@@@@
|@
|
|
2 - 4 … |@@@@@@@@ |
perf_events perf.data
mountsnoop
*fsslower Device drivers memleak
*fsdist oomkill
dcstat biotop tcptop slabratetop
dcsnoop hardirqs
biolatency tcplife
mdflush softirqs
biosnoop tcpconnect
ttysnoop
bitesize tcpaccept
BCC Linux Performance Checklist
1. execsnoop 8. tcpaccept
2. opensnoop 9. tcptop
3. ext4slower 10.gethostlatency
(or btrfs*, xfs*, zfs*) 11.cpudist
4. biolatency 12.runqlat
5. biosnoop 13.profile
6. cachestat
7. tcpconnect
Some BCC Tools
# ext4slower 1
Tracing ext4 operations slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
06:49:17 bash 3616 R 128 0 7.75 cksum
06:49:17 cksum 3616 R 39552 0 1.34 [
06:49:17 cksum 3616 R 96 0 5.36 2to3-2.7
06:49:17 cksum 3616 R 96 0 14.94 2to3-3.4
^C
# execsnoop
PCOMM PID RET ARGS
bash 15887 0 /usr/bin/man ls
preconv 15894 0 /usr/bin/preconv -e UTF-8
man 15896 0 /usr/bin/tbl
man 15897 0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8
^C
Some BCC Tools
# runqlat -p `pidof java` 10 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 11 |* |
2 -> 3 : 7 | |
4 -> 7 : 133 |****************** |
8 -> 15 : 288 |****************************************|
16 -> 31 : 205 |**************************** |
32 -> 63 : 38 |***** |
64 -> 127 : 11 |* |
128 -> 255 : 5 | |
256 -> 511 : 3 | |
512 -> 1023 : 1 | |
1024 -> 2047 : 3 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 3 | |
BCC’s profile Tool
# profile 10 -F 97 -K # kernel stacks only
…
ffffffffa4818691 __lock_text_start
ffffffffa45b0341 ata_scsi_queuecmd
ffffffffa458813d scsi_dispatch_cmd
ffffffffa458b021 scsi_request_fn
ffffffffa43be643 __blk_run_queue
ffffffffa43c3bc1 blk_queue_bio
ffffffffa43c1cf2 generic_make_request
ffffffffa43c1e4d submit_bio
ffffffffa43b825d submit_bio_wait
ffffffffa43c5c65 blkdev_issue_flush
ffffffffa4309b4d ext4_sync_fs
ffffffffa428b260 sync_fs_one_sb
ffffffffa425a553 iterate_supers
ffffffffa428b374 sys_sync
ffffffffa4003c17 do_syscall_64
ffffffffa4818bab return_from_SYSCALL_64
- stress (3303)
14
BCC’s profile Tool
kernel user monitor
PMU perf script | fold
| flamegraph
cpu-clocks
perf_events perf.data
BPF map
BPF program
BPF stacks
Lab: Snooping File Opens 💻
General-Purpose BCC Tools
Tracing Sources For BCC Tools
kernel user
kprobes application
tcp_sendmsg
USDT
tracepoints hotspot:class_loaded
sched:sched_switch BPF
program
perf_events application
cpu-clocks
uprobes
mysqld:…mysql_parse…
USDT Probes in (Some) High-Level Languages
OpenJDK Node.js
hotspot:gc_begin node:http_server_request
Oracle JDK
hotspot:thread_start node:http_client_request
hotspot:method_entry node:gc_begin