Ftrace Tutorial
Steven Rostedt <srostedt@redhat.com>
(rostedt@goodmis.org)
Introduction
Kernel internal tracer
Derived from -rt patch Latency Tracer
Plugin tracers
ftrace : function tracer
irqsoff : interrupt disabled latency
wakeup : latency of highest priority task to
wake up
sched_switch: task context switches
(more)
Ring buffer
Saved traces (snap shots)
used to save maximum latency traces
The Debug File System
/sys/kernel/debug
I prefer:
mkdir /debug
mount -t debugfs nodev /debug
/etc/fstab
debugfs /sys/kernel/debug debugfs defaults 0 0
debugfs /debug debugfs defaults 0 0
/debug/tracing
available_tracers
current_tracer
tracing_enabled
trace
latency_trace
trace_pipe
iter_ctrl
tracing_max_latency
tracing_cpumask
trace_entries
Selecting a tracer
#cat/debug/tracing/available_tracers
wakeuppreemptirqsoffpreemptoffirqsoffftracesysprofsched_switchnone
#echowakeup>/debug/tracing/current_tracer
#cat/debug/tracing/current_tracer
wakeup
The none tracer
No tracer selected
none is special
it is not a tracer
echo none > /debug/tracing/current_tracer
Starting a trace
do not relay on tracing being enabled
echo 1 > /debug/tracing/tracing_enabled
note, make sure to have a space between the '1'
and the '>'. This has burnt many a kernel
programmer.
The enabled stays across tracers.
echo 1 > /debug/tracing/tracing_enabled
echo ftrace > /debug/tracing/current_tracer
echo irqsoff > /debug/tracing/current_tracer
Stopping a trace
echo 0 > /debug/tracing/tracing_enabled
do not forget that space!
Or in a program:
inttrace_fd;
[...]
intmain(intargc,char*argv[]){
[...]
trace_fd=open("/debug/tracing/tracing_enabled",O_WRONLY);
[...]
if(condition_hit()){
write(trace_fd,"0",1);
}
[...]
}
Reading the Output
latency_trace
trace
trace_pipe
Latency Trace Output
#tracer:irqsoff
#
irqsofflatencytracev1.1.5on2.6.26tip
latency:971us,#3/3,CPU#1|(M:preemptVP:0,KP:0,SP:0HP:0#P:2)
|task:swapper0(uid:0nice:20policy:0rt_prio:0)
=>startedat:acpi_os_acq uire_lock
=>endedat:cpuidle_idle_call
#_=>CPU#
#/_=>irqsoff
#|/_=>needresched
#||/_=>hardirq/softirq
#|||/_=>preemptdepth
#||||/
#|||||delay
#cmdpid|||||time|caller
#\/|||||\|/
<idle>01d..11us!:_spin_lock_irqsave(acpi_os_acq uire_lock)
<idle>01d..1971us:acpi_idle_enter_bm(cpuidle_idle_call)
<idle>01d..2972us:trace_hardirqs_on(cpuidle_idle_call)
Various outputs
<idle>01d.h21335164us:tick_sched_timer(__run_hrtimer)
<idle>00.Ns21386686us+:_spin_lock_irq(run_timer_softirq)
<idle>01d.H4138821
7us:ktime_get_ts(ktime_get)
bash34981....1576794us:rw_verify_area(vfs_write)
bash34981d..4120768us+:0:140:R+3096:120:Sgnometer minal
bash34981d..3120796us!:3498:120:S==>0:140:R<idle>
trace output
<idle>0[01]1977
.853298:read_hpet<getnstimeofday
<idle>0[01]1977
.853300:set_nor malized_timespec<ktime_get_ts
<idle>0[01]1977
.853301:_spin_lock<hrtimer_interrupt
bash3498[01]2057
.488856:0:140:R+3096:120:S
gnometer minal3096[00]2057
.488882:3096:120:S==>0:140:R
iter_ctrl
print-parent
sym-offset
sym-addr
verbose
raw
hex
binary
block
stacktrace
sched-tree
Using iter_ctrl
<idle>0[01]2975.463936:tick_program_event<tick_broadcast_oneshot_control
#echonoprintparent>/debug/tracing/iter_ctrl
<idle>0[01]2975.463936:tick_program_event
#echosymoffset>/debug/tracing/iter_ctrl
<idle>0[01]2975.463936:tick_program_event+0xb/0x6a
#echosymaddr>/debug/tracing/iter_ctrl
<idle>0[01]2975.463936:tick_program_event+0xb/0x6a<ffffffff81057c49>
The tracers
sched_switch
ftrace
wakeup
irqsoff
preemptoff
preemptirqsoff
Available Tracers?
#cat/debug/tracing/available_tracers
wakeuppreemptirqsoffpreemptoffirqsoffftracesysprofsched_switchnone
sched_switch
Traces task wakeups
Traces task context switches
bash3498[01]5459.824565:0:140:R+7971:120:R
<idle>0[00]5459.824836:0:140:R==>7971:120:R
bash3498[01]5459.824984:3498:120:S==>0:140:R
<idle>0[01]5459.825342:0:140:R==>7971:120:R
ls7971[00]5459.825380:7971:120:R+3:0:S
ls7971[00]5459.825384:7971:120:R==>3:0:R
migration/03[00]5459.825401:3:0:S==>0:140:R
ls7971[01]5459.825565:7971:120:R+598:115:S
stacktrace
iter_ctrl that effects the tracing itself
bash3498[01]6216.772637
:0:140:R+8495:120:R
bash3498[01]6216.772639:do_fork<=sys_clone<=ptregscall_common<=<=0<=0
<=0<=0
<idle>0[00]6216.773108:0:140:R==>8495:120:R
<idle>0[00]6216.773109:schedule<=cpu_idle<=rest_init<=<=0<=0<=0<=
0
bash3498[01]6216.773234:3498:120:S==>0:140:R
bash3498[01]6216.773235:schedule<=do_wait<=sys_wait4<=tracesys<=<=0<=
0<=0
ls8495[00]6216.773719:8495:120:R+3:0:S
ls8495[00]6216.773720:wake_up_process<=sched_exec<=do_execve<=sys_execve
<=stub_execve<=<=0<=0
ls8495[00]6216.773889:8495:120:R==>3:0:R
ftrace - function tracer
Traces at every non inline function
Other functions not traced
annotated with notrace
Makefile with CFLAGS_REMOVE_... = -pg
Must have /proc/sys/kernel/ftrace_enabled=1
Appears in most other tracers
Very verbose
init1[00]6710.079562:_spin_lock<tick_sched_timer
init1[00]6710.079563:do_timer<tick_sched_timer
bash3498[01]6710.079810:mutex_unlock<tracing_ctrl_write
bash3498[01]6710.079812:dnotify_parent<vfs_write
bash3498[01]6710.079813:_spin_lock<dnotify_parent
bash3498[01]6710.079814:_spin_unlock<dnotify_parent
bash3498[01]6710.079815:inotify_dentry_parent_q ueue_event<vfs_write
bash3498[01]6710.079815:inotify_inode_q ueue_event<vfs_write
bash3498[01]6710.07981
:syscall_trace_leave<int_very_careful
7
Latency Tracers
Stores the last maximum latency trace
wakeup : scheduling latency of RT tasks
irqsoff : interrupts off
preemptoff : preemption off
preemptirqsoff: interrupts and/or preemption
off
tracing_max_latency
wakeup - sched latency
Only traces RT tasks
use LatencyTop for non-RT tasks
Records and traces the maximum latency an
RT task took from wake up to schedule
Remember to reset tracing_max_latency
#echowakeup>/debug/tracing/current_tracer
#echo0>/debug/tracing/tracing_max_latency
#echo1>/debug/tracing/tracing_enabled
#chr tf10usleep10
#echo0>/debug/tracing/tracing_enabled
Wakeup without
function tracing
#tracer:wakeup
#
wakeuplatencytracev1.1.5on2.6.26tip
latency:9us,#2/2,CPU#1|(M:preemptVP:0,KP:0,SP:0HP:0#P:2)
|task:migration/17663(uid:0nice:5policy:1rt_prio:99)
#_=>CPU#
#/_=>irqsoff
#|/_=>needresched
#||/_=>hardirq/softirq
#|||/_=>preemptdepth
#||||/
#|||||delay
#cmdpid|||||time|caller
#\/|||||\|/
usleep102371d..22us+:try_to_wake_up(wake_up_process)
usleep102371d..39us:schedule(preempt_schedule)
With function tracing
#tracer:wakeup
#
wakeuplatencytracev1.1.5on2.6.26tip
latency:19us,#18/18,CPU#0|(M:preemptVP:0,KP:0,SP:0HP:0#P:2)
|task:usleep10133(uid:0nice:0policy:1rt_prio:10)
#_=>CPU#
#/_=>irqsoff
#|/_=>needresched
#||/_=>hardirq/softirq
#|||/_=>preemptdepth
#||||/
#|||||delay
#cmdpid|||||time|caller
#\/|||||\|/
<idle>00d.h41us:try_to_wake_up(wake_up_process)
<idle>00dNh42us:_spin_unlock_irqrestore(try_to_wake_up)
<idle>00dNh33us:_spin_lock(__run_hrtimer)
<idle>00dNh44us:_spin_unlock(hrtimer_interrupt)
<idle>00dNh35us:tick_program_event(hrtimer_interrupt)
[...]
<idle>00.N.214us:_spin_lock_irqsave(hrtick_set)
<idle>00dN.315us+:_spin_unlock_irqrestore(hrtick_set)
<idle>00dN.216us:_spin_lock(schedule)
irqsoff
local_irq_save(flags);
[...]
preempt_disabled();
[...]
local_irq_restore(flags);
[...]
preempt_enabled();
preemptoff
local_irq_save(flags);
[...]
preempt_disabled();
[...]
local_irq_restore(flags);
[...]
preempt_enabled();
preemptirqsoff
local_irq_save(flags);
[...]
preempt_disabled();
[...]
local_irq_restore(flags);
[...]
preempt_enabled();
trace_entries
Not enough data recorded
Too much data recorded
Run-time configurable
Must be done with none tracer or it will give
an -EBUSY
Number is number of entries, but the buffers
are allocate via pages.
If more entries can fit on a page that was
allocated to handle requested entries, the
remaining page will be filled with entries
Dynamic Ftrace
(the fun begins!)
Produces non-measurable overhead
Requires kernel thread ftraced to check for
more updates
Calls kstop_machine to execute text
modification
Not safe to modify code text in SMP environment
/debug/tracing/ftraced_enabled
How it works?
With the gcc profiler switch -pg
Every non-inline function calls mcount
00001adb<do_fork>:
1adb:55push%ebp
1adc:89e5mov%esp,%ebp
1ade:57push%edi
1adf:56push%esi
1ae0:53push%ebx
1ae1:83ec1csub$0x1c,%esp
1ae4:e8fcffffffcall1ae5<do_fork+0xa>
1ae5:R_386_PC32mcount
1ae9:89c3mov%eax,%ebx
1aeb:89c7mov%eax,%edi
1aed:81e300000002and$0x2000000,%ebx
1af3:89cemov%ecx,%esi
Non dynamic i368 mcount
ENTRY(mcount)
cmpl$ftrace_stub,ftrace_trace_function
jnztrace
.globlftrace_stub
ftrace_stub:
ret
/*takenfromglibc*/
trace:
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
movl0x4(%ebp),%edx
subl$MCOUNT_INSN_SIZE,%eax
call*ftrace_trace_function
popl%edx
popl%ecx
popl%eax
jmpftrace_stub
END(mcount)
Dynamic i386 mcount
ENTRY(mcount)
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
subl$MCOUNT_INSN_SIZE,%eax
.globlmcount_call
mcount_call:
callftrace_stub
popl%edx
popl%ecx
popl%eax
ret
END(mcount)
Call ftrace_record_ip
ENTRY(mcount)
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
subl$MCOUNT_INSN_SIZE,%eax
.globlmcount_call
mcount_call:
callftrace_record_ip
popl%edx
popl%ecx
popl%eax
ret
END(mcount)
ftrace_record_ip
HASH
do_fork
do_fork+0x9
ftrace_record_ip
ftraced
HASH
ftraced
List
do_fork+0x9
do_fork+0x9
kstop_machine
(modify code: nop)
nop
00001adb<do_fork>:
1adb:55push%ebp
1adc:89e5mov%esp,%ebp
1ade:57push%edi
1adf:56push%esi
1ae0:53push%ebx
1ae1:83ec1csub$0x1c,%esp
1ae4:908d742600nop
1ae9:89c3mov%eax,%ebx
1aeb:89c7mov%eax,%edi
1aed:81e300000002and$0x2000000,%ebx
1af3:89cemov%ecx,%esi
Starting of ftrace
00001adb<do_fork>:
1adb:55push%ebp
1adc:89e5mov%esp,%ebp
1ade:57push%edi
1adf:56push%esi
1ae0:53push%ebx
1ae1:83ec1csub$0x1c,%esp
1ae4:e8fcffffffcall1ae5
<do_fork+0xa>
1ae5:R_386_PC32ftrace_caller
1ae9:89c3mov%eax,%ebx
1aeb:89c7mov%eax,%edi
1aed:81e300000002and$0x2000000,%ebx
1af3:89cemov%ecx,%esi
ftrace_caller
ENTRY(ftrace_caller)
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
movl0x4(%ebp),%edx
subl$MCOUNT_INSN_SIZE,%eax
.globlftrace_call
ftrace_call:
callftrace_stub
popl%edx
popl%ecx
popl%eax
.globlftrace_stub
ftrace_stub:
ret
END(ftrace_caller)
Registering an ftrace caller
ENTRY(ftrace_caller)
pushl%eax
pushl%ecx
pushl%edx
movl0xc(%esp),%eax
movl0x4(%ebp),%edx
subl$MCOUNT_INSN_SIZE,%eax
.globlftrace_call
ftrace_call:
callfunction_trace_call
popl%edx
popl%ecx
popl%eax
.globlftrace_stub
ftrace_stub:
ret
END(ftrace_caller)
Selective function tracer
tracing is dynamically enabled
have a list of functions that need to be traced
Why not filter which functions we trace?
Picking what functions to trace
/debug/tracing/available_filter_functions
/debug/tracing/set_ftrace_filter
/debug/tracing/set_ftrace_notrace
available_filter_functions
$cat/debug/tracing/available_filter_functions|
head
filelock_init
__rcu_read_lock
kmem_cache_create
notifier_call_chain
down_write
__rcu_read_unlock
_spin_lock_irq
_spin_unlock_irq
_spin_lock
__kmalloc
set_ftrace_filter
#echosys_open>/debug/tracing/set_ftrace_filter
#echoftrace>/debug/tracing/current_tracer
#echo1>/debug/tracing/tracing_enabled
#ls>/dev/null
#echo0>/debug/tracing/tracing_enabled
#cat/debug/tracing/trace
#tracer:ftrace
#
#TASKPIDCPU#TIMESTAMPFUNCTION
#|||||
ls5652[00]2320.450897
:sys_open<tracesys
ls5652[01]2320.452238:sys_open<tracesys
ls5652[01]2320.452311:sys_open<tracesys
ls5652[01]2320.452436:sys_open<tracesys
ls5652[01]2320.452548:sys_open<tracesys
ls5652[01]2320.452651:sys_open<tracesys
ls5652[01]2320.452749:sys_open<tracesys
ls5652[01]2320.452854:sys_open<tracesys
ls5652[01]2320.452950:sys_open<tracesys
ls5652[01]2320.453551:sys_open<tracesys
ls5652[01]2320.453767
:sys_open<tracesys
ls5652[01]2320.453851:sys_open<tracesys
bash5495[00]2320.453977
:sys_open<tracesys
ls5652[01]2320.454030:sys_open<tracesys
set_ftrace_notrace
Modify like set_ftrace_filter
Acts like a notrace added to the function
The function will not be traced even if in the
set_ftrace_filter
set_ftrace_* wildcards
Prefix: echo 'sys_*' > /debug/tracing/set...
Postfix: echo '*lock' > /debug/tracing/set...
Included: echo '*device*' > /debug/tracing/set...
Anything else:
use grep on available_filter_functions
#grep'^selinux.*open$'/debug/tracing/available_filter_functions>\
/debug/tracing/set_ftrace_notrace
Todo:
ftrace dump on OOPS
change sleep interval of ftraced thread
use CPU clock (aka TSC) for interrupt and
preemption latency traces
option to force per CPU trace interleaving
integrity
printk like hooks (for debugging purposes
only)
Hooks for tuna to show in the oscilloscope