Alex Dzyoba

Posted on Jul 9, 2018 • Originally published at alex.dzyoba.com on Oct 27, 2014

Ftrace

#linux #kernel #profiling #ftrace

ftrace

Ftrace is a framework for tracing and profiling Linux kernel with the following features:

Kernel functions tracing
Call graph tracing
Tracepoints support
Dynamic tracing via kprobes
Statistics for kernel functions
Statistics for kernel events

Essentially, ftrace built around smart lockless ring buffer implementation (see Documentation/trace/ring-buffer-design.txt/). That buffer stores all ftrace info and imported via debugfs¹ in/sys/kernel/debug/tracing/. All manipulations are done with simple files operations in this directory.

How ftrace works

As I’ve just said, ftrace is a framework meaning that it provides only ring buffer, all real work is done by so called tracers. Currently, ftrace includes next tracers:

function – default tracer;
function_graph – constructs call graph;
irqsoff, preempoff, preemptirqsoff, wakeup, wakeup_rt – latency tracers. These are origins of ftrace, they were presented in -rt kernel. I won’t give you lot of info on this topic cause it’s more about realtime, scheduling and hardware stuff;
nop – you guess.

Also, as additional features you’ll get:

kernel tracepoints support;
kprobes support;
blktrace support, though it’s going to be deleted.

Now let’s look at specific tracers.

Function tracing

Main ftrace function is, well, functions tracing (function and function_graph tracers). To achieve this, kernel function instrumented with mcount calls, just like with gprof. But kernel mcount, of course, totally differs from userspace, because it’s architecture dependent. This dependency is required to be able to build call graphs, and more specific to get caller address from previous stack frame.

This mcount is inserted in function prologue and if it’s turned off it’s doing nothing. But if it’s turned on then it’s calling ftrace function that depending on current tracer writes different data to ring buffer.

Events tracing

Events tracing is done with help of tracepoints. You set event via set_event file in /sys/kernel/debug/tracing and then it will be traced in the ring buffer. For example, to trace kmalloc, just issue

echo kmalloc > /sys/kernel/debug/tracing/set_event

and now you can see in trace:

tail-7747 [000] .... 12584.876544: kmalloc: call_site=c06c56da ptr=e9cf9eb0 bytes_req=4 bytes_alloc=8 gfp_flags=GFP_KERNEL|GFP_ZERO

and it’s the same as in include/trace/events/kmem.h, meaning it’s just a tracepoint.

kprobes tracing

In kernel 3.10 there was added support for kprobes and kretprobes for ftrace. Now you can do dynamic tracing without writing your own kernel module. But, unfortunately, there is nothing much to do with it, just

Registers values
Memory dumps
Symbols values
Stack values
Return values (kretprobes)

And again, this output is written to ring buffer. Also, you can calculate some statistic over it.

Let’s trace something that doesn’t have tracepoint like something not from the kernel but from the kernel module.

On my Samsung N210 laptop I have ath9k WiFi module that’s most likely doesn’t have any tracepoints. To check this just grep available_events:

[tracing]# grep ath available_events 
cfg80211:rdev_del_mpath
cfg80211:rdev_add_mpath
cfg80211:rdev_change_mpath
cfg80211:rdev_get_mpath
cfg80211:rdev_dump_mpath
cfg80211:rdev_return_int_mpath_info
ext4:ext4_ext_convert_to_initialized_fastpath

Let’s see what functions can we put probe on:

[tracing]# grep "\[ath9k\]" /proc/kallsyms | grep ' t ' | grep rx
f82e6ed0 t ath_rx_remove_buffer [ath9k]
f82e6f60 t ath_rx_buf_link.isra.25 [ath9k]
f82e6ff0 t ath_get_next_rx_buf [ath9k]
f82e7130 t ath_rx_edma_buf_link [ath9k]
f82e7200 t ath_rx_addbuffer_edma [ath9k]
f82e7250 t ath_rx_edma_cleanup [ath9k]
f82f3720 t ath_debug_stat_rx [ath9k]
f82e7a70 t ath_rx_tasklet [ath9k]
f82e7310 t ath_rx_cleanup [ath9k]
f82e7800 t ath_calcrxfilter [ath9k]
f82e73e0 t ath_rx_init [ath9k]

(First grep filters symbols from ath9k module, second grep filters functions which reside in text section and last grep filters receiver functions).

For example, we will trace ath_get_next_rx_buf function.

[tracing]# echo 'r:ath_probe ath9k:ath_get_next_rx_buf $retval' >> kprobe_events

This command is not from top of my head – check Documentation/tracing/kprobetrace.txt

This puts retprobe on our function and fetches return value (it’s just a pointer).

After we’ve put probe we must enable it:

[tracing]# echo 1 > events/kprobes/enable

And then we can look for output in trace file and here it is:

midori-6741 [000] d.s. 3011.304724: ath_probe: (ath_rx_tasklet+0x35a/0xc30 [ath9k] <- ath_get_next_rx_buf) arg1=0xf6ae39f4

Example (block_hasher)

By default, ftrace is collecting info about all kernel functions and that’s huge. But, being a sophisticated kernel mechanism, ftrace has a lot of features, many kinds of options, tunable params and so on for which I don’t have a feeling to talk about because there are plenty of manuals and articles on lwn (see To read section). Hence, it’s no wonder that we can, for example, filter by PID. Here is the script:

#!/bin/sh

DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'`

# Reset trace stat
echo 0 > $DEBUGFS/tracing/function_profile_enabled
echo 1 > $DEBUGFS/tracing/function_profile_enabled

echo $$ > $DEBUGFS/tracing/set_ftrace_pid
echo function > $DEBUGFS/tracing/current_tracer

exec $*

function_profile_enabled configures collecting statistical info.

Launch our magic script

./ftrace-me ./block_hasher -d /dev/md127 -b 1048576 -t10 -n10000

get per-processor statistics from files in tracing/trace_stat/

head -n50 tracing/trace_stat/function* > ~/trace_stat

and see top 5

==> function0 <==
  Function Hit Time Avg
  -------- --- ---- ---
  schedule 444425 8653900277 us 19472.12 us 
  schedule_timeout 36019 813403521 us 22582.62 us 
  do_IRQ 8161576 796860573 us 97.635 us
  do_softirq 486268 791706643 us 1628.128 us 
  __do_softirq 486251 790968923 us 1626.667 us 

==> function1 <==
  Function Hit Time Avg
  -------- --- ---- ---
  schedule 1352233 13378644495 us 9893.742 us 
  schedule_hrtimeout_range 11853 2708879282 us 228539.5 us 
  poll_schedule_timeout 7733 2366753802 us 306058.9 us 
  schedule_timeout 176343 1857637026 us 10534.22 us 
  schedule_timeout_interruptible 95 1637633935 us 17238251 us 

==> function2 <==
  Function Hit Time Avg
  -------- --- ---- ---
  schedule 1260239 9324003483 us 7398.599 us 
  vfs_read 215859 884716012 us 4098.582 us 
  do_sync_read 214950 851281498 us 3960.369 us 
  sys_pread64 13136 830103896 us 63193.04 us 
  generic_file_aio_read 14955 830034649 us 55502.14 us

(Don’t pay attention to schedule – it’s just calls of scheduler).

Most of the time we are spending in schedule, do_IRQ,schedule_hrimeout_range and vfs_read meaning that we either waiting for reading or waiting for some timeout. Now that’s strange! To make it clearer we can disable so called graph time so that child functions wouldn’t be counted. Let me explain, by default ftrace counting function time as a time of function itself plus all subroutines calls. That’s and graph_time option in ftrace.

Tell

echo 0 > options/graph_time

And collect profile again

==> function0 <==
  Function Hit Time Avg
  -------- --- ---- ---
  schedule 34129 6762529800 us 198146.1 us 
  mwait_idle 50428 235821243 us 4676.394 us 
  mempool_free 59292718 27764202 us 0.468 us    
  mempool_free_slab 59292717 26628794 us 0.449 us    
  bio_endio 49761249 24374630 us 0.489 us    

==> function1 <==
  Function Hit Time Avg
  -------- --- ---- ---
  schedule 958708 9075670846 us 9466.564 us 
  mwait_idle 406700 391923605 us 963.667 us  
  _spin_lock_irq 22164884 15064205 us 0.679 us    
  __make_request 3890969 14825567 us 3.810 us    
  get_page_from_freelist 7165243 14063386 us 1.962 us

Now we see amusing mwait_idle that somebody is somehow calling. We can’t say how does it happen.

Maybe, we should get a function graph! We know that it all starts with preadso let’s try to trace down function calls from pread.

By that moment, I had tired to read/write to debugfs files and started to use CLI interface to ftrace which is trace-cmd.

Using trace-cmd is dead simple – first, you’re recording with trace-cmd record and then analyze it with trace-cmd report.

Record:

trace-cmd record -p function_graph -o graph_pread.dat -g sys_pread64 \
        ./block_hasher -d /dev/md127 -b 1048576 -t10 -n100

Look:

trace-cmd report -i graph_pread.dat | less

And it’s disappointing.

block_hasher-4102 [001] 2764.516562: funcgraph_entry: | __page_cache_alloc() {
block_hasher-4102 [001] 2764.516562: funcgraph_entry: | alloc_pages_current() {
block_hasher-4102 [001] 2764.516562: funcgraph_entry: 0.052 us | policy_nodemask();
block_hasher-4102 [001] 2764.516563: funcgraph_entry: 0.058 us | policy_zonelist();
block_hasher-4102 [001] 2764.516563: funcgraph_entry: | __alloc_pages_nodemask() {
block_hasher-4102 [001] 2764.516564: funcgraph_entry: 0.054 us | _cond_resched();
block_hasher-4102 [001] 2764.516564: funcgraph_entry: 0.063 us | next_zones_zonelist();
block_hasher-4109 [000] 2764.516564: funcgraph_entry: | SyS_pread64() {
block_hasher-4102 [001] 2764.516564: funcgraph_entry: | get_page_from_freelist() {
block_hasher-4109 [000] 2764.516564: funcgraph_entry: | __fdget() {
block_hasher-4102 [001] 2764.516565: funcgraph_entry: 0.052 us | next_zones_zonelist();
block_hasher-4109 [000] 2764.516565: funcgraph_entry: | __fget_light() {
block_hasher-4109 [000] 2764.516565: funcgraph_entry: 0.217 us | __fget();
block_hasher-4102 [001] 2764.516565: funcgraph_entry: 0.046 us | __zone_watermark_ok();
block_hasher-4102 [001] 2764.516566: funcgraph_entry: 0.057 us | __mod_zone_page_state();
block_hasher-4109 [000] 2764.516566: funcgraph_exit: 0.745 us | }
block_hasher-4109 [000] 2764.516566: funcgraph_exit: 1.229 us | }
block_hasher-4102 [001] 2764.516566: funcgraph_entry: | zone_statistics() {
block_hasher-4109 [000] 2764.516566: funcgraph_entry: | vfs_read() {
block_hasher-4102 [001] 2764.516566: funcgraph_entry: 0.064 us | __inc_zone_state();
block_hasher-4109 [000] 2764.516566: funcgraph_entry: | rw_verify_area() {
block_hasher-4109 [000] 2764.516567: funcgraph_entry: | security_file_permission() {
block_hasher-4102 [001] 2764.516567: funcgraph_entry: 0.057 us | __inc_zone_state();
block_hasher-4109 [000] 2764.516567: funcgraph_entry: 0.048 us | cap_file_permission();
block_hasher-4102 [001] 2764.516567: funcgraph_exit: 0.907 us | }
block_hasher-4102 [001] 2764.516567: funcgraph_entry: 0.056 us | bad_range();
block_hasher-4109 [000] 2764.516567: funcgraph_entry: 0.115 us | __fsnotify_parent();
block_hasher-4109 [000] 2764.516568: funcgraph_entry: 0.159 us | fsnotify();
block_hasher-4102 [001] 2764.516568: funcgraph_entry: | mem_cgroup_bad_page_check() {
block_hasher-4102 [001] 2764.516568: funcgraph_entry: | lookup_page_cgroup_used() {
block_hasher-4102 [001] 2764.516568: funcgraph_entry: 0.052 us | lookup_page_cgroup();
block_hasher-4109 [000] 2764.516569: funcgraph_exit: 1.958 us | }
block_hasher-4102 [001] 2764.516569: funcgraph_exit: 0.435 us | }
block_hasher-4109 [000] 2764.516569: funcgraph_exit: 2.487 us | }
block_hasher-4102 [001] 2764.516569: funcgraph_exit: 0.813 us | }
block_hasher-4102 [001] 2764.516569: funcgraph_exit: 4.666 us | }

First of all, there is no straight function call chain, it’s constantly interrupted and transferred to another CPU. Secondly, there are a lot of noise e.g. inc_zone_state and __page_cache_alloc calls. And finally, there are neither mdraid function nor mwait_idle calls!

And the reasons are ftrace default sources (tracepoints) and async/callback nature of kernel code. You won’t see direct functions call chain from sys_pread64, the kernel doesn’t work this way.

But what if we setup kprobes on mdraid functions? No problem! Just add return probes for mwait_idle and md_make_request:

# echo 'r:md_make_request_probe md_make_request $retval' >> kprobe_events 
# echo 'r:mwait_probe mwait_idle $retval' >> kprobe_events

Repeat the routine with trace-cmd to get function graph:

# trace-cmd record -p function_graph -o graph_md.dat -g md_make_request -e md_make_request_probe -e mwait_probe -F \
            ./block_hasher -d /dev/md0 -b 1048576 -t10 -n100

-e enables particular event. Now, look at function graph:

block_hasher-28990 [000] 10235.125319: funcgraph_entry: | md_make_request() {
block_hasher-28990 [000] 10235.125321: funcgraph_entry: | make_request() {
block_hasher-28990 [000] 10235.125322: funcgraph_entry: 0.367 us | md_write_start();
block_hasher-28990 [000] 10235.125323: funcgraph_entry: | bio_clone_mddev() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry: | bio_alloc_bioset() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry: | mempool_alloc() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry: 0.178 us | _cond_resched();
block_hasher-28990 [000] 10235.125324: funcgraph_entry: | mempool_alloc_slab() {
block_hasher-28990 [000] 10235.125324: funcgraph_entry: | kmem_cache_alloc() {
block_hasher-28990 [000] 10235.125324: funcgraph_entry: | cache_alloc_refill() {
block_hasher-28990 [000] 10235.125325: funcgraph_entry: 0.275 us | _spin_lock();
block_hasher-28990 [000] 10235.125326: funcgraph_exit: 1.072 us | }
block_hasher-28990 [000] 10235.125326: funcgraph_exit: 1.721 us | }
block_hasher-28990 [000] 10235.125326: funcgraph_exit: 2.085 us | }
block_hasher-28990 [000] 10235.125326: funcgraph_exit: 2.865 us | }
block_hasher-28990 [000] 10235.125326: funcgraph_entry: 0.187 us | bio_init();
block_hasher-28990 [000] 10235.125327: funcgraph_exit: 3.665 us | }
block_hasher-28990 [000] 10235.125327: funcgraph_entry: 0.229 us | __bio_clone();
block_hasher-28990 [000] 10235.125327: funcgraph_exit: 4.584 us | }
block_hasher-28990 [000] 10235.125328: funcgraph_entry: 1.093 us | raid5_compute_sector();
block_hasher-28990 [000] 10235.125330: funcgraph_entry: | blk_recount_segments() {
block_hasher-28990 [000] 10235.125330: funcgraph_entry: 0.340 us | __blk_recalc_rq_segments();
block_hasher-28990 [000] 10235.125331: funcgraph_exit: 0.769 us | }
block_hasher-28990 [000] 10235.125331: funcgraph_entry: 0.202 us | _spin_lock_irq();
block_hasher-28990 [000] 10235.125331: funcgraph_entry: 0.194 us | generic_make_request();
block_hasher-28990 [000] 10235.125332: funcgraph_exit: + 10.613 us | }
block_hasher-28990 [000] 10235.125332: funcgraph_exit: + 13.638 us | }

Much better! But for some reason, it doesn’t have mwait_idle calls. And it just calls generic_make_request. I’ve tried and record function graph for generic_make_request (-g option). Still no luck. I’ve extracted all function containing wait and here is the result:

# grep 'wait' graph_md.graph | cut -f 2 -d'|' | awk '{print $1}' | sort -n | uniq -c
     18 add_wait_queue()
   2064 bit_waitqueue()
      1 bit_waitqueue();
   1194 finish_wait()
     28 page_waitqueue()
   2033 page_waitqueue();
   1222 prepare_to_wait()
     25 remove_wait_queue()
      4 update_stats_wait_end()
    213 update_stats_wait_end();

(cut will separate function names, awk will print only function names,uniq with sort will reduce it to unique names)

Nothing related to timeouts. I’ve tried to grep for timeout and, damn, nothing suspicious.

So, right now I’m going to stop because it’s not going anywhere.

Conclusion

Well, it was really fun! ftrace is such a powerful tool but it’s made for debugging, not profiling. I was able to get kernel function call graph, get statistics for kernel execution on source code level (can you believe it?), trace some unknown function and all that happened thanks to ftrace. Bless it!