DynamoRIO
Profiling DynamoRIO and Clients

Linux

DynamoRIO PC self-sampling

The client can use dr_set_itimer() for programmatic PC self-sampling, with dr_where_am_i() providing information on where the sample was taken. This provides general categorization of where time is being spent in the overall instrumentation system, with potential to drill down further offline based on the PC.

For PC sampling via DR's -prof_pcs runtime option instead, that is available internally in varying degrees on different platforms but is not polished enough and is missing some pieces (see the bottom of this page).

External sampling tools

Perf and oprofile are the two prominent sampling profilers on Linux today. Perf is newer and has a nicer interface, but it requires patching and building from source in order to get symbols for DR. oprofile is typically available on older distros, but it's not available on Ubuntu Precise, it seems to cause system lockups, and we're not sure we trust the results.

Before doing any micro-optimization based on the profile, make sure to disable CPU frequency scaling before taking measurements:

for N in /sys/devices/system/cpu/cpu[; do echo performance | sudo tee $N/cpufreq/scaling_governor ; done

oprofile

To install oprofile, type:

# Or other distro command
sudo apt-get install oprofile
# We don't need to profile the kernel
sudo /usr/bin/opcontrol --no-vmlinux

To make sudo opcontrol work w/o a password, type sudo visudo and add one line to the /etc/sudoers file:

your_username ALL=NOPASSWD: /usr/bin/opcontrol

To run oprofile, you can use a script like the following to start and stop it around the command you wish to run:

sudo opcontrol --shutdown || true # Shutdown existing oprofile daemon, if any
sudo opcontrol --reset # Throw away previously collected data
sudo opcontrol -c 0 # Replace 0 with N if you need callgraph
sudo opcontrol --start
your_command_here
sudo opcontrol --stop
sudo opcontrol --dump # Dumps data into local file
# Now get the report:
opreport -t 1 -l object_file_to_get_stats_for
# or to get the callgraph (don't forget to -c N above!)
opreport -c -l object_file_to_get_stats_for

Example report output:

[rnk@wittenberg src](0-9])$ opreport -t 2 -l ../../dynamorio/build/install/lib64/release/libdynamorio.so.3.2
...
samples % symbol name
4971 7.1208 insert_exit_stub_other_flags
2107 3.0182 mutex_lock
2082 2.9824 decode_sizeof
1774 2.5412 build_bb_ilist
1695 2.4280 encoding_possible_pass
1653 2.3679 fragment_lookup_fine_and_coarse
1647 2.3593 instr_encode_common
1502 2.1516 dispatch
1399 2.0040 hashtable_fragment_lookup

perf

Perf currently does not handle symbols in DSOs that have a preferred base, and they only recently added support for following .gnu_debuglink. Since profiling without symbols isn't very useful, the following instructions are for building perf from source with a patch I wrote to fix the problem.

The patch to get good symbols with perf is available here: https://github.com/rnk/linux/compare/perf-p_vaddr.diff

You can clone the entire branch, or you can apply the patch to some other copy of the Linux kernel source. Either way, cd into tools/perf and run 'make' to build just perf. It will warn you about each library or header that it can't find, and you can install the appropriate package.

Running 'make install' as a normal user will install to $HOME/bin and $HOME/libexec.

To do a run and get a report, it's quite simple:

perf record your_command # Stores result in cwd/perf.data
perf report

Example output:

[src](rnk@wittenberg)$ perf report | head
# Overhead Command Shared Object Symbol
# ........ .............. ................... ...........................................................................
#
17.68% DumpRenderTree perf-14440.map [0x0000000071c59213
5.14% DumpRenderTree libdynamorio.so.3.2 [.](.]) insert_exit_stub_other_flags
4.51% DumpRenderTree libdynamorio.so.3.2 [hashtable_fragment_lookup.isra.31
2.27% DumpRenderTree libdynamorio.so.3.2 [.](.]) mutex_lock
1.94% DumpRenderTree libdynamorio.so.3.2 [build_bb_ilist
1.92% DumpRenderTree libdynamorio.so.3.2 [.](.]) encoding_possible_pass

The perf-NNN.map DSO corresponds to DR's code cache. As you can see from above, at the time of writing, stub updating is a hotspot. You can focus in on just DR by passing "-d libdynamorio.3.2".

To get a combined source and asm annotation, you can use "perf annotate -s insert_exit_stub_other_flags". Example output:

...
byte * ▒
│ insert_relative_jump(byte *pc, cache_pc target, bool hot_patch) ▒
│ { ▒
│ ASSERT(pc != NULL); ▒
│ **pc = JMP_OPCODE; ▒
│ pc++; ◆
0.24 │ dc: lea 0x1(%rax),%rdx ▒
│ ▒
byte ** ▒
│ insert_relative_jump(byte *pc, cache_pc target, bool hot_patch) ▒
│ { ▒
│ ASSERT(pc != NULL); ▒
│ **pc = JMP_OPCODE; ▒
0.16 │ movb $0xe9,(%rax) ▒
byte ** ▒
│ insert_relative_target(byte **pc, cache_pc target, bool hot_patch) ▒
│ { ▒
│ ▒
int value = (int)(ptr_int_t)(target - pc - 4); ▒
0.16 │ sub %rdx,%r14 ▒
│ sub $0x4,%r14d ▒
│ IF_X64(ASSERT(CHECK_TRUNCATE_TYPE_int(target - pc - 4))); ▒
│ ATOMIC_4BYTE_WRITE(pc, value, hot_patch); ▒
│ xchg %r14d,(%rdx) ▒
│ **((ptr_uint_t **)pc) = (ptr_uint_t)l; pc += sizeof(l); ▒
│ #ifdef X64 ▒
│ } ▒
│ #endif ▒
│ ▒
│ pc = insert_relative_jump(pc, exit_target, NOT_HOT_PATCHABLE); ▒
97.40 │ add $0x5,%rax

97% of the samples in this function were on "add $0x5, %rax", which is misleading. The expensive instruction is more likely the "xchgl %r14d, (%rdx)" before it, which instruction we use to atomically update the code cache. In this particular case, we happen to be emitting the full exit stub, so it's unlikely that this needs to be an atomic update.

Windows

We've used Code Analyst successfully.

TODO, more detail.

Cross-platform -prof_pcs

There are many open issues for cleaning this up, such as issue 140, issue 359, issue 767. On Linux the dr_set_itimer() solution above provides programmatic support.

DR_API int decode_sizeof(void *drcontext, byte *pc, int *num_prefixes _IF_X86_64(uint *rip_rel_pos))