DynamoRIO API
Cache Simulator

drcachesim is a DynamoRIO client that collects memory access traces and feeds them to either an online or offline tool for analysis. The default analysis tool is a CPU cache simulator, while other provided tools compute metrics such as reuse distance. The trace collector and simulator support multiple processes each with multiple threads. The analysis tool framework is extensible, supporting the creation of new tools which can operate both online and offline.

Overview

drcachesim consists of two components: a tracer and an analyzer. The tracer collects a memory access trace from each thread within each application process. The analyzer consumes the traces (online or offline) and performs customized analysis. It is designed to be extensible, allowing users to easily implement a simulator for different devices, such as CPU caches, TLBs, page caches, etc. (see Extending the Simulator), or to build arbitrary trace analysis tools (see Creating New Analysis Tools). The default analyzer simulates the architectural behavior of caching devices for a target application (or multiple applications).

Running the Simulator

To launch drcachesim, use the -t flag to drrun:

$ bin64/drrun -t drcachesim -- /path/to/target/app <args> <for> <app>

The target application will be launched under a DynamoRIO tracer client that gathers all of its memory references and passes them to the simulator via a pipe. (See Offline Traces and Analysis for how to dump a trace for offline analysis.) Any child processes will be followed into and profiled, with their memory references passed to the simulator as well.

Here is an example:

$ bin64/drrun -t drcachesim -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (1 thread(s))
L1I stats:
Hits: 258,433
Misses: 1,148
Miss rate: 0.44%
L1D stats:
Hits: 93,654
Misses: 2,624
Prefetch hits: 458
Prefetch misses: 2,166
Miss rate: 2.73%
Core #1 (1 thread(s))
L1I stats:
Hits: 8,895
Misses: 99
Miss rate: 1.10%
L1D stats:
Hits: 3,448
Misses: 156
Prefetch hits: 26
Prefetch misses: 130
Miss rate: 4.33%
Core #2 (1 thread(s))
L1I stats:
Hits: 4,150
Misses: 101
Miss rate: 2.38%
L1D stats:
Hits: 1,578
Misses: 130
Prefetch hits: 25
Prefetch misses: 105
Miss rate: 7.61%
Core #3 (0 thread(s))
LL stats:
Hits: 1,414
Misses: 2,844
Prefetch hits: 824
Prefetch misses: 1,577
Local miss rate: 66.79%
Child hits: 370,667
Total miss rate: 0.76%

Analysis Tool Suite

In addition to a CPU cache simulator, other analysis tools are available that operate on memory address traces. Which tool is used can be selected with the -simulator_type parameter.

To simulate TLB devices instead of caches, pass TLB to -simulator_type:

$ bin64/drrun -t drcachesim -simulator_type TLB -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
TLB simulation results:
Core #0 (1 thread(s))
L1I stats:
Hits: 252,412
Misses: 401
Miss rate: 0.16%
L1D stats:
Hits: 87,132
Misses: 9,127
Miss rate: 9.48%
LL stats:
Hits: 9,315
Misses: 213
Local miss rate: 2.24%
Child hits: 339,544
Total miss rate: 0.06%
Core #1 (1 thread(s))
L1I stats:
Hits: 8,709
Misses: 20
Miss rate: 0.23%
L1D stats:
Hits: 3,544
Misses: 55
Miss rate: 1.53%
LL stats:
Hits: 15
Misses: 60
Local miss rate: 80.00%
Child hits: 12,253
Total miss rate: 0.49%
Core #2 (1 thread(s))
L1I stats:
Hits: 1,622
Misses: 21
Miss rate: 1.28%
L1D stats:
Hits: 689
Misses: 35
Miss rate: 4.83%
LL stats:
Hits: 3
Misses: 53
Local miss rate: 94.64%
Child hits: 2,311
Total miss rate: 2.24%
Core #3 (0 thread(s))

To compute reuse distance metrics:

$ bin64/drrun -t drcachesim -simulator_type reuse_distance -reuse_distance_histogram -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Reuse distance tool aggregated results:
Total accesses: 349632
Unique accesses: 196603
Unique cache lines accessed: 4235
Reuse distance mean: 14.64
Reuse distance median: 1
Reuse distance standard deviation: 104.10
Reuse distance histogram:
Distance Count Percent Cumulative
0 153029 44.36% 44.36%
1 101294 29.37% 73.73%
2 14116 4.09% 77.82%
3 14248 4.13% 81.95%
4 8894 2.58% 84.53%
5 2733 0.79% 85.32%
...
==================================================
Reuse distance tool results for shard 29327 (thread 29327):
Total accesses: 335084
Unique accesses: 187927
Unique cache lines accessed: 4148
Reuse distance mean: 14.77
Reuse distance median: 1
Reuse distance standard deviation: 106.02
Reuse distance histogram:
Distance Count Percent Cumulative
0 147157 44.47% 44.47%
1 96820 29.26% 73.72%
2 13613 4.11% 77.84%
3 13834 4.18% 82.02%
4 8666 2.62% 84.64%
5 2552 0.77% 85.41%
...
3658 29 0.01% 100.00%
3851 1 0.00% 100.00%
Reuse distance threshold = 100 cache lines
Top 10 frequently referenced cache lines
cache line: #references #distant refs
0x7f2a86b3fd80: 27980, 0
0x7f2a86b3fdc0: 18823, 0
0x7f2a88388fc0: 16409, 111
0x7f2a8838abc0: 15176, 6
0x7f2a883884c0: 9930, 20
0x7f2a88388480: 7944, 20
0x7f2a88388500: 7574, 20
0x7f2a88398d00: 7390, 100
0x7f2a86b3fd40: 6668, 0
0x7f2a88388440: 5717, 20
Top 10 distant repeatedly referenced cache lines
cache line: #references #distant refs
0x7f2a885a4180: 246, 132
0x7f2a87504ec0: 202, 128
0x7f2a875044c0: 323, 126
0x7f2a885a4480: 220, 126
0x7f2a87504f00: 293, 124
0x7f2a86fd7e00: 289, 124
0x7f2a875049c0: 221, 124
0x7f2a875053c0: 270, 122
0x7f2a86db9c00: 269, 122
0x7f2a875047c0: 201, 122
==================================================
Reuse distance tool results for shard 29328 (thread 29328):
Total accesses: 12216
Unique accesses: 7251
Unique cache lines accessed: 319
Reuse distance mean: 12.98
Reuse distance median: 1
Reuse distance standard deviation: 38.19
Reuse distance histogram:
Distance Count Percent Cumulative
0 4965 41.73% 41.73%
1 3758 31.59% 73.32%
2 411 3.45% 76.78%
3 348 2.93% 79.70%
4 179 1.50% 81.21%
5 152 1.28% 82.48%
...

A reuse time tool is also provided, which counts the total number of memory accesses (without considering uniqueness) between accesses to the same address:

$ bin64/drrun -t drcachesim -simulator_type reuse_time -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Reuse time tool aggregated results:
Total accesses: 88281
Total instructions: 261315
Mean reuse time: 433.47
Reuse time histogram:
Distance Count Percent Cumulative
1 27893 32.84% 32.84%
2 10948 12.89% 45.73%
3 5789 6.82% 52.54%
...
==================================================
Reuse time tool results for shard 29482 (thread 29482):
Total accesses: 84194
Total instructions: 250854
Mean reuse time: 450.01
Reuse time histogram:
Distance Count Percent Cumulative
1 26677 32.86% 32.86%
2 10508 12.95% 45.81%
3 5427 6.69% 52.50%
...
==================================================
Reuse time tool results for shard 29483 (thread 29483):
Total accesses: 3411
Total instructions: 8805
Mean reuse time: 86.36
Reuse time histogram:
Distance Count Percent Cumulative
1 1014 31.56% 31.56%
2 363 11.30% 42.86%
3 308 9.59% 52.44%

To simply see the counts of instructions and memory references broken down by thread use the basic counts tool:

$ bin64/drrun -t drcachesim -simulator_type basic_counts -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Basic counts tool results:
Total counts:
267193 total (fetched) instructions
345 total non-fetched instructions
0 total prefetches
67686 total data loads
22503 total data stores
3 total threads
280 total scheduling markers
0 total transfer markers
0 total other markers
Thread 247451 counts:
255009 (fetched) instructions
345 non-fetched instructions
0 prefetches
64453 data loads
21243 data stores
258 scheduling markers
0 transfer markers
0 other markers
Thread 247453 counts:
9195 (fetched) instructions
0 non-fetched instructions
0 prefetches
2444 data loads
937 data stores
12 scheduling markers
0 transfer markers
0 other markers
Thread 247454 counts:
2989 (fetched) instructions
0 non-fetched instructions
0 prefetches
789 data loads
323 data stores
10 scheduling markers
0 transfer markers
0 other markers

The non-fetched instructions are x86 string loop instructions, where subsequent iterations do not incur a fetch. They are included in the trace as a different type of trace entry to support core simulators in addition to cache simulators.

The opcode_mix tool uses the non-fetched instruction information along with the preserved libraries and binaries from the traced execution to gather more information on each executed instruction than was stored in the trace. It only supports offline traces, and the modules.log file created during post-processing of the trace must be preserved. The results are broken down by the opcodes used in DR's IR, where mov is split into a separate opcode for load and store but both have the same public string "mov":

$ bin64/drrun -t drcachesim -offline -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
$ bin64/drrun -t drcachesim -simulator_type opcode_mix -indir drmemtrace.*.dir
Opcode mix tool results:
267271 : total executed instructions
36432 : mov
31075 : mov
24715 : add
22579 : test
22539 : cmp
12137 : lea
11136 : jnz
10568 : movzx
10243 : jz
9056 : and
8064 : jnz
7279 : jz
5659 : push
4528 : sub
4357 : pop
4001 : shr
3427 : jnbe
2634 : mov
2469 : shl
2344 : jb
2291 : ret
2178 : xor
2164 : call
2111 : pcmpeqb
1472 : movdqa
...

The view tool prints out disassembled instructions in att, intel, arm or DR format for offline traces. The -skip_refs and -sim_refs flags can be used to set a start point and end point for the disassembled view. Note that these flags compute the number of instructions which are skipped or displayed which is distinct from the number of trace entries.

The tool also displays metadata marker entries for timestamps, on which core and thread the subsequent instruction sequence was executed, and kernel and system call transfers (these correspond to signal or event handler interruptions of the regular execution flow).

$ bin64/drrun -t drcachesim -simulator_type view -sim_refs 20 -indir drmemtrace.*.dir
<marker: timestamp 13218166936578899>
<marker: tid 46977 on core 7>
0x00007f3a5127d870 48 83 ec 48 sub $0x48, %rsp
0x00007f3a5127d874 0f 31 rdtsc
0x00007f3a5127d876 48 c1 e2 20 shl $0x20, %rdx
0x00007f3a5127d87a 89 c0 mov %eax, %eax
0x00007f3a5127d87c 48 09 c2 or %rax, %rdx
0x00007f3a5127d87f 48 8b 05 ea 25 22 00 mov <rel> 0x00007f3a5149fe70, %rax
0x00007f3a5127d886 48 89 15 d3 23 22 00 mov %rdx, <rel> 0x00007f3a5149fc60
0x00007f3a5127d88d 48 8d 15 dc 25 22 00 lea <rel> 0x00007f3a5149fe70, %rdx
0x00007f3a5127d894 49 89 d6 mov %rdx, %r14
0x00007f3a5127d897 4c 2b 35 62 27 22 00 sub <rel> 0x00007f3a514a0000, %r14
0x00007f3a5127d89e 48 85 c0 test %rax, %rax
0x00007f3a5127d8a1 48 89 15 40 31 22 00 mov %rdx, <rel> 0x00007f3a514a09e8
0x00007f3a5127d8a8 4c 89 35 29 31 22 00 mov %r14, <rel> 0x00007f3a514a09d8
0x00007f3a5127d8af 0f 84 9b 00 00 00 jz $0x00007f3a5127d950
0x00007f3a5127d8b5 4c 8d 05 84 27 22 00 lea <rel> 0x00007f3a514a0040, %r8
0x00007f3a5127d8bc 49 b9 d8 03 00 80 03 mov $0x00000003800003d8, %r9
00 00 00
0x00007f3a5127d8c6 48 b9 78 fb ff 7f 03 mov $0x000000037ffffb78, %rcx
00 00 00
0x00007f3a5127d8d0 48 8d 35 41 31 22 00 lea <rel> 0x00007f3a514a0a18, %rsi
0x00007f3a5127d8d7 bf ff ff ff 6f mov $0x6fffffff, %edi
0x00007f3a5127d8dc 41 bb ff fd ff 6f mov $0x6ffffdff, %r11d
View tool results:
20 : total disassembled instructions

Here is an example of a signal handler interrupting the regular flow:

0x00007fa87c6c0512 eb 5a jmp $0x00007fa87c6c056e
0x00007fa87c6c056e 80 bd 7c ff ff ff 00 cmp -0x84(%rbp), $0x00
0x00007fa87c6c0575 0f 85 e5 03 00 00 jnz $0x00007fa87c6c0960
<marker: kernel xfer to handler>
<marker: timestamp 13218875821472138>
<marker: tid 159754 on core 0>
0x00007fa879bb88dc 55 push %rbp
0x00007fa879bb88dd 48 89 e5 mov %rsp, %rbp
0x00007fa879bb88e0 48 83 ec 40 sub $0x40, %rsp
0x00007fa879bb88e4 89 7d dc mov %edi, -0x24(%rbp)
0x00007fa879bb88e7 48 89 75 d0 mov %rsi, -0x30(%rbp)
0x00007fa879bb88eb 48 89 55 c8 mov %rdx, -0x38(%rbp)
0x00007fa879bb88ef 83 7d dc 0a cmp -0x24(%rbp), $0x0a
0x00007fa879bb88f3 74 0e jz $0x00007fa879bb8903
0x00007fa879bb8903 48 8b 45 c8 mov -0x38(%rbp), %rax
0x00007fa879bb8907 48 83 c0 28 add $0x28, %rax
0x00007fa879bb890b 48 89 45 f8 mov %rax, -0x08(%rbp)
0x00007fa879bb890f 48 8b 45 f8 mov -0x08(%rbp), %rax
0x00007fa879bb8913 48 8b 80 80 00 00 00 mov 0x80(%rax), %rax
0x00007fa879bb891a 48 89 45 f0 mov %rax, -0x10(%rbp)
0x00007fa879bb891e eb 6d jmp $0x00007fa879bb898d
0x00007fa879bb898d 90 nop
0x00007fa879bb898e c9 leave
0x00007fa879bb898f c3 ret
0x00007fa87c6ca3a0 48 c7 c0 0f 00 00 00 mov $0x0000000f, %rax
0x00007fa87c6ca3a7 0f 05 syscall
<marker: timestamp 13218875821472148>
<marker: tid 159754 on core 0>
<marker: syscall xfer>
<marker: timestamp 13218875821475975>
<marker: tid 159754 on core 4>
0x00007fa87c6c057b 48 8b 75 c8 mov -0x38(%rbp), %rsi
0x00007fa87c6c057f 64 48 33 34 25 28 00 xor %fs:0x28, %rsi
00 00

The func_view tool records function argument and return values for function names specified at tracing time. See Tracing Function Calls for more information.

$ bin64/drrun -t drcachesim -offline -record_function 'fib|1' -- ~/test/fib 5
Estimation of pi is 3.142425985001098
$ bin64/drrun -t drcachesim -simulator_type func_view -indir drmemtrace.*.dir
0x7fc06d2288eb => common.fib!fib(0x5)
0x7fc06d22888e => common.fib!fib(0x4)
0x7fc06d22888e => common.fib!fib(0x3)
0x7fc06d22888e => common.fib!fib(0x2)
0x7fc06d22888e => common.fib!fib(0x1) => 0x1
0x7fc06d22889d => common.fib!fib(0x0) => 0x1
=> 0x2
0x7fc06d22889d => common.fib!fib(0x1) => 0x1
=> 0x3
0x7fc06d22889d => common.fib!fib(0x2)
0x7fc06d22888e => common.fib!fib(0x1) => 0x1
0x7fc06d22889d => common.fib!fib(0x0) => 0x1
=> 0x2
=> 0x5
0x7fc06d22889d => common.fib!fib(0x3)
0x7fc06d22888e => common.fib!fib(0x2)
0x7fc06d22888e => common.fib!fib(0x1) => 0x1
0x7fc06d22889d => common.fib!fib(0x0) => 0x1
=> 0x2
0x7fc06d22889d => common.fib!fib(0x1) => 0x1
=> 0x3
=> 0x8
Function view tool results:
Function id=0: common.fib!fib
15 calls
15 returns

The top referenced cache lines are displayed by the histogram tool:

$ bin64/drrun -t drcachesim -simulator_type histogram -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Cache line histogram tool results:
icache: 1134 unique cache lines
dcache: 3062 unique cache lines
icache top 10
0x7facdd013780: 30929
0x7facdb789fc0: 27664
0x7facdb78a000: 18629
0x7facdd003e80: 18176
0x7facdd003500: 11121
0x7facdd0034c0: 9763
0x7facdd005940: 8865
0x7facdd003480: 8277
0x7facdb789f80: 6660
0x7facdd003540: 5888
dcache top 10
0x7ffcc35e7d80: 4088
0x7ffcc35e7d40: 3497
0x7ffcc35e7e00: 3478
0x7ffcc35e7f40: 2919
0x7ffcc35e7dc0: 2837
0x7facdbe2e980: 2452
0x7facdbe2ec80: 2273
0x7ffcc35e7e80: 2194
0x7facdb6625c0: 2016
0x7ffcc35e7e40: 1997

Configuration File

drcachesim supports reconfigurable cache hierarchies defined in a configuration file. The configuration file is a text file with the following formatting rules.

  • A comment starts with two slashes followed by one or more spaces. Anything after the '// ' until the end of the line is considered a comment and ignored.
  • A parameter's name and its value are listed consecutively with white space (spaces, tabs, or a new line) between them.
  • Parameters must be separated by white space. Including one parameter per line helps keep the configuration file more human-readable.
  • A cache's parameters must be enclosed inside braces and preceded by the cache's user-chosen unique name.
  • Parameters can be listed in any order.
  • Parameters not included in the configuration file take their default values.
  • String values must not be enclosed in quotations.

Supported common parameters and their value types (each of these parameters sets the corresponding option with the same name described in Simulator Parameters):

  • num_cores <unsigned int>
  • line_size <unsigned int>
  • skip_refs <unsigned int>
  • warmup_refs <unsigned int>
  • warmup_fraction <float in [0,1]>
  • sim_refs <unsigned int>
  • cpu_scheduling <bool>
  • verbose <unsigned int>
  • coherence <bool>

Supported cache parameters and their value types:

  • type <string, one of "instruction", "data", or "unified">
  • core <unsigned int in [0, num_cores)>
  • size <unsigned int, power of 2>
  • assoc <unsigned int, power of 2>
  • inclusive <bool>
  • parent <string>
  • replace_policy <string, one of "LRU", "LFU", or "FIFO">
  • prefetcher <string, one of "nextline" or "none">
  • miss_file <string>

Example:

// Configuration for a single-core CPU.
// Common params.
num_cores 1
line_size 64
cpu_scheduling true
sim_refs 8888888
warmup_fraction 0.8
// Cache params.
P0L1I { // P0 L1 instruction cache
type instruction
core 0
size 65536 // 64K
assoc 8
parent P0L2
replace_policy LRU
}
P0L1D { // P0 L1 data cache
type data
core 0
size 65536 // 64K
assoc 8
parent P0L2
replace_policy LRU
}
P0L2 { // P0 L2 unified cache
size 512K
assoc 16
inclusive true
parent LLC
replace_policy LRU
}
LLC { // LLC
size 1M
assoc 16
inclusive true
parent mem
replace_policy LRU
miss_file misses.txt
}

Offline Traces and Analysis

To dump a trace for future offline analysis, use the offline parameter:

$ bin64/drrun -t drcachesim -offline -- /path/to/target/app <args> <for> <app>

The collected traces will be dumped into a newly created directory, which can be passed to drcachesim for offline cache simulation with the -indir option:

$ bin64/drrun -t drcachesim -indir drmemtrace.app.pid.xxxx.dir/

The direct results of the -offline run are raw, compacted files, stored in a raw/ subdirectory of the drmemtrace.app.pid.xxxx.dir directory. The -indir option both converts the data to a canonical trace form and passes the resulting data to the cache simulator. The canonical trace data is stored by -indir in a trace/ subdirectory inside the drmemtrace.app.pid.xxxx.dir/ directory. For both the raw and canonical data, a separate file per application thread is used. If the canonical data already exists, future runs will use that data rather than re-converting it. Either the top-level directory or the trace/ subdirectory may be pointed at with -indir:

$ bin64/drrun -t drcachesim -indir drmemtrace.app.pid.xxxx.dir/trace

The canonical trace files may be manually compressed with gzip, as the trace reader supports reading gzipped files.

Older versions of the simulator produced a single trace file containing all threads interleaved. The -infile option supports reading these legacy files:

$ gzip drmemtrace.app.pid.xxxx.dir/drmemtrace.trace
$ bin64/drrun -t drcachesim -infile drmemtrace.app.pid.xxxx.dir/drmemtrace.trace.gz

The same analysis tools used online are available for offline: the trace format is identical.

Tracing a Subset of Execution

While the cache simulator supports skipping references, for large applications the overhead of the tracing itself is too high to conveniently trace the entire execution. There are several methods of tracing only during a desired window of execution.

The -trace_after_instrs option delays tracing by the specified number of dynamic instruction executions. This can be used to skip initialization and arrive at the desired starting point. The trace's length can also be limited by the -exit_after_tracing option.

If the application can be modified, it can be linked with the drcachesim tracer and use DynamoRIO's start/stop API routines dr_app_setup_and_start() and dr_app_stop_and_cleanup() to delimit the desired trace region. As an example, see our burst_static test application.

Simulator Details

Generally, the simulator is able to be extended to model a variety of caching devices. Currently, CPU caches and TLBs are implemented. The type of device to simulate can be specified by the parameter "-simulator_type" (see Simulator Parameters).

The CPU cache simulator models a configurable number of cores, each with an L1 data cache and an L1 instruction cache. Currently there is a single shared L2 unified cache, but we would like to extend support to arbitrary cache hierarchies (see Current Limitations). The cache line size and each cache's total size and associativity are user-specified (see Simulator Parameters).

The TLB simulator models a configurable number of cores, each with an L1 instruction TLB, an L1 data TLB, and an L2 unified TLB. Each TLB's entry number and associativity, and the virtual/physical page size, are user-specified (see Simulator Parameters).

Neither simulator has a simple way to know which core any particular thread executed on for each of its instructions. The tracer records which core a thread is on each time it writes out a full trace buffer, giving an approximation of the actual scheduling (at the granularity of the trace buffer size). By default, these cache and TLB simulators ignore that information and schedule threads to simulated cores in a static round-robin fashion with load balancing to fill in gaps with new threads after threads exit. The option "-cpu_scheduling" (see Simulator Parameters) can be used to instead map each physical cpu to a simulated core and use the recorded cpu that each segment of thread execution occurred on to schedule execution in a manner that more closely resembles the traced execution on the physical machine. Below is an example of the output using this option running an application with many threads on a pysical machine with 8 cpus. The 8 cpus are mapped to the 4 simulated cores:

$ bin64/drrun -t drcachesim -cpu_scheduling -- ~/test/pi_estimator 20
Estimation of pi is 3.141592653798125
<Stopping application /home/bruening/dr/test/threadsig (213517)>
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (2 traced CPU(s): #2, #5)
L1I stats:
Hits: 2,756,429
Misses: 1,190
Miss rate: 0.04%
L1D stats:
Hits: 1,747,822
Misses: 13,511
Prefetch hits: 2,354
Prefetch misses: 11,157
Miss rate: 0.77%
Core #1 (2 traced CPU(s): #4, #0)
L1I stats:
Hits: 472,948
Misses: 299
Miss rate: 0.06%
L1D stats:
Hits: 895,099
Misses: 1,224
Prefetch hits: 253
Prefetch misses: 971
Miss rate: 0.14%
Core #2 (2 traced CPU(s): #1, #7)
L1I stats:
Hits: 448,581
Misses: 649
Miss rate: 0.14%
L1D stats:
Hits: 811,483
Misses: 1,723
Prefetch hits: 378
Prefetch misses: 1,345
Miss rate: 0.21%
Core #3 (2 traced CPU(s): #6, #3)
L1I stats:
Hits: 275,192
Misses: 154
Miss rate: 0.06%
L1D stats:
Hits: 522,655
Misses: 850
Prefetch hits: 173
Prefetch misses: 677
Miss rate: 0.16%
LL stats:
Hits: 12,491
Misses: 7,109
Prefetch hits: 8,922
Prefetch misses: 5,228
Local miss rate: 36.27%
Child hits: 7,933,367
Total miss rate: 0.09%

The memory access traces contain some optimizations that combine references for one basic block together. This may result in not considering some thread interleavings that could occur natively. There are no other disruptions to thread ordering, however, and the application runs with all of its threads concurrently just like it would natively (although slower).

Once every process has exited, the simulator prints cache miss statistics for each cache to stderr. The simulator is designed to be extensible, allowing for different cache studies to be carried out: see Extending the Simulator.

For L2 caching devices, the L1 caching devices are considered its children. Two separate miss rates are computed, one (the "Local miss rate") considering just requests that reach L2 while the other (the "Total miss rate") includes the child hits.

For memory requests that cross blocks, each block touched is considered separately, resulting in separate hit and miss statistics. This can be changed by implementing a custom statistics gatherer (see Extending the Simulator).

Software and hardware prefetches are combined in the prefetch hit and miss statistics, which are reported separately from regular loads and stores. To isolate software prefetch statistics, disable the hardware prefetcher by running with "-data_prefetcher none" (see Simulator Parameters). While misses from software prefetches are included in cache miss files, misses from hardware prefetches are not.

Cache Miss Analyzer

The cache simulator can be used to analyze the stream of last-level cache (LLC) miss addresses. This can be useful when looking for patterns that can be utilized in software prefetching. The current analyzer can only identify simple stride patterns, but it can be extended to search for more complex patterns. To invoke the miss analyzer, pass miss_analyzer to the -simulator_type parameter. To write the prefetching hints to a file use the -LL_miss_file parameter to specify the file's path and name.

For example, to run the analyzer on a benchmark called "my_benchmark" and store the prefetching recommendations in a file called "rec.csv", run the following:

$ bin64/drrun -t drcachesim -simulator_type miss_analyzer -LL_miss_file rec.csv -- my_benchmark

Physical Addresses

The memory access tracing client gathers virtual addresses. On Linux, if the kernel allows user-mode applications access to the /proc/self/pagemap file, physical addresses may be used instead. This can be requested via the -use_physical runtime option (see Simulator Parameters). This works on current kernels but is expected to stop working from user mode on future kernels due to recent security changes (see http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce).

Core Simulation Support

The drcachesim trace format includes information intended for use by core simulators as well as pure cache simulators. For traces that are not filtered by an online first-level cache, each data reference is preceded by the instruction fetch entry for the instruction that issued the data request. Additionally, on x86, string loop instructions involve a single insruction fetch followed by a loop of loads and/or stores. A drcachesim trace includes a special "no-fetch" instruction entry per iteration so that core simulators have the instruction information to go along with each load and store, while cache simulators can ignore these "no-fetch" entries and avoid incorrectly inflating instruction fetch statistics.

Offline traces guarantee that a branch target instruction entry in a trace must immediately follow the branch instruction with no intervening thread switch. This allows a core simulator to identify the target of a branch by looking at the subsequent trace entry. However, this guarantee does not hold when a kernel event such as a signal is delivered immediately after a branch.

Traces include scheduling markers providing the timestamp and hardware thread identifier on each thread transition, allowing a simulator to more closely match the actual hardware if so desired.

Traces also include markers indicating disruptions in user mode control flow such as signal handler entry and exit.

A final feature that aids core simulators is the pair of interfaces module_mapper_t::get_loaded_modules() and module_mapper_t::find_mapped_trace_address(), which facilitate reading the raw bytes for each instruction in order to obtain the opcode and full operand information.

Extending the Simulator

The drcachesim tool was designed to be extensible, allowing users to easily model different caching devices, implement different models, and gather custom statistics.

To model different caching devices, subclass the simulator_t, caching_device_t, caching_device_block_t, caching_device_stats_t classes.

To implement a different cache model, subclass the cache_t class and override the request(), access_update(), and/or replace_which_way() method(s).

Statistics gathering is separated out into the caching_device_stats_t class. To implement custom statistics, subclass caching_device_stats_t and override the access(), child_access(), flush(), and/or print_stats() methods.

Customizing the Tracer

The tracer supports customization for special-purpose i/o via drmemtrace_replace_file_ops(), allowing traces to be written to locations not supported by simple UNIX file operations. One option for using this function is to create a new client which links with the provided drmemtrace_static library, includes the drmemtrace/drmemtrace.h header via:

use_DynamoRIO_drmemtrace_tracer(mytool)

And includes its own dr_client_main() which calls drmemtrace_client_main().

The tracer also supports storing custom data with each module (i.e., library or executable) such as a build identifier via drmemtrace_custom_module_data(). The custom data may be retrieved by creating a custom offline trace post-processor and using the module_mapper_t class.

Tracing Function Calls

The tracer supports recording argument and return values for specified functions. This feature is currently limited to offline mode only (Offline Traces and Analysis). The -record_function parameter lists which function names to trace. Requested names will be located per library and each instance traced separately. The number of arguments to record is specified for each name, using a bar character to separate them. An ampersand separates functions. Here is an example:

$ bin64/drrun -t drcachesim -offline -record_function 'fib|1&calloc|2'

Within the trace, each function is identified by a numeric identifier. The list of recorded functions, each with its identifier, is placed into a file "funclist.log" in the trace directory, where the sample tool func_view uses it to provide a linear function call trace as well as summary statistics as shown above.

The -record_heap parameter requests recording of a pre-determined set of functions related to heap allocation. The -record_heap_value paramter controls the contents of this set.

Creating New Analysis Tools

drcachesim provides a drmemtrace analysis tool framework to make it easy to create new trace analysis tools. A new tool should subclass analysis_tool_t.

Concurrent processing of traces is supported by logically splitting a trace into "shards" which are each processed sequentially. The default shard is a traced application thread, but the tool interface can support other divisions. For tools that support concurrent processing of shards and do not need to see a single time-sorted interleaved merged trace, the interface functions with the parallel_ prefix should be overridden, and parallel_shard_supported() should return true. parallel_shard_init() will be invoked for each shard prior to invoking parallel_shard_memref() for each entry in that shard; the data structure returned from parallel_shard_init() will be passed to parallel_shard_memref() for each trace entry for that shard. The concurrency model used guarantees that all entries from any one shard are processed by the same single worker thread, so no synchronization is needed inside the parallel_ functions. A single worker thread invokes print_results() as well.

For serial operation, process_memref(), operates on a trace entry in a single, sorted, interleaved stream of trace entries. In the default mode of operation, the analyzer_t class iterates over the trace and calls the process_memref() function of each tool. An alternative mode is supported which exposes the iterator and allows a separate control infrastructure to be built. This alternative mode does not support parallel operation at this time.

Both parallel and serial operation can be supported by a tool, typically by having process_memref() create data on a newly seen traced thread and invoking parallel_shard_memref() to do its work.

For both parallel and serial operation, the function print_results() should be overridden. It is called just once after processing all trace data and it should present the results of the analysis. For parallel operation, any desired aggregation across the whole trace should occur here as well, while shard-specific results can be presented in parallel_shard_exit().

Today, parallel analysis is only supported for offline traces. Support for online traces may be added in the future.

In the default mode of operation, the analyzer_t class iterates over the trace and calls the appropriate analysis_tool_t functions for each tool. An alternative mode is supported which exposes the iterator and allows a separate control infrastructure to be built.

Each trace entry is of type memref_t and represents one instruction or data reference or a metadata operation such as a thread exit or marker. There are built-in scheduling markers providing the timestamp and cpu identifier on each thread transition. Other built-in markers indicate disruptions in user mode control flow such as signal handler entry and exit.

CMake support is provided for including the headers and linking the libraries of the drmemtrace framework. A new CMake function is defined in the DynamoRIO package which sets the include directory for using the drmemtrace/ headers:

use_DynamoRIO_drmemtrace(mytool)

The drmemtrace_analyzer library exported by the DynamoRIO package is the main library to link when building a new tool. The tools described above are also exported as the libraries drmemtrace_basic_counts, drmemtrace_view, drmemtrace_opcode_mix, drmemtrace_histogram, drmemtrace_reuse_distance, drmemtrace_reuse_time, drmemtrace_simulator, and drmemtrace_func_view and can be created using the basic_counts_tool_create(), opcode_mix_tool_create(), histogram_tool_create(), reuse_distance_tool_create(), reuse_time_tool_create(), view_tool_create(), cache_simulator_create(), tlb_simulator_create(), and func_view_create() functions.

Simulator Parameters

drcachesim's behavior can be controlled through options passed after the -c drcachesim but prior to the "--" delimiter on the command line:

$ bin64/drrun -t drcachesim <options> <to> <drcachesim> -- /path/to/target/app <args> <for> <app>

Boolean options can be disabled using a "-no_" prefix.

The parameters available are described below:

  • -offline
    default value: false
    By default, traces are processed online, sent over a pipe to a simulator. If this option is enabled, trace data is instead written to files in -outdir for later offline analysis. No simulator is executed.
  • -ipc_name
    default value: drcachesimpipe
    For online tracing and simulation (the default, unless -offline is requested), specifies the name of the named pipe used to communicate between the target application processes and the caching device simulator. On Linux this can include an absolute path (if it doesn't, a default temp directory will be used). A unique name must be chosen for each instance of the simulator being run at any one time. On Windows, the name is limited to 247 characters.
  • -outdir
    default value: .
    For the offline analysis mode (when -offline is requested), specifies the path to a directory where per-thread trace files will be written.
  • -subdir_prefix
    default value: drmemtrace
    For the offline analysis mode (when -offline is requested), specifies the prefix for the name of the sub-directory where per-thread trace files will be written. The sub-directory is created inside -outdir and has the form 'prefix.app-name.pid.id.dir'.
  • -indir
    default value: ""
    After a trace file is produced via -offline into -outdir, it can be passed to the simulator via this flag pointing at the subdirectory created in -outdir. The -offline tracing produces raw data files which are converted into final trace files on the first execution with -indir. The raw files can also be manually converted using the drraw2trace tool. Legacy single trace files with all threads interleaved into one are not supported with this option: use -infile instead.
  • -infile
    default value: ""
    Directs the simulator to use a single all-threads-interleaved-into-one trace file. This is a legacy file format that is no longer produced.
  • -jobs
    default value: -1
    By default, both post-processing of offline raw trace files and analysis of trace files is parallelized. This option controls the number of concurrent jobs. 0 disables concurrency and uses a single thread to perform all operations. A negative value sets the job count to the number of hardware threads, with a cap of 16.
  • -module_file
    default value: ""
    The opcode_mix tool needs the modules.log file (generated by the offline post-processing step in the raw/ subdirectory) in addition to the trace file. If the file is named modules.log and is in the same directory as the trace file, or a raw/ subdirectory below the trace file, this parameter can be omitted.
  • -funclist_file
    default value: ""
    The func_view tool needs the mapping from function name to identifier that was recorded during offline tracing. This data is stored in its own separate file in the raw/ subdirectory. If the file is named funclist.log and is in the same directory as the trace file, or a raw/ subdirectory below the trace file, this parameter can be omitted.
  • -cores
    default value: 4
    Specifies the number of cores to simulate.
  • -line_size
    default value: 64
    Specifies the cache line size, which is assumed to be identical for L1 and L2 caches. Must be a power of 2.
  • -L1I_size
    default value: 32K
    Specifies the total size of each L1 instruction cache. Must be a power of 2 and a multiple of -line_size.
  • -L1D_size
    default value: 32K
    Specifies the total size of each L1 data cache. Must be a power of 2 and a multiple of -line_size.
  • -L1I_assoc
    default value: 8
    Specifies the associativity of each L1 instruction cache. Must be a power of 2.
  • -L1D_assoc
    default value: 8
    Specifies the associativity of each L1 data cache. Must be a power of 2.
  • -LL_size
    default value: 8M
    Specifies the total size of the unified last-level (L2) cache. Must be a power of 2 and a multiple of -line_size.
  • -LL_assoc
    default value: 16
    Specifies the associativity of the unified last-level (L2) cache. Must be a power of 2.
  • -LL_miss_file
    default value: ""
    If non-empty, when running the cache simulator, requests that every last-level cache miss be written to a file at the specified path. Each miss is written in text format as a <program counter, address> pair. If this tool is linked with zlib, the file is written in gzip-compressed format. If non-empty, when running the cache miss analyzer, requests that prefetching hints based on the miss analysis be written to the specified file. Each hint is written in text format as a <program counter, stride, locality level> tuple.
  • -L0_filter
    default value: false
    Filters out instruction and data hits in a 'zero-level' cache during tracing itself, shrinking the final trace to only contain instruction and data accesses that miss in this initial cache. This cache is direct-mapped with sizes equal to -L0I_size and -L0D_size. It uses virtual addresses regardless of -use_physical.
  • -L0I_size
    default value: 32K
    Specifies the size of the 'zero-level' instruction cache for -L0_filter. Must be a power of 2 and a multiple of -line_size, unless it is set to 0, which disables instruction fetch entries from appearing in the trace.
  • -L0D_size
    default value: 32K
    Specifies the size of the 'zero-level' data cache for -L0_filter. Must be a power of 2 and a multiple of -line_size, unless it is set to 0, which disables data entries from appearing in the trace.
  • -coherence
    default value: false
    Writes to cache lines will invalidate other private caches that hold that line.
  • -use_physical
    default value: false
    If available, the default virtual addresses will be translated to physical. This is not possible from user mode on all platforms. This is not supported with -offline at this time.
  • -virt2phys_freq
    default value: 0
    This option only applies if -use_physical is enabled. The virtual to physical mapping is cached for performance reasons, yet the underlying mapping can change without notice. This option controls the frequency with which the cached value is ignored in order to re-access the actual mapping and ensure accurate results. The units are the number of memory accesses per forced access. A value of 0 uses the cached values for the entire application execution.
  • -cpu_scheduling
    default value: false
    By default, the simulator schedules threads to simulated cores in a static round-robin fashion. This option causes the scheduler to instead use the recorded cpu that each thread executed on (at a granularity of the trace buffer size) for scheduling, mapping traced cpu's to cores and running each segment of each thread on the core that owns the recorded cpu for that segment.
  • -max_trace_size
    default value: 0
    If non-zero, this sets a maximum size on the amount of raw trace data gathered for each thread. This is not an exact limit: it may be exceeded by the size of one internal buffer. Once reached, instrumentation continues for that thread, but no further data is recorded.
  • -trace_after_instrs
    default value: 0
    If non-zero, this causes tracing to be suppressed until this many dynamic instruction executions are observed. At that point, regular tracing is put into place. Use -max_trace_size to set a limit on the subsequent trace length.
  • -exit_after_tracing
    default value: 0
    If non-zero, after tracing the specified number of references, the process is exited with an exit code of 0. The reference count is approximate.
  • -online_instr_types
    default value: false
    By default, offline traces include some information on the types of instructions, branches in particular. For online traces, this comes at a performance cost, so it is turned off by default.
  • -replace_policy
    default value: LRU
    Specifies the replacement policy for caches. Supported policies: LRU (Least Recently Used), LFU (Least Frequently Used), FIFO (First-In-First-Out).
  • -data_prefetcher
    default value: nextline
    Specifies the hardware data prefetcher policy. The currently supported policies are 'nextline' (fetch the subsequent cache line) and 'none' (disables hardware prefetching). The prefetcher is located between the L1D and LL caches.
  • -page_size
    default value: 4K
    Specifies the virtual/physical page size.
  • -TLB_L1I_entries
    default value: 32
    Specifies the number of entries in each L1 instruction TLB. Must be a power of 2.
  • -TLB_L1D_entries
    default value: 32
    Specifies the number of entries in each L1 data TLB. Must be a power of 2.
  • -TLB_L1I_assoc
    default value: 32
    Specifies the associativity of each L1 instruction TLB. Must be a power of 2.
  • -TLB_L1D_assoc
    default value: 32
    Specifies the associativity of each L1 data TLB. Must be a power of 2.
  • -TLB_L2_entries
    default value: 1024
    Specifies the number of entries in each unified L2 TLB. Must be a power of 2.
  • -TLB_L2_assoc
    default value: 4
    Specifies the associativity of each unified L2 TLB. Must be a power of 2.
  • -TLB_replace_policy
    default value: LFU
    Specifies the replacement policy for TLBs. Supported policies: LFU (Least Frequently Used).
  • -simulator_type
    default value: cache
    Specifies the type of the simulator. Supported types: cache, miss_analyzer, TLB, reuse_distance, reuse_time, histogramor basic_counts.
  • -verbose
    default value: 0
    Verbosity level for notifications.
  • -show_func_trace
    default value: true
    In the func_trace tool, this controls whether every traced call is shown or instead only aggregate statistics are shown.
  • -disable_optimizations
    default value: false
    Disables various optimizations where information is omitted from offline trace recording when it can be reconstructed during post-processing. This is meant for testing purposes.
  • -dr
    default value: ""
    Specifies the path of the DynamoRIO root directory.
  • -dr_debug
    default value: false
    Requests use of the debug build of DynamoRIO rather than the release build.
  • -dr_ops
    default value: ""
    Specifies the options to pass to DynamoRIO.
  • -tracer
    default value: ""
    The full path to the tracer library.
  • -skip_refs
    default value: 0
    Specifies the number of references to skip in the beginning of the application execution. These memory references are dropped instead of being simulated.
  • -warmup_refs
    default value: 0
    Specifies the number of memory references to warm up caches before simulation. The warmup references come after the skipped references and before the simulated references. This flag is incompatible with warmup_fraction.
  • -warmup_fraction
    default value: 0
    Specifies the fraction of last level cache blocks to be loaded such that the cache is considered to be warmed up before simulation. The warmup fraction is computed after the skipped references and before simulated references. This flag is incompatible with warmup_refs.
  • -sim_refs
    default value: 8589934592G
    Specifies the number of memory references to simulate. The simulated references come after the skipped and warmup references, and the references following the simulated ones are dropped.
  • -view_syntax
    default value: att
    Specifies the syntax to use when viewing disassembled offline traces.The option can be set to one of att (default), intel, dr and arm.An invalid specification falls back to the default.
  • -config_file
    default value: ""
    The full path to the cache hierarchy configuration file.
  • -report_top
    default value: 10
    Specifies the number of top results to be reported.
  • -reuse_distance_threshold
    default value: 100
    Specifies the reuse distance threshold for reporting the distant repeated references. A reference is a distant repeated reference if the distance to the previous reference on the same cache line exceeds the threshold.
  • -reuse_distance_histogram
    default value: false
    By default only the mean, median, and standard deviation of the reuse distances are reported. This option prints out the full histogram of reuse distances.
  • -reuse_skip_dist
    default value: 500
    Specifies the distance between nodes in the skip list. For optimal performance, set this to a value close to the estimated average reuse distance of the dataset.
  • -reuse_verify_skip
    default value: false
    Verifies every skip list-calculated reuse distance with a full list walk. This incurs significant additional overhead. This option is only available in debug builds.
  • -record_function
    default value: ""
    Record invocations trace for the specified function(s) in the option value. Default value is empty. The value should fit this format: function_name|func_args_num (e.g., -record_function "memset|3") with an optional suffix "|noret" (e.g., -record_function "free|1|noret"). The trace would contain information for each function invocation's return address, function argument value(s), and (unless "|noret" is specified) function return value. (If multiple requested functions map to the same address and differ in whether "noret" was specified or in the number of args, the attributes from the first one requested will be used.) We only record pointer-sized arguments and return values. The trace identifies which function is involved via a numeric ID entry prior to each set of value entries. The mapping from numeric ID to library-qualified symbolic name is recorded during tracing in a file "funclist.log" whose format is described by the drmemtrace_get_funclist_path() function's documentation. If the target function is in the dynamic symbol table, then the function_name should be a mangled name (e.g. "_Znwm" for "operator new", "_ZdlPv" for "operator delete"). Otherwise, the function_name should be a demangled name. Recording multiple functions can be achieved by using the separator "&" (e.g., -record_function "memset|3&memcpy|3"), or specifying multiple -record_function options (e.g., -record_function "memset|3" -record_function "memcpy|3"). Note that the provided function name should be unique, and not collide with existing heap functions (see -record_heap_value) if -record_heap option is enabled.
  • -record_heap
    default value: false
    It is a convenience option to enable recording a trace for the defined heap functions in -record_heap_value. Specifying this option is equivalent to -record_function [heap_functions], where [heap_functions] is the value in -record_heap_value.
  • -record_heap_value
    default value: malloc|1&free|1|noret&tc_malloc|1&tc_free|1|noret&__libc_malloc|1&__libc_free|1|noret&calloc|2&_Znwm|1&_ZnwmRKSt9nothrow_t|2&_ZnwmSt11align_val_t|2&_ZnwmSt11align_val_tRKSt9nothrow_t|3&_ZnwmPv|2&_Znam|1&_ZnamRKSt9nothrow_t|2&_ZnamSt11align_val_t|2&_ZnamSt11align_val_tRKSt9nothrow_t|3&_ZnamPv|2&_ZdlPv|1|noret&_ZdlPvRKSt9nothrow_t|2|noret&_ZdlPvSt11align_val_t|2|noret&_ZdlPvSt11align_val_tRKSt9nothrow_t|3|noret&_ZdlPvm|2|noret&_ZdlPvmSt11align_val_t|3|noret&_ZdlPvS_|2|noret&_ZdaPv|1|noret&_ZdaPvRKSt9nothrow_t|2|noret&_ZdaPvSt11align_val_t|2|noret&_ZdaPvSt11align_val_tRKSt9nothrow_t|3|noret&_ZdaPvm|2|noret&_ZdaPvmSt11align_val_t|3|noret&_ZdaPvS_|2|noret
    Functions recorded by -record_heap. The option value should fit the same format required by -record_function. These functions will not be traced unless -record_heap is specified.
  • -record_dynsym_only
    default value: false
    Symbol lookup can be expensive for large applications and libraries. This option causes the symbol lookup for -record_function and -record_heap to look in the dynamic symbol table only.
  • -record_replace_retaddr
    default value: false
    Function wrapping can be expensive for large concurrent applications. This option causes the post-function control point to be located using return address replacement, which has lower overhead, but runs the risk of breaking an application that examines or changes its own return addresses in the recorded functions.
  • -miss_count_threshold
    default value: 50000
    Specifies the minimum number of LLC misses of a load for it to be eligible for analysis in search of patterns in the miss address stream.
  • -miss_frac_threshold
    default value: 0.005
    Specifies the minimum fraction of LLC misses of a load (from all misses) for it to be eligible for analysis in search of patterns in the miss address stream.
  • -confidence_threshold
    default value: 0.75
    Specifies the minimum confidence to include a discovered pattern in the output results. Confidence in a discovered pattern for a load instruction is calculated as the fraction of the load's misses with the discovered pattern over all the load's misses.

Current Limitations

The drcachesim tool is a work in progress. We welcome contributions in these areas of missing functionality:

Comparison to Other Simulators

drcachesim is one of the few simulators to support multiple processes. This feature requires an out-of-process simulator and inter-process communication. A single-process design would incur less overhead. Thus, we expect drcachesim to pay for its multi-process support with potentially unfavorable performance versus single-process simulators.

When comparing cache hits, misses, and miss rates across simulators, the details can vary substantially. For example, some other simulators (such as cachegrind) do not split memory references that cross cache lines into multiple hits or misses, while drcachesim does split them. Instructions that reference multiple memory words on the same cache line (such as ldm on ARM) are considered to be single accesses by drcachesim, while other simulators (such as cachegrind) may split the accesses into separate pieces. A final example involves string loop instructions on x86. drcachesim considers only the first iteration to involve an instruction fetch (presenting subsequent iterations as a "non-fetched instruction" which the simulator ignores: the basic_counts tool does show these as a separate statistics), while other simulators (incorrectly) issue a fetch to the instruction cache on every iteration of the string loop.