DynamoRIO API
Cache Simulator

drcachesim is a DynamoRIO client that collects memory access traces and feeds them to an online or offline tool for analysis. The default analysis tool is a cache simulator which simulates a set of specific caching devices, e.g., CPU caches and TLBs. The trace collector and simulator support multiple processes each with multiple threads.

Overview

drcachesim consists of two components: a tracer and an analyzer. The tracer collects a memory access trace from each thread within each application process. The analyzer consumes the traces (online or offline) and performs customized analysis. It is designed to be extensible, allowing users to easily implement a simulator for different devices, such as CPU caches, TLBs, page caches, etc. (see Extending the Simulator). The default analyzer simulates the architectural behavior of caching devices for a target application (or multiple applications).

Running the Simulator

To launch drcachesim, use the -t flag to drrun:

bin64/drrun -t drcachesim -- /path/to/target/app <args> <for> <app>

The target application will be launched under a DynamoRIO tracer client that gathers all of its memory references and passes them to the simulator via a pipe. Any child processes will be followed into and profiled, with their memory references passed to the simulator as well.

To dump the trace for future offline analysis:

bin64/drrun -t drcachesim -offline -- /path/to/target/app <args> <for> <app>

The collected traces will be dumpped into a newly created directory, which can be passed to drcachesim for offline cache simulation.

bin64/drrun -t drcachesim -indir drmemtrace.app.pid.xxxx.dir/

Simulator Details

Generally, the simulator is able to be extended to model a variety of caching devices. Currently, CPU caches and TLBs are implemented. The type of device to simulate can be specified by the parameter "-simulator_type" (see Simulator Parameters).

The CPU cache simulator models a configurable number of cores, each with an L1 data cache and an L1 instruction cache. Currently there is a single shared L2 unified cache, but we would like to extend support to arbitrary cache hierarchies (see Current Limitations). The cache line size and each cache's total size and associativity are user-specified (see Simulator Parameters).

The TLB simulator models a configurable number of cores, each with an L1 instruction TLB, an L1 data TLB, and an L2 unified TLB. Each TLB's entry number and associativity, and the virtual/physical page size, are user-specified (see Simulator Parameters).

Neither simulator has a simple way to know which core any particular thread executed on at a given point in time. Instead it uses a simple static scheduling of threads to cores, using a round-robin assignment with load balancing to fill in gaps with new threads after threads exit.

The memory access traces contain some optimizations that combine references for one basic block together. This may result in not considering some thread interleavings that could occur natively. There are no other disruptions to thread ordering, however, and the application runs with all of threads concurrently just like it would natively (although slower).

Once every process has exited, the simulator prints cache miss statistics for each cache to stderr. The simulator is designed to be extensible, allowing for different cache studies to be carried out: see Extending the Simulator.

For L2 caching devices, the L1 caching devices are considered its children. Two separate miss rates are computed, one (the "Local miss rate") considering just requests that reach L2 while the other (the "Total miss rate") includes the child hits.

For memory requests that cross blocks, each block touched is considered separately, resulting in separate hit and miss statistics. This can be changed by implementing a custom statistics gatherer (see Extending the Simulator).

Physical Addresses

The memory access tracing client gathers virtual addresses. On Linux, if the kernel allows user-mode applications access to the /proc/self/pagemap file, physical addresses may be used instead. This can be requested via the -use_physical runtime option (see Simulator Parameters). This works on current kernels but is expected to stop working from user mode on future kernels due to recent security changes (see http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce).

Current Limitations

The drcachesim tool is a work in progress. We welcome contributions in these areas of missing functionality:

Extending the Simulator

The drcachesim tool was designed to be extensible, allowing users to easily model different caching devices, implement different models, and gather custom statistics.

To model different caching devices, subclass the simulator_t, caching_device_t, caching_device_block_t, caching_device_stats_t classes.

To implement a different cache model, subclass the cache_t class and override the request(), access_update(), and/or replace_which_way() method(s).

Statistics gathering is separated out into the caching_device_stats_t class. To implement custom statistics, subclass caching_device_stats_t and override the access(), child_access(), flush(), and/or print_stats() methods.

Simulator Parameters

drcachesim's behavior can be controlled through options passed after the -c drcachesim but prior to the "--" delimiter on the command line:

bin64/drrun -t drcachesim <options> <to> <drcachesim> -- /path/to/target/app <args> <for> <app>

Boolean options can be disabled using a "-no_" prefix.

The parameters available are described below:

  • -offline
    default value: false
    By default, traces are processed online, sent over a pipe to a simulator. If this option is enabled, trace data is instead written to files in -outdir for later offline analysis. No simulator is executed.
  • -ipc_name
    default value: drcachesimpipe
    For online tracing and simulation (the default, unless -offline is requested), specifies the base name of the named pipe used to communicate between the target application processes and the caching device simulator. A unique name must be chosen for each instance of the simulator being run at any one time. On Windows, the name is limited to 247 characters.
  • -outdir
    default value: .
    For the offline analysis mode (when -offline is requested), specifies the path to a directory where per-thread trace files will be written.
  • -indir
    default value: ""
    After a trace file is produced via -offline into -outdir, it can be passed to the simulator via this flag pointing at the subdirectory created in -outdir.
  • -infile
    default value: ""
    Directs the simulator to use a trace file (not a raw data file from -offline: such a file neeeds to be converted via drposttrace or -indir first).
  • -cores
    default value: 4
    Specifies the number of cores to simulate.
  • -line_size
    default value: 64
    Specifies the cache line size, which is assumed to be identical for L1 and L2 caches.
  • -L1I_size
    default value: 32K
    Specifies the total size of each L1 instruction cache.
  • -L1D_size
    default value: 32K
    Specifies the total size of each L1 data cache.
  • -L1I_assoc
    default value: 8
    Specifies the associativity of each L1 instruction cache.
  • -L1D_assoc
    default value: 8
    Specifies the associativity of each L1 data cache.
  • -LL_size
    default value: 8M
    Specifies the total size of the unified last-level (L2) cache.
  • -LL_assoc
    default value: 16
    Specifies the associativity of the unified last-level (L2) cache.
  • -use_physical
    default value: false
    If available, the default virtual addresses will be translated to physical. This is not possible from user mode on all platforms.
  • -virt2phys_freq
    default value: 0
    This option only applies if -use_physical is enabled. The virtual to physical mapping is cached for performance reasons, yet the underlying mapping can change without notice. This option controls the frequency with which the cached value is ignored in order to re-access the actual mapping and ensure accurate results. The units are the number of memory accesses per forced access. A value of 0 uses the cached values for the entire application execution.
  • -max_trace_size
    default value: 0
    If non-zero, this sets a maximum size on the amount of raw trace data gathered for each thread. This is not an exact limit: it may be exceeded by the size of one internal buffer. Once reached, instrumentation continues for that thread, but no further data is recorded.
  • -online_instr_types
    default value: false
    By default, offline traces include some information on the types of instructions, branches in particular. For online traces, this comes at a performance cost, so it is turned off by default.
  • -replace_policy
    default value: LRU
    Specifies the replacement policy for caches. Supported policies: LRU (Least Recently Used), LFU (Least Frequently Used), FIFO (First-In-First-Out).
  • -page_size
    default value: 4K
    Specifies the virtual/physical page size.
  • -TLB_L1I_entries
    default value: 32
    Specifies the number of entries in each L1 instruction TLB.
  • -TLB_L1D_entries
    default value: 32
    Specifies the number of entries in each L1 data TLB.
  • -TLB_L1I_assoc
    default value: 32
    Specifies the associativity of each L1 instruction TLB.
  • -TLB_L1D_assoc
    default value: 32
    Specifies the associativity of each L1 data TLB.
  • -TLB_L2_entries
    default value: 1024
    Specifies the number of entries in each unified L2 TLB.
  • -TLB_L2_assoc
    default value: 4
    Specifies the associativity of each unified L2 TLB.
  • -TLB_replace_policy
    default value: LFU
    Specifies the replacement policy for TLBs. Supported policies: LFU (Least Frequently Used).
  • -simulator_type
    default value: cache
    Specifies the type of the simulator. Supported types: cache, TLB.
  • -verbose
    default value: 0
    Verbosity level for notifications.
  • -dr
    default value: ""
    Specifies the path of the DynamoRIO root directory.
  • -dr_debug
    default value: false
    Requests use of the debug build of DynamoRIO rather than the release build.
  • -dr_ops
    default value: ""
    Specifies the options to pass to DynamoRIO.
  • -tracer
    default value: ""
    The full path to the tracer library.
  • -skip_refs
    default value: 0
    Specifies the number of references to skip in the beginning of the application execution. These memory references are dropped instead of being simulated.
  • -warmup_refs
    default value: 0
    Specifies the number of memory references to warm up caches before simulation. The warmup references come after the skipped references and before the simulated references.
  • -sim_refs
    default value: 8589934592G
    Specifies the number of memory references simulated. The simulated references come after the skipped and warmup references, and the references following the simulated ones are dropped.
  • -report_top
    default value: 10
    Specifies the number of top results to be reported.
  • -reuse_distance_threshold
    default value: 100
    Specifies the reuse distance threshold for reporting the distant repeated references. A reference is a distant repeated reference if the distance to the previous reference on the same cache line exceeds the threshold.

Comparison to Other Simulators

drcachesim is one of the few simulators to support multiple processes. This feature requires an out-of-process simulator and inter-process communication. A single-process design would incur less overhead. Thus, we expect drcachesim to pay for its multi-process support with potentially unfavorable performance versus single-process simulators.

When comparing cache hits, misses, and miss rates across simulators, the details can vary substantially. For example, some other simulators (such as cachegrind) do not split memory references that cross cache lines into multiple hits or misses. Additionally, instructions that reference multiple memory words (such as ldm on ARM) are considered to be single accesses by drcachesim, while other simulators (such as cachegrind) may split the accesses into separate pieces.