drcachesim is a DynamoRIO client that collects memory access traces and feeds them to an online or offline tool for analysis. The default analysis tool is a cache simulator which simulates a set of specific caching devices, e.g., CPU caches and TLBs. The trace collector and simulator support multiple processes each with multiple threads.
drcachesim consists of two components: a tracer and an analyzer. The tracer collects a memory access trace from each thread within each application process. The analyzer consumes the traces (online or offline) and performs customized analysis. It is designed to be extensible, allowing users to easily implement a simulator for different devices, such as CPU caches, TLBs, page caches, etc. (see Extending the Simulator). The default analyzer simulates the architectural behavior of caching devices for a target application (or multiple applications).
drcachesim, use the
-t flag to
The target application will be launched under a DynamoRIO tracer client that gathers all of its memory references and passes them to the simulator via a pipe. Any child processes will be followed into and profiled, with their memory references passed to the simulator as well.
To dump the trace for future offline analysis:
The collected traces will be dumpped into a newly created directory, which can be passed to drcachesim for offline cache simulation.
Generally, the simulator is able to be extended to model a variety of caching devices. Currently, CPU caches and TLBs are implemented. The type of device to simulate can be specified by the parameter "-simulator_type" (see Simulator Parameters).
The CPU cache simulator models a configurable number of cores, each with an L1 data cache and an L1 instruction cache. Currently there is a single shared L2 unified cache, but we would like to extend support to arbitrary cache hierarchies (see Current Limitations). The cache line size and each cache's total size and associativity are user-specified (see Simulator Parameters).
The TLB simulator models a configurable number of cores, each with an L1 instruction TLB, an L1 data TLB, and an L2 unified TLB. Each TLB's entry number and associativity, and the virtual/physical page size, are user-specified (see Simulator Parameters).
Neither simulator has a simple way to know which core any particular thread executed on at a given point in time. Instead it uses a simple static scheduling of threads to cores, using a round-robin assignment with load balancing to fill in gaps with new threads after threads exit.
The memory access traces contain some optimizations that combine references for one basic block together. This may result in not considering some thread interleavings that could occur natively. There are no other disruptions to thread ordering, however, and the application runs with all of threads concurrently just like it would natively (although slower).
Once every process has exited, the simulator prints cache miss statistics for each cache to stderr. The simulator is designed to be extensible, allowing for different cache studies to be carried out: see Extending the Simulator.
For L2 caching devices, the L1 caching devices are considered its children. Two separate miss rates are computed, one (the "Local miss rate") considering just requests that reach L2 while the other (the "Total miss rate") includes the child hits.
For memory requests that cross blocks, each block touched is considered separately, resulting in separate hit and miss statistics. This can be changed by implementing a custom statistics gatherer (see Extending the Simulator).
The memory access tracing client gathers virtual addresses. On Linux, if the kernel allows user-mode applications access to the
/proc/self/pagemap file, physical addresses may be used instead. This can be requested via the
-use_physical runtime option (see Simulator Parameters). This works on current kernels but is expected to stop working from user mode on future kernels due to recent security changes (see http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce).
drcachesim tool is a work in progress. We welcome contributions in these areas of missing functionality:
drcachesim tool was designed to be extensible, allowing users to easily model different caching devices, implement different models, and gather custom statistics.
To model different caching devices, subclass the
simulator_t, caching_device_t, caching_device_block_t, caching_device_stats_t classes.
To implement a different cache model, subclass the
cache_t class and override the
Statistics gathering is separated out into the
caching_device_stats_t class. To implement custom statistics, subclass
caching_device_stats_t and override the
drcachesim's behavior can be controlled through options passed after the
drcachesim but prior to the "--" delimiter on the command line:
Boolean options can be disabled using a "-no_" prefix.
The parameters available are described below:
drcachesim is one of the few simulators to support multiple processes. This feature requires an out-of-process simulator and inter-process communication. A single-process design would incur less overhead. Thus, we expect
drcachesim to pay for its multi-process support with potentially unfavorable performance versus single-process simulators.
When comparing cache hits, misses, and miss rates across simulators, the details can vary substantially. For example, some other simulators (such as
cachegrind) do not split memory references that cross cache lines into multiple hits or misses. Additionally, instructions that reference multiple memory words (such as
ldm on ARM) are considered to be single accesses by
drcachesim, while other simulators (such as
cachegrind) may split the accesses into separate pieces.