DynamoRIO
Simulator Details

Generally, the simulator is able to be extended to model a variety of caching devices. Currently, CPU caches and TLBs are implemented. The type of devices to simulate can be specified by the parameter "-simulator_type" (see Simulator Parameters).

The CPU cache simulator models a configurable number of cores, each with an L1 data cache and an L1 instruction cache. Currently there is a single shared L2 unified cache, but we would like to extend support to arbitrary cache hierarchies (see Current Limitations). The cache line size and each cache's total size and associativity are user-specified (see Simulator Parameters).

The TLB simulator models a configurable number of cores, each with an L1 instruction TLB, an L1 data TLB, and an L2 unified TLB. Each TLB's entry number and associativity, and the virtual/physical page size, are user-specified (see Simulator Parameters).

Neither simulator has a simple way to know which core any particular thread executed on for each of its instructions. The tracer records which core a thread is on each time it writes out a full trace buffer, giving an approximation of the actual scheduling (at the granularity of the trace buffer size). By default, these cache and TLB simulators ignore that information and schedule threads to simulated cores in a static round-robin fashion with load balancing to fill in gaps with new threads after threads exit. The option "-cpu_scheduling" (see Simulator Parameters) can be used to instead map each physical cpu to a simulated core and use the recorded cpu that each segment of thread execution occurred on to schedule execution in a manner that more closely resembles the traced execution on the physical machine. Below is an example of the output using this option running an application with many threads on a pysical machine with 8 cpus. The 8 cpus are mapped to the 4 simulated cores:

$ bin64/drrun -t drcachesim -cpu_scheduling -- ~/test/pi_estimator 20
Estimation of pi is 3.141592653798125
<Stopping application /home/bruening/dr/test/threadsig (213517)>
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (2 traced CPU(s): #2, #5)
L1I stats:
Hits: 2,756,429
Misses: 1,190
Miss rate: 0.04%
L1D stats:
Hits: 1,747,822
Misses: 13,511
Prefetch hits: 2,354
Prefetch misses: 11,157
Miss rate: 0.77%
Core #1 (2 traced CPU(s): #4, #0)
L1I stats:
Hits: 472,948
Misses: 299
Miss rate: 0.06%
L1D stats:
Hits: 895,099
Misses: 1,224
Prefetch hits: 253
Prefetch misses: 971
Miss rate: 0.14%
Core #2 (2 traced CPU(s): #1, #7)
L1I stats:
Hits: 448,581
Misses: 649
Miss rate: 0.14%
L1D stats:
Hits: 811,483
Misses: 1,723
Prefetch hits: 378
Prefetch misses: 1,345
Miss rate: 0.21%
Core #3 (2 traced CPU(s): #6, #3)
L1I stats:
Hits: 275,192
Misses: 154
Miss rate: 0.06%
L1D stats:
Hits: 522,655
Misses: 850
Prefetch hits: 173
Prefetch misses: 677
Miss rate: 0.16%
LL stats:
Hits: 12,491
Misses: 7,109
Prefetch hits: 8,922
Prefetch misses: 5,228
Local miss rate: 36.27%
Child hits: 7,933,367
Total miss rate: 0.09%

The memory access traces contain some optimizations that combine references for one basic block together. This may result in not considering some thread interleavings that could occur natively. There are no other disruptions to thread ordering, however, and the application runs with all of its threads concurrently just like it would natively (although slower).

Once every process has exited, the simulator prints cache miss statistics for each cache to stderr. The simulator is designed to be extensible, allowing for different cache studies to be carried out: see Extending the Simulator.

For L2 caching devices, the L1 caching devices are considered its children. Two separate miss rates are computed, one (the "Local miss rate") considering just requests that reach L2 while the other (the "Total miss rate") includes the child hits. This generalizes to deeper hierarchies: lower level caches are children and reported child hits are cumulative across all lower levels.

For memory requests that cross blocks, each block touched is considered separately, resulting in separate hit and miss statistics. This can be changed by implementing a custom statistics gatherer (see Extending the Simulator).

Software and hardware prefetches are combined in the prefetch hit and miss statistics, which are reported separately from regular loads and stores. To isolate software prefetch statistics, disable the hardware prefetcher by running with "-data_prefetcher none" (see Simulator Parameters). While misses from software prefetches are included in cache miss files, misses from hardware prefetches are not.