DynamoRIO
Simulator Details

Generally, the simulator is able to be extended to model a variety of caching devices. Currently, CPU caches and TLBs are implemented. The type of devices to simulate can be specified by the parameter "-tool" (see Simulator Parameters).

The CPU cache simulator models a configurable number of cores, each with an L1 data cache and an L1 instruction cache. Currently there is a single shared L2 unified cache, but we would like to extend support to arbitrary cache hierarchies (see Current Limitations). The cache line size and each cache's total size and associativity are user-specified (see Simulator Parameters).

The TLB simulator models a configurable number of cores, each with an L1 instruction TLB, an L1 data TLB, and an L2 unified TLB. Each TLB's entry number and associativity, and the virtual/physical page size, are user-specified (see Simulator Parameters).

Neither simulator has a simple way to know which core any particular thread executed on for each of its instructions. The tracer records which core a thread is on each time it writes out a full trace buffer, giving an approximation of the actual scheduling: but this is not representative due to overhead (see As-Traced Schedule Limitations). By default, these cache and TLB simulators ignore that information and schedule threads to simulated cores in a static round-robin fashion with load balancing to fill in gaps with new threads after threads exit. The option "-cpu_scheduling" (see Simulator Parameters) can be used to instead map each physical cpu to a simulated core and use the recorded cpu that each segment of thread execution occurred on to schedule execution following the "as traced" schedule, but as just noted this is not representative. Instead, we recommend using offline traces and dynamic re-scheduling as explained in Dynamic Scheduling using the -core_serial parameter. Here is an example:

$ bin64/drrun -t drmemtrace -offline -- ~/test/pi_estimator 8 20
Estimation of pi is 3.141592653798125
$ bin64/drrun -t drcachesim -core_serial -cores 3 -indir drmemtrace.pi_estimator.*.dir
Cache simulation results:
Core #0 (traced CPU(s): #0)
L1I0 (size=32768, assoc=8, block=64, LRU) stats:
Hits: 1,853,727
Misses: 2,152
Compulsory misses: 2,045
Invalidations: 0
Miss rate: 0.12%
L1D0 (size=32768, assoc=8, block=64, LRU) stats:
Hits: 605,114
Misses: 11,973
Compulsory misses: 9,845
Invalidations: 0
Prefetch hits: 1,880
Prefetch misses: 10,093
Miss rate: 1.94%
Core #1 (traced CPU(s): #1)
L1I1 (size=32768, assoc=8, block=64, LRU) stats:
Hits: 942,992
Misses: 461
Compulsory misses: 366
Invalidations: 0
Miss rate: 0.05%
L1D1 (size=32768, assoc=8, block=64, LRU) stats:
Hits: 385,134
Misses: 534
Compulsory misses: 775
Invalidations: 0
Prefetch hits: 144
Prefetch misses: 390
Miss rate: 0.14%
Core #2 (traced CPU(s): #2)
L1I2 (size=32768, assoc=8, block=64, LRU) stats:
Hits: 944,622
Misses: 453
Compulsory misses: 365
Invalidations: 0
Miss rate: 0.05%
L1D2 (size=32768, assoc=8, block=64, LRU) stats:
Hits: 385,808
Misses: 537
Compulsory misses: 791
Invalidations: 0
Prefetch hits: 140
Prefetch misses: 397
Miss rate: 0.14%
LL (size=8388608, assoc=16, block=64, LRU) stats:
Hits: 8,091
Misses: 8,019
Compulsory misses: 13,173
Invalidations: 0
Prefetch hits: 5,693
Prefetch misses: 5,187
Local miss rate: 49.78%
Child hits: 5,119,561
Total miss rate: 0.16%

The memory access traces contain some optimizations that combine references for one basic block together. This may result in not considering some thread interleavings that could occur natively. There are no other disruptions to thread ordering, however, and the application runs with all of its threads concurrently just like it would natively (although slower).

Once every process has exited, the simulator prints cache miss statistics for each cache to stderr. The simulator is designed to be extensible, allowing for different cache studies to be carried out: see Extending the Simulator.

For L2 caching devices, the L1 caching devices are considered its children. Two separate miss rates are computed, one (the "Local miss rate") considering just requests that reach L2 while the other (the "Total miss rate") includes the child hits. This generalizes to deeper hierarchies: lower level caches are children and reported child hits are cumulative across all lower levels.

For memory requests that cross blocks, each block touched is considered separately, resulting in separate hit and miss statistics. This can be changed by implementing a custom statistics gatherer (see Extending the Simulator).

Software and hardware prefetches are combined in the prefetch hit and miss statistics, which are reported separately from regular loads and stores. To isolate software prefetch statistics, disable the hardware prefetcher by running with "-data_prefetcher none" (see Simulator Parameters). While misses from software prefetches are included in cache miss files, misses from hardware prefetches are not.