DynamoRIO
Emulating Scatter and Gather Instructions

Background

x86

The x86 gather and scatter instructions were introduced in the AVX2 and AVX512 instruction set extensions. They allow loading or storing a subset of elements in a vector from/to multiple non-contiguous addresses.

AVX2 has only gather instructions, no scatter instructions, whereas AVX512 has both. AVX2 is limited to 256-bit length vectors, whereas AVX512 has 512-bit support. Both support masking of individual memory accesses, using either a special mask register in AVX512 or another vector in AVX2.

Examples of these instructions are (in DR’s IR):

vpgatherdd %rax(,%ymm11,4)[4byte] %ymm13 -> %ymm12 %ymm13

Above is an AVX2 gather instruction that reads 32-bit doublewords into the 256-bit ymm12 vector from addresses generated by adding the base address in rax to the corresponding index elements in ymm11, conditionally based on the masks in ymm13. Elements may be gathered in any order. When an element is read, its mask is cleared. If some load faults, all elements to its right (closer to LSB) will be complete.

vpscatterdd {%k1} %xmm10 -> %rax(,%xmm11,4)[4byte] %k1

Above is an AVX512 scatter instruction that writes 32-bit doublewords from the 128- bit xmm10 vector to addresses generated by adding the base address in rax to the corresponding index element in xmm11, conditionally based on the mask register k1. Elements may be scattered in any order. When an element is stored, its mask is cleared. If some store faults, all elements to its right (closer to LSB) will be complete.

AArch64

The AArch64 SVE (Scalable Vector Extension) introduced the first AArch64 scatter and gather instructions.

SVE has two scatter and gather addressing modes (scalar+vector and vector+immediate) and two predicated contiguous load and store addressing modes (scalar+scalar and scalar+immediate). The SVE2 instruction set extension adds a third scatter/gather addressing mode: vector+scalar.

The predicated contiguous instructions do not use vector based addressing but they do have other similarities with the scatter and gather instructions which mean DynamoRIO handles them in a similar way.

All scatter/gather/predicated contiguous instructions access elements conditionally based on the mask value of a governing predicate register. Loads are always zeroing: inactive elements are set to 0 in the destination register.

Scalar+vector

ld1b (%x0,%z0.s,uxtw)[1byte] %p0/z -> %z1.s

Above is a scalar+vector gather load that reads 8-bit values which are zero-extended and written to 32-bit elements of the z1 vector register. Addresses are calculated by adding the base address from the scalar x0 register to the corresponding 32-bit element in the vector index register z0. Elements that are not active in the mask contained by the governing predicate register p0 are not loaded and the corresponding element in the destination register z1 is set to 0.

st1h %z17.d %p5 -> (%x19,%z20.d)[2byte]

Above is a scalar+vector scatter store that writes the lowest 16-bits of the 64-bit elements of the z17 vector register. Addresses are calculated by adding the base address from the scalar x19 register to the corresponding 64-bit element in the vector index register z20. Elements that are not active in the mask contained by the governing predicate register p5 are not stored.

Vector+immediate

ld1sw +0x18(%z8.d)[4byte] %p2/z -> %z6.d

Above is a vector+immediate gather load that reads 32-bit values which are sign-extended and written to 64-bit elements of the z6 vector register. Addresses are calculated by adding the immedate value 0x18 to the corresponding 64-bit element in the vector base register z8. Elements that are not active in the mask contained by governing predicate register p2 are zeroed.

Vector+scalar

stnt1d %z27.d %p7 -> (%z25.d,%x29)[8byte]

Introduced with SVE2. The above instruction is a scatter store that writes the 64-bit elements of the z27 vector register. Addresses are calculated by adding the value of scalar register x29 to the corresponding 64-bit element in the vector base register z25. Elements that are not active in the mask contained by the governing predicate register p7 are not stored.

Scalar+scalar

ld1sb (%x17,%x18)[1byte] %p5/z -> %z16.d

The above instruction is a scalar+scalar predicated contiguous load. 8-bit values are loaded and sign extended to 64-bits and written to the vector register z16. The address of the first element is calculated by adding the value of the scalar index register x18 to the value of the scalar base register x17. Addresses for subsequent elements are calculated by adding the size of the loaded value (1-byte in this example) to the address of the previous element. Elements that are not active in the mask contained by governing predicate register p5 are zeroed.

Scalar+immediate

st1w %z10.s %p3 -> -0x60(%x11)[4byte]

The above instruction is a scalar+immediate predicated contiguous store that writes the 32-bit elements of the z10 vector register. The address for the first element is calculated by adding the immediate value -0x60 to the value of the scalar base register x11. Addresses for subsequent elements are calculated by adding the size of the stored value (4 bytes in this example) to the address of the previous element. Elements that are not active in the mask contained by governing predicate register p3 are not stored.

Note that DynamoRIO IR and disassembly for scalar+immediate instructions give the offset in bytes, but the instruction itself uses a 4-bit signed immediate which is multiplied by the current SVE vector length in bytes. Arm assembly syntax uses a vector length agnostic representation for the offset: #<imm4>, MUL VL

So the above instruction might be encoded as

st1w z10.s, p3, [x11, #-3, MUL VL]

if the currently vector length is 32 bytes, or

st1w z10.s, p3, [x11, #-6, MUL VL]

if it is 16 bytes, and cannot be encoded for vector lengths > 32 bytes.

Non-faulting loads

Non-faulting loads (ldnf*) do not fault if an element read faults. Instead a special predicate register FFR is updated. If this happens the FFR element corresponding to the element that faulted, and all elements higher than that are set to 0. Elements lower than the faulting element are unchanged. Non-faulting loads support scalar+immediate addressing.

First-faulting loads

First-faulting loads (ldff*) behave like a normal gather/load instruction if the first active element causes a fault, and behave like a non-faulting load if the first active element succeeds but a later active element read causes a fault. First faulting loads support scalar+scalar, scalar+vector, and vector+immediate addressing.

Problem Statement

Scatter and gather instructions pose a challenge to DynamoRIO clients that observe memory addresses, like address tracing tools (e.g. drcachesim, that collects memory address and control flow traces), and taint tracking tools. They are complex to handle because:

  • A single gather or scatter instruction loads from or stores to multiple addresses
  • Accessed addresses may be non-contiguous
  • Each access is conditional based on some mask

DynamoRIO clients only see the scatter or gather instruction and need to do more work to extract all accessed addresses. This is unlike regular scalar loads or stores, where the accessed address is readily available. The goal of this work is to make it easier for DR clients to observe these addresses. We achieve this by expanding the scatter and gather instructions into a functionally equivalent sequence of scalar stores and loads. This way, DR clients will see regular store and load instructions which they can instrument as usual. This is similar to what DR does for repeat string operations (like rep movs, repnz cmps): convert it into a loop so that each memory access is made by a separate dynamic instruction. This method has worked well for such instructions that implicitly issue multiple memory accesses.

Original issue: DynamoRIO/dynamorio#2985.

Design

This required the addition of new support in various DynamoRIO components, like drreg, drx, drmgr and core DR. Multiple contributors worked on designing and implementing the required changes.

Scatter/gather Instruction Expansion

Owner: Hendrik Greving

As described above, we can simplify work for DR clients by replacing each scatter and gather instruction with a functionally equivalent sequence of scalar stores and loads. The expanded sequence is the unrolled version of the following loop:

num_accesses = vector_size / element_size
for i = 0, 1, ..., (num_accesses-1), do
extract mask for the ith access from mask reg or mask vector
if mask is set, then
if index is vector, then
extract ith element of index vector
compute address = base + ith index element
else // base is vector
extract ith element of base vector
compute address = ith base element + index
done
if instr_is_gather, then
load data from address into a scalar reg
insert scalar data into destination vector
else // instr_is_scatter
extract scalar data from source vector to scalar reg
store data from scalar reg to address
done
if x86, then
clear ith mask in mask reg or mask vector
done
done
done

Due to the x86 ISA, the extraction/insertion of the scalar value from/to the vector may involve multiple steps, e.g. to extract a 32-bit scalar value from a 512-bit zmm reg, we first need to extract a 128-bit xmm from it.

drmgr in DR provides multiple phases of instrumentation. Our expansion is done in the first phase known as app2app. As the name suggests, this phase is intended to transform app instructions to equivalent instructions. For simplicity, we also separate out the scatter and gather instructions from their basic block and create a separate fragment with only the expanded sequence. The logic for expanding scatter and gather instructions is implemented in the drx extension library as drx_expand_scatter_gather, and can be used by any client that needs it, including drcachesim. This support was added by commit.

As an example, the following are the expansions of some instructions.

Expansion for x86 gather

vpgatherdd 0x00402039(,%xmm11,4)[4byte] %xmm13 -> %xmm12 %xmm13
+0 m4 @0x00007fdb2ac5e6a0 65 48 a3 e0 00 00 00 mov %rax -> %gs:0x000000e0[8byte]
00 00 00 00
+11 m4 @0x00007fdb2ac5eac0 9f lahf -> %ah
+12 m4 @0x00007fdb2ac5ea40 0f 90 c0 seto -> %al // Spill aflags using drreg.
+15 m4 @0x00007fdb2ac5efa8 65 48 89 0c 25 e8 00 mov %rcx -> %gs:0x000000e8[8byte] // Spill the scratch GPR using drreg.
00 00
+24 m4 @0x00007fdb2ac0ec70 65 48 8b 0c 25 20 00 mov %gs:0x20[8byte] -> %rcx
00 00
+33 m4 @0x00007fdb2ac0ebf0 48 8b 89 f0 0a 00 00 mov 0x00000af0(%rcx)[8byte] -> %rcx
+40 m4 @0x00007fdb2ac5f0a8 48 8b 09 mov (%rcx)[8byte] -> %rcx
+43 m4 @0x00007fdb2ac5e7f0 48 8b 49 10 mov 0x10(%rcx)[8byte] -> %rcx
+47 m4 @0x00007fdb2ac0f488 48 8b 49 08 mov 0x08(%rcx)[8byte] -> %rcx
+51 m4 @0x00007fdb2ac0ed38 62 f1 7c 48 29 01 vmovaps {%k0} %zmm0 -> (%rcx)[64byte] // Manually spill the scratch zmm reg.
+57 m4 @0x00007fdb2ac0ee00 <label>
+57 L4 @0x00007fdb2ac0efb0 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0 // Expansion for the first vector element starts here.
+63 L4 @0x00007fdb2ac0bee8 c4 e3 79 16 c1 00 vpextrd %xmm0 $0x00 -> %ecx // Extract mask for the first element.
+69 L4 @0x00007fdb2ac0c750 c1 e9 1f shr $0x0000001f %ecx -> %ecx
+72 L4 @0x00007fdb2ac0be68 81 e1 01 00 00 00 and $0x00000001 %ecx -> %ecx
+78 L4 @0x00007fdb2ac0bca0 0f 84 fa ff ff ff jz @0x00007fdb2ac0ee98[8byte] // Check whether to load the first element based on mask.
+84 L4 @0x00007fdb2ac0c6d0 c4 63 7d 39 d8 00 vextracti128 %ymm11 $0x00 -> %xmm0
+90 L4 @0x00007fdb2ac0be00 c4 e3 79 16 c1 00 vpextrd %xmm0 $0x00 -> %ecx // Extract index for the first load address.
+96 L4 @0x00007fdb2ac0b8b8 48 63 c9 movsxd %ecx -> %rcx
+99 L4 @0x00007fdb2ac0cc20 8b 0c 8d 39 20 40 00 mov 0x00402039(,%rcx,4)[4byte] -> %ecx // Load the first element into a scalar reg.
+106 L4 @0x00007fdb2ac5e5b8 c4 63 7d 39 e0 00 vextracti128 %ymm12 $0x00 -> %xmm0
+112 L4 @0x00007fdb2ac5e870 c4 e3 79 22 c1 00 vpinsrd %xmm0 %ecx $0x00 -> %xmm0
+118 L4 @0x00007fdb2ac5e9c0 c4 63 1d 38 e0 00 vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12 // Insert the first element into the destination vector reg
+124 L4 @0x00007fdb2ac5eb40 33 c9 xor %ecx %ecx -> %ecx
+126 L4 @0x00007fdb2ac5ebc0 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0
+132 L4 @0x00007fdb2ac5eda8 c4 e3 79 22 c1 00 vpinsrd %xmm0 %ecx $0x00 -> %xmm0
+138 L4 @0x00007fdb2ac5ec40 c4 63 15 38 e8 00 vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13 // Clear the mask bit for the first element.
+144 m4 @0x00007fdb2ac0ee98 <label>
+144 L4 @0x00007fdb2ac5ed40 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0 // Repeat for the second vector element.
+150 L4 @0x00007fdb2ac5ee28 c4 e3 79 16 c1 01 vpextrd %xmm0 $0x01 -> %ecx
+156 L4 @0x00007fdb2ac5eea8 c1 e9 1f shr $0x0000001f %ecx -> %ecx
+159 L4 @0x00007fdb2ac5e638 81 e1 01 00 00 00 and $0x00000001 %ecx -> %ecx
+165 L4 @0x00007fdb2ac5e788 0f 84 fa ff ff ff jz @0x00007fdb2ac5ecc0[8byte]
+171 L4 @0x00007fdb2ac5e720 c4 63 7d 39 d8 00 vextracti128 %ymm11 $0x00 -> %xmm0
+177 L4 @0x00007fdb2ac5e958 c4 e3 79 16 c1 01 vpextrd %xmm0 $0x01 -> %ecx
+183 L4 @0x00007fdb2ac5e8f0 48 63 c9 movsxd %ecx -> %rcx
+186 L4 @0x00007fdb2ac5f028 8b 0c 8d 39 20 40 00 mov 0x00402039(,%rcx,4)[4byte] -> %ecx
+193 L4 @0x00007fdb2ac5ef28 c4 63 7d 39 e0 00 vextracti128 %ymm12 $0x00 -> %xmm0
+199 L4 @0x00007fdb2ac0baa0 c4 e3 79 22 c1 01 vpinsrd %xmm0 %ecx $0x01 -> %xmm0
+205 L4 @0x00007fdb2ac5f128 c4 63 1d 38 e0 00 vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+211 L4 @0x00007fdb2ac0ea60 33 c9 xor %ecx %ecx -> %ecx
+213 L4 @0x00007fdb2ac0e9c8 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0
+219 L4 @0x00007fdb2ac0e930 c4 e3 79 22 c1 01 vpinsrd %xmm0 %ecx $0x01 -> %xmm0
+225 L4 @0x00007fdb2ac0cbb8 c4 63 15 38 e8 00 vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+231 m4 @0x00007fdb2ac5ecc0 <label>
+231 L4 @0x00007fdb2ac5e538 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0 // Repeat for the third vector element.
+237 L4 @0x00007fdb2ac5e4b8 c4 e3 79 16 c1 02 vpextrd %xmm0 $0x02 -> %ecx
+243 L4 @0x00007fdb2ac5df68 c1 e9 1f shr $0x0000001f %ecx -> %ecx
+246 L4 @0x00007fdb2ac5e438 81 e1 01 00 00 00 and $0x00000001 %ecx -> %ecx
+252 L4 @0x00007fdb2ac5e3b8 0f 84 fa ff ff ff jz @0x00007fdb2ac0f018[8byte]
+258 L4 @0x00007fdb2ac5e050 c4 63 7d 39 d8 00 vextracti128 %ymm11 $0x00 -> %xmm0
+264 L4 @0x00007fdb2ac5e338 c4 e3 79 16 c1 02 vpextrd %xmm0 $0x02 -> %ecx
+270 L4 @0x00007fdb2ac5e2b8 48 63 c9 movsxd %ecx -> %rcx
+273 L4 @0x00007fdb2ac5e238 8b 0c 8d 39 20 40 00 mov 0x00402039(,%rcx,4)[4byte] -> %ecx
+280 L4 @0x00007fdb2ac5e1b8 c4 63 7d 39 e0 00 vextracti128 %ymm12 $0x00 -> %xmm0
+286 L4 @0x00007fdb2ac5e138 c4 e3 79 22 c1 02 vpinsrd %xmm0 %ecx $0x02 -> %xmm0
+292 L4 @0x00007fdb2ac5e0b8 c4 63 1d 38 e0 00 vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+298 L4 @0x00007fdb2ac5dfd0 33 c9 xor %ecx %ecx -> %ecx
+300 L4 @0x00007fdb2ac5dee8 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0
+306 L4 @0x00007fdb2ac0f080 c4 e3 79 22 c1 02 vpinsrd %xmm0 %ecx $0x02 -> %xmm0
+312 L4 @0x00007fdb2ac0f0e8 c4 63 15 38 e8 00 vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+318 m4 @0x00007fdb2ac0f018 <label>
+318 L4 @0x00007fdb2ac0f220 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0 // Repeat for the fourth vector element.
+324 L4 @0x00007fdb2ac0f2a0 c4 e3 79 16 c1 03 vpextrd %xmm0 $0x03 -> %ecx
+330 L4 @0x00007fdb2ac0f150 c1 e9 1f shr $0x0000001f %ecx -> %ecx
+333 L4 @0x00007fdb2ac0f388 81 e1 01 00 00 00 and $0x00000001 %ecx -> %ecx
+339 L4 @0x00007fdb2ac0f408 0f 84 fa ff ff ff jz @0x00007fdb2ac0f1b8[8byte]
+345 L4 @0x00007fdb2ac0ba20 c4 63 7d 39 d8 00 vextracti128 %ymm11 $0x00 -> %xmm0
+351 L4 @0x00007fdb2ac0f320 c4 e3 79 16 c1 03 vpextrd %xmm0 $0x03 -> %ecx
+357 L4 @0x00007fdb2ac0f508 48 63 c9 movsxd %ecx -> %rcx
+360 L4 @0x00007fdb2ac0f588 8b 0c 8d 39 20 40 00 mov 0x00402039(,%rcx,4)[4byte] -> %ecx
+367 L4 @0x00007fdb2ac0ef30 c4 63 7d 39 e0 00 vextracti128 %ymm12 $0x00 -> %xmm0
+373 L4 @0x00007fdb2ac0c050 c4 e3 79 22 c1 03 vpinsrd %xmm0 %ecx $0x03 -> %xmm0
+379 L4 @0x00007fdb2ac0c1c8 c4 63 1d 38 e0 00 vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+385 L4 @0x00007fdb2ac0c3c8 33 c9 xor %ecx %ecx -> %ecx
+387 L4 @0x00007fdb2ac0bfd0 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0
+393 L4 @0x00007fdb2ac5de68 c4 e3 79 22 c1 03 vpinsrd %xmm0 %ecx $0x03 -> %xmm0
+399 L4 @0x00007fdb2ac5dde8 c4 63 15 38 e8 00 vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+405 m4 @0x00007fdb2ac0f1b8 <label>
+405 L4 @0x00007fdb2ac5d898 c4 41 11 ef ed vpxor %xmm13 %xmm13 -> %xmm13 // Zero the mask reg.
+410 m4 @0x00007fdb2ac5dd68 65 48 8b 0c 25 20 00 mov %gs:0x20[8byte] -> %rcx
00 00
+419 m4 @0x00007fdb2ac5dce8 48 8b 89 f0 0a 00 00 mov 0x00000af0(%rcx)[8byte] -> %rcx
+426 m4 @0x00007fdb2ac5d980 48 8b 09 mov (%rcx)[8byte] -> %rcx
+429 m4 @0x00007fdb2ac5dc68 48 8b 49 10 mov 0x10(%rcx)[8byte] -> %rcx
+433 m4 @0x00007fdb2ac5dbe8 48 8b 49 08 mov 0x08(%rcx)[8byte] -> %rcx
+437 m4 @0x00007fdb2ac5db68 62 f1 7c 48 28 01 vmovaps {%k0} (%rcx)[64byte] -> %zmm0 // Manually restore the scratch zmm reg.
+443 m4 @0x00007fdb2ac5dae8 65 48 8b 0c 25 e8 00 mov %gs:0x000000e8[8byte] -> %rcx // Restore the scratch GPR using drreg.
00 00
+452 m4 @0x00007fdb2ac5da68 3c 81 cmp %al $0x81
+454 m4 @0x00007fdb2ac5d9e8 9e sahf %ah
+455 m4 @0x00007fdb2ac5d900 65 48 a1 e0 00 00 00 mov %gs:0x000000e0[8byte] -> %rax // Restore aflags using drreg.
00 00 00 00
+466 m4 @0x00007fdb2ac5d818 <label>

Expansion for x86 scatter

vpscatterdd {%k1} %xmm10 -> 0x00402039(,%xmm11,4)[4byte] %k1
+0 m4 @0x00007fdb2ac106e0 65 48 a3 e0 00 00 00 mov %rax -> %gs:0x000000e0[8byte]
00 00 00 00
+11 m4 @0x00007fdb2ac10760 9f lahf -> %ah
+12 m4 @0x00007fdb2ac107e0 0f 90 c0 seto -> %al // Spill aflags using drreg.
+15 m4 @0x00007fdb2ac100a8 65 48 89 0c 25 e8 00 mov %rcx -> %gs:0x000000e8[8byte] // Spill the first scratch GPR using drreg.
00 00
+24 m4 @0x00007fdb2ac10110 65 48 89 14 25 f0 00 mov %rdx -> %gs:0x000000f0[8byte] // Spill the second scratch GPR using drreg.
00 00
+33 m4 @0x00007fdb2ac0fed8 65 48 8b 0c 25 20 00 mov %gs:0x20[8byte] -> %rcx
00 00
+42 m4 @0x00007fdb2ac0ff40 48 8b 89 f0 0a 00 00 mov 0x00000af0(%rcx)[8byte] -> %rcx
+49 m4 @0x00007fdb2ac0f608 48 8b 09 mov (%rcx)[8byte] -> %rcx
+52 m4 @0x00007fdb2ac10860 48 8b 49 10 mov 0x10(%rcx)[8byte] -> %rcx
+56 m4 @0x00007fdb2ac108e0 48 8b 49 08 mov 0x08(%rcx)[8byte] -> %rcx
+60 m4 @0x00007fdb2ac0ca50 62 f1 7c 48 29 01 vmovaps {%k0} %zmm0 -> (%rcx)[64byte] // Manually spill the scratch zmm reg.
+66 m4 @0x00007fdb2ac10660 <label>
+66 L4 @0x00007fdb2ac10560 c5 f8 93 c9 kmovw %k1 -> %ecx // Expansion for the first vector element starts here.
+70 L4 @0x00007fdb2ac104f8 f7 c1 01 00 00 00 test %ecx $0x00000001
+76 L4 @0x00007fdb2ac10478 0f 84 fa ff ff ff jz @0x00007fdb2ac105e0[8byte] // Check whether to store the first element based on mask.
+82 L4 @0x00007fdb2ac103f8 62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+89 L4 @0x00007fdb2ac10378 c4 e3 79 16 c1 00 vpextrd %xmm0 $0x00 -> %ecx // Extract index for the first store address.
+95 L4 @0x00007fdb2ac102f8 62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+102 L4 @0x00007fdb2ac10278 c4 e3 79 16 c2 00 vpextrd %xmm0 $0x00 -> %edx // Extract the element for the first store.
+108 L4 @0x00007fdb2ac101f8 48 63 c9 movsxd %ecx -> %rcx
+111 L4 @0x00007fdb2ac10178 89 14 8d 39 20 40 00 mov %edx -> 0x00402039(,%rcx,4)[4byte] // Store the first element.
+118 L4 @0x00007fdb2ac10028 b9 01 00 00 00 mov $0x00000001 -> %ecx
+123 m4 @0x00007fdb2ac0ffa8 65 48 89 1c 25 f8 00 mov %rbx -> %gs:0x000000f8[8byte] // Spill the third scratch GPR using drreg.
00 00
+132 m4 @0x00007fdb2ac0fe58 c5 f8 93 d8 kmovw %k0 -> %ebx // Manually spill the scratch mask reg k0 to the scratch GPR.
+136 L4 @0x00007fdb2ac0fdd8 c5 f8 92 c1 kmovw %ecx -> %k0
+140 L4 @0x00007fdb2ac0fbf0 c5 fc 42 c9 kandnw %k0 %k1 -> %k1 // Clear bit for the first element in the mask reg.
+144 m4 @0x00007fdb2ac0fd58 c5 f8 92 c3 kmovw %ebx -> %k0 // Manually restore the scratch mask reg from the scratch GPR.
+148 m4 @0x00007fdb2ac0fcd8 65 48 8b 1c 25 f8 00 mov %gs:0x000000f8[8byte] -> %rbx // Restore the third scratch GPR using drreg.
00 00
+157 m4 @0x00007fdb2ac105e0 <label>
+157 L4 @0x00007fdb2ac0fb70 c5 f8 93 c9 kmovw %k1 -> %ecx // Repeat for the second vector element.
+161 L4 @0x00007fdb2ac0faf0 f7 c1 02 00 00 00 test %ecx $0x00000002
+167 L4 @0x00007fdb2ac0fa70 0f 84 fa ff ff ff jz @0x00007fdb2ac0fc58[8byte]
+173 L4 @0x00007fdb2ac0f9f0 62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+180 L4 @0x00007fdb2ac0f970 c4 e3 79 16 c1 01 vpextrd %xmm0 $0x01 -> %ecx
+186 L4 @0x00007fdb2ac0f8f0 62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+193 L4 @0x00007fdb2ac0f870 c4 e3 79 16 c2 01 vpextrd %xmm0 $0x01 -> %edx
+199 L4 @0x00007fdb2ac0f7f0 48 63 c9 movsxd %ecx -> %rcx
+202 L4 @0x00007fdb2ac0f770 89 14 8d 39 20 40 00 mov %edx -> 0x00402039(,%rcx,4)[4byte]
+209 L4 @0x00007fdb2ac0f6f0 b9 02 00 00 00 mov $0x00000002 -> %ecx
+214 m4 @0x00007fdb2ac0f670 65 48 89 1c 25 f8 00 mov %rbx -> %gs:0x000000f8[8byte]
00 00
+223 m4 @0x00007fdb2ac0c448 c5 f8 93 d8 kmovw %k0 -> %ebx
+227 L4 @0x00007fdb2ac0c2c8 c5 f8 92 c1 kmovw %ecx -> %k0
+231 L4 @0x00007fdb2ac0c9e8 c5 fc 42 c9 kandnw %k0 %k1 -> %k1
+235 m4 @0x00007fdb2ac0c980 c5 f8 92 c3 kmovw %ebx -> %k0
+239 m4 @0x00007fdb2ac0c818 65 48 8b 1c 25 f8 00 mov %gs:0x000000f8[8byte] -> %rbx
00 00
+248 m4 @0x00007fdb2ac0fc58 <label>
+248 L4 @0x00007fdb2ac0c518 c5 f8 93 c9 kmovw %k1 -> %ecx // Repeat for the third vector element.
+252 L4 @0x00007fdb2ac0c668 f7 c1 04 00 00 00 test %ecx $0x00000004
+258 L4 @0x00007fdb2ac0cab8 0f 84 fa ff ff ff jz @0x00007fdb2ac0ba20[8byte]
+264 L4 @0x00007fdb2ac0cb20 62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+271 L4 @0x00007fdb2ac0c248 c4 e3 79 16 c1 02 vpextrd %xmm0 $0x02 -> %ecx
+277 L4 @0x00007fdb2ac0bd98 62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+284 L4 @0x00007fdb2ac0c0d0 c4 e3 79 16 c2 02 vpextrd %xmm0 $0x02 -> %edx
+290 L4 @0x00007fdb2ac0c900 48 63 c9 movsxd %ecx -> %rcx
+293 L4 @0x00007fdb2ac0bfd0 89 14 8d 39 20 40 00 mov %edx -> 0x00402039(,%rcx,4)[4byte]
+300 L4 @0x00007fdb2ac0c3c8 b9 04 00 00 00 mov $0x00000004 -> %ecx
+305 m4 @0x00007fdb2ac0c1c8 65 48 89 1c 25 f8 00 mov %rbx -> %gs:0x000000f8[8byte]
00 00
+314 m4 @0x00007fdb2ac0c050 c5 f8 93 d8 kmovw %k0 -> %ebx
+318 L4 @0x00007fdb2ac0ef30 c5 f8 92 c1 kmovw %ecx -> %k0
+322 L4 @0x00007fdb2ac0f588 c5 fc 42 c9 kandnw %k0 %k1 -> %k1
+326 m4 @0x00007fdb2ac0f508 c5 f8 92 c3 kmovw %ebx -> %k0
+330 m4 @0x00007fdb2ac0f320 65 48 8b 1c 25 f8 00 mov %gs:0x000000f8[8byte] -> %rbx
00 00
+339 m4 @0x00007fdb2ac0ba20 <label>
+339 L4 @0x00007fdb2ac0f408 c5 f8 93 c9 kmovw %k1 -> %ecx // Repeat for the fourth vector element.
+343 L4 @0x00007fdb2ac0f388 f7 c1 08 00 00 00 test %ecx $0x00000008
+349 L4 @0x00007fdb2ac0f150 0f 84 fa ff ff ff jz @0x00007fdb2ac0f488[8byte]
+355 L4 @0x00007fdb2ac0f2a0 62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+362 L4 @0x00007fdb2ac0f220 c4 e3 79 16 c1 03 vpextrd %xmm0 $0x03 -> %ecx
+368 L4 @0x00007fdb2ac0f1b8 62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+375 L4 @0x00007fdb2ac0f0e8 c4 e3 79 16 c2 03 vpextrd %xmm0 $0x03 -> %edx
+381 L4 @0x00007fdb2ac0f080 48 63 c9 movsxd %ecx -> %rcx
+384 L4 @0x00007fdb2ac0f018 89 14 8d 39 20 40 00 mov %edx -> 0x00402039(,%rcx,4)[4byte]
+391 L4 @0x00007fdb2ac0cbb8 b9 08 00 00 00 mov $0x00000008 -> %ecx
+396 m4 @0x00007fdb2ac0e930 65 48 89 1c 25 f8 00 mov %rbx -> %gs:0x000000f8[8byte]
00 00
+405 m4 @0x00007fdb2ac0e9c8 c5 f8 93 d8 kmovw %k0 -> %ebx
+409 L4 @0x00007fdb2ac0ea60 c5 f8 92 c1 kmovw %ecx -> %k0
+413 L4 @0x00007fdb2ac0eb28 c5 fc 42 c9 kandnw %k0 %k1 -> %k1
+417 m4 @0x00007fdb2ac0ebf0 c5 f8 92 c3 kmovw %ebx -> %k0
+421 m4 @0x00007fdb2ac0ec70 65 48 8b 1c 25 f8 00 mov %gs:0x000000f8[8byte] -> %rbx
00 00
+430 m4 @0x00007fdb2ac0f488 <label>
+430 L4 @0x00007fdb2ac0ed38 c4 e1 f4 47 c9 kxorq %k1 %k1 -> %k1 // Clear the mask reg.
+435 m4 @0x00007fdb2ac0ee00 65 48 8b 0c 25 20 00 mov %gs:0x20[8byte] -> %rcx
00 00
+444 m4 @0x00007fdb2ac0ee98 48 8b 89 f0 0a 00 00 mov 0x00000af0(%rcx)[8byte] -> %rcx
+451 m4 @0x00007fdb2ac0efb0 48 8b 09 mov (%rcx)[8byte] -> %rcx
+454 m4 @0x00007fdb2ac0bee8 48 8b 49 10 mov 0x10(%rcx)[8byte] -> %rcx
+458 m4 @0x00007fdb2ac0c750 48 8b 49 08 mov 0x08(%rcx)[8byte] -> %rcx
+462 m4 @0x00007fdb2ac0be68 62 f1 7c 48 28 01 vmovaps {%k0} (%rcx)[64byte] -> %zmm0 // Manually restore the scratch zmm reg.
+468 m4 @0x00007fdb2ac0bca0 65 48 8b 0c 25 e8 00 mov %gs:0x000000e8[8byte] -> %rcx // Restore the first scratch GPR using drreg.
00 00
+477 m4 @0x00007fdb2ac0c6d0 65 48 8b 14 25 f0 00 mov %gs:0x000000f0[8byte] -> %rdx // Restore the second scratch GPR using drreg.
00 00
+486 m4 @0x00007fdb2ac0be00 3c 81 cmp %al $0x81
+488 m4 @0x00007fdb2ac0b8b8 9e sahf %ah
+489 m4 @0x00007fdb2ac0cc20 65 48 a1 e0 00 00 00 mov %gs:0x000000e0[8byte] -> %rax // Restore aflags using drreg.
00 00 00 00
+500 m4 @0x00007fdb2ac0baa0 <label>

Expansion for AArch64 gather

ldff1sb (%x1,%z2.d)[1byte] %p3/z -> %z28.d
str %x0 -> +0x0148(%x28)[8byte] // Save flags using drreg
mrs %nzcv -> %x0
str %x0 -> +0x0150(%x28)[8byte]
ldr +0x0148(%x28)[8byte] -> %x0
str %x0 -> +0x0148(%x28)[8byte] // Save scratch GPR using drreg
ldr +0x38(%x28)[8byte] -> %x0
ldr +0x0f50(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr +0x10(%x0)[8byte] -> %x0
ldr +0x20(%x0)[8byte] -> %x0
str %z28 -> (%x0)[32byte] // Save the value of the destination register in case we
// need to restore its value on a fault.
ldr +0x38(%x28)[8byte] -> %x0
ldr +0x0f50(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr +0x10(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
str %p0 -> (%x0)[4byte] // Spill a predicate register to use as the loop variable mask
<label note=0x0000000000000001>
dup $0x00 lsl $0x00 -> %z28.d // Clear destination register
pfalse -> %p0.b // Initialize loop variable to 0
pnext %p3 %p0.d -> %p0.d // Set loop variable to the first active element
b.eq @0x0000fffda4f27518[8byte] // If no active elements, break the loop
lastb %p0 %z2.d -> %x0 // Extract the first active element index to scratch GPR
ldrsb (%x1,%x0)[1byte] -> %x0 // Load the first element to scratch GPR
cpy %p0/m %x0 -> %z28.d // Copy scratch GPR to current element of destination register
pnext %p3 %p0.d -> %p0.d // Repeat for the next active element
b.eq @0x0000fffda4f27518[8byte]
lastb %p0 %z2.d -> %x0
ldrsb (%x1,%x0)[1byte] -> %x0
cpy %p0/m %x0 -> %z28.d
pnext %p3 %p0.d -> %p0.d // Repeat for the next active element
b.eq @0x0000fffda4f27518[8byte]
lastb %p0 %z2.d -> %x0
ldrsb (%x1,%x0)[1byte] -> %x0
cpy %p0/m %x0 -> %z28.d
pnext %p3 %p0.d -> %p0.d // Repeat for the next active element
b.eq @0x0000fffda4f27518[8byte]
lastb %p0 %z2.d -> %x0
ldrsb (%x1,%x0)[1byte] -> %x0
cpy %p0/m %x0 -> %z28.d
<label note=0x0000000000000000>
<label note=0x0000000000000000>
<label note=0x0000000000000002>
ldr +0x38(%x28)[8byte] -> %x0
ldr +0x0f50(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr +0x10(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr (%x0)[4byte] -> %p0 // Restore spilled predicate register
ldr +0x0148(%x28)[8byte] -> %x0
str %x0 -> +0x0148(%x28)[8byte]
ldr +0x0150(%x28)[8byte] -> %x0 // Restore flags using drreg
msr %x0 -> %nzcv
ldr +0x0148(%x28)[8byte] -> %x0 // Restore spilled GPR
b $0x00000000004001d0

Expansion for AArch64 predicated contiguous store

st2w %z28.s %z29.s %p2 -> (%x1,%x2,lsl #2)[4byte]
str %x0 -> +0x0148(%x28)[8byte]
mrs %nzcv -> %x0 // Spill flags using drreg
str %x0 -> +0x0150(%x28)[8byte]
ldr +0x0148(%x28)[8byte] -> %x0
str %x0 -> +0x0148(%x28)[8byte] // Spill scrach GPRs using drreg
str %x3 -> +0x0158(%x28)[8byte]
str %x4 -> +0x0160(%x28)[8byte]
ldr +0x38(%x28)[8byte] -> %x0
ldr +0x0f50(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr +0x10(%x0)[8byte] -> %x0
ldr +0x20(%x0)[8byte] -> %x0
str %z0 -> (%x0)[32byte] // Manually spill scratch vector Z register
ldr +0x38(%x28)[8byte] -> %x0
ldr +0x0f50(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr +0x10(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
str %p0 -> (%x0)[4byte] // Manually spill scratch predicate P register
<label note=0x0000000000000001>
add %x1 %x2 uxtx $0x0000000000000002 -> %x4 // Calculate start address
index $0x00 $0x02 -> %z0.s // Initialize vector index register with value [0, 2, 4, ..]
pfalse -> %p0.b // Initialize loop variable to 0
pnext %p2 %p0.s -> %p0.s // Set loop variable to the first active element
b.eq @0x0000fffdb29fe3d0[8byte] // If no active elements, break the loop
lastb %p0 %z0.s -> %x0 // Extract vector index to GPR
lastb %p0 %z28.s -> %x3 // Extract vector element value from first source register to GPR
str %w3 -> (%x4,%x0,lsl #2)[4byte] // Store first register element value
add %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0 // Add 1 to index value
lastb %p0 %z29.s -> %x3 // Extract vector element value from second source register to GPR
str %w3 -> (%x4,%x0,lsl #2)[4byte] // Store second register element value
pnext %p2 %p0.s -> %p0.s // Repeat for next active element
b.eq @0x0000fffdb29fe3d0[8byte]
lastb %p0 %z0.s -> %x0
lastb %p0 %z28.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
add %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb %p0 %z29.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext %p2 %p0.s -> %p0.s // Repeat for next active element
b.eq @0x0000fffdb29fe3d0[8byte]
lastb %p0 %z0.s -> %x0
lastb %p0 %z28.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
add %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb %p0 %z29.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext %p2 %p0.s -> %p0.s // Repeat for next active element
b.eq @0x0000fffdb29fe3d0[8byte]
lastb %p0 %z0.s -> %x0
lastb %p0 %z28.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
add %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb %p0 %z29.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext %p2 %p0.s -> %p0.s // Repeat for next active element
b.eq @0x0000fffdb29fe3d0[8byte]
lastb %p0 %z0.s -> %x0
lastb %p0 %z28.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
add %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb %p0 %z29.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext %p2 %p0.s -> %p0.s // Repeat for next active element
b.eq @0x0000fffdb29fe3d0[8byte]
lastb %p0 %z0.s -> %x0
lastb %p0 %z28.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
add %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb %p0 %z29.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext %p2 %p0.s -> %p0.s // Repeat for next active element
b.eq @0x0000fffdb29fe3d0[8byte]
lastb %p0 %z0.s -> %x0
lastb %p0 %z28.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
add %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb %p0 %z29.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext %p2 %p0.s -> %p0.s // Repeat for next active element
b.eq @0x0000fffdb29fe3d0[8byte]
lastb %p0 %z0.s -> %x0
lastb %p0 %z28.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
add %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb %p0 %z29.s -> %x3
str %w3 -> (%x4,%x0,lsl #2)[4byte]
<label note=0x0000000000000000>
<label note=0x0000000000000002>
ldr +0x38(%x28)[8byte] -> %x0
ldr +0x0f50(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr +0x10(%x0)[8byte] -> %x0
ldr +0x20(%x0)[8byte] -> %x0
ldr (%x0)[32byte] -> %z0 // Manually restore scratch vector register
ldr +0x38(%x28)[8byte] -> %x0
ldr +0x0f50(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr +0x10(%x0)[8byte] -> %x0
ldr (%x0)[8byte] -> %x0
ldr (%x0)[4byte] -> %p0 // Manually restore scratch predicate register
ldr +0x0148(%x28)[8byte] -> %x0 // Restore GPRs using drreg
ldr +0x0158(%x28)[8byte] -> %x3
ldr +0x0160(%x28)[8byte] -> %x4
str %x0 -> +0x0148(%x28)[8byte]
ldr +0x0150(%x28)[8byte] -> %x0 // Restore flags using drreg
msr %x0 -> %nzcv
ldr +0x0148(%x28)[8byte] -> %x0

As shown by the above expanded scatter and gather sequences, we require scratch registers for the expansion. The GPR scratch registers are obtained using drreg, whereas the scratch vector register and the scratch mask register are obtained by manually spilling them.

We need to make sure that we restore the application state correctly when a state restoration event occurs, which can be a fault in one of the scalar loads or stores in the expanded sequence, a fault in instrumentation added by some other DR client, or some async event like DR detach. While the spilled registers obtained from drreg are restored by the drreg state restoration logic, drx still needs to restore the scratch mask register that is spilled manually to a GPR, and the scratch zmm register that is spilled manually to a drx spill slot. We also need to ensure that the bit for the previous access is cleared if the state restore event happened after the load or store completed but before we could reflect it in the mask. When a state restore event occurs, we walk the expanded sequence using a state machine till we reach the faulting pc, keeping track of the state that needs to be restored (commit, commit).

As pointed out above, this expansion is done in the app2app phase. DR clients may use drreg to get scratch registers for their instrumentation in later phases (like insertion or instru2instru). While drreg indeed supports some basic usage outside of the insertion phase, it does not mitigate bad interactions by such multi-phase use. The following section talks about the changes made in drreg to support multi-phase use.

Drreg Support For Multi-phase Reservations

Owner: Abhinav Sharma

Upstream issue: DynamoRIO/dynamorio#3823

Drreg is DynamoRIO’s register reservation framework. It allows users to reserve a register to use as scratch. Internally, drreg automatically performs the following functions so that the user does not need to. Drreg

  • keeps all required book-keeping like the spill slot to spilled register mapping
  • restores spilled registers to their application value before they are read by an application instruction; also, it re-spills the spilled registers if they are written by an application instruction.
  • performs application state restoration on state restore events like encountering an application fault, and DR detach.

While expanding a scatter or gather instruction in the app2app phase, we need a scratch register to hold the scalar values and masks. In later phases (like the insertion or the instru2instru phase), drcachesim and other DR clients may also use drreg to get scratch registers for their instrumentation.

Drreg initially supported only insertion phase use, with some basic support in other phases. Importantly, it did not attempt to avoid any bad interactions between the multiple phases. To support multi-phase use of drreg, we needed to solve the following:

  • avoid spill slot conflict across multiple phases: multi-phase use can potentially lead to spill slot conflicts if the same slot is selected in multiple phases. This may clobber the spilled application value and cause the application to crash or otherwise fail.
  • allow aflags spill to any slot: drreg hardcoded the aflags spill slot as the zero-th slot, to simplify some logic. To support the ability to spill aflags in multiple phases, drreg should be able to use any spill slot for aflags.
  • application state restore logic: on a state restore event, we should be able to figure out which slot contains each spilled register's app value. This is complicated by the fact that registers may be spilled by instrumentation added by multiple phases, and the spill regions may overlap which causes the spilled application value to be moved between spill slots.

We explored the following ideas to avoid spill slot conflicts in drreg:

Disjoint slot spaces or arenas

We can ask drreg to create slot spaces or arenas at init time, which are assigned disjoint spill slots. When reserving a register, the user passes in a "space/arena Id" to instruct drreg to pick free slots only from that arena. This requires keeping some global drreg state. This also requires the user to guess the best configuration for assigning slots to the arenas, and passing the correct arena Id before each reservation. It may artificially make some spill slots unavailable for use, thereby reducing efficiency.

Assign phase Id to slots

Instead of creating slot spaces at init time with a best-guess assignment of slots, we can instead assign a phase Id to slots when they are requested in that phase. We then avoid using slots that are already assigned a phase Id, when we are not in that phase where the slot was used before. This also requires keeping some global drreg state. This does not help in avoiding spill slot conflicts between multiple clients in the same phase.

Preferred: Scan fragment to determine eligible slots

When picking a spill slot, we can determine whether using it will cause a slot conflict by scanning for its uses in the current fragment after the current instruction. We pick only that spill slot which does not have any later uses in the current fragment. This does not require any init time guesses or keeping any global drreg state. It does not impose any additional responsibilities on the users, and it also works for multiple clients in the same phase. This was implemented to pick spill slots for GPRs (commit) and aflags too (commit).

State Restoration For Drreg

Owner: Abhinav Sharma

Upstream issue: DynamoRIO/dynamorio#3823, DynamoRIO/dynamorio#3801

On a state restore event, drreg should be able to restore all spilled registers to their application values.

Unfortunately, when a state restore event happens, we only have the encoded fragment, and none of the drreg state, like the register to spill slot mappings. We need to reconstruct this state based on the faulting pc and the encoded fragment.

It is complex to determine which registers need to be restored and from which spill slot. This is because drreg automatically adds spill and restore instructions to handle various complex cases like automatic re-spilling of reserved registers after their application write instruction, and automatic restore of reserved registers before their application read instruction. Drreg also uses various optimisations like lazy restores for application values in case the register is reserved again. This is even more complex for aflags, for which spill and restore require atleast two steps (spilling aflags involves reading aflags into a register using lahf and then writing that register to a spill slot; restoring aflags involves reading aflags from its spill slot to a register, and then writing aflags from that register using sahf); and an additional step for reading or writing the overflow flag if needed. In some cases, aflags are even kept in a register as an optimisation.

Additionally, in multi-phase use, a register may be spilled by multiple phases, with a separate spill slot for each phase. The application value for the register may reside in one or more spill slots, and may also move between spill slots based on how the spill regions from different phases overlap. See various tricky scenarios in drreg-test.c.

We explored two ways to adapt drreg’s state restoration logic to multi-phase use. This also fixed some known existing issues with drreg: Dynamorio/dynamorio#4933, DynamoRIO/dynamorio#4939.

Track app values as they are moved between slots and registers

At a state restoration event, we walk the faulting fragment from beginning to the faulting instruction, and we keep track of where the native value of each register is present. At any point, it may be present in the register itself, a spill slot, or both. We track gpr_is_native to denote whether a register contains its native app value or not; and spill_slot_to_reg, to denote which register’s app value a spill slot contains.

  • When a register is written by an application instruction, we invalidate all spill_slot_to_reg entries that are mapped to that register, and also set gpr_is_native for that register.
  • When a register is written by a non-drreg meta instruction, we clear gpr_is_native for that reg.
  • When a register is loaded by drreg from the slot it was spilled to, we set gpr_is_native.
  • When a register is spilled to some spill slot, we set spill_slot_to_reg for that spill slot to that reg.

This strategy allows us to robustly keep track of the various corner cases that can arise in drreg, like spill regions from different phases overlapping (nesting or just overlapping), and the other known issues linked above. This was implemented by this commit.

The drawback of this approach is that it needs to be aware of other methods of spilling and restoring registers outside drreg (dropped PR). DynamoRIO uses various such methods internally (spilling to stack, slots not managed by drreg), and also the client may use their own unique methods. So, some non-drreg meta instructions may actually restore an application value to a register, but this approach will not be able to recognize that. This may cause it to lose track of some register’s application value. We dropped this approach on encountering DynamoRIO/dynamorio#4963.

Preferred: Pairing restores with spills (instead of the other way)

The key observation behind this approach is that it is easier to find the matching spill for a given restore, than to find the matching restore for a given spill. This is because there may be other restores besides the final restore, e.g. restores for app read, user prompted restores, etc. This makes it hard to find exactly where the spill region for a register/aflags ends. Additional complexities include the fact that aflags re-spills may not use the same slot, which makes differentiating spills from multiple phases difficult.

Each restore must have a matching spill. Based on this observation, we scan the faulting fragment from end to beginning, matching register restores to their spills. When we reach the faulting instruction, any restore for which we did not see the matching spill yet must be performed by the drreg state restoration. This was implemented by (commit).

This algorithm does not need to be aware of non-drreg methods of spilling/restoring registers. Note that, like the general drreg operation, this method does not restore the application value of a spilled GPR/aflags if they are dead at the faulting instruction. However, even dead registers need to be restored when drreg_options_t.conservative is set. This can be handled if there is additional metadata available to the drreg state restore callback (DynamoRIO/dynamorio#3801).

Simplifying Instrumentation For Emulated Instructions

Owner: Derek Bruening

Upstream Issue: DynamoRIO/dynamorio#4865

Emulated sequences like the expanded scatter and gather sequence described above pose another challenge for clients that need to observe instructions and memory references both. For observing instructions, these clients should see the original application instruction (that is, the scatter or gather instruction), whereas for observing memory references, they should see the emulated sequence (that is, all the individual scalar stores or loads). DynamoRIO should absorb this complexity and provide the required events to the client.

We implemented drmgr_orig_app_instr_for_fetch, drmgr_orig_app_instr_for_operands and drmgr_in_emulation_region APIs (commit, commit) that return the appropriate instruction to the client to be used for either instruction instrumentation or memory reference instrumentation. These were subsequently used in drcachesim as well (commit).

Support For Vector Reservation

Owner: Abhinav Sharma

The scatter and gather expansions require scratch vector registers, for which we need the capability to spill and restore vector registers. Following are the design choices:

  • Extend drreg to support reservation for vector registers. DynamoRIO/dynamorio#3844 aims to add this support.
  • Use custom spill and restore logic in drx. We can do this by reserving memory in TLS to use as a spill slot.

Some observations about this use-case for vector reservation:

  • We need to spill only one vector register, so we do not need sophisticated spill slot management logic.
  • The spilled vector register will not need to be restored for app reads, or re- spilled after app writes. Note that we will not encounter any application instructions that use the spilled vector register, because it needs to be spilled only for the duration of the expanded scatter or gather sequence.

Extending drreg to support vector spilling is a complex task. Given the above observations, the current use case does not justify the effort. Therefore, we chose to implement custom spill logic in drx (commit, commit).

Using The Expansion In DR Clients

Owner: Abhinav Sharma

Clients that need to observe each memory reference must use the drx_expand_scatter_gather API. This was added in the app2app phase of drcachesim and other DynamoRIO clients (commit). This also required fixing some issues (crashes and correctness problems) that surfaced when all pieces were integrated (commit, commit).

Testing On Large Apps

Owner: Abhinav Sharma

drcachesim was successfully used to trace an application with scatter and gather instructions. The resulting trace was observed to have millions of such instructions. We also verified correctness by comparing application output with and without tracing.

DR_API bool instr_is_gather(instr_t *instr)