Background

x86

The x86 gather and scatter instructions were introduced in the AVX2 and AVX512 instruction set extensions. They allow loading or storing a subset of elements in a vector from/to multiple non-contiguous addresses.

AVX2 has only gather instructions, no scatter instructions, whereas AVX512 has both. AVX2 is limited to 256-bit length vectors, whereas AVX512 has 512-bit support. Both support masking of individual memory accesses, using either a special mask register in AVX512 or another vector in AVX2.

Examples of these instructions are (in DR’s IR):

vpgatherdd %rax(,%ymm11,4)[4byte] %ymm13 -> %ymm12 %ymm13

Above is an AVX2 gather instruction that reads 32-bit doublewords into the 256-bit ymm12 vector from addresses generated by adding the base address in rax to the corresponding index elements in ymm11, conditionally based on the masks in ymm13. Elements may be gathered in any order. When an element is read, its mask is cleared. If some load faults, all elements to its right (closer to LSB) will be complete.

vpscatterdd {%k1} %xmm10 -> %rax(,%xmm11,4)[4byte] %k1

Above is an AVX512 scatter instruction that writes 32-bit doublewords from the 128- bit xmm10 vector to addresses generated by adding the base address in rax to the corresponding index element in xmm11, conditionally based on the mask register k1. Elements may be scattered in any order. When an element is stored, its mask is cleared. If some store faults, all elements to its right (closer to LSB) will be complete.

AArch64

The AArch64 SVE (Scalable Vector Extension) introduced the first AArch64 scatter and gather instructions.

SVE has two scatter and gather addressing modes (scalar+vector and vector+immediate) and two predicated contiguous load and store addressing modes (scalar+scalar and scalar+immediate). The SVE2 instruction set extension adds a third scatter/gather addressing mode: vector+scalar.

The predicated contiguous instructions do not use vector based addressing but they do have other similarities with the scatter and gather instructions which mean DynamoRIO handles them in a similar way.

All scatter/gather/predicated contiguous instructions access elements conditionally based on the mask value of a governing predicate register. Loads are always zeroing: inactive elements are set to 0 in the destination register.

Scalar+vector

ld1b (%x0,%z0.s,uxtw)[1byte] %p0/z -> %z1.s

Above is a scalar+vector gather load that reads 8-bit values which are zero-extended and written to 32-bit elements of the z1 vector register. Addresses are calculated by adding the base address from the scalar x0 register to the corresponding 32-bit element in the vector index register z0. Elements that are not active in the mask contained by the governing predicate register p0 are not loaded and the corresponding element in the destination register z1 is set to 0.

st1h %z17.d %p5 -> (%x19,%z20.d)[2byte]

Above is a scalar+vector scatter store that writes the lowest 16-bits of the 64-bit elements of the z17 vector register. Addresses are calculated by adding the base address from the scalar x19 register to the corresponding 64-bit element in the vector index register z20. Elements that are not active in the mask contained by the governing predicate register p5 are not stored.

Vector+immediate

ld1sw +0x18(%z8.d)[4byte] %p2/z -> %z6.d

Above is a vector+immediate gather load that reads 32-bit values which are sign-extended and written to 64-bit elements of the z6 vector register. Addresses are calculated by adding the immedate value 0x18 to the corresponding 64-bit element in the vector base register z8. Elements that are not active in the mask contained by governing predicate register p2 are zeroed.

Vector+scalar

stnt1d %z27.d %p7 -> (%z25.d,%x29)[8byte]

Introduced with SVE2. The above instruction is a scatter store that writes the 64-bit elements of the z27 vector register. Addresses are calculated by adding the value of scalar register x29 to the corresponding 64-bit element in the vector base register z25. Elements that are not active in the mask contained by the governing predicate register p7 are not stored.

Scalar+scalar

ld1sb (%x17,%x18)[1byte] %p5/z -> %z16.d

The above instruction is a scalar+scalar predicated contiguous load. 8-bit values are loaded and sign extended to 64-bits and written to the vector register z16. The address of the first element is calculated by adding the value of the scalar index register x18 to the value of the scalar base register x17. Addresses for subsequent elements are calculated by adding the size of the loaded value (1-byte in this example) to the address of the previous element. Elements that are not active in the mask contained by governing predicate register p5 are zeroed.

Scalar+immediate

st1w %z10.s %p3 -> -0x60(%x11)[4byte]

The above instruction is a scalar+immediate predicated contiguous store that writes the 32-bit elements of the z10 vector register. The address for the first element is calculated by adding the immediate value -0x60 to the value of the scalar base register x11. Addresses for subsequent elements are calculated by adding the size of the stored value (4 bytes in this example) to the address of the previous element. Elements that are not active in the mask contained by governing predicate register p3 are not stored.

Note that DynamoRIO IR and disassembly for scalar+immediate instructions give the offset in bytes, but the instruction itself uses a 4-bit signed immediate which is multiplied by the current SVE vector length in bytes. Arm assembly syntax uses a vector length agnostic representation for the offset: #<imm4>, MUL VL

So the above instruction might be encoded as

st1w z10.s, p3, [x11, #-3, MUL VL]

if the currently vector length is 32 bytes, or

st1w z10.s, p3, [x11, #-6, MUL VL]

if it is 16 bytes, and cannot be encoded for vector lengths > 32 bytes.

Non-faulting loads

Non-faulting loads (ldnf*) do not fault if an element read faults. Instead a special predicate register FFR is updated. If this happens the FFR element corresponding to the element that faulted, and all elements higher than that are set to 0. Elements lower than the faulting element are unchanged. Non-faulting loads support scalar+immediate addressing.

First-faulting loads

First-faulting loads (ldff*) behave like a normal gather/load instruction if the first active element causes a fault, and behave like a non-faulting load if the first active element succeeds but a later active element read causes a fault. First faulting loads support scalar+scalar, scalar+vector, and vector+immediate addressing.

Problem Statement

Scatter and gather instructions pose a challenge to DynamoRIO clients that observe memory addresses, like address tracing tools (e.g. drcachesim, that collects memory address and control flow traces), and taint tracking tools. They are complex to handle because:

A single gather or scatter instruction loads from or stores to multiple addresses
Accessed addresses may be non-contiguous
Each access is conditional based on some mask

DynamoRIO clients only see the scatter or gather instruction and need to do more work to extract all accessed addresses. This is unlike regular scalar loads or stores, where the accessed address is readily available. The goal of this work is to make it easier for DR clients to observe these addresses. We achieve this by expanding the scatter and gather instructions into a functionally equivalent sequence of scalar stores and loads. This way, DR clients will see regular store and load instructions which they can instrument as usual. This is similar to what DR does for repeat string operations (like rep movs, repnz cmps): convert it into a loop so that each memory access is made by a separate dynamic instruction. This method has worked well for such instructions that implicitly issue multiple memory accesses.

Original issue: DynamoRIO/dynamorio#2985.

Design

This required the addition of new support in various DynamoRIO components, like drreg, drx, drmgr and core DR. Multiple contributors worked on designing and implementing the required changes.

Scatter/gather Instruction Expansion

Owner: Hendrik Greving

As described above, we can simplify work for DR clients by replacing each scatter and gather instruction with a functionally equivalent sequence of scalar stores and loads. The expanded sequence is the unrolled version of the following loop:

num_accesses = vector_size / element_size
for i = 0, 1, ..., (num_accesses-1), do
  extract mask for the ith access from mask reg or mask vector
  if mask is set, then
    if index is vector, then
        extract ith element of index vector
        compute address = base + ith index element
    else // base is vector
        extract ith element of base vector
        compute address = ith base element + index
    done
    if instr_is_gather, then
      load data from address into a scalar reg
      insert scalar data into destination vector
    else // instr_is_scatter
      extract scalar data from source vector to scalar reg
      store data from scalar reg to address
    done
    if x86, then
        clear ith mask in mask reg or mask vector
    done
  done
done

Due to the x86 ISA, the extraction/insertion of the scalar value from/to the vector may involve multiple steps, e.g. to extract a 32-bit scalar value from a 512-bit zmm reg, we first need to extract a 128-bit xmm from it.

drmgr in DR provides multiple phases of instrumentation. Our expansion is done in the first phase known as app2app. As the name suggests, this phase is intended to transform app instructions to equivalent instructions. For simplicity, we also separate out the scatter and gather instructions from their basic block and create a separate fragment with only the expanded sequence. The logic for expanding scatter and gather instructions is implemented in the drx extension library as drx_expand_scatter_gather, and can be used by any client that needs it, including drcachesim. This support was added by commit.

As an example, the following are the expansions of some instructions.

Expansion for x86 gather

vpgatherdd 0x00402039(,%xmm11,4)[4byte] %xmm13 -> %xmm12 %xmm13

+0    m4 @0x00007fdb2ac5e6a0  65 48 a3 e0 00 00 00 mov    %rax -> %gs:0x000000e0[8byte]
                              00 00 00 00
+11   m4 @0x00007fdb2ac5eac0  9f                   lahf    -> %ah
+12   m4 @0x00007fdb2ac5ea40  0f 90 c0             seto    -> %al                              // Spill aflags using drreg.
+15   m4 @0x00007fdb2ac5efa8  65 48 89 0c 25 e8 00 mov    %rcx -> %gs:0x000000e8[8byte]        // Spill the scratch GPR using drreg.
                              00 00
+24   m4 @0x00007fdb2ac0ec70  65 48 8b 0c 25 20 00 mov    %gs:0x20[8byte] -> %rcx
                              00 00
+33   m4 @0x00007fdb2ac0ebf0  48 8b 89 f0 0a 00 00 mov    0x00000af0(%rcx)[8byte] -> %rcx
+40   m4 @0x00007fdb2ac5f0a8  48 8b 09             mov    (%rcx)[8byte] -> %rcx
+43   m4 @0x00007fdb2ac5e7f0  48 8b 49 10          mov    0x10(%rcx)[8byte] -> %rcx
+47   m4 @0x00007fdb2ac0f488  48 8b 49 08          mov    0x08(%rcx)[8byte] -> %rcx
+51   m4 @0x00007fdb2ac0ed38  62 f1 7c 48 29 01    vmovaps {%k0} %zmm0 -> (%rcx)[64byte]       // Manually spill the scratch zmm reg.
+57   m4 @0x00007fdb2ac0ee00                       <label>
+57   L4 @0x00007fdb2ac0efb0  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0          // Expansion for the first vector element starts here.
+63   L4 @0x00007fdb2ac0bee8  c4 e3 79 16 c1 00    vpextrd %xmm0 $0x00 -> %ecx                 // Extract mask for the first element.
+69   L4 @0x00007fdb2ac0c750  c1 e9 1f             shr    $0x0000001f %ecx -> %ecx
+72   L4 @0x00007fdb2ac0be68  81 e1 01 00 00 00    and    $0x00000001 %ecx -> %ecx
+78   L4 @0x00007fdb2ac0bca0  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0ee98[8byte]           // Check whether to load the first element based on mask.
+84   L4 @0x00007fdb2ac0c6d0  c4 63 7d 39 d8 00    vextracti128 %ymm11 $0x00 -> %xmm0
+90   L4 @0x00007fdb2ac0be00  c4 e3 79 16 c1 00    vpextrd %xmm0 $0x00 -> %ecx                 // Extract index for the first load address.
+96   L4 @0x00007fdb2ac0b8b8  48 63 c9             movsxd %ecx -> %rcx
+99   L4 @0x00007fdb2ac0cc20  8b 0c 8d 39 20 40 00 mov    0x00402039(,%rcx,4)[4byte] -> %ecx   // Load the first element into a scalar reg.
+106  L4 @0x00007fdb2ac5e5b8  c4 63 7d 39 e0 00    vextracti128 %ymm12 $0x00 -> %xmm0
+112  L4 @0x00007fdb2ac5e870  c4 e3 79 22 c1 00    vpinsrd %xmm0 %ecx $0x00 -> %xmm0
+118  L4 @0x00007fdb2ac5e9c0  c4 63 1d 38 e0 00    vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12    // Insert the first element into the destination vector reg
+124  L4 @0x00007fdb2ac5eb40  33 c9                xor    %ecx %ecx -> %ecx
+126  L4 @0x00007fdb2ac5ebc0  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0
+132  L4 @0x00007fdb2ac5eda8  c4 e3 79 22 c1 00    vpinsrd %xmm0 %ecx $0x00 -> %xmm0
+138  L4 @0x00007fdb2ac5ec40  c4 63 15 38 e8 00    vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13    // Clear the mask bit for the first element.
+144  m4 @0x00007fdb2ac0ee98                       <label>
+144  L4 @0x00007fdb2ac5ed40  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0          // Repeat for the second vector element.
+150  L4 @0x00007fdb2ac5ee28  c4 e3 79 16 c1 01    vpextrd %xmm0 $0x01 -> %ecx
+156  L4 @0x00007fdb2ac5eea8  c1 e9 1f             shr    $0x0000001f %ecx -> %ecx
+159  L4 @0x00007fdb2ac5e638  81 e1 01 00 00 00    and    $0x00000001 %ecx -> %ecx
+165  L4 @0x00007fdb2ac5e788  0f 84 fa ff ff ff    jz     @0x00007fdb2ac5ecc0[8byte]
+171  L4 @0x00007fdb2ac5e720  c4 63 7d 39 d8 00    vextracti128 %ymm11 $0x00 -> %xmm0
+177  L4 @0x00007fdb2ac5e958  c4 e3 79 16 c1 01    vpextrd %xmm0 $0x01 -> %ecx
+183  L4 @0x00007fdb2ac5e8f0  48 63 c9             movsxd %ecx -> %rcx
+186  L4 @0x00007fdb2ac5f028  8b 0c 8d 39 20 40 00 mov    0x00402039(,%rcx,4)[4byte] -> %ecx
+193  L4 @0x00007fdb2ac5ef28  c4 63 7d 39 e0 00    vextracti128 %ymm12 $0x00 -> %xmm0
+199  L4 @0x00007fdb2ac0baa0  c4 e3 79 22 c1 01    vpinsrd %xmm0 %ecx $0x01 -> %xmm0
+205  L4 @0x00007fdb2ac5f128  c4 63 1d 38 e0 00    vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+211  L4 @0x00007fdb2ac0ea60  33 c9                xor    %ecx %ecx -> %ecx
+213  L4 @0x00007fdb2ac0e9c8  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0
+219  L4 @0x00007fdb2ac0e930  c4 e3 79 22 c1 01    vpinsrd %xmm0 %ecx $0x01 -> %xmm0
+225  L4 @0x00007fdb2ac0cbb8  c4 63 15 38 e8 00    vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+231  m4 @0x00007fdb2ac5ecc0                       <label>
+231  L4 @0x00007fdb2ac5e538  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0          // Repeat for the third vector element.
+237  L4 @0x00007fdb2ac5e4b8  c4 e3 79 16 c1 02    vpextrd %xmm0 $0x02 -> %ecx
+243  L4 @0x00007fdb2ac5df68  c1 e9 1f             shr    $0x0000001f %ecx -> %ecx
+246  L4 @0x00007fdb2ac5e438  81 e1 01 00 00 00    and    $0x00000001 %ecx -> %ecx
+252  L4 @0x00007fdb2ac5e3b8  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0f018[8byte]
+258  L4 @0x00007fdb2ac5e050  c4 63 7d 39 d8 00    vextracti128 %ymm11 $0x00 -> %xmm0
+264  L4 @0x00007fdb2ac5e338  c4 e3 79 16 c1 02    vpextrd %xmm0 $0x02 -> %ecx
+270  L4 @0x00007fdb2ac5e2b8  48 63 c9             movsxd %ecx -> %rcx
+273  L4 @0x00007fdb2ac5e238  8b 0c 8d 39 20 40 00 mov    0x00402039(,%rcx,4)[4byte] -> %ecx
+280  L4 @0x00007fdb2ac5e1b8  c4 63 7d 39 e0 00    vextracti128 %ymm12 $0x00 -> %xmm0
+286  L4 @0x00007fdb2ac5e138  c4 e3 79 22 c1 02    vpinsrd %xmm0 %ecx $0x02 -> %xmm0
+292  L4 @0x00007fdb2ac5e0b8  c4 63 1d 38 e0 00    vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+298  L4 @0x00007fdb2ac5dfd0  33 c9                xor    %ecx %ecx -> %ecx
+300  L4 @0x00007fdb2ac5dee8  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0
+306  L4 @0x00007fdb2ac0f080  c4 e3 79 22 c1 02    vpinsrd %xmm0 %ecx $0x02 -> %xmm0
+312  L4 @0x00007fdb2ac0f0e8  c4 63 15 38 e8 00    vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+318  m4 @0x00007fdb2ac0f018                       <label>
+318  L4 @0x00007fdb2ac0f220  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0          // Repeat for the fourth vector element.
+324  L4 @0x00007fdb2ac0f2a0  c4 e3 79 16 c1 03    vpextrd %xmm0 $0x03 -> %ecx
+330  L4 @0x00007fdb2ac0f150  c1 e9 1f             shr    $0x0000001f %ecx -> %ecx
+333  L4 @0x00007fdb2ac0f388  81 e1 01 00 00 00    and    $0x00000001 %ecx -> %ecx
+339  L4 @0x00007fdb2ac0f408  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0f1b8[8byte]
+345  L4 @0x00007fdb2ac0ba20  c4 63 7d 39 d8 00    vextracti128 %ymm11 $0x00 -> %xmm0
+351  L4 @0x00007fdb2ac0f320  c4 e3 79 16 c1 03    vpextrd %xmm0 $0x03 -> %ecx
+357  L4 @0x00007fdb2ac0f508  48 63 c9             movsxd %ecx -> %rcx
+360  L4 @0x00007fdb2ac0f588  8b 0c 8d 39 20 40 00 mov    0x00402039(,%rcx,4)[4byte] -> %ecx
+367  L4 @0x00007fdb2ac0ef30  c4 63 7d 39 e0 00    vextracti128 %ymm12 $0x00 -> %xmm0
+373  L4 @0x00007fdb2ac0c050  c4 e3 79 22 c1 03    vpinsrd %xmm0 %ecx $0x03 -> %xmm0
+379  L4 @0x00007fdb2ac0c1c8  c4 63 1d 38 e0 00    vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+385  L4 @0x00007fdb2ac0c3c8  33 c9                xor    %ecx %ecx -> %ecx
+387  L4 @0x00007fdb2ac0bfd0  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0
+393  L4 @0x00007fdb2ac5de68  c4 e3 79 22 c1 03    vpinsrd %xmm0 %ecx $0x03 -> %xmm0
+399  L4 @0x00007fdb2ac5dde8  c4 63 15 38 e8 00    vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+405  m4 @0x00007fdb2ac0f1b8                       <label>
+405  L4 @0x00007fdb2ac5d898  c4 41 11 ef ed       vpxor  %xmm13 %xmm13 -> %xmm13              // Zero the mask reg.
+410  m4 @0x00007fdb2ac5dd68  65 48 8b 0c 25 20 00 mov    %gs:0x20[8byte] -> %rcx
                              00 00
+419  m4 @0x00007fdb2ac5dce8  48 8b 89 f0 0a 00 00 mov    0x00000af0(%rcx)[8byte] -> %rcx
+426  m4 @0x00007fdb2ac5d980  48 8b 09             mov    (%rcx)[8byte] -> %rcx
+429  m4 @0x00007fdb2ac5dc68  48 8b 49 10          mov    0x10(%rcx)[8byte] -> %rcx
+433  m4 @0x00007fdb2ac5dbe8  48 8b 49 08          mov    0x08(%rcx)[8byte] -> %rcx
+437  m4 @0x00007fdb2ac5db68  62 f1 7c 48 28 01    vmovaps {%k0} (%rcx)[64byte] -> %zmm0       // Manually restore the scratch zmm reg.
+443  m4 @0x00007fdb2ac5dae8  65 48 8b 0c 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rcx        // Restore the scratch GPR using drreg.
                              00 00
+452  m4 @0x00007fdb2ac5da68  3c 81                cmp    %al $0x81
+454  m4 @0x00007fdb2ac5d9e8  9e                   sahf   %ah
+455  m4 @0x00007fdb2ac5d900  65 48 a1 e0 00 00 00 mov    %gs:0x000000e0[8byte] -> %rax        // Restore aflags using drreg.
                              00 00 00 00
+466  m4 @0x00007fdb2ac5d818                       <label>

Expansion for x86 scatter

vpscatterdd {%k1} %xmm10 -> 0x00402039(,%xmm11,4)[4byte] %k1

+0    m4 @0x00007fdb2ac106e0  65 48 a3 e0 00 00 00 mov    %rax -> %gs:0x000000e0[8byte]
                              00 00 00 00
+11   m4 @0x00007fdb2ac10760  9f                   lahf    -> %ah
+12   m4 @0x00007fdb2ac107e0  0f 90 c0             seto    -> %al                              // Spill aflags using drreg.
+15   m4 @0x00007fdb2ac100a8  65 48 89 0c 25 e8 00 mov    %rcx -> %gs:0x000000e8[8byte]        // Spill the first scratch GPR using drreg.
                              00 00
+24   m4 @0x00007fdb2ac10110  65 48 89 14 25 f0 00 mov    %rdx -> %gs:0x000000f0[8byte]        // Spill the second scratch GPR using drreg.
                              00 00
+33   m4 @0x00007fdb2ac0fed8  65 48 8b 0c 25 20 00 mov    %gs:0x20[8byte] -> %rcx
                              00 00
+42   m4 @0x00007fdb2ac0ff40  48 8b 89 f0 0a 00 00 mov    0x00000af0(%rcx)[8byte] -> %rcx
+49   m4 @0x00007fdb2ac0f608  48 8b 09             mov    (%rcx)[8byte] -> %rcx
+52   m4 @0x00007fdb2ac10860  48 8b 49 10          mov    0x10(%rcx)[8byte] -> %rcx
+56   m4 @0x00007fdb2ac108e0  48 8b 49 08          mov    0x08(%rcx)[8byte] -> %rcx
+60   m4 @0x00007fdb2ac0ca50  62 f1 7c 48 29 01    vmovaps {%k0} %zmm0 -> (%rcx)[64byte]       // Manually spill the scratch zmm reg.
+66   m4 @0x00007fdb2ac10660                       <label>
+66   L4 @0x00007fdb2ac10560  c5 f8 93 c9          kmovw  %k1 -> %ecx                          // Expansion for the first vector element starts here.
+70   L4 @0x00007fdb2ac104f8  f7 c1 01 00 00 00    test   %ecx $0x00000001
+76   L4 @0x00007fdb2ac10478  0f 84 fa ff ff ff    jz     @0x00007fdb2ac105e0[8byte]           // Check whether to store the first element based on mask.
+82   L4 @0x00007fdb2ac103f8  62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+89   L4 @0x00007fdb2ac10378  c4 e3 79 16 c1 00    vpextrd %xmm0 $0x00 -> %ecx                 // Extract index for the first store address.
+95   L4 @0x00007fdb2ac102f8  62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+102  L4 @0x00007fdb2ac10278  c4 e3 79 16 c2 00    vpextrd %xmm0 $0x00 -> %edx                 // Extract the element for the first store.
+108  L4 @0x00007fdb2ac101f8  48 63 c9             movsxd %ecx -> %rcx
+111  L4 @0x00007fdb2ac10178  89 14 8d 39 20 40 00 mov    %edx -> 0x00402039(,%rcx,4)[4byte]   // Store the first element.
+118  L4 @0x00007fdb2ac10028  b9 01 00 00 00       mov    $0x00000001 -> %ecx
+123  m4 @0x00007fdb2ac0ffa8  65 48 89 1c 25 f8 00 mov    %rbx -> %gs:0x000000f8[8byte]        // Spill the third scratch GPR using drreg.
                              00 00
+132  m4 @0x00007fdb2ac0fe58  c5 f8 93 d8          kmovw  %k0 -> %ebx                          // Manually spill the scratch mask reg k0 to the scratch GPR.
+136  L4 @0x00007fdb2ac0fdd8  c5 f8 92 c1          kmovw  %ecx -> %k0
+140  L4 @0x00007fdb2ac0fbf0  c5 fc 42 c9          kandnw %k0 %k1 -> %k1                       // Clear bit for the first element in the mask reg.
+144  m4 @0x00007fdb2ac0fd58  c5 f8 92 c3          kmovw  %ebx -> %k0                          // Manually restore the scratch mask reg from the scratch GPR.
+148  m4 @0x00007fdb2ac0fcd8  65 48 8b 1c 25 f8 00 mov    %gs:0x000000f8[8byte] -> %rbx        // Restore the third scratch GPR using drreg.
                              00 00
+157  m4 @0x00007fdb2ac105e0                       <label>
+157  L4 @0x00007fdb2ac0fb70  c5 f8 93 c9          kmovw  %k1 -> %ecx                          // Repeat for the second vector element.
+161  L4 @0x00007fdb2ac0faf0  f7 c1 02 00 00 00    test   %ecx $0x00000002
+167  L4 @0x00007fdb2ac0fa70  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0fc58[8byte]
+173  L4 @0x00007fdb2ac0f9f0  62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+180  L4 @0x00007fdb2ac0f970  c4 e3 79 16 c1 01    vpextrd %xmm0 $0x01 -> %ecx
+186  L4 @0x00007fdb2ac0f8f0  62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+193  L4 @0x00007fdb2ac0f870  c4 e3 79 16 c2 01    vpextrd %xmm0 $0x01 -> %edx
+199  L4 @0x00007fdb2ac0f7f0  48 63 c9             movsxd %ecx -> %rcx
+202  L4 @0x00007fdb2ac0f770  89 14 8d 39 20 40 00 mov    %edx -> 0x00402039(,%rcx,4)[4byte]
+209  L4 @0x00007fdb2ac0f6f0  b9 02 00 00 00       mov    $0x00000002 -> %ecx
+214  m4 @0x00007fdb2ac0f670  65 48 89 1c 25 f8 00 mov    %rbx -> %gs:0x000000f8[8byte]
                              00 00
+223  m4 @0x00007fdb2ac0c448  c5 f8 93 d8          kmovw  %k0 -> %ebx
+227  L4 @0x00007fdb2ac0c2c8  c5 f8 92 c1          kmovw  %ecx -> %k0
+231  L4 @0x00007fdb2ac0c9e8  c5 fc 42 c9          kandnw %k0 %k1 -> %k1
+235  m4 @0x00007fdb2ac0c980  c5 f8 92 c3          kmovw  %ebx -> %k0
+239  m4 @0x00007fdb2ac0c818  65 48 8b 1c 25 f8 00 mov    %gs:0x000000f8[8byte] -> %rbx
                              00 00
+248  m4 @0x00007fdb2ac0fc58                       <label>
+248  L4 @0x00007fdb2ac0c518  c5 f8 93 c9          kmovw  %k1 -> %ecx                          // Repeat for the third vector element.
+252  L4 @0x00007fdb2ac0c668  f7 c1 04 00 00 00    test   %ecx $0x00000004
+258  L4 @0x00007fdb2ac0cab8  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0ba20[8byte]
+264  L4 @0x00007fdb2ac0cb20  62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+271  L4 @0x00007fdb2ac0c248  c4 e3 79 16 c1 02    vpextrd %xmm0 $0x02 -> %ecx
+277  L4 @0x00007fdb2ac0bd98  62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+284  L4 @0x00007fdb2ac0c0d0  c4 e3 79 16 c2 02    vpextrd %xmm0 $0x02 -> %edx
+290  L4 @0x00007fdb2ac0c900  48 63 c9             movsxd %ecx -> %rcx
+293  L4 @0x00007fdb2ac0bfd0  89 14 8d 39 20 40 00 mov    %edx -> 0x00402039(,%rcx,4)[4byte]
+300  L4 @0x00007fdb2ac0c3c8  b9 04 00 00 00       mov    $0x00000004 -> %ecx
+305  m4 @0x00007fdb2ac0c1c8  65 48 89 1c 25 f8 00 mov    %rbx -> %gs:0x000000f8[8byte]
                              00 00
+314  m4 @0x00007fdb2ac0c050  c5 f8 93 d8          kmovw  %k0 -> %ebx
+318  L4 @0x00007fdb2ac0ef30  c5 f8 92 c1          kmovw  %ecx -> %k0
+322  L4 @0x00007fdb2ac0f588  c5 fc 42 c9          kandnw %k0 %k1 -> %k1
+326  m4 @0x00007fdb2ac0f508  c5 f8 92 c3          kmovw  %ebx -> %k0
+330  m4 @0x00007fdb2ac0f320  65 48 8b 1c 25 f8 00 mov    %gs:0x000000f8[8byte] -> %rbx
                              00 00
+339  m4 @0x00007fdb2ac0ba20                       <label>
+339  L4 @0x00007fdb2ac0f408  c5 f8 93 c9          kmovw  %k1 -> %ecx                          // Repeat for the fourth vector element.
+343  L4 @0x00007fdb2ac0f388  f7 c1 08 00 00 00    test   %ecx $0x00000008
+349  L4 @0x00007fdb2ac0f150  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0f488[8byte]
+355  L4 @0x00007fdb2ac0f2a0  62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+362  L4 @0x00007fdb2ac0f220  c4 e3 79 16 c1 03    vpextrd %xmm0 $0x03 -> %ecx
+368  L4 @0x00007fdb2ac0f1b8  62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+375  L4 @0x00007fdb2ac0f0e8  c4 e3 79 16 c2 03    vpextrd %xmm0 $0x03 -> %edx
+381  L4 @0x00007fdb2ac0f080  48 63 c9             movsxd %ecx -> %rcx
+384  L4 @0x00007fdb2ac0f018  89 14 8d 39 20 40 00 mov    %edx -> 0x00402039(,%rcx,4)[4byte]
+391  L4 @0x00007fdb2ac0cbb8  b9 08 00 00 00       mov    $0x00000008 -> %ecx
+396  m4 @0x00007fdb2ac0e930  65 48 89 1c 25 f8 00 mov    %rbx -> %gs:0x000000f8[8byte]
                              00 00
+405  m4 @0x00007fdb2ac0e9c8  c5 f8 93 d8          kmovw  %k0 -> %ebx
+409  L4 @0x00007fdb2ac0ea60  c5 f8 92 c1          kmovw  %ecx -> %k0
+413  L4 @0x00007fdb2ac0eb28  c5 fc 42 c9          kandnw %k0 %k1 -> %k1
+417  m4 @0x00007fdb2ac0ebf0  c5 f8 92 c3          kmovw  %ebx -> %k0
+421  m4 @0x00007fdb2ac0ec70  65 48 8b 1c 25 f8 00 mov    %gs:0x000000f8[8byte] -> %rbx
                              00 00
+430  m4 @0x00007fdb2ac0f488                       <label>
+430  L4 @0x00007fdb2ac0ed38  c4 e1 f4 47 c9       kxorq  %k1 %k1 -> %k1                       // Clear the mask reg.
+435  m4 @0x00007fdb2ac0ee00  65 48 8b 0c 25 20 00 mov    %gs:0x20[8byte] -> %rcx
                              00 00
+444  m4 @0x00007fdb2ac0ee98  48 8b 89 f0 0a 00 00 mov    0x00000af0(%rcx)[8byte] -> %rcx
+451  m4 @0x00007fdb2ac0efb0  48 8b 09             mov    (%rcx)[8byte] -> %rcx
+454  m4 @0x00007fdb2ac0bee8  48 8b 49 10          mov    0x10(%rcx)[8byte] -> %rcx
+458  m4 @0x00007fdb2ac0c750  48 8b 49 08          mov    0x08(%rcx)[8byte] -> %rcx
+462  m4 @0x00007fdb2ac0be68  62 f1 7c 48 28 01    vmovaps {%k0} (%rcx)[64byte] -> %zmm0       // Manually restore the scratch zmm reg.
+468  m4 @0x00007fdb2ac0bca0  65 48 8b 0c 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rcx        // Restore the first scratch GPR using drreg.
                              00 00
+477  m4 @0x00007fdb2ac0c6d0  65 48 8b 14 25 f0 00 mov    %gs:0x000000f0[8byte] -> %rdx        // Restore the second scratch GPR using drreg.
                              00 00
+486  m4 @0x00007fdb2ac0be00  3c 81                cmp    %al $0x81
+488  m4 @0x00007fdb2ac0b8b8  9e                   sahf   %ah
+489  m4 @0x00007fdb2ac0cc20  65 48 a1 e0 00 00 00 mov    %gs:0x000000e0[8byte] -> %rax        // Restore aflags using drreg.
                              00 00 00 00
+500  m4 @0x00007fdb2ac0baa0                       <label>

Expansion for AArch64 gather

ldff1sb (%x1,%z2.d)[1byte] %p3/z -> %z28.d

str    %x0 -> +0x0148(%x28)[8byte]    // Save flags using drreg
mrs    %nzcv -> %x0
str    %x0 -> +0x0150(%x28)[8byte]
ldr    +0x0148(%x28)[8byte] -> %x0
str    %x0 -> +0x0148(%x28)[8byte]    // Save scratch GPR using drreg
ldr    +0x38(%x28)[8byte] -> %x0
ldr    +0x0f50(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    +0x10(%x0)[8byte] -> %x0
ldr    +0x20(%x0)[8byte] -> %x0
str    %z28 -> (%x0)[32byte]          // Save the value of the destination register in case we
                                      // need to restore its value on a fault.
ldr    +0x38(%x28)[8byte] -> %x0
ldr    +0x0f50(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    +0x10(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
str    %p0 -> (%x0)[4byte]            // Spill a predicate register to use as the loop variable mask
<label note=0x0000000000000001>
dup    $0x00 lsl $0x00 -> %z28.d      // Clear destination register
pfalse  -> %p0.b                      // Initialize loop variable to 0
pnext  %p3 %p0.d -> %p0.d             // Set loop variable to the first active element
b.eq   @0x0000fffda4f27518[8byte]     // If no active elements, break the loop
lastb  %p0 %z2.d -> %x0               // Extract the first active element index to scratch GPR
ldrsb  (%x1,%x0)[1byte] -> %x0        // Load the first element to scratch GPR
cpy    %p0/m %x0 -> %z28.d            // Copy scratch GPR to current element of destination register
pnext  %p3 %p0.d -> %p0.d             // Repeat for the next active element
b.eq   @0x0000fffda4f27518[8byte]
lastb  %p0 %z2.d -> %x0
ldrsb  (%x1,%x0)[1byte] -> %x0
cpy    %p0/m %x0 -> %z28.d
pnext  %p3 %p0.d -> %p0.d             // Repeat for the next active element
b.eq   @0x0000fffda4f27518[8byte]
lastb  %p0 %z2.d -> %x0
ldrsb  (%x1,%x0)[1byte] -> %x0
cpy    %p0/m %x0 -> %z28.d
pnext  %p3 %p0.d -> %p0.d             // Repeat for the next active element
b.eq   @0x0000fffda4f27518[8byte]
lastb  %p0 %z2.d -> %x0
ldrsb  (%x1,%x0)[1byte] -> %x0
cpy    %p0/m %x0 -> %z28.d
<label note=0x0000000000000000>
<label note=0x0000000000000000>
<label note=0x0000000000000002>
ldr    +0x38(%x28)[8byte] -> %x0
ldr    +0x0f50(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    +0x10(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    (%x0)[4byte] -> %p0            // Restore spilled predicate register
ldr    +0x0148(%x28)[8byte] -> %x0
str    %x0 -> +0x0148(%x28)[8byte]
ldr    +0x0150(%x28)[8byte] -> %x0    // Restore flags using drreg
msr    %x0 -> %nzcv
ldr    +0x0148(%x28)[8byte] -> %x0    // Restore spilled GPR
b      $0x00000000004001d0

Expansion for AArch64 predicated contiguous store

st2w %z28.s %z29.s %p2 -> (%x1,%x2,lsl #2)[4byte]

str    %x0 -> +0x0148(%x28)[8byte]
mrs    %nzcv -> %x0                               // Spill flags using drreg
str    %x0 -> +0x0150(%x28)[8byte]
ldr    +0x0148(%x28)[8byte] -> %x0
str    %x0 -> +0x0148(%x28)[8byte]                // Spill scrach GPRs using drreg
str    %x3 -> +0x0158(%x28)[8byte]
str    %x4 -> +0x0160(%x28)[8byte]
ldr    +0x38(%x28)[8byte] -> %x0
ldr    +0x0f50(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    +0x10(%x0)[8byte] -> %x0
ldr    +0x20(%x0)[8byte] -> %x0
str    %z0 -> (%x0)[32byte]                       // Manually spill scratch vector Z register
ldr    +0x38(%x28)[8byte] -> %x0
ldr    +0x0f50(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    +0x10(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
str    %p0 -> (%x0)[4byte]                        // Manually spill scratch predicate P register
<label note=0x0000000000000001>
add    %x1 %x2 uxtx $0x0000000000000002 -> %x4    // Calculate start address
index  $0x00 $0x02 -> %z0.s                       // Initialize vector index register with value [0, 2, 4, ..]
pfalse  -> %p0.b                                  // Initialize loop variable to 0
pnext  %p2 %p0.s -> %p0.s                         // Set loop variable to the first active element
b.eq   @0x0000fffdb29fe3d0[8byte]                 // If no active elements, break the loop
lastb  %p0 %z0.s -> %x0                           // Extract vector index to GPR
lastb  %p0 %z28.s -> %x3                          // Extract vector element value from first source register to GPR
str    %w3 -> (%x4,%x0,lsl #2)[4byte]             // Store first register element value
add    %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0 // Add 1 to index value
lastb  %p0 %z29.s -> %x3                          // Extract vector element value from second source register to GPR
str    %w3 -> (%x4,%x0,lsl #2)[4byte]             // Store second register element value
pnext  %p2 %p0.s -> %p0.s                         // Repeat for next active element
b.eq   @0x0000fffdb29fe3d0[8byte]
lastb  %p0 %z0.s -> %x0
lastb  %p0 %z28.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
add    %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb  %p0 %z29.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext  %p2 %p0.s -> %p0.s                         // Repeat for next active element
b.eq   @0x0000fffdb29fe3d0[8byte]
lastb  %p0 %z0.s -> %x0
lastb  %p0 %z28.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
add    %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb  %p0 %z29.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext  %p2 %p0.s -> %p0.s                         // Repeat for next active element
b.eq   @0x0000fffdb29fe3d0[8byte]
lastb  %p0 %z0.s -> %x0
lastb  %p0 %z28.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
add    %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb  %p0 %z29.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext  %p2 %p0.s -> %p0.s                         // Repeat for next active element
b.eq   @0x0000fffdb29fe3d0[8byte]
lastb  %p0 %z0.s -> %x0
lastb  %p0 %z28.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
add    %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb  %p0 %z29.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext  %p2 %p0.s -> %p0.s                         // Repeat for next active element
b.eq   @0x0000fffdb29fe3d0[8byte]
lastb  %p0 %z0.s -> %x0
lastb  %p0 %z28.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
add    %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb  %p0 %z29.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext  %p2 %p0.s -> %p0.s                         // Repeat for next active element
b.eq   @0x0000fffdb29fe3d0[8byte]
lastb  %p0 %z0.s -> %x0
lastb  %p0 %z28.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
add    %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb  %p0 %z29.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
pnext  %p2 %p0.s -> %p0.s                         // Repeat for next active element
b.eq   @0x0000fffdb29fe3d0[8byte]
lastb  %p0 %z0.s -> %x0
lastb  %p0 %z28.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
add    %x0 $0x0000000000000001 lsl $0x0000000000000000 -> %x0
lastb  %p0 %z29.s -> %x3
str    %w3 -> (%x4,%x0,lsl #2)[4byte]
<label note=0x0000000000000000>
<label note=0x0000000000000002>
ldr    +0x38(%x28)[8byte] -> %x0
ldr    +0x0f50(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    +0x10(%x0)[8byte] -> %x0
ldr    +0x20(%x0)[8byte] -> %x0
ldr    (%x0)[32byte] -> %z0                       // Manually restore scratch vector register
ldr    +0x38(%x28)[8byte] -> %x0
ldr    +0x0f50(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    +0x10(%x0)[8byte] -> %x0
ldr    (%x0)[8byte] -> %x0
ldr    (%x0)[4byte] -> %p0                        // Manually restore scratch predicate register
ldr    +0x0148(%x28)[8byte] -> %x0                // Restore GPRs using drreg
ldr    +0x0158(%x28)[8byte] -> %x3
ldr    +0x0160(%x28)[8byte] -> %x4
str    %x0 -> +0x0148(%x28)[8byte]
ldr    +0x0150(%x28)[8byte] -> %x0                // Restore flags using drreg
msr    %x0 -> %nzcv
ldr    +0x0148(%x28)[8byte] -> %x0

As shown by the above expanded scatter and gather sequences, we require scratch registers for the expansion. The GPR scratch registers are obtained using drreg, whereas the scratch vector register and the scratch mask register are obtained by manually spilling them.

We need to make sure that we restore the application state correctly when a state restoration event occurs, which can be a fault in one of the scalar loads or stores in the expanded sequence, a fault in instrumentation added by some other DR client, or some async event like DR detach. While the spilled registers obtained from drreg are restored by the drreg state restoration logic, drx still needs to restore the scratch mask register that is spilled manually to a GPR, and the scratch zmm register that is spilled manually to a drx spill slot. We also need to ensure that the bit for the previous access is cleared if the state restore event happened after the load or store completed but before we could reflect it in the mask. When a state restore event occurs, we walk the expanded sequence using a state machine till we reach the faulting pc, keeping track of the state that needs to be restored (commit, commit).

As pointed out above, this expansion is done in the app2app phase. DR clients may use drreg to get scratch registers for their instrumentation in later phases (like insertion or instru2instru). While drreg indeed supports some basic usage outside of the insertion phase, it does not mitigate bad interactions by such multi-phase use. The following section talks about the changes made in drreg to support multi-phase use.

Drreg Support For Multi-phase Reservations

Owner: Abhinav Sharma

Upstream issue: DynamoRIO/dynamorio#3823

Drreg is DynamoRIO’s register reservation framework. It allows users to reserve a register to use as scratch. Internally, drreg automatically performs the following functions so that the user does not need to. Drreg

keeps all required book-keeping like the spill slot to spilled register mapping
restores spilled registers to their application value before they are read by an application instruction; also, it re-spills the spilled registers if they are written by an application instruction.
performs application state restoration on state restore events like encountering an application fault, and DR detach.

While expanding a scatter or gather instruction in the app2app phase, we need a scratch register to hold the scalar values and masks. In later phases (like the insertion or the instru2instru phase), drcachesim and other DR clients may also use drreg to get scratch registers for their instrumentation.

Drreg initially supported only insertion phase use, with some basic support in other phases. Importantly, it did not attempt to avoid any bad interactions between the multiple phases. To support multi-phase use of drreg, we needed to solve the following:

avoid spill slot conflict across multiple phases: multi-phase use can potentially lead to spill slot conflicts if the same slot is selected in multiple phases. This may clobber the spilled application value and cause the application to crash or otherwise fail.
allow aflags spill to any slot: drreg hardcoded the aflags spill slot as the zero-th slot, to simplify some logic. To support the ability to spill aflags in multiple phases, drreg should be able to use any spill slot for aflags.
application state restore logic: on a state restore event, we should be able to figure out which slot contains each spilled register's app value. This is complicated by the fact that registers may be spilled by instrumentation added by multiple phases, and the spill regions may overlap which causes the spilled application value to be moved between spill slots.

We explored the following ideas to avoid spill slot conflicts in drreg:

Disjoint slot spaces or arenas

We can ask drreg to create slot spaces or arenas at init time, which are assigned disjoint spill slots. When reserving a register, the user passes in a "space/arena Id" to instruct drreg to pick free slots only from that arena. This requires keeping some global drreg state. This also requires the user to guess the best configuration for assigning slots to the arenas, and passing the correct arena Id before each reservation. It may artificially make some spill slots unavailable for use, thereby reducing efficiency.

Assign phase Id to slots

Instead of creating slot spaces at init time with a best-guess assignment of slots, we can instead assign a phase Id to slots when they are requested in that phase. We then avoid using slots that are already assigned a phase Id, when we are not in that phase where the slot was used before. This also requires keeping some global drreg state. This does not help in avoiding spill slot conflicts between multiple clients in the same phase.

Preferred: Scan fragment to determine eligible slots

When picking a spill slot, we can determine whether using it will cause a slot conflict by scanning for its uses in the current fragment after the current instruction. We pick only that spill slot which does not have any later uses in the current fragment. This does not require any init time guesses or keeping any global drreg state. It does not impose any additional responsibilities on the users, and it also works for multiple clients in the same phase. This was implemented to pick spill slots for GPRs (commit) and aflags too (commit).

State Restoration For Drreg

Owner: Abhinav Sharma

Upstream issue: DynamoRIO/dynamorio#3823, DynamoRIO/dynamorio#3801

On a state restore event, drreg should be able to restore all spilled registers to their application values.

Unfortunately, when a state restore event happens, we only have the encoded fragment, and none of the drreg state, like the register to spill slot mappings. We need to reconstruct this state based on the faulting pc and the encoded fragment.

It is complex to determine which registers need to be restored and from which spill slot. This is because drreg automatically adds spill and restore instructions to handle various complex cases like automatic re-spilling of reserved registers after their application write instruction, and automatic restore of reserved registers before their application read instruction. Drreg also uses various optimisations like lazy restores for application values in case the register is reserved again. This is even more complex for aflags, for which spill and restore require atleast two steps (spilling aflags involves reading aflags into a register using lahf and then writing that register to a spill slot; restoring aflags involves reading aflags from its spill slot to a register, and then writing aflags from that register using sahf); and an additional step for reading or writing the overflow flag if needed. In some cases, aflags are even kept in a register as an optimisation.

Additionally, in multi-phase use, a register may be spilled by multiple phases, with a separate spill slot for each phase. The application value for the register may reside in one or more spill slots, and may also move between spill slots based on how the spill regions from different phases overlap. See various tricky scenarios in drreg-test.c.

We explored two ways to adapt drreg’s state restoration logic to multi-phase use. This also fixed some known existing issues with drreg: Dynamorio/dynamorio#4933, DynamoRIO/dynamorio#4939.

Track app values as they are moved between slots and registers

At a state restoration event, we walk the faulting fragment from beginning to the faulting instruction, and we keep track of where the native value of each register is present. At any point, it may be present in the register itself, a spill slot, or both. We track gpr_is_native to denote whether a register contains its native app value or not; and spill_slot_to_reg, to denote which register’s app value a spill slot contains.

When a register is written by an application instruction, we invalidate all spill_slot_to_reg entries that are mapped to that register, and also set gpr_is_native for that register.
When a register is written by a non-drreg meta instruction, we clear gpr_is_native for that reg.
When a register is loaded by drreg from the slot it was spilled to, we set gpr_is_native.
When a register is spilled to some spill slot, we set spill_slot_to_reg for that spill slot to that reg.

This strategy allows us to robustly keep track of the various corner cases that can arise in drreg, like spill regions from different phases overlapping (nesting or just overlapping), and the other known issues linked above. This was implemented by this commit.

The drawback of this approach is that it needs to be aware of other methods of spilling and restoring registers outside drreg (dropped PR). DynamoRIO uses various such methods internally (spilling to stack, slots not managed by drreg), and also the client may use their own unique methods. So, some non-drreg meta instructions may actually restore an application value to a register, but this approach will not be able to recognize that. This may cause it to lose track of some register’s application value. We dropped this approach on encountering DynamoRIO/dynamorio#4963.

Preferred: Pairing restores with spills (instead of the other way)

The key observation behind this approach is that it is easier to find the matching spill for a given restore, than to find the matching restore for a given spill. This is because there may be other restores besides the final restore, e.g. restores for app read, user prompted restores, etc. This makes it hard to find exactly where the spill region for a register/aflags ends. Additional complexities include the fact that aflags re-spills may not use the same slot, which makes differentiating spills from multiple phases difficult.

Each restore must have a matching spill. Based on this observation, we scan the faulting fragment from end to beginning, matching register restores to their spills. When we reach the faulting instruction, any restore for which we did not see the matching spill yet must be performed by the drreg state restoration. This was implemented by (commit).

This algorithm does not need to be aware of non-drreg methods of spilling/restoring registers. Note that, like the general drreg operation, this method does not restore the application value of a spilled GPR/aflags if they are dead at the faulting instruction. However, even dead registers need to be restored when drreg_options_t.conservative is set. This can be handled if there is additional metadata available to the drreg state restore callback (DynamoRIO/dynamorio#3801).

Simplifying Instrumentation For Emulated Instructions

Owner: Derek Bruening

Upstream Issue: DynamoRIO/dynamorio#4865

Emulated sequences like the expanded scatter and gather sequence described above pose another challenge for clients that need to observe instructions and memory references both. For observing instructions, these clients should see the original application instruction (that is, the scatter or gather instruction), whereas for observing memory references, they should see the emulated sequence (that is, all the individual scalar stores or loads). DynamoRIO should absorb this complexity and provide the required events to the client.

We implemented drmgr_orig_app_instr_for_fetch, drmgr_orig_app_instr_for_operands and drmgr_in_emulation_region APIs (commit, commit) that return the appropriate instruction to the client to be used for either instruction instrumentation or memory reference instrumentation. These were subsequently used in drcachesim as well (commit).

Support For Vector Reservation

Owner: Abhinav Sharma

The scatter and gather expansions require scratch vector registers, for which we need the capability to spill and restore vector registers. Following are the design choices:

Extend drreg to support reservation for vector registers. DynamoRIO/dynamorio#3844 aims to add this support.
Use custom spill and restore logic in drx. We can do this by reserving memory in TLS to use as a spill slot.

Some observations about this use-case for vector reservation:

We need to spill only one vector register, so we do not need sophisticated spill slot management logic.
The spilled vector register will not need to be restored for app reads, or re- spilled after app writes. Note that we will not encounter any application instructions that use the spilled vector register, because it needs to be spilled only for the duration of the expanded scatter or gather sequence.

Extending drreg to support vector spilling is a complex task. Given the above observations, the current use case does not justify the effort. Therefore, we chose to implement custom spill logic in drx (commit, commit).

Using The Expansion In DR Clients

Owner: Abhinav Sharma

Clients that need to observe each memory reference must use the drx_expand_scatter_gather API. This was added in the app2app phase of drcachesim and other DynamoRIO clients (commit). This also required fixing some issues (crashes and correctness problems) that surfaced when all pieces were integrated (commit, commit).

Testing On Large Apps

Owner: Abhinav Sharma

drcachesim was successfully used to trace an application with scatter and gather instructions. The resulting trace was observed to have millions of such instructions. We also verified correctness by comparing application output with and without tracing.

Table of Contents

Background

x86

AArch64

Scalar+vector

Vector+immediate

Vector+scalar

Scalar+scalar

Scalar+immediate

Non-faulting loads

First-faulting loads

Problem Statement

Design

Scatter/gather Instruction Expansion

Drreg Support For Multi-phase Reservations

State Restoration For Drreg

Simplifying Instrumentation For Emulated Instructions

Support For Vector Reservation

Using The Expansion In DR Clients

Testing On Large Apps