DynamoRIO
|
Table of Contents
Background
x86
The x86 gather and scatter instructions were introduced in the AVX2 and AVX512 instruction set extensions. They allow loading or storing a subset of elements in a vector from/to multiple non-contiguous addresses.
AVX2 has only gather instructions, no scatter instructions, whereas AVX512 has both. AVX2 is limited to 256-bit length vectors, whereas AVX512 has 512-bit support. Both support masking of individual memory accesses, using either a special mask register in AVX512 or another vector in AVX2.
Examples of these instructions are (in DR’s IR):
Above is an AVX2 gather instruction that reads 32-bit doublewords into the 256-bit ymm12
vector from addresses generated by adding the base address in rax
to the corresponding index elements in ymm11
, conditionally based on the masks in ymm13
. Elements may be gathered in any order. When an element is read, its mask is cleared. If some load faults, all elements to its right (closer to LSB) will be complete.
Above is an AVX512 scatter instruction that writes 32-bit doublewords from the 128- bit xmm10
vector to addresses generated by adding the base address in rax
to the corresponding index element in xmm11
, conditionally based on the mask register k1
. Elements may be scattered in any order. When an element is stored, its mask is cleared. If some store faults, all elements to its right (closer to LSB) will be complete.
AArch64
The AArch64 SVE (Scalable Vector Extension) introduced the first AArch64 scatter and gather instructions.
SVE has two scatter and gather addressing modes (scalar+vector and vector+immediate) and two predicated contiguous load and store addressing modes (scalar+scalar and scalar+immediate). The SVE2 instruction set extension adds a third scatter/gather addressing mode: vector+scalar.
The predicated contiguous instructions do not use vector based addressing but they do have other similarities with the scatter and gather instructions which mean DynamoRIO handles them in a similar way.
All scatter/gather/predicated contiguous instructions access elements conditionally based on the mask value of a governing predicate register. Loads are always zeroing: inactive elements are set to 0 in the destination register.
Scalar+vector
Above is a scalar+vector gather load that reads 8-bit values which are zero-extended and written to 32-bit elements of the z1
vector register. Addresses are calculated by adding the base address from the scalar x0
register to the corresponding 32-bit element in the vector index register z0
. Elements that are not active in the mask contained by the governing predicate register p0
are not loaded and the corresponding element in the destination register z1
is set to 0
.
Above is a scalar+vector scatter store that writes the lowest 16-bits of the 64-bit elements of the z17
vector register. Addresses are calculated by adding the base address from the scalar x19
register to the corresponding 64-bit element in the vector index register z20
. Elements that are not active in the mask contained by the governing predicate register p5
are not stored.
Vector+immediate
Above is a vector+immediate gather load that reads 32-bit values which are sign-extended and written to 64-bit elements of the z6
vector register. Addresses are calculated by adding the immedate value 0x18 to the corresponding 64-bit element in the vector base register z8
. Elements that are not active in the mask contained by governing predicate register p2
are zeroed.
Vector+scalar
Introduced with SVE2. The above instruction is a scatter store that writes the 64-bit elements of the z27
vector register. Addresses are calculated by adding the value of scalar register x29
to the corresponding 64-bit element in the vector base register z25
. Elements that are not active in the mask contained by the governing predicate register p7
are not stored.
Scalar+scalar
The above instruction is a scalar+scalar predicated contiguous load. 8-bit values are loaded and sign extended to 64-bits and written to the vector register z16
. The address of the first element is calculated by adding the value of the scalar index register x18
to the value of the scalar base register x17
. Addresses for subsequent elements are calculated by adding the size of the loaded value (1-byte in this example) to the address of the previous element. Elements that are not active in the mask contained by governing predicate register p5
are zeroed.
Scalar+immediate
The above instruction is a scalar+immediate predicated contiguous store that writes the 32-bit elements of the z10
vector register. The address for the first element is calculated by adding the immediate value -0x60
to the value of the scalar base register x11
. Addresses for subsequent elements are calculated by adding the size of the stored value (4 bytes in this example) to the address of the previous element. Elements that are not active in the mask contained by governing predicate register p3
are not stored.
Note that DynamoRIO IR and disassembly for scalar+immediate instructions give the offset in bytes, but the instruction itself uses a 4-bit signed immediate which is multiplied by the current SVE vector length in bytes. Arm assembly syntax uses a vector length agnostic representation for the offset: #<imm4>, MUL VL
So the above instruction might be encoded as
if the currently vector length is 32 bytes, or
if it is 16 bytes, and cannot be encoded for vector lengths > 32 bytes.
Non-faulting loads
Non-faulting loads (ldnf*
) do not fault if an element read faults. Instead a special predicate register FFR is updated. If this happens the FFR element corresponding to the element that faulted, and all elements higher than that are set to 0. Elements lower than the faulting element are unchanged. Non-faulting loads support scalar+immediate addressing.
First-faulting loads
First-faulting loads (ldff*
) behave like a normal gather/load instruction if the first active element causes a fault, and behave like a non-faulting load if the first active element succeeds but a later active element read causes a fault. First faulting loads support scalar+scalar, scalar+vector, and vector+immediate addressing.
Problem Statement
Scatter and gather instructions pose a challenge to DynamoRIO clients that observe memory addresses, like address tracing tools (e.g. drcachesim, that collects memory address and control flow traces), and taint tracking tools. They are complex to handle because:
- A single gather or scatter instruction loads from or stores to multiple addresses
- Accessed addresses may be non-contiguous
- Each access is conditional based on some mask
DynamoRIO clients only see the scatter or gather instruction and need to do more work to extract all accessed addresses. This is unlike regular scalar loads or stores, where the accessed address is readily available. The goal of this work is to make it easier for DR clients to observe these addresses. We achieve this by expanding the scatter and gather instructions into a functionally equivalent sequence of scalar stores and loads. This way, DR clients will see regular store and load instructions which they can instrument as usual. This is similar to what DR does for repeat string operations (like rep movs
, repnz cmps
): convert it into a loop so that each memory access is made by a separate dynamic instruction. This method has worked well for such instructions that implicitly issue multiple memory accesses.
Original issue: DynamoRIO/dynamorio#2985.
Design
This required the addition of new support in various DynamoRIO components, like drreg, drx, drmgr and core DR. Multiple contributors worked on designing and implementing the required changes.
Scatter/gather Instruction Expansion
Owner: Hendrik Greving
As described above, we can simplify work for DR clients by replacing each scatter and gather instruction with a functionally equivalent sequence of scalar stores and loads. The expanded sequence is the unrolled version of the following loop:
Due to the x86 ISA, the extraction/insertion of the scalar value from/to the vector may involve multiple steps, e.g. to extract a 32-bit scalar value from a 512-bit zmm
reg, we first need to extract a 128-bit xmm
from it.
drmgr in DR provides multiple phases of instrumentation. Our expansion is done in the first phase known as app2app. As the name suggests, this phase is intended to transform app instructions to equivalent instructions. For simplicity, we also separate out the scatter and gather instructions from their basic block and create a separate fragment with only the expanded sequence. The logic for expanding scatter and gather instructions is implemented in the drx extension library as drx_expand_scatter_gather
, and can be used by any client that needs it, including drcachesim. This support was added by commit.
As an example, the following are the expansions of some instructions.
Expansion for x86 gather
Expansion for x86 scatter
Expansion for AArch64 gather
Expansion for AArch64 predicated contiguous store
As shown by the above expanded scatter and gather sequences, we require scratch registers for the expansion. The GPR scratch registers are obtained using drreg, whereas the scratch vector register and the scratch mask register are obtained by manually spilling them.
We need to make sure that we restore the application state correctly when a state restoration event occurs, which can be a fault in one of the scalar loads or stores in the expanded sequence, a fault in instrumentation added by some other DR client, or some async event like DR detach. While the spilled registers obtained from drreg are restored by the drreg state restoration logic, drx still needs to restore the scratch mask register that is spilled manually to a GPR, and the scratch zmm
register that is spilled manually to a drx spill slot. We also need to ensure that the bit for the previous access is cleared if the state restore event happened after the load or store completed but before we could reflect it in the mask. When a state restore event occurs, we walk the expanded sequence using a state machine till we reach the faulting pc, keeping track of the state that needs to be restored (commit, commit).
As pointed out above, this expansion is done in the app2app phase. DR clients may use drreg to get scratch registers for their instrumentation in later phases (like insertion or instru2instru). While drreg indeed supports some basic usage outside of the insertion phase, it does not mitigate bad interactions by such multi-phase use. The following section talks about the changes made in drreg to support multi-phase use.
Drreg Support For Multi-phase Reservations
Owner: Abhinav Sharma
Upstream issue: DynamoRIO/dynamorio#3823
Drreg is DynamoRIO’s register reservation framework. It allows users to reserve a register to use as scratch. Internally, drreg automatically performs the following functions so that the user does not need to. Drreg
- keeps all required book-keeping like the spill slot to spilled register mapping
- restores spilled registers to their application value before they are read by an application instruction; also, it re-spills the spilled registers if they are written by an application instruction.
- performs application state restoration on state restore events like encountering an application fault, and DR detach.
While expanding a scatter or gather instruction in the app2app phase, we need a scratch register to hold the scalar values and masks. In later phases (like the insertion or the instru2instru phase), drcachesim and other DR clients may also use drreg to get scratch registers for their instrumentation.
Drreg initially supported only insertion phase use, with some basic support in other phases. Importantly, it did not attempt to avoid any bad interactions between the multiple phases. To support multi-phase use of drreg, we needed to solve the following:
- avoid spill slot conflict across multiple phases: multi-phase use can potentially lead to spill slot conflicts if the same slot is selected in multiple phases. This may clobber the spilled application value and cause the application to crash or otherwise fail.
- allow aflags spill to any slot: drreg hardcoded the aflags spill slot as the zero-th slot, to simplify some logic. To support the ability to spill aflags in multiple phases, drreg should be able to use any spill slot for aflags.
- application state restore logic: on a state restore event, we should be able to figure out which slot contains each spilled register's app value. This is complicated by the fact that registers may be spilled by instrumentation added by multiple phases, and the spill regions may overlap which causes the spilled application value to be moved between spill slots.
We explored the following ideas to avoid spill slot conflicts in drreg:
Disjoint slot spaces or arenas
We can ask drreg to create slot spaces or arenas at init time, which are assigned disjoint spill slots. When reserving a register, the user passes in a "space/arena Id" to instruct drreg to pick free slots only from that arena. This requires keeping some global drreg state. This also requires the user to guess the best configuration for assigning slots to the arenas, and passing the correct arena Id before each reservation. It may artificially make some spill slots unavailable for use, thereby reducing efficiency.
Assign phase Id to slots
Instead of creating slot spaces at init time with a best-guess assignment of slots, we can instead assign a phase Id to slots when they are requested in that phase. We then avoid using slots that are already assigned a phase Id, when we are not in that phase where the slot was used before. This also requires keeping some global drreg state. This does not help in avoiding spill slot conflicts between multiple clients in the same phase.
Preferred: Scan fragment to determine eligible slots
When picking a spill slot, we can determine whether using it will cause a slot conflict by scanning for its uses in the current fragment after the current instruction. We pick only that spill slot which does not have any later uses in the current fragment. This does not require any init time guesses or keeping any global drreg state. It does not impose any additional responsibilities on the users, and it also works for multiple clients in the same phase. This was implemented to pick spill slots for GPRs (commit) and aflags too (commit).
State Restoration For Drreg
Owner: Abhinav Sharma
Upstream issue: DynamoRIO/dynamorio#3823, DynamoRIO/dynamorio#3801
On a state restore event, drreg should be able to restore all spilled registers to their application values.
Unfortunately, when a state restore event happens, we only have the encoded fragment, and none of the drreg state, like the register to spill slot mappings. We need to reconstruct this state based on the faulting pc and the encoded fragment.
It is complex to determine which registers need to be restored and from which spill slot. This is because drreg automatically adds spill and restore instructions to handle various complex cases like automatic re-spilling of reserved registers after their application write instruction, and automatic restore of reserved registers before their application read instruction. Drreg also uses various optimisations like lazy restores for application values in case the register is reserved again. This is even more complex for aflags, for which spill and restore require atleast two steps (spilling aflags involves reading aflags into a register using lahf
and then writing that register to a spill slot; restoring aflags involves reading aflags from its spill slot to a register, and then writing aflags from that register using sahf
); and an additional step for reading or writing the overflow flag if needed. In some cases, aflags are even kept in a register as an optimisation.
Additionally, in multi-phase use, a register may be spilled by multiple phases, with a separate spill slot for each phase. The application value for the register may reside in one or more spill slots, and may also move between spill slots based on how the spill regions from different phases overlap. See various tricky scenarios in drreg-test.c.
We explored two ways to adapt drreg’s state restoration logic to multi-phase use. This also fixed some known existing issues with drreg: Dynamorio/dynamorio#4933, DynamoRIO/dynamorio#4939.
Track app values as they are moved between slots and registers
At a state restoration event, we walk the faulting fragment from beginning to the faulting instruction, and we keep track of where the native value of each register is present. At any point, it may be present in the register itself, a spill slot, or both. We track gpr_is_native
to denote whether a register contains its native app value or not; and spill_slot_to_reg
, to denote which register’s app value a spill slot contains.
- When a register is written by an application instruction, we invalidate all
spill_slot_to_reg
entries that are mapped to that register, and also setgpr_is_native
for that register. - When a register is written by a non-drreg meta instruction, we clear
gpr_is_native
for that reg. - When a register is loaded by drreg from the slot it was spilled to, we set
gpr_is_native
. - When a register is spilled to some spill slot, we set
spill_slot_to_reg
for that spill slot to that reg.
This strategy allows us to robustly keep track of the various corner cases that can arise in drreg, like spill regions from different phases overlapping (nesting or just overlapping), and the other known issues linked above. This was implemented by this commit.
The drawback of this approach is that it needs to be aware of other methods of spilling and restoring registers outside drreg (dropped PR). DynamoRIO uses various such methods internally (spilling to stack, slots not managed by drreg), and also the client may use their own unique methods. So, some non-drreg meta instructions may actually restore an application value to a register, but this approach will not be able to recognize that. This may cause it to lose track of some register’s application value. We dropped this approach on encountering DynamoRIO/dynamorio#4963.
Preferred: Pairing restores with spills (instead of the other way)
The key observation behind this approach is that it is easier to find the matching spill for a given restore, than to find the matching restore for a given spill. This is because there may be other restores besides the final restore, e.g. restores for app read, user prompted restores, etc. This makes it hard to find exactly where the spill region for a register/aflags ends. Additional complexities include the fact that aflags re-spills may not use the same slot, which makes differentiating spills from multiple phases difficult.
Each restore must have a matching spill. Based on this observation, we scan the faulting fragment from end to beginning, matching register restores to their spills. When we reach the faulting instruction, any restore for which we did not see the matching spill yet must be performed by the drreg state restoration. This was implemented by (commit).
This algorithm does not need to be aware of non-drreg methods of spilling/restoring registers. Note that, like the general drreg operation, this method does not restore the application value of a spilled GPR/aflags if they are dead at the faulting instruction. However, even dead registers need to be restored when drreg_options_t.conservative
is set. This can be handled if there is additional metadata available to the drreg state restore callback (DynamoRIO/dynamorio#3801).
Simplifying Instrumentation For Emulated Instructions
Owner: Derek Bruening
Upstream Issue: DynamoRIO/dynamorio#4865
Emulated sequences like the expanded scatter and gather sequence described above pose another challenge for clients that need to observe instructions and memory references both. For observing instructions, these clients should see the original application instruction (that is, the scatter or gather instruction), whereas for observing memory references, they should see the emulated sequence (that is, all the individual scalar stores or loads). DynamoRIO should absorb this complexity and provide the required events to the client.
We implemented drmgr_orig_app_instr_for_fetch
, drmgr_orig_app_instr_for_operands
and drmgr_in_emulation_region
APIs (commit, commit) that return the appropriate instruction to the client to be used for either instruction instrumentation or memory reference instrumentation. These were subsequently used in drcachesim as well (commit).
Support For Vector Reservation
Owner: Abhinav Sharma
The scatter and gather expansions require scratch vector registers, for which we need the capability to spill and restore vector registers. Following are the design choices:
- Extend drreg to support reservation for vector registers. DynamoRIO/dynamorio#3844 aims to add this support.
- Use custom spill and restore logic in drx. We can do this by reserving memory in TLS to use as a spill slot.
Some observations about this use-case for vector reservation:
- We need to spill only one vector register, so we do not need sophisticated spill slot management logic.
- The spilled vector register will not need to be restored for app reads, or re- spilled after app writes. Note that we will not encounter any application instructions that use the spilled vector register, because it needs to be spilled only for the duration of the expanded scatter or gather sequence.
Extending drreg to support vector spilling is a complex task. Given the above observations, the current use case does not justify the effort. Therefore, we chose to implement custom spill logic in drx (commit, commit).
Using The Expansion In DR Clients
Owner: Abhinav Sharma
Clients that need to observe each memory reference must use the drx_expand_scatter_gather
API. This was added in the app2app phase of drcachesim and other DynamoRIO clients (commit). This also required fixing some issues (crashes and correctness problems) that surfaced when all pieces were integrated (commit, commit).
Testing On Large Apps
Owner: Abhinav Sharma
drcachesim was successfully used to trace an application with scatter and gather instructions. The resulting trace was observed to have millions of such instructions. We also verified correctness by comparing application output with and without tracing.