DynamoRIO
AArch64 Port

This page contains a record of some design decisions for the port to AArch64. The AArch64 master issue, #1569, has a list of commits and some more up-to-date information on the status of this port.

Introduction to AArch64

AArch64 is the ARM architecture's 64-bit execution state, which was introduced in version 8 of the architecture, ARMv8, announced in 2011. There have been subsequent updates to the architecture: ARMv8.2 was announced in 2016.

ARM defines three architecture "profiles" (A, R and M), representing architecture configurations and subsets appropriate to different market segments. For DynamoRIO we are only concerned with the "application profile", ARMv8-A, which includes virtual memory.

ARMv8 also defines the 32-bit execution state, AArch32, which uses the A32 ("ARM") and T32 ("Thumb") instruction sets familiar from previous versions of the ARM architecture. It is only possible to switch between AArch32 and AArch64 on an exception. A system that runs AArch64 software may or may not also be able to run AArch32 software. Although there are many similarities between AArch32 and AArch64 there are also some fundamental differences, so for many purposes it is helpful to think of AArch32 and AArch64 as separate architectures and this is the approach taken by DynamoRIO with the preprocessor macros AARCH64, ARM, X86, and subdirectories in the source code with the same names in lower case. However, there is also a preprocessor macro AARCHXX, and a corresponding subdirectory, to facilitate sharing of code between AArch32 and AArch64 where this is convenient.

Note that in DynamoRIO's source code, as in many other places, "ARM" is used to mean AArch32.

Linux uses the name "arm64" for its AArch64 architecture (which includes an ABI and other things not specified by the ARM Architecture). GCC and other tool chains use "aarch64" (lower case). So there is a Debian package called "gcc-aarch64-linux-gnu", which is the "GNU C compiler for the arm64 architecture".

The AArch64 user-mode execution state consists of:

  • X0-X30: 31 64-bit general-purpose registers. X30 is used as the procedure link register.
  • A 64-bit program counter (PC) and stack pointer (SP). Unlike in AArch32, these are distinct from the numbered registers.
  • V0-V31: 32 128-bit registers for floating-point and SIMD.
  • NZCV: Condition Flags (the top bits of a 32-bit register).
  • FPCR: Floating-Point Control Register (32 bits, some unused).
  • FPSR: Floating-Point Status Register (32 bits, some unused).
  • Under Linux, the 64-bit system register TPIDR_EL0 that is readable and writable in user mode and used for thread-local storage (TLS).

The ARM architecture is bi-endian: the operating system can switch between little-endian and big-endian handling of data, with little-endian as the default. The Linux arm64 kernel can be configured as big-endian but all major Linux arm64 distributions are little-endian.

IR decisions

AArch64 has 31, not 32, general-purpose registers. Depending on the context, the value 31 in an encoding may refer either to the stack pointer or, more often, to the "zero register", which is read as zero and unaffected by a write (it is a pseudo-register). DynamoRIO's internal representation (IR) distinguishes between XSP and XZR. In the enum, DR_REG_XSP follows DR_REG_X30 and is included in the range DR_REG_START_GPR to DR_REG_STOP_GPR even though XSP is not usually interchangeable with other X registers. DR_REG_XZR is not included in the "GPR" range.

The IR distinguishes between the "X" registers and the "W" registers, which are aliases for the lower 32 bits of an X register. Writing to a W register sets the top half of the corresponding X register to zero. Similarly, there are aliases for the lowest part of an FP/SIMD register: DR_REG_B0 (8 bits), DR_REG_H0 (16 bits), DR_REG_S0 (32 bits), DR_REG_D0 (64 bits), and DR_REG_Q0 (all 128 bits). (This is a noteworthy difference from AArch32: in AArch32, S3 is the highest word of D1 and of Q0; in AArch64, S3 is the lowest word of D3 and of Q3.)

There are the expected differences between DynamoRIO's IR and the standard assembly language. In particular, DynamoRIO lists source and destination registers separately. A register operand that is both read and written must appear in both lists, as must a register whose contents is only partly overwritten by an instruction. An example is MOVK, which overwrites part of a general-purpose register with a constant value.

Descriptions of the ARM architecture distinguish between "instructions" and "aliases". For example CMP X1, X2 is an alias for SUBS XZR, X1, X2: a flag-setting subtract that discards the result by specifying the zero register as the destination. A typical assembler accepts both of these forms, generating the same instruction, typically disassembled as CMP. However, DynamoRIO's AArch64 IR ignores aliases, so there is no OP_cmp. However, for convenience there are (or should be) macros in aarch64/instr_create_api.h corresponding to the standard aliases.

There is no DR_REG_PC for AArch64. Literal loads and instructions that generate PC-relative address are represented as in X86_64, using REL_ADDR_kind, not as in ARM/AArch32.

TBD: NZCV, FPCR, FPSR, SIMD instructions.

Encoder/decoder

AArch64 has a single instruction set, called "A64", in which all instructions have 32 bits. The encoding is relatively simple and consistent, which makes it possible in some cases to deduce properties of an instruction without fully decoding it. For example, a general-purpose register operand is encoded in one of four positions in the instruction word so it it may be possible to know that an instruction does not read or write a given register even without knowing anything else about the instruction. Similarly, it is possible to recognise a potential load/store instruction by examining just a few bits.

Encodings are described in "codec.txt", which is processed by "codec.py" to generate several C source files. In order to avoid adding Python as a build requirement these generated files are included in the source. A developer who modifies "codec.txt" should run "codec.py" manually.

Adding a new instruction to "codec.txt" will often require adding a new operand type, for which encoder and decoder functions must be added in "codec.c".

Currently the instruction bit patterns listed in "codec.txt" are not allowed to overlap. A possible extension would be to allow a more specific pattern (one with fewer 'x' bits) to override a less specific pattern. This would allow NOP, YIELD, WFE, WFI, SEV and SEVL to be defined as special cases of HINT, but there are other ways of handling HINT so this single case is not a strong argument for extending the notation. Also, there may be other ways of extending the notation that are inconsistent with the approach just described.

At the end of 2016, DynamoRIO's encoder/decoder handles all the load/store instructions, including load/store of FP/SIMD registers, and all the instructions that do not operate on FP/SIMD registers, up to ARMv8.2.

Because the decoder is incomplete, unrecognised instructions are decoded as instances of a generic instruction, OP_xx, which is regarded as reading and writing the general-purpose registers referenced in the four places in the instruction word where the number of a general-purpose register might appear. This ensures that undecoded FP/SIMD instructions are correctly (though perhaps inefficiently) handled when they might read or write the "stolen" register.

Stolen register

DynamoRIO uses a "stolen" register on AArch64 for the same reason as on AArch32: it is not possible to use TPIDR_EL0/TPIDRURO directly as an address for accessing memory. The stolen register may be specified on the command line at run time; by default it is X28.

If the fragment cache were not shared between threads it would be possible to avoid stealing a general-purpose register: borrow TPIDR_EL0 instead and spill registers, when necessary, by first spilling a general-purpose register into TPIDR_EL0 and then generating a memory address with ADRP. This way one could avoid the expense of mangling instructions that use a stolen general-purpose register, but instrumentation would be more expensive in some cases, so the value for DynamoRIO of this approach is unclear.

Reachability

An AArch64 unconditional immediate/direct branch (B or BL) has a range of +/- 128 MiB. If the fragment cache were restricted to a 128 MiB block of memory then it would be possible to branch from any fragment to any other fragment. DynamoRIO does not currently restrict the memory range used for the fragment cache so in general it is necessary to use a register/indirect branch when exiting from a fragment. There are opportunities for improvement in this area.

Self-modifying code

The X86 architecture requires hardware to detect when the instruction cache has been invalided by a write to memory, so DynamoRIO must detect when code that has already been rewritten into the fragment cache is subsequently modified, which is not trivial to implement efficiently.

The ARM architecture requires software to perform explicit synchronisation between writing instructions to memory and executing those instructions. In AArch32 this cannot be done in user mode, so 32-bit ARM Linux uses a system call (SYS_cacheflush), which DynamoRIO can easily intercept.

In AArch64 there are user-mode instructions for synchronising the instruction cache, so DynamoRIO must mangle these instructions so as to detect when a program may have legally modified itself.

The prescribed recipe for synchronising the instruction cache is implemented by clear_icache() in "dr_helper.c". DynamoRIO detects when an app has performed these operations by mangling the IC and ISB instructions. A program will typically invoke IC on a contiguous set of cache lines, then invoke ISB, so DynamoRIO mangles IC into a call to a procedure that updates the set of cache lines, provided they are contiguous, without returning to the C runtime, which would involve saving nearly all the registers (about 800 bytes). A return to the C runtime with X0 set to linkstub_selfmod only occurs when an ISB instruction is executed after one or more IC instructions have been executed.