DynamoRIO
|
Table of Contents
Overview
Memory address traces (or "memtraces") are critical tools for analyzing application behavior and performance. DynamoRIO's Tracing and Analysis Framework provides memtrace gathering and analyzing tools. Today, these memtraces are gathered either during a full execution of a standalone application, or during a short burst of execution on a server (using the start/stop interfaces as described at Tracing a Subset of Execution). For long-running benchmarks with phased behavior, however, we would like to instead gather a series of smaller memtrace samples to ease simulation while still representing the application in the aggregate.
Initial Use Case: SPEC2017
An initial use case driving this work is the gathering of memtraces from SPEC2017. These benchmarks execute tens of trillions of instructions each and include phased behavior. Our goal is to gather uniformly sampled memtrace sequences across the whole execution. To support a full warmup for each sample, our target is 10 billion instructions in each. Something like 50 samples per benchmark should suffice to capture major phases. With these parameters, for a 25-trillion-instruction benchmark, that would look like a series of 10 billion instruction traces each followed by 490 billion instructions of untraced execution.
Design Point: Separate Traces v. Merged-with-Markers
Focusing on a use case of a series of 50 10-billion-instruction traces for a SPEC benchmark, there are two main ways to store them. We could create 50 independent sets of trace files, each with its own metadata and separate set of sharded data files. A simulator could either simulate all 50 separately and aggregate just the resulting statistics, or a single instance of a simulator could skip ahead to each new sample and use the first portion of the sample to warm up caches and other state.
The alternative is to store all the data in one set of data files, with metadata markers inserted to indicate the division points between the samples. This doesn’t support the separate simulation model up front, though we could provide an iterator interface that skips ahead to a target window and stops at the end of that window (or the simulator could be modified to stop when it sees a window separation marker). However, this will not be as efficient for parallel simulation with separate simulator instances for each window, since the skipping ahead will take some time. This arrangement does more easily support the single-simulator-instance approach, and more readily fits with online simulation.
In terms of implementation, there are several factors to consider here.
Separate raw files
If we want separate final traces, at first the simplest approach is to produce a separate set of raw files for each tracing window. These would be post-processed separately and independently.
However, implementing this split does not fit well in the current interfaces. To work with other filesystems, we have separated out the i/o and in particular directory creation, and some proprietary uses rework the output completely to write to a fixed set of destinations which are mapped back to per-thread raw files during post-processing. Creating new sets of outputs at tracing time would require new interfaces and significant refactoring for these use cases.
For typical use with files on the local disk, we could add creation of a new directory (and duplication of the module file) for each window by the tracing thread that hits the end-of-window trigger. The other threads would each create a new output raw file each time they transitioned to a new window (see also the Proposal A discussion below). This is implemented today with the -split_windows
option.
Splitting during raw2trace
Alternatively, we could keep a single raw file for each thread and split it up into per-window final trace files during postprocessing by the raw2trace tool. We would use markers inserted at the window transition points to identify where to separate.
raw2trace would need to create a new output dir and duplicate the trace headers and module file. Like for separate raw files, this goes against the current i/o separation where today we pass in a list of all the input and output files up front and raw2trace never opens a file on its own, to better support proprietary filesystems with upstream code.
Another concern here is hitting file size limits with a single raw file across many sample traces. For the example above of 50 10-billion-instruction traces, if we assume an average of 2 dynamic instructions per raw entry, each window might contain 5GB of data, reaching 250GB for all 50. Furthermore, the final trace is even larger.
The file size problem gets worse if we use a constant sampling interval across SPEC2017. Some SPEC2017 benchmarks have many more instructions than others. The bwaves_s benchmark has 382 trillion instructions, so a constant interval might result in it having 50x more data than other benchmarks, exceeding the file size limit. A constant number of windows is preferred for this reason.
Splitting after raw2trace using an analyzer
Given the complexities of splitting in earlier steps, and given that we may want to use a single simulator instance to process all of the sample traces, and given that for online trace analysis we will likely also have a single simulator instance: perhaps we should not try to split the samples and instead treat the 50 samples as a single trace with internal markers indicating the window division.
Online and offline analyzers can use window ID markers to fast-forward or pause and align each thread to the next window. Maybe the existing serial iterator can have built-in support for aligning the windows.
If single-file final traces will exist, we would need to update all our existing analyzers to handle the gaps in the traces: reset state for function and callstack trackers; keep per-window numbers for statistics gatherers.
We can also create an analyzer that splits a final trace up if we do want separate traces.
Decision: Split the final trace with an analyzer
Separate files seems to be the most flexible and useful setup for our expected use cases, in particular parallel simulation. But given that separating early in the pipeline is complex, we’ll split after raw2trace post-processing, with a new analyzer tool.
We’ll update some simple inspection and sanity tools (view, basic_counts, and invariant_checker) to handle and visualize windows, but we’ll assume that trace windows will be split before being analyzed by any more complex analysis tools. For online traces we will probably stick with multi-window-at-once.
We’ll create a tool to manually split up multi-window trace files.
Update: We ended up implementing splitting at the raw file output level for the local-disk use case; we may additionally implement splitting in an anlyzer for other use cases.
Design Point: Continuous Control v. Re-Attach
One method of obtaining multiple traces is to repeat today’s bursts over and over, with a full detach from the application after each trace. However, each attach point is expensive, with long periods of profiling and code cache pre-population. While a scheme of sharing the profiling and perhaps code cache could be developed while keeping a full detach, a simpler approach is to remain in control but switch from tracing to instruction counting in between tracing windows. Instruction counting is needed to determine where to start the next window in any case.
Instruction counting through instrumentation is not cheap, incurring perhaps a 1.5x slowdown. Compared to the 50x overhead while tracing, however, it is acceptable. If lower overhead is desired in the future, a scheme using a full detach and using hardware performance counters to count instruction can be investigated. The decision for the initial implementation, however, is to use the simpler alternating tracing and counting instrumentation windows.
Design Point: Instrumentation Dispatch v. Flushing
As the prior section concluded, we plan to alternate between tracing and instruction counting. There are two main approaches to varying instrumentation during execution: inserting all cases up front with a dispatch to the desired current scheme, and replacing instrumentation by flushing the system’s software code cache when changing schemes.
Flushing is an expensive process, and can be fragile as the lower-overhead forms of flushing open up race conditions between threads executing the old and new code cache contents. Its complexity is one reason we are deciding to instead use a dispatch approach for our initial implementation.
With dispatch, we insert both tracing and counting instrumentation for each block in the software code cache. Dispatch code at the top of the block selects which scheme to use. The current mode, either tracing or counting, is stored in memory and needs to be synchronized across all threads.
The simplest method of synchronizing the instrumentation mode is to store it in a global variable, have the dispatch code use a load-acquire to read it, and modify it with a store-release. There is overhead to a load-acquire at the top of every block, but experimentation shows that it is reasonable compared to the overhead of the instrumentation itself even for instruction counting mode, and its simplicity makes it our choice for the initial implementation.
The mechanisms for creating the dispatch and separate copies for the modes is provided for us by the drbbdup library. This library was, however, missing some key pieces we had to add.
AArch64 support for drbbdup
The drbbdup library had no AArch64 support initially. It needed some updates to generated code sequences, as well as handling of the weak memory model on AArch64. To support storing the mode variable in global memory with the short reach of AArch64 memory addressing modes, we added explicit support for using a scratch register to hold the address. For the weak memory mode, explicit support for using a load-acquire to read the value was added.
Function wrapping support for drbbdup
The function wrapping library used when tracing does not easily work with drbbdup out of the box, due to the different control models. We added an inverted control mode to the wrapping library to allow enabling wrapping only for the tracing mode and not for the counting mode.
However, there is a complication here where a transition to counting mode while in the middle of a wrapped function can cause problems if cleanup at exiting the function is skipped. We ended up adding support for cleanup-only execution of function wrapping actions which is invoked while counting, with full wrapping actions enabled while tracing.
A final minor change to support DRWRAP_REPLACE_RETADDR was to use the tag rather than the first instruction’s application PC, since that wrapping scheme creates some blocks with nothing but synthetic instructions with no corresponding application PC.
Write-xor-execute support for drbbdup
The drbbdup library uses its own generated code area for out-of-line callouts, but only for dynamically discovered instrumentation cases. This does not play well with write-xor-execute security features where we need to separate writable and executable code and have a special mechanism for DynamoRIO’s code cache that does not currently extend to tools using their own caches. To solve this for our use case which has only static cases, we simply disabled the drbbdup code cache when a new flag is set disabling dynamic case support.
Emulation support for drbbdup
The memtrace code uses emulation sequence markers for instrumenting the proper application instructions in the face of various expansions for repeated string loops and scatter-gather instructions. However, the drbbdup library presents its own interface layer which hides the markers and in fact results in memtrace missing some application instructions.
Consider partial detach with PMU instruction counting for non-tracing windows?
We could reduce the overhead of the non-tracing windows where we’re counting instructions by using the PMU to count for us. We would do something like a detach but keep our code cache (and assume no code changes by the application) and re-attach when the PMU counting reaches the target.
Handling Phase Transitions
For a normal memtrace burst, we completely detach from the server at the end of our desired trace duration. This detach process synchronizes with every application thread.
For multi-window traces, we are using multi-case dispatched instrumentation where we change the instrumentation type for each window. We have no detach to go through and wake up all threads and have them flush their trace buffers and we're deliberately trying to avoid a global synchronization point. Yet we would prefer perfect transitions between windows, whether that's separate raw files or accurately-placed markers.
Key step: Add end-of-block phase change check
We do flush prior to a syscall, so a thread at a kernel wait point should have an empty buffer and not be a concern.
The main concern is a thread not in a wait state that happens to not be scheduled consistently for a long time and so does not fill up its buffer until well after the window ends.
We can augment the current end-of-block flush check which today looks for the buffer being full. We can add a check for the prior window having ended, by having a global window ordinal and storing its value per thread at the start of filling up a new buffer. (This is better than simply checking the drbbdup mode value for being in non-tracing mode as that will not catch a double mode change.) If the prior window has ended, we can flush the buffer, or simply add a marker, depending on the scheme (see below).
A thread that receives a signal mid-block (it would have to be a synchronous signal as DR waits until the end of the block for asynchronous) will skip its end-of-block checks and redirect to run the app's signal handler: but it would hit the checks for the first block of the handler.
The worst case inaccuracy here is a thread who starts writing in window N but ends up unscheduled until a much later window M. But at most one basic block's worth of trace entries will be placed into window N even though they happened later. Thus we have "basic block accuracy", which is pretty good, as typically a basic block only contains a few instructions.
Proposal A: Separate raw files split at flush time
If we're splitting raw files (see above), we would use the end-of-block window-change flush to emit a thread exit and create a new file. In post-processing, we'd add any missing thread exits to threads that don't have them, to cover waiting threads who never reached a flush.
As discussed above, the trigger thread would create a new directory for each window. A just-finished buffer is written to the directory corresponding to the window for its start point.
A thread that is unscheduled for a long time could have a nearly-full buffer that is not written out until many windows later, but it would be written to the old directory for the old window. The next buffer would go to a new file in the new window, with no files in the in-between window directories.
Proposal B (winner): Label buffers with owning window
If we add the window ordinal to every buffer header, we can identify which window they belong to, and avoid the need to separate raw files. A window-end flush ensures a buffer belongs solely to the window identified in its header; the next buffer will have the new window value.
This scheme can be used with file splitting during raw2trace, or waiting until analysis. Each thread has one raw file which contains all windows during the execution.
Proposal C: Trigger thread identifies buffer transition point of the other threads
For this proposal, the thread triggering the end of the window walks the other threads and identifies the phase transition point inside the buffer, telling the post-processor where to split them.
I considered having the triggerer also flush the buffers, but that is challenging with a race with the owner also flushing. Plus, it still requires post-processing help to identify the precise point for splitting the buffer (without synchronization the triggerer can only get close).
To avoid barriers on common case trace buffer writes, we use a lazy scheme where the triggerer does not modify the trace buffers themselves, but instead marks which portion has been written using a separate variable never accessed in a fastpath.
Implementation:
- The tracer maintains a global list of thread buffers using a global mutex on thread init and exit.
- Each trace buffer has a corresponding set of externally_written variables, each holding a distance into the buffer that was written out by another thread.
On hitting the trace window endpoint threshold, the triggering thread grabs the mutex and walks the buffers.
The triggerer doesn't have the current buffer position pointer. Instead it walks the buffer until it reaches zeroed memory (we zero the buffer after each flush). We have no synchronization with the owning thread: but observing writes out of order should be ok since we'll just miss one by stopping too early. We need to fix things up in post-processing in any case, because we need the phase transition to be at a clean point (we can't identify that point from triggerer: if we end at an instr entry, we don't know if some memrefs are coming afterward or not). In post-processing we adjust that position to the end of the block, and we split the buffer contents around that point to the neighboring traces.
The triggerer does a store-release of the furthest-writting point into the externally_written variable.
- Whenever a thread writes out its buffer, it does a load-acquire on the externally_written variable and if non-zero it writes out a marker in the buffer header. Post-processing reads the marker and uses it to split the buffer at the nearest block boundary after the marker value.
- If windows are small enough that the triggerer doesn't complete its buffer walk before a new window starts: other thread buffers may completely go into the new window. That seems ok: if the windows are that small, in the absence of application synchronization the resulting window split should be a possible thread ordering.
This scheme ends up with block-level accuracy since the trigger thread's marked transition point must be adjusted to essentially a block boundary in post-processing. Thus, it does not seem any better than the other schemes, and it is more complex.
Decision: Proposal B
We decided to go with Proposal B as it works best with our file separation decision and is simple to implement.
Online Traces
It makes sense for offline to treat each window trace as separate and simulate them separately (though aggregating the results to cover the whole application).
But for online: we have the same instance of the simulator or analysis tool throughout the whole application run. It will get confused if it throws away thread bookkeeping on a thread exit for a window.
Either we have a window-controller simulator who spins up and down a new instance of the real target simulator/tool on each window, or we introduce new "end phase/start phase" markers. If we have split offline traces, those would only be for online though which does not sound appealing. Simulators/tools would need special handling for them: reporting statistics for the phase while continuing to aggregate for a multi-phase report or something.
We might want combined files for offline too, as discussed above. That would unify the two, which is appealing.