Stalker
Introduction
Stalker is Frida’s code tracing engine. It allows threads to be followed, capturing every function, every block, even every instruction which is executed. A very good overview of the Stalker engine is provided here and we recommend that you read it carefully first. Obviously, the implementation is somewhat architecture specific, although there is much in common between them. Stalker currently supports the AArch64 architecture commonly found on mobile phones and tablets running Android or iOS, as well as the Intel 64 and IA-32 architectures commonly found on desktops and laptops. This page intends to take things to the next level of detail, it dissects the ARM64 implementation of Stalker and explains in more detail exactly how it works. It is hoped that this may help future efforts to port Stalker to other hardware architectures.
Disclaimer
Whilst this article will cover a lot of the details of the inner workings of Stalker, it won’t cover back-patching in real detail. It is intended as a starting point to help others understand the technology and Stalker is fiendishly complicated enough without this! To be fair though, this complexity isn’t there without reason, it is there to minimize the overhead of what is an inherently expensive operation. Lastly, while this article will cover the key concepts of the implementation and will extract some critical parts of the implementation for a line-by-line analysis, there will be some last details of the implementation left for the reader to discover by reading the source code. However, it is hoped it will prove to be a very useful head-start.
Table of contents
- Introduction
- Disclaimer
- Use Cases
- Following
- Basic Operation
- Options
- Terminology
- Slabs
- Blocks
- Instrumenting Blocks
- Helpers
- Context
- Context Helpers
- Reading/Writing Context
- Control flow
- Gates
- Virtualize functions
- Emitting events
- Unfollow and tidy up
- Miscellaneous
Use Cases
To start to understand the implementation of Stalker, we must first understand in detail what it offers to the user. Whilst Stalker can be invoked directly through its native Gum interface, most users will instead call it via the JavaScript API which will call these Gum methods on their behalf. The TypeScript type definitions for Gum are well commented and provide a little more detail still.
The main API to Stalker from JavaScript is:
start stalking
threadId
(or the current thread if omitted)
Let’s consider when these calls may be used. Stalking where you provide a thread
ID is likely to be used where you have a thread of interest and are wondering
what it is doing. Perhaps it has an interesting name? Thread names can be found
using cat /proc/PID/tasks/TID/comm
. Or perhaps you walked the threads in your
process using the Frida JavaScript API Process.enumerateThreads()
and then
used a NativeFunction to call:
Using this along with the Thread.backtrace() to dump thread stacks can give you a really good overview of what a process is doing.
The other scenario where you might call Stalker.follow()
is perhaps from a
function which has been
intercepted or replaced. In
this scenario, you have found a function of interest and you want to understand
how it behaves, you want to see which functions or perhaps even code blocks the
thread takes after a given function is called. Perhaps you want to compare the
direction the code takes with different input, or perhaps you want to modify the
input to see if you can get the code to take a particular path.
In either of these scenarios, although Stalker has to work slightly differently
under the hood, it is all managed by the same simple API for the user,
Stalker.follow()
.
Following
When the user calls Stalker.follow()
, under the hood, the JavaScript engine
calls through to either gum_stalker_follow_me()
to follow the current thread,
or gum_stalker_follow(thread_id)
to follow another thread in the process.
gum_stalker_follow_me
In the case of gum_stalker_follow_me()
, the Link Register is used to determine
the instruction at which to start stalking. In AArch64 architecture, the Link
Register (LR) is set to the address of the instruction to continue execution
following the return from a function call, it is set to the address of the next
instruction by instructions such as BL and BLR. As there is only one link
register, if the called function is to call another routine, then the value of
LR must be stored (typically this will be on the stack). This value will
subsequently be loaded back from the stack into a register and the RET
instruction used to return control back to the caller.
Let’s look at the code for gum_stalker_follow_me()
. This is the function
prototype:
So we can see the function is called by the QuickJS or V8 runtime passing 3 arguments. The first is the Stalker instance itself. Note that there may be multiple of these if multiple scripts are loaded at once. The second is a transformer, this can be used to transform the instrumented code as it is being written (more on this later). The last parameter is the event sink, this is where the generated events are passed as the Stalker engine runs.
We can see that the first instruction STP stores a pair of registers onto the
stack. We can notice the expression [sp, -16]!
. This is a
pre-decrement
which means that the stack is advanced first by 16 bytes, then the two 8 byte
register values are stored. We can see the corresponding instruction
ldp x29, x30, [sp], 16
at the bottom of the function. This is restoring these
two register values from the stack back into the registers. But what are these
two registers?
Well, X30
is the Link Register and X29
is the Frame Pointer register. Recall
that we must store the link register to the stack if we wish to call another
function as this will cause it to be overwritten and we need this value in order
that we can return to our caller.
The frame pointer is used to point to the top of the stack at the point a
function was called so that all the stack passed arguments and the stack based
local variables can be access at a fixed offset from the frame pointer. Again we
need to save and restore this as each function will have its value for this
register, so we need to store the value which our caller put in there and
restore it before we return. Indeed you can see in the next instruction
mov x29, sp
that we set the frame pointer to the current stack pointer.
We can see the next instruction mov x3, x30
, puts the value of the link
register into X3. The first 8 arguments on AArch64 are passed in the registers
X0-X7. So this is being put into the register used for the fourth argument. We
then call (branch with link) the function _gum_stalker_do_follow_me()
. So we
can see that we pass the first three arguments in X0-X2 untouched, so that
_gum_stalker_do_follow_me()
receives the same values we were called with.
Finally, we can see after this function returns, we branch to the address we
receive as its return value. (In AArch64 the return value of a function is
returned in X0).
gum_stalker_follow
This routine has a very similar prototype to gum_stalker_follow_me()
, but has
the additional thread_id
parameter. Indeed, if asked to follow the current
thread, then it will call that function. Let’s look at the case when another
thread ID is specified though.
We can see that this calls the function gum_process_modify_thread()
. This
isn’t part of Stalker, but part of Gum itself. This function takes a callback
with a context parameter to call passing the thread context structure. This
callback can then modify the GumCpuContext
structure and
gum_process_modify_thread()
will then write the changes back. We can see the
context structure below, as you can see it contains fields for all of the
registers in the AArch64 CPU. We can also see below the function prototype of
our callback.
So, how does gum_process_modify_thread()
work? Well it depends on the
platform. On Linux (and Android) it uses the ptrace
API (the same one used by
GDB) to attach to the thread and read and write registers. But there are a host
of complexities. On Linux, you cannot ptrace your own process (or indeed any in
the same process group), so Frida creates a clone of the current process in its
own process group and shares the same memory space. It communicates with it
using a UNIX socket. This cloned process acts as a debugger, reading the
registers of the original target process and storing them in the shared memory
space and then writing them back to the process on demand. Oh and then there is
PR_SET_DUMPABLE
and PR_SET_PTRACER
which control the permissions of who is
allowed to ptrace our original process.
Now you will see that the functionality of gum_stalker_infect()
is actually
quite similar to that of _gum_stalker_do_follow_me()
we mentioned earlier.
Both function carry out essentially the same job, although
_gum_stalker_do_follow_me()
is running on the target thread, but
gum_stalker_infect()
is not, so it must write some code to be called by the
target thread using the
GumArm64Writer
rather than calling functions directly.
We will cover these functions in more detail shortly, but first we need a little more background.
Basic Operation
Code can be thought of as a series of blocks of instructions (also known as basic blocks). Each block starts with an optional series of instructions (we may have two consecutive branch statements) which run in sequence and ends when we encounter an instruction which causes (or can cause) execution to continue with an instruction other than the one immediately following it in memory.
Stalker works on one block at a time. It starts with either the block after the
return to the call to gum_stalker_follow_me()
or the block of code to which
the instruction pointer of the target thread is pointing when
gum_stalker_follow()
is called.
Stalker works by allocating some memory and writing to it a new instrumented copy of the original block. Instructions may be added to generate events, or carry out any of the other features the Stalker engine offers. Stalker must also relocate instructions as necessary. Consider the following instruction:
ADR Address of label at a PC-relative offset.
ADR Xd, label
Xd Is the 64-bit name of the general-purpose destination register, in the range 0 to 31.
label Is the program label whose address is to be calculated. It is an offset from the address of this instruction, in the range ±1MB.
If this instruction is copied to a different location in memory and executed, then because the address of the label is calculated by adding an offset to the current instruction pointer, then the value would be different. Fortunately, Gum has a Relocator for just this purpose which is capable of modifying the instruction given its new location so that the correct address is calculated.
Now, recall we said that Stalker works one block at a time. How, then do we instrument the next block? We remember also that each block also ends with a branch instruction, well if we modify this branch to instead branch back into the Stalker engine, but ensure we store the destination of where the branch was intending to end up, we can instrument the next block and re-direct execution there instead. This same simple process can continue with one block after the next.
Now, this process can be a little slow, so there are a few optimizations which we can apply. First of all, if we execute the same block of code more than once (e.g. a loop, or maybe just a function called multiple times) we don’t have to re-instrument it all over again. We can just re-execute the same instrumented code. For this reason, a hashtable is kept of all of the blocks which we have encountered before and where we put the instrumented copy of the block.
Secondly, when a call instruction is encountered, after emitting the
instrumented call, we then emit a landing pad which we can return to without
having to re-enter Stalker. Stalker builds a side-stack, using GumExecFrame
structures which record the true return address (real_address
) and this
landing pad (code_address
). When a function returns, we emit code that will
check the return address in the side-stack against the real_address
and if it
matches, it can simply return to the code_address
without re-entering the
runtime. This landing pad initially will contain code which enters the Stalker
engine to instrument the next block, but it can later be backpatched to branch
directly to this block. This means that the entire return sequence can be
handled without the expense of entering and leaving Stalker.
If the return address doesn’t match that stored the real_address
of the
GumExecFrame
, or we run out of space in the side-stack, we simply start
building a new one again from scratch. We need to preserve the value of LR
whilst the application code is executing so that the application cannot use this
to detect the presence of Stalker (anti-debugging) or in case it is using it for
any other purpose besides simply returning (e.g. to reference inline data in the
code section). Also, we want Stalker to be able to unfollow at any time, so we
don’t want to be having to go back up our stack correcting LR values which we
have modified along the way.
Finally, whilst we always replace branches with calls back to Stalker to
instrument the next block, depending on the configuration of
Stalker.trustThreshold
, we may backpatch such instrumented code to replace
the call with a direct branch to the next instrumented block instead.
Deterministic branches (e.g. the destination is fixed and the branch is not
conditional) are simple, we can just replace the branch to Stalker with one to
the next block. But we can deal with conditional branches too, if we instrument
both blocks of code (the one if the branch is taken and the one if it isn’t).
Then we can replace the original conditional branch with one conditional branch
which directs control flow to instrumented version of the block encountered when
the branch was taken, followed by an unconditional branch to the other
instrumented block. We can also deal partially with branches where the target is
not static. Say our branch is something like:
This sort of instruction is common when calling a function pointer, or class method. Whilst the value of X0 can change, quite often it will actually always be the same. In this case, we can replace the final branch instruction with code which compares the value of X0 against our known function, and if it matches branches to the address of the instrumented copy of the code. This can then be followed by an unconditional branch back to the Stalker engine if it doesn’t match. So if the value of the function pointer say is changed, then the code will still work and we will re-enter Stalker and instrument wherever we end up. However, if as we expect it remains unchanged then we can bypass the Stalker engine altogether and go straight to the instrumented function.
Options
Now let’s look at the options when we follow a thread with Stalker. Stalker
generates events when a followed thread is being executed, these are placed onto
a queue and flushed either periodically or manually by the user. This isn’t done
by Stalker itself, but the EventSink::process
vfunc as re-entering the
JavaScript runtime to process events one at a time would be prohibitively
expensive. The size and time period can be configured by the options. Events can
be generated on a per-instruction basis either for calls, returns or all
instructions. Or they can be generated on a block basis, either when a block is
executed, or when it is instrumented by the Stalker engine.
We can also provide one of two callbacks onReceive
or onCallSummary
. The
former will quite simply deliver a binary blob containing the raw events
generated by Stalker, with events in the order that they were generated in.
(Stalker.parse()
can be used to turn it into a JS array of tuples representing
the events.). The second aggregates these results simply returning a count of
times each function was called. This is more efficient than onReceive
, but the
data is much less granular.
Terminology
Before we can carry on with describing the detailed implementation of Stalker, we first need to understand some key terminology and concepts that are used in the design.
Probes
Whilst a thread is running outside of Stalker, you may be familiar with using
Interceptor.attach()
to get a callback when a given function is called. When a
thread is running in Stalker, however, these interceptors may not work. These
interceptors work by patching the first few instructions (prologue) of the
target function to re-direct execution into Frida. Frida copies and relocates
these first few instructions somewhere else so that after the onEnter
callback
has been completed, it can re-direct control flow back to the original function.
The reasons these may not work within Stalker is simple, the original function
is never called. Each block, before it is executed is instrumented elsewhere in
memory and it is this copy which is executed. Stalker supports the API function
Stalker.addCallProbe(address, callback[, data])
to provide this functionality
instead. If our Interceptor
has been attached before the block is
instrumented, or Stalker’s trustThreshold
is configured so that our block will
be re-instrumented then our Interceptor
will work (as the patched instructions
will be copied across to the new instrumented block). Otherwise it won’t. Of
course, we want to be able to support hooking functions when these conditions
aren’t met. The average user of the API might not be familiar with this level
of detail of the design and so call probes solve this problem.
The optional data parameter is passed when the probe callback is
registered and will be passed to the callback routine when executed. This
pointer, therefore needs to be stored in the Stalker engine. Also the address
needs to be stored, so that when an instruction is encountered which calls the
function, the code can instead be instrumented to call the function first. As
multiple functions may call the one to which you add the probe, many
instrumented blocks may contain additional instructions to call the probe
function. Thus whenever a probe is added or removed, the cached instrumented
blocks are all destroyed and so all code has to be re-instrumented. Note that
this data parameter is only used if the callback
is a C callback – e.g.
implemented using CModule
– as when JavaScript is used, it is simpler to use
a closure to capture any required state.
Trust Threshold
Recall that one of the simple optimizations we apply is that if we attempt to execute a block more than once, on subsequent occasions, we can simply call the instrumented block we created last time around? Well, that only works if the code we are instrumenting hasn’t changed. In the case of self-modifying code (which is quite often used as an anti-debugging/anti-disassembly technique to attempt to frustrate analysis of security critical code) the code may change, and hence the instrumented block cannot be re-used. So, how do we detect if a block has changed? We simply keep a copy of the original code in the data-structure along with the instrumented version. Then when we encounter a block again, we can compare the code we are going to instrument with the version we instrumented last time and if they match, we can re-use the block. But performing the comparison every time a block runs may slow things down. So again, this is an area where Stalker can be customized.
Stalker.trustThreshold
: an integer specifying how many times a piece of code needs to be executed before it is assumed it can be trusted to not mutate. Specify -1 for no trust (slow), 0 to trust code from the get-go, and N to trust code after it has been executed N times. Defaults to 1.
In actual fact, the value of N is the number of times the block needs to be
re-executed and match the previously instrumented block (e.g. be unchanged)
before we stop performing the comparison. Note that the original copy of the
code block is still stored even when the trust threshold is set to -1
or 0
.
Whilst it is not actually needed for these values, it has been retained to keep
things simple. In any case, neither of these is the default setting.
Excluded Ranges
Stalker also has the API Stalker.exclude(range)
that’s passed a base and limit
used to prevent Stalker from instrumenting code within these regions. Consider,
for example, your thread calls malloc()
inside libc
. You most likely don’t
care about the inner workings of the heap and this is not only going to slow
down performance, but also generate a whole lot of extraneous events you don’t
care about. One thing to consider, however, is that as soon as a call is made to
an excluded range, stalking of that thread is stopped until it returns. That
means, if that thread were to call a function which is not inside a restricted
range, a callback perhaps, then this would not be captured by Stalker. Just as
this can be used to stop the stalking of a whole library, it can be used to stop
stalking a given function (and its callees) too. This can be particularly useful
if your target application is statically linked. Here, was cannot simply ignore
all calls to libc
, but we can find the symbol for malloc()
using
Module.enumerateSymbols()
and ignore that single function.
Freeze/Thaw
As an extension to DEP, some systems prevent pages from being marked writable and executable at the same time. Thus Frida must toggle the page permissions between writable and executable to write instrumented code, and allow that code to execute respectively. When pages are executable, they are said to be frozen (as they cannot be changed) and when they are made writeable again, they are considered thawed.
Call Instructions
AArch64, unlike Intel doesn’t have an single explicit CALL
instruction, which
has different forms to cope with all supported scenarios. Instead, it uses a
number of different instructions to offer support for function calls. These
instructions all branch to a given location and update the Link register, LR
,
with the return address:
BL
BLR
BLRAA
BLRAAZ
BLRAB
BLRABZ
For simplicity, in the remainder of this article, we will refer to this collection of instructions as “call instructions”.
Frames
Whenever Stalker encounters a call, it stores the return address and the address of the instrumented return block forwarder in a structure and adds these to a stack stored in a data-structure of its own. It uses this as a speculative optimization, and also as a heuristic to approximate the call depth when emitting call and return events.
Transformer
A GumStalkerTransformer
type is used to generate the instrumented code. The
implementation of the default transformer looks like this:
It is called by the function responsible for generating instrumented code,
gum_exec_ctx_obtain_block_for()
and its job is to generate the instrumented
code. We can see that it does this using a loop to process one instruction at a
time. First retrieving an instruction from the iterator, then telling Stalker to
instrument the instruction as is (without modification). These two functions are
implemented inside Stalker itself. The first is responsible for parsing a
cs_insn
and updating the internal state. This cs_insn
type is a datatype
used by the internal Capstone disassembler to
represent an instruction. The second is responsible for writing out the
instrumented instruction (or set of instructions). We will cover these in more
detail later.
Rather than using the default transformer, the user can instead provide a custom implementation which can replace and insert instructions at will. A good example is provided in the API documentation.
Callouts
Transformers can also make callouts. That is they instruct Stalker to emit
instructions to make a call to a JavaScript function – or plain C callback,
e.g. implemented using CModule – passing the CPU context and an optional
context parameter. This function is then able to modify or inspect registers at
will. This information is stored in a GumCallOutEntry
.
EOB/EOI
Recall that the Relocator is heavily involved in generating the instrumented code. It has two important properties which control its state.
End of Block (EOB) indicates that the end of a block has been reached. This occurs when we encounter any branch instruction. A branch, a call, or a return instruction.
End of Input (EOI) indicates that not only have we reached the end of a block,
but we have possibly reached the end of the input, i.e. what follows this
instruction may not be another instruction. Whilst this is not the case for a
call instruction as code control will (typically) pass back when the callee
returns and so more instructions must follow. (Note that a compiler will
typically generate a branch instruction for a call to a non-returning function
like exit()
.) Whilst there is no guarantee of valid instructions following
call instructions, we can speculatively optimize for this being the case. If we
encounter a non-conditional branch instruction, or a return instruction, it is
quite possible that there will be no code following afterwards.
Prologues/Epilogues
When control flow is redirected from the program into the Stalker engine, the registers of the CPU must be saved so that Stalker can run and make use of the registers and restore them before control is passed back to the program so that no state is lost.
The Procedure Call Standard for AArch64 states that some registers (notably X19 to X29) are callee saved registers. This means that when the compiler generates code which makes use of these registers, it must store them first. Hence it is not strictly necessary to save these registers to the context structure, since they will be restored if they are used by the code within the Stalker engine. This “minimal” context is sufficient for most purposes.
However, if the Stalker engine is to call a probe registered by
Stalker.addCallProbe()
, or a callout created by iterator.putCallout()
(called by a Transformer), then these callbacks will expect to receive the full
CPU context as an argument. And they will expect to be able to modify this
context and for the changes to take effect when control is passed back to the
application code. Thus for these instances, we must write a “full” context and
its layout must match the expected format dictated by the structure
GumArm64CpuContext
.
Note however, that the code necessary to write out the necessary CPU registers (the prologue) in either case is quite long (tens of instructions). And the code to restore them afterwards (the epilogue) is similar in length. We don’t want to write these at the beginning and end of every block we instrument. Therefore we write these (in the same way we write the instrumented blocks) into a common memory location and simply emit call instructions at the beginning and end of each instrumented block to call these functions. These common memory locations are referred to as helpers. The following functions create these prologues and epilogues.
Finally, note that in the AArch64 architecture, it is only possible to make a direct branch to code within ±128 MB of the caller, and using an indirect branch is more expensive (both in terms of code size and performance). Therefore, as we write more and more instrumented blocks, we will get further and further away from the shared prologue and epilogue. If we get more than 128 MB away, we simply write out another copy of these prologues and epilogues to use. This gives us a very reasonable tradeoff.
Counters
Finally, there are a series of counters which you can see kept recording the number of each type of instructions encountered at the end of an instrumented block. These are only used by the test-suite to guide the developer during performance tuning, indicating which branch types most commonly require a full context-switch into Stalker to resolve the target.
Slabs
Let’s now take a look at where Stalker stores its instrumented code, in slabs. Below are the data-structures used to hold it all:
Now let’s look at some code when Stalker is initialized which configures their size:
So we can see that each slab is 4 MB in size. A 12th of this slab is reserved
for its header, the GumSlab
structure itself including its GumExecBlock
array. Note that this is defined as a zero length array at the end of the
GumSlab
structure, but the actual number of these which can fit in the header
of the slab is calculated and stored in slab_max_blocks
.
So what is the remainder of the slab used for? Whilst the header of the slab is used for all the accounting information, the remainder (henceforth referred to as the tail) of the slab is used for the instrumented instructions themselves (they are stored inline in the slab).
So why is a 12th of the slab allocated for the header and the remainder for the
instructions? Well the length of each block to be instrumented will vary
considerably and may be affected by the compiler being used and its optimization
settings. Some rough empirical testing showed that given the average length of
each block this might be a reasonable ratio to ensure we didn’t run out of space
for new GumExecBlock
entries before we ran out of space for new instrumented
blocks in the tail and vice versa.
Let’s now look at the code which creates them:
Here, we can see that the data
field points to the start of the tail where
instructions can be written after the header. The offset
field keeps track of
our offset into the tail. The size
field keeps track of the total number of
bytes available in the tail. The num_blocks
field keeps track of how many
instrumented blocks have been written to the slab.
Note that where possible we allocate the slab with RWX permissions so that we don’t have to freeze and thaw it all of the time. On systems which support RWX the freeze and thaw functions become no-ops.
Lastly, we can see that each slab contains a next
pointer which can be used to
link slabs together to form a singly-linked list. This is used so we can walk
them and dispose them all when Stalker is finished.
Blocks
Now we understand how the slabs work. Let’s look in more detail at the blocks. As we know, we can store multiple blocks in a slab and write their instructions to the tail. Let’s look at the code to allocate a new block:
The function first checks if there is space for a minimally sized block in the
tail of the slab (1024 bytes) and whether there is space in the array of
GumExecBlocks
in the slab header for a new entry. If it does then a new entry
is created in the array and its pointers are set to reference the GumExecCtx
(the main Stalker session context) and the GumSlab
, The code_begin
and
code_end
pointers are both set to the first free byte in the tail. The
recycle_count
used by the trust threshold mechanism to determine how many
times the block has been encountered unmodified is reset to zero, and the
remainder of the tail is thawed to allow code to be written to it.
Next if the trust threshold is set to less than zero (recall -1 means blocks are
never trusted and always re-written) then we reset the slab offset
(the
pointer to the first free byte in the tail) and start over. This means that any
instrumented code written for any blocks within the slab will be overwritten.
Finally, as there is no space left in the current slab and we can’t overwrite it
because the trust threshold means blocks may be re-used, then we must allocate a
new slab by calling gum_exec_ctx_add_slab()
, which we looked at above. We then
call gum_exec_ctx_ensure_inline_helpers_reachable()
, more on that in a moment,
and then we allocate our block from the new slab.
Recall, that we use helpers (such as the prologues and epilogues that save and
restore the CPU context) to prevent having to duplicate these instructions at
the beginning and end of every block. As we need to be able to call these from
instrumented code we are writing to the slab, and we do so with a direct branch
that can only reach ±128 MB from the call site, we need to ensure we can get to
them. If we haven’t written them before, then we write them to our current slab.
Note that these helper funtions need to be reachable from any instrumented
instruction written in the tail of the slab. Because our slab is only 4 MB in
size, then if our helpers are written in our current slab then they will be
reachable just fine. If we are allocating a subsequent slab and it is close
enough to the previous slab (we only retain the location we last wrote the
helper functions to) then we might not need to write them out again and can just
rely upon the previous copy in the nearby slab. Note that we are at the mercy of
mmap()
for where our slab is allocated in virtual memory and ASLR may dictate
that our slab ends up nowhere near the previous one.
We can only assume that either this is unlikely to be a problem, or that this
has been factored into the size of the slabs to ensure that writing the helpers
to each slab isn’t much of an overhead because it doesn’t use a significant
proportion of their space. An alternative could be to store every location every
time we have written out a helper function so that we have more candidates to
choose from (maybe our slab isn’t allocated nearby the one previously allocated,
but perhaps it is close enough to one of the others). Otherwise, we could
consider making a custom allocator using mmap()
to reserve a large (e.g. 128
MB) region of virtual address space and then use mmap()
again to commit the
memory one slab at a time as needed. But these ideas are perhaps both overkill.
Instrumenting Blocks
The main function which instruments a code block is called
gum_exec_ctx_obtain_block_for()
. It first looks for an existing block in the
hash table which is indexed on the address of the original block which was
instrumented. If it finds one and the aforementioned constraints around the
trust threshold are met then it can simply be returned.
The fields of the GumExecBlock
are used as follows. The real_begin
is set to
the start of the original block of code to be instrumented. The code_begin
field points to the first free byte of the tail (remember this was set by the
gum_exec_block_new()
function discussed above). A GumArm64Relocator
is
initialized to read code from the original code at real_begin
and a
GumArm64Writer
is initialized to write its output to the slab starting at
code_begin
. Each of these items is packaged into a GumGeneratorContext
and
finally this is used to construct a GumStalkerIterator
.
This iterator is then passed to the transformer. Recall the default implementations is as follows:
We will gloss over the details of gum_stalker_iterator_next()
and
gum_stalker_iterator_keep()
for now. But in essence, this causes the iterator
to read code one instruction at a time from the relocator, and write the
relocated instruction out using the writer. Following this process, the
GumExecBlock
structure can be updated. Its field real_end
can be set to the
address where the relocator read up to, and its field code_end
can be set to
the address which the writer wrote up to. Thus real_begin
and real_end
mark
the limits of the original block, and code_begin
and code_end
mark the
limits of the newly instrumented block. Finally,
gum_exec_ctx_obtain_block_for()
calls gum_exec_block_commit()
which takes a
copy of the original block and places it immediately after the instrumented
copy. The field real_snapshot
points to this (and is thus identical to
code_end
). Next the slab’s offset
field is updated to reflect the space used
by our instrumented block and our copy of the original code. Finally, the block
is frozen to allow it to be executed.
Now let’s just return to a few more details of the function
gum_exec_ctx_obtain_block_for()
. First we should note that each block has a
single instruction prefixed.
This instruction is the restoration prolog (denoted by
GUM_RESTORATION_PROLOG_SIZE
). This is skipped in “bootstrap” usage – hence you
will note this constant is added on by _gum_stalker_do_follow_me()
and
gum_stalker_infect()
when returning the address of the instrumented code. When
return instructions are instrumented, however, if the return is to a block which
has already been instrumented, then we can simply return to that block rather
than returning back into the Stalker engine. This code is written by
gum_exec_block_write_ret_transfer_code()
. In a worst-case scenario, where we
may need to use registers to perform the final branch to the instrumented block,
this function stores them into the stack, and the code to restore these from the
stack is prefixed in the block itself. Hence, in the event that we can return
directly to an instrumented block, we return to this first instruction rather
than skipping GUM_RESTORATION_PROLOG_SIZE
bytes.
Secondly, we can see gum_exec_ctx_obtain_block_for()
does the following after
the instrumented block is written:
This inserts a break instruction which is intended to simplify debugging.
Lastly, if Stalker is configured to, gum_exec_ctx_obtain_block_for()
will
generate an event of type GUM_COMPILE
when compiling the block.
Helpers
We can see from gum_exec_ctx_ensure_inline_helpers_reachable()
that we have a
total of 6 helpers. These helpers are common fragments of code which are needed
repeatedly by our instrumented blocks. Rather than emitting the code they
contain repeatedly, we instead write it once and place a call or branch
instruction to have our instrumented code execute it. Recall that the helpers
are written into the same slabs we are writing our instrumented code into and
that if possible we can re-use the helper written into a previous nearby slab
rather than putting a copy in each one.
This function calls gum_exec_ctx_ensure_helper_reachable()
for each helper
which in turn calls gum_exec_ctx_is_helper_reachable()
to check if the helper
is within range, or otherwise calls the callback passed as the second argument
to write out a new copy.
So, what are our 6 helpers. We have 2 for writing prologues which store register
context, one for a full context and one for a minimal context. We will cover
these later. We also have 2 for their corresponding epilogues for restoring the
registers. The other two, the last_stack_push
and last_stack_pop_and_go
are
used when instrumenting call instructions.
Before we analyze these two in detail, we first need to understand the frame
structures. We can see from the code snippets below that we allocate a page to
contain GumExecFrame
structures. These structures are stored sequentially in
the page like an array and are populated starting with the entry at the end of
the page. Each frame contains the address of the original block and the address
of the instrumented block which we generated to replace it:
last_stack_push
Much of the complexity in understanding Stalker and the helpers in particular is that some functions – let’s call them writers – write code which is executed at a later point. These writers have branches in themselves which determine exactly what code to write, and the written code can also sometimes have branches too. The approach I will take for these two helpers therefore is to show pseudo code for the assembly which is emitted into the slab which will be called by instrumented blocks.
The pseudo code for this helper is shown below:
As we can see, this helper is actually a simple function which takes two
arguments, the real_address
and the code_address
to store in the next
GumExecFrame
structure. Note that our stack is written backwards from the end
of the page in which they are stored towards the start and that current_frame
points to the last used entry (so our stack is full and descending). Also note
we have a conditional check to see whether we are on the last entry (the one at
the very beginning of the page will be page-aligned) and if we have run out of
space for more entries (we have space for 512) then we simply do nothing. If we
have space, we write the values from the parameters into the entry and retard
the current_frame
pointer to point to it.
This helper is used when virtualizing call instructions. Virtualizing is the
name given to the replacement of an instruction typically those relating to
branching with a series of instructions which instead of executing the intended
block allow Stalker to manage the control-flow. Recall as our transformer walks
the instructions using the iterator and calls iterator.keep()
we output our
transformed instruction. When we encounter a branch, we need to emit code to
call back into the Stalker engine so that it can instrument that block, but if
the branch statement is a call instruction (BL
, BLX
etc) we also need to
emit a call to the above helper to store the stack frame information. This
information is used when emitting call events as well as later when optimizing
the return.
last_stack_pop_and_go
Now lets look at the last_stack_pop_and_go
helper. To understand this, we also
need to understand the code written by
gum_exec_block_write_ret_transfer_code()
(the code that calls it), as well as
that written by gum_exec_block_write_exec_generated_code()
which it calls. We
will skip over pointer authentication for now.
So this code is a little harder. It isn’t really a function and the actual assembly written by it is muddied a little by the need to save and restore registers. But the essence of it is this: When virtualizing a return instruction this helper is used to optimize passing control back to the caller. ret_reg contains the address of the block to which we are intending to return.
Lets take a look at the definition of the return instruction:
RET Return from subroutine, branches unconditionally to an address in a register, with a hint that this is a subroutine return.
RET {Xn} Where:
Xn Is the 64-bit name of the general-purpose register holding the address to be branched to, in the range 0 to 31. Defaults to X30 if absent.
As we can see, we are going to return to an address passed in a register.
Typically, we can predict the register value and where we will return to, as the
compiler will emit assembly code so that the register is set to the address of
the instruction immediately following the call which got us there. After
emitting an instrumented call, we emit directly after a little landing pad which
will call back into Stalker to instrument the next block. This landing pad can
later be backpatched (if the conditions are right) to avoid re-entering Stalker
altogether. We store the addresses of both the original block following the call
and this landing pad in the GumExecFrame
structure, so we can simply
virtualize our return instruction by replacing it with instructions which simply
branch to this landing pad. We don’t need to re-enter the Stalker engine each
time we see a return instruction and get a nice performance boost. Simple!
However, we must bear in mind that not all calls will result in a return. A
common technique for hostile or specialized code is to make a call in order to
use the LR
to determine the current position of the instruction pointer. This
value may then be used for introspection purposes (e.g. to validate code to
detect modification, to decrypt or unscramble code, etc.).
Also, remember that the user can use a custom transform to modify instructions as they see fit, they can insert instructions which modify register values, or perhaps a callout function which is passed the context structure which allows them to modify register values as they like. Now consider what if they modify the value in the return register!
So we can see that the helper checks the value of the return register against
the value of the real_address
stored in the GumExecFrame
. If it matches,
then all is well and we can simply branch directly back to the landing pad.
Recall on the first instance, this simply re-enters Stalker to instrument the
next block and branches to it, but at a later point backpatching may be used to
directly branch to this instrumented block and avoid re-entering Stalker
altogether.
Otherwise, we follow a different path. First the array of GumExecFrame
is
cleared, now our control-flow has deviated, we will start again building our
stack again. We accept that we will take this same slower path for any previous
frames in the call-stack we recorded so far if we ever return to them, but will
have the possibility of using the fast path for new calls we encounter from here
on out (until the next time a call instruction is used in an unconventional
manner).
We make a minimal prologue (our instrumented code is now going to have to
re-enter Stalker) and we need to be able to restore the application’s registers
before we return control back to it. We call the entry gate for return,
gum_exec_ctx_replace_current_block_from_ret()
(more on entry gates later). We
then execute the corresponding epilogue before branching to the ctx->resume_at
pointer which is set by Stalker during the above call to
gum_exec_ctx_replace_current_block_from_ret()
to point to the new instrumented
block.
Context
Let’s look at the prologues and epilogues now.
We can see that these do little other than call the corresponding prologue or
epilogue helpers. We can see that the prologue will store X19
and the link
register onto the stack. These are then restored into X19
and X20
at the end
of the epilogue. This is because X19
is needed as scratch space to write the
context blocks and the link register needs to be preserved as it will be
clobbered by the call to the helper.
The LDP and STP instructions load and store a pair of registers respectively and have the option to increment or decrement the stack pointer. This increment or decrement can be carried out either before, or after the values are loaded or stored.
Note also the offset at which these registers are placed. They are stored at
16
bytes + GUM_RED_ZONE_SIZE
beyond the top of the stack. Note that our
stack on AArch64 is full and descending. This means that the stack grows toward
lower addresses and the stack pointer points to the last item pushed (not to the
next empty space). So, if we subtract 16 bytes from the stack pointer, then this
gives us enough space to store the two 64-bit registers. Note that the stack
pointer must be decremented before the store (pre-decrement) and incremented
after the load (post-increment).
So what is GUM_RED_ZONE_SIZE
? The
redzone is a 128
byte area beyond the stack pointer which a function can use to store temporary
variables. This allows a function to store data in the stack without the need to
adjust the stack pointer all of the time. Note that this call to the prologue is
likely the first thing to be carried out in our instrumented block, we don’t
know what local variables the application code has stored in the redzone and so
we must ensure that we advance the stackpointer beyond it before we start using
the stack to store information for the Stalker engine.
Context Helpers
Now that we have looked at how these helpers are called, let us now have a look at the helpers themselves. Although there are two prologues and two epilogues (full and minimal), they are both written by the same function as they have much in common. The version which is written is based on the function parameters. The easiest way to present these is with annotated code:
Now let’s look at the epilogue:
This is all quite complicated. Partly this is because we have only a single register to use as scratch space, partly because we want to keep the prologue and epilogue code stored inline in the instrumented block to a bare minimum, and partly because our context values can be changed by callouts and the like. But hopefully it all now makes sense.
Reading/Writing Context
Now that we have our context saved, whether it was a full context, or just the minimal one, Stalker may need to read registers from the context to see what state of the application code was. For example to find the address which a branch or return instruction was going to branch to so that we can instrument the block.
When Stalker writes the prologue and epilogue code, it does so by calling
gum_exec_block_open_prolog()
and gum_exec_block_close_prolog()
. These store
the type of prologue which has been written in gc->opened_prolog
.
Therefore when we want to read a register, this can be achieved with the single
function gum_exec_ctx_load_real_register_into()
. This determines which kind of
prologue is in use and calls the relevant routine accordingly. Note that these
routines don’t actually read the registers, they emit code which reads them.
Reading registers from the full frame is actually the simplest. We can see the
code closely matches the structure used to pass the context to callouts etc.
Remember that in each case register X20
points to the base of the context
structure.
Reading from the minimal context is actually a little harder. X0
through X18
are simple, they are stored in the context block. After X18
is 8 bytes padding
(to make a total of 10 pairs of registers) followed by X29
and X30
. This
makes a total of 11 pairs of registers. Immediately following this is the
NEON/floating point registers (totaling 128 bytes). Finally X19
and X20
, are
stored above this as they are restored by the inline epilogue code written by
gum_exec_ctx_write_epilog()
.
Control flow
Execution of Stalker begins at one of 3 entry points:
_gum_stalker_do_follow_me()
gum_stalker_infect()
gum_exec_ctx_replace_current_block_with()
The first two we have already covered, these initialize the Stalker engine and
start instrumenting the first block of execution.
gum_exec_ctx_replace_current_block_with()
is used to instrument subsequent
blocks. In fact, the main difference between this function and the preceding two
is that the Stalker engine has already been initialized and hence this work
doesn’t need to be repeated. All three call gum_exec_ctx_obtain_block_for()
to
generate the instrumented block.
We covered gum_exec_ctx_obtain_block_for()
previously in our section on
transformers. It calls the transformed implementation in use, which by default
calls gum_stalker_iterator_next()
which calls the relocator using
gum_arm64_relocator_read_one()
to read the next relocated instruction. Then it
calls gum_stalker_iterator_keep()
to generate the instrumented copy. It does
this in a loop until gum_stalker_iterator_next()
returns FALSE
as it has
reached the end of the block.
Most of the time gum_stalker_iterator_keep()
will simply call
gum_arm64_relocator_write_one()
to emit the relocated instruction as is.
However, if the instruction is a branch or return instruction it will call
gum_exec_block_virtualize_branch_insn()
or
gum_exec_block_virtualize_ret_insn()
respectively. These two virtualization
functions which we will cover in more detail later, emit code to transfer
control back into gum_exec_ctx_replace_current_block_with()
via an entry gate
ready to process the next block (unless there is an optimization where we can
bypass Stalker and go direct to the next instrumented block, or we are entering
into an excluded range).
Gates
Entry gates are generated by macro, one for each of the different instruction
types found at the end of a block. When we virtualize each of these types of
instruction, we direct control flow back to the
gum_exec_ctx_replace_current_block_with()
function via one of these gates. We
can see that the implementation of the gate is quite simple, it updates a
counter of how many times it has been called and passes control to
gum_exec_ctx_replace_current_block_with()
passing through the parameters it
was called with, the GumExecCtx
and the start_address
of the next block to
be instrumented.
These counters can be displayed by the following routine. They are only meant to be used by the test-suite rather than being exposed to the user through the API.
Virtualize functions
Let’s now look in more detail at the virtualizing we have for replacing the branch instruction we find at the end of each block. We have four of these functions:
gum_exec_block_virtualize_branch_insn()
gum_exec_block_virtualize_ret_insn()
gum_exec_block_virtualize_sysenter_insn()
gum_exec_block_virtualize_linux_sysenter()
We can see that two of these relate to to syscalls (and in fact, one calls the other), we will cover these later. Let’s look at the ones for branches and returns.
gum_exec_block_virtualize_branch_insn
This routine first determines whether the destination of the branch comes from
an immediate offset in the instruction, or a register. In the case of the
latter, we don’t extract the value just yet, we only determine which register.
This is referred to as the target
. The next section of the function deals with
branch instructions. This includes both conditional and non-conditional
branches. For conditional targets the destination if the branch is not taken is
referred to as cond_target
, this is set to the address of the next instruction
in the original block.
Likewise regular_entry_func
and cond_entry_func
are used to hold the entry
gates which will be used to handle the branch. The former is used to hold the
gate used for non-conditional branches and cond_entry_func
holds the gate to
be used for a conditional branch (whether it is taken or not).
The function gum_exec_block_write_jmp_transfer_code()
is used to write the
code required to branch to the entry gate. For non-conditional branches this is
simple, we call the function passing the target
and the regular_entry_func
.
For conditional branches things are slightly more complicated. Our output looks
like the following pseudo-code:
Here, we can see that we first write a branch instruction into our instrumented
block, as in our instrumented block, we also need to determine whether we should
take the branch or not. But instead of branching directly to the target, just
like for the non-conditional branches we use
gum_exec_block_write_jmp_transfer_code()
to write code to jump back into
Stalker via the relevant entry gate passing the real address we would have
branched to. Note, however that the branch is inverted from the original (e.g.
CBZ
would be replaced by CBNZ
).
Now, let’s look at how gum_exec_block_virtualize_branch_insn()
handles calls.
First we emit code to generate the call event if we are configured to. Next we
check if there are any probes in use. If there are, then we call
gum_exec_block_write_call_probe_code()
to emit the code necessary to call any
registered probe callback. Next, we check if the call is to an excluded range
(note that we can only do this if the call is to an immediate address), if it is
then we emit the instruction as is. But we follow this by using
gum_exec_block_write_jmp_transfer_code()
as we did when handling branches to
emit code to call back into Stalker right after to instrument the block at the
return address. Note that here we use the excluded_call_imm
entry gate.
Finally, if it is just a normal call expression, then we use the function
gum_exec_block_write_call_invoke_code()
to emit the code to handle the call.
This function is pretty complicated as a result of all of the optimization for
backpatching, so we will only look at the basics.
Remember earlier that in gum_exec_block_virtualize_branch_insn()
, we could
only check if our call was to an excluded range if the target was specified in
an immediate? Well if the target was specified in a register, then here we emit
code to check whether the target is in an excluded range. This is done by
loading the target register using
gum_exec_ctx_write_push_branch_target_address()
(which in turn calls
gum_exec_ctx_load_real_register_into()
which we covered ealier to read the
context) and emitting code to call
gum_exec_block_check_address_for_exclusion()
whose implementation is quite
self-explanatory. If it is excluded then a branch is taken and similar code to
that described when handling excluded immediate calls discussed above is used.
Next we emit code to call the entry gate and generate the instrumented block of
the callee. Then call the helper last_stack_push
to add our GumExecFrame
to
our context containing the original and instrumented block address. The real and
instrumented code addresses are read from the current cursor positions of the
GeneratorContext and CodeWriter respectively, and we then generate the required
landing pad for the return address (this is the optimization we covered
earlier, we can jump straight to this block when executing the virtualized
return statement rather than re-entering Stalker). Lastly we use
gum_exec_block_write_exec_generated_code()
to emit code to branch to the
instrumented callee.
gum_exec_block_virtualize_ret_insn
After looking at the virtualization of call instructions, you will be pleased to
know that this one is relatively simple! If configured, this function calls
gum_exec_block_write_ret_event_code()
to generate an event for the return
statement. Then it calls gum_exec_block_write_ret_transfer_code()
to generate
the code required to handle the return instruction. This one is simple too, it
emits code to call the last_stack_pop_and_go
helper we covered earlier.
Emitting events
Events are one of the key outputs of the Stalker engine. They are emitted by the following functions. Their implementation again is quite self-explanatory:
gum_exec_ctx_emit_call_event()
gum_exec_ctx_emit_ret_event()
gum_exec_ctx_emit_exec_event()
gum_exec_ctx_emit_block_event()
One thing to note with each of these functions, however, is that they all call
gum_exec_block_write_unfollow_check_code()
to generate code for checking if
Stalker is to stop following the thread. We’ll have a look at this in more
detail next.
Unfollow and tidy up
If we look at the function which generates the instrumented code to check if we
are being asked to unfollow, we can see it cause the thread to call
gum_exec_ctx_maybe_unfollow()
passing the address of the next instruction to
be instrumented. We can see that if the state has been set to stop following,
then we simply branch back to the original code.
A quick note about pending calls. If we have a call to an excluded range, then we emit the original call in the instrumented code followed by a call back to Stalker. Whilst the thread is running in the excluded range, however, we cannot control the instruction pointer until it returns. We therefore need to simply keep track of these and wait for the thread to exit the excluded range.
Now we can see how a running thread gracefully goes back to running normal uninstrumented code, let’s see how we stop stalking in the first place. We have two possible ways to stop stalking:
gum_stalker_unfollow_me()
gum_stalker_unfollow()
The first is quite simple, we set the state to stop following. Then call
gum_exec_ctx_maybe_unfollow()
to attempt to stop the current thread from being
followed, and then dispose of the Stalker context.
We notice here that we pass NULL
as the address to
gum_exec_ctx_maybe_unfollow()
which may seem odd, but we can see that in this
instance it isn’t used as when we instrument a block (remember
gum_exec_ctx_replace_current_block_with()
is where the entry gates direct us
to instrument subsequent blocks) we check to see if we are about to call
gum_unfollow_me()
, and if so then we return the original block from the
function rather than the address of the instrumented block generated by
gum_exec_ctx_obtain_block_for()
. Therefore we can see that this is a special
case and this function isn’t stalked. We simply jump to the real function so at
this point we have stopped stalking the thread forever. This handling differs
from excluded ranges as for those we retain the original call instruction in an
instrumented block, but then follow it with a call back into Stalker. In this
case, we are just vectoring back to an original uninstrumented block:
Let’s look at gum_stalker_unfollow()
now:
This function looks through the list of contexts looking for the one for the
requested thread. Again, it sets the state of the context to
GUM_EXEC_CTX_UNFOLLOW_PENDING
. If the thread has already run, we must wait for
it to check this flag and return to normal execution. However, if it has not run
(perhaps it was in a blocking syscall when we asked to follow it and never got
infected in the first instance) then we can disinfect it ourselves by calling
gum_process_modify_thread()
to modify the thread context (this function was
described in detail earlier) and using gum_stalker_disinfect()
as our callback
to perform the changes. This simply checks to see if the program counter was set
to point to the infect_thunk
and resets the program pointer back to its
original value. The infect_thunk
is created by gum_stalker_infect()
which is
the callback used by gum_stalker_follow()
to modify the context. Recall that
whilst some of the setup can be carried out on behalf of the target thread, some
has to be done in the context of the target thread itself (in particular setting
variables in thread-local storage). Well, it is the infect_thunk
which
contains that code.
Miscellaneous
Hopefully we have now covered the most important aspects of Stalker and have provided a good background on how it works. We do have a few other observations though, which may be of interest.
Exclusive Store
The AArch64 architecture has support for exclusive load/store instructions. These instructions are intended to be used for synchronization. If an exclusive load is performed from a given address, then later attempts an exclusive store to the same location, then the CPU is able to detect any other stores (exclusive or otherwise) to the same location in the intervening period and the store fails.
Obviously, these types of primitives are likely to be used for constructs such as mutexes and semaphores. Multiple threads may attempt to load the current count of the semaphore, test whether is it already full, then increment and store the new value back to take the semaphore. These exclusive operations are ideal for just such a scenario. Consider though what would happen if multiple threads are competing for the same resource. If one of those threads were being traced by Stalker, it would always lose the race. Also these instructions are easily disturbed by other kinds of CPU operations and so if we do something complex like emit an event between a load and a store we are going to cause it to fail every time, and end up looping indefinitely. Stalker, however, deals with such a scenario:
Here, we can see that the iterator records when it sees an exclusive load and tracks how many instructions have passed since. This is continued for up to four instructions – as this was determined by empirical testing based on how many instructions would be needed to load, test, modify and store the value. This is then used to prevent any instrumentation being emitted which isn’t strictly necessary:
Exhausted Blocks
Whilst we check to ensure a minimum amount of space for our current instrumented
block is left in the slab before we start (and allocate a new one if we fall
below this minimum), we cannot predict how long a sequence of instructions we
are likely to encounter in our input block. Nor is it simple to detemine exactly
how many instructions in output we will need to write the necessary
instrumentation (we have possible code for emitting the different types of
event, checking for excluded ranges, virtualizing instructions found at the end
of the block etc.). Also, trying to allow for the instrumented code to be
non-sequential is fraught with difficulty. So the approach taken is to ensure
that each time we read a new instruction from the iterator there is at least
1024 bytes of space in the slab for our output. If it is not the case, then we
store the current address in continuation_real_address
and return FALSE
so
that the iterator ends.
Our caller gum_exec_ctx_obtain_block_for()
which is walking the iterator to
generate the block then acts exactly as if there was a branch instruction to the
next instruction, essentially terminating the current block and starting the
next one.
It is as if the following instructions had been encountered in the input right before the instruction which would have not had sufficient space:
Syscall Virtualization
Syscalls are entry points from user-mode into kernel-mode. It is how
applications ask the kernel carry out operations on its behalf, whether that be
opening files or reading network sockets. On AArch64 systems, this is carried
out using the SVC
instruction, whereas on Intel the instruction is sysenter
.
Hence the terms syscall and sysenter here are used synonymously.
Syscall virtualization is carried out by the following routine. We can see we only do anything on Linux systems:
This is required because of the clone
syscall. This syscall creates a new
process which shares execution context with the parent, such as file handles,
virtual address space, and signal handlers. In essence, this effectively creates
a new thread. But the current thread is being traced by Stalker, and clone is
going to create an exact replica of it. Given that Stalker contexts are on a
per-thread basis, we should not be stalking this new child.
Note that for syscalls in AArch64 the first 8 arguments are passed in registers
X0
through X7
and the syscall number is passed in X8
, additional arguments
are passed on the stack. The return value for the syscall is returned in X0
.
The function gum_exec_block_virtualize_linux_sysenter()
generates the
necessary instrumented code to deal with such a syscall. We will look at the
pseudo code below:
We can see that it first checks if we are dealing with a clone
syscall,
otherwise it simply performs the original syscall and that is all (the original
syscall instruction is copied from the original block). Otherwise if it is a
clone syscall, then we again perform the original syscall. At this point, we
have two threads of execution, the syscall determines that each thread will
return a different value.
The original thread will receive the child’s PID as its return value, whereas
the child will receive the value of 0.
If we receive a non-zero value, we can simply continue as we were. We want to continue stalking the thread and allow execution to carry on with the next instruction. If, however, we receive a return value of 0, then we are in the child thread. We therefore carry out a branch to the next instruction in the original block ensuring that the child continues to run without any interruption from Stalker.
Pointer Authentication
Last of all, we should note that newer versions of iOS have introduced pointer authentication codes. Pointer authentication codes (PACs) make use of unused bits in pointers (the high bits of virtual addresses are commonly unused as most systems have a maximum of 48-bits of virtual address space) to store authentication values. These values are calculated by using the original pointer, a context parameter (typically the contents of another register) and a cryptographic key. The idea is that the key cannot be read or written from user-mode, and the resulting pointer authentication code cannot be guessed without having access to it.
Let’s look at the following fragment of code:
The pacia
instruction combines the values of LR
, SP
and the key to
generate a version of LR
with the authentication code LR'
and stores back
into the LR
register. This value is stored in the stack and later restored at
the end of the function. The autia
instruction validates the value of LR'
.
This is possible since the PAC in the high bits of LR
can be stripped to give
the original LR
value and the pointer authentication code can be regenerated
as it was before using SP
and the key. The result is checked against LR'
. If
the value doesn’t match then the instruction generates a fault. Thus if the
value of LR
stored in the stack is modified, or the stack pointer itself is
corrupted then the validation will fail. This is useful to prevent the building
of ROP chains which require return addresses to be stored in the stack. Since
LR'
is now stored in the stack instead of LR
, valid return addresses cannot
be forged without the key.
Frida needs to take this into account also when generating code. When reading
pointers from registers used by the application (e.g. to determine the
destination of an indirect branch or return), it is necessary to strip these
pointer authentication codes from the address before it is used. This is
achieved using the function gum_arm64_writer_put_xpaci_reg()
.