Contemporary computer systems are extremely powerful and most complex components and libraries are built like a luxury car: they include a lot of comfort and safety technologies which are designed to improve the life of the user of said components. This also facilitates code reuse via modular programming and generally improves maintainability.
Unfortunately these complex structures, improved comfort for the library user and commendable flexibility have a flip side: they lead to a lot of additional work in runtime! You first fill and then parse complex data structures—and this takes time. You often produce a lot of information on the low levels which is just not used on higher levels—and this work is also not free.
This new validator is built differently. It only keeps around the indispensable minimum of the information needed to prove (or disprove) that code is safe. Similarly to how a F1 car uses custom-designed car seats, we use custom-designed data structures to push the data from one point of the validator to another. We only collect the bare minimum of the information (and perhaps a little bit besides that to make testing possible)—and if the requirements are changing we often change all the pieces: from the gen_dfa
input data format to the highest-level dfa_validate_32.c
/dfa_validate_64.c
external API adapters.
This streamlining was one of the most important design goals of a new validator. And indeed the code which reaches the CPU is very simple: it does not contain complex data structures and multilayered functions while all the previous validators had many layers and quite a few complex data structures. How can it be? Were all these structures superfluous and unnecessary? Well… not really. The new validator throws away all that complexity and trades it for a few comparisons and jumps. Tens of thousands comparisons and similar number of jumps, to be exact. In a single flat function. Basically we trade runtime complexity for build-time complexity.
But note that build-time complexity ≠ source code complexity. Since our goal is to produce an extremely fast validator, not an extremely complicated validator with impenetrable source, we try to keep its source as simple as feasible. To achieve this we employ ragel and our own generator of ragel code. Why two levels of indirection? Ragel is an industry-standard tool for DFA generation (you can find it in most Linux distributions, article on the Wikipedia was added in 2006, etc) and our generator is used to produce ragel output from textual description of the x86 instruction set. Said textual description uses a form which is pretty close to what you can find in the AMD manual (Intel manuals use similar acronyms, but they take a significantly different approach to describe VEX-encoded commands). The initial goal was to use snippets from the manual, but this proved to be unfeasible in a few cases because the manual is designed to be read by humans. For example POPF/D/Q Fv
describes the instruction which has different names in legacy 16bit mode, 32bit ia32 mode and 64bit x86-64 mode while STOSW/D/Q Yv, rAX
is an instruction which is called differently if different prefixes are used. The fact that you have access to all three forms of STOSW/D/Q
in x86-64 mode but can only use POPF
and POPFD
is not reflected anywhere. To solve this problem we use a slightly more formalized (and thus machine-parseable) description, but, thankfully, such cases are rare thus most commands in our tables are described exactly as they are described in the AMD manual.
If you know how Ragel works or if the phrase “Ragel is a compiler of finite state machines and it can produce not just finite-state automata but finite state transducers and we use this capability in our work” makes perfect sense and clarifies the affairs to you then you can skip next [optional] part.
I'll not explain what the DFA is (it's explained in CS courses you took years back… or you can refresh your knowledge on Wikipedia). But I'll explain a little about Ragel's take on finite state transducers. Extensive documentation with all the gory details is on Ragel's site, but while it explains how to use Ragel it does not explain what it is and why you might want to use it.
Let's start with the first question: what it is. Ragel is a compiler of DFA machines… but with a twist. You describe a DFA structure using simple RE-style format and Ragel generates the corresponding code in C (D/Go/Java/Ruby/etc: Ragel supports a lot of languages, but we are interested in C here). When you describe the DFA you just write acceptable bytes and then use the following operations: concatenation (“1 . 2” will accept either “1” followed by “2”), union (“1 | 2” will accept either “1” or “2”), intersection (“('a'..'n') & ('m'..'z')” will accept either “m” or “n”), difference (“('a'..'n') - ('m'..'z')” will accept everything between “a” and “l”, but will not accept either “m” or “n”) and kleene star (“(1 | 2)*” will accept any number of “1” or “2”).
These operations can produce quite non-trivial results: e.g. “("b" . ("aa"+ | "aaa"+))*” will produce the following DFA:
If, instead of “("aa"+ | "aaa"+)” in the example above you'll use something like “("a"{5}+ | "a"{7}+ | "a"{11}+)” then the resulting DFA will include almost four hundreds nodes and over five hundreds transitions! This limits applicability of DFA technology: e.g. it's possible to describe “valid code sequence” (including bundles, “restricted registers” and everything else) as a DFA, but… said DFA will include millions of nodes and billions of transitions!
2.1. Ragel actions.
To overcome this problem Ragel offers so-called "actions": pieces of code which are called when certain pieces in DFA are reached. E.g. we can mark begin and end of “aa” (or “aaa”) in the example above—“("b" . (("aa" >begin @end)+ | ("aaa" >begin @end)+ ))*” produces the following DFA:
Let's see what happens if we'll feed it with “baaaaaaaaa” sequence:
- offset 0: nothing
- offset 1:
begin
- offset 2:
end
- offset 3:
begin
thenend
- offset 4:
end
thenbegin
- offset 5:
begin
- offset 6:
end
- offset 7:
begin
- offset 8:
end
- offset 9:
begin
thenend
Hmm. Something is wrong here: why do we have so many
begin
's andend
's?!! Let's try to change the DFA a bit: “("b" . (("aa" >begin2 @end2)+ | ("aaa" >begin3 @end3)+ ))*” produces the following DFA:This time we have:
- offset 0: nothing
- offset 1:
begin2
thenbegin3
- offset 2:
end2
- offset 3:
begin2
thenend3
- offset 4:
end2
thenbegin3
- offset 5:
begin2
- offset 6:
end2
thenend3
- offset 7:
begin2
thenbegin3
- offset 8:
end2
- offset 9:
begin2
thenend3
Ah-ha. Now everything is clear. DFA is DFA: it does not support memory and it does not support rollbacks. This means that our DFA it processing two branches simultaneously—both “"aa"+” and “"aaa"+”. We'll need to keep this in mind. Another couple of observations:
- When we used just the
begin
action, the actionbegin
was called once, but when we split it in two (begin2
andbegin3
) both are called! By default Ragel merges actions.- Actions are called in non-random order—take a look at offset 4:
end2
is called beforebegin3
. That's becausebegin3
has lower priority thanend2
! Note that in the previous example this same effect was observed, but it was quite mysterious there. The closer the action is to the beginning of the source file the higher its priority.
Here is the build diagram:
*.def
files contain instruction definitions taken almost verbatim from the AMD instruction manual. They are parsed by gen_dfa
, which in turn produces ragel definitions of a regular language of all instructions (validator_x86_32_instruction.rl
). This regular language (machine in ragel terms) is used as a building block to define the language of all 'valid' bundles (give or take some subtle details we will discuss later). The language of valid bundles is defined in validator_x86_32.rl
.
To understand how the validator works it's best to start from the function ValidateChunkIA32
in validator_x86_32.rl
. Said function is very short and “simple”: it allocates a couple of arrays (valid_targets
and jump_dests
), then cycles over code passed to it (processing it in bundle-sized chunks) and at the end it compares valid jump targets and collected jump destinations… that's it. Oh, and it also includes a couple of cryptic lines right in the middle of the innermost cycle:
%% write init;
%% write exec;
These lines instruct ragel to insert DFA code (in C) here. The resulting output will go to the file validator_x86-32.c
, which performs actual validation.
Our main DFA is “(one_instruction | special_instruction)*
”—i.e. it accepts sequences of “normal” instructions and “special” instructions.
It consumes byte by byte from the current_position
pointer until one of the following ending conditions is met:
current_position == end_position
).Note that even if automaton leaves prematurely (before the end of the bundle), validation goes on from the beginning of the next bundle. If one bundle is rejected then the whole chunk is always rejected, but this approach makes it possible to diagnose more errors in one pass which helps while code is developed.
Apparently the collection of valid jump targets and actual target destinations happens inside this automaton. How?
There are three “special” instructions in the IA32 case: naclcall
, nacljmp
and mov %gs:0x0/0x4,%reg
(the public ABI allows read-only access to %gs:0
, and read-only access to %gs:4
is allowed for IRT). The last one is declared as a “special” instruction to simplify the validation logic (and DFA, too): instead of accepting all versions of the mov %gs:something,%reg
instruction followed by additional logic which rejects most possibilities (only plain vanilla “zero” is allowed here as per the ABI) we only describe this one version of the instruction and ragel does the rest. naclcall
and nacljmp
are two-instruction sequences: and $~0x1f, %eXX
and call %eXX
/jmp %eXX
. First instruction (and $~0x1f, %eXX
) can also be used as “normal” instruction.
Just like in the example above there are two actions: the “generic cleanup” one (end_of_instruction_cleanup
) is triggered at the instruction end (“normal” or “special”)—it's used to remember the beginning of the instruction, to clear the instruction_info_collected, and to mark the first byte of the instruction as a valid target for the direct jump; the second action is triggered at the final byte of naclcall
or nacljmp
: it expands boundaris of the instruction to cover both “component” instructions of naclcall
or nacljmp
and also marks instruction as “special”. Note that this naclcall
/nacljmp
-specific action textually is placed above end_of_instruction_cleanup
action in file and that means that it'll be processed first (which important because we don't want to mark call
/jmp in naclcall
/nacljmp as a valid jump target!). This guarantees that end_of_instruction_cleanup
will not mark start of the and $~0x1f, %eXX
as a valid jump target.
There is one additional action which is declared as “$err
”. This is the error fallback action: it's called whenever there is no transition for a particular byte in our DFA. This means we've hit either a forbidden instruction like lgdt
or some undefined byte sequence… in both cases the UNRECOGNIZED_INSTRUCTION
error is reported and processing is stopped.
This explains how the valid_targets
array is filled and invalid instructions are rejected.
But of course there are jump_dests
, too. Special instructions don't touch it, but something obviously fills the array. This can only be the result of processing of normal instructions, thus we need to go deeper. Where does it all come from? To understand that we need to look at the [autogenerated] validator_x86_32_instruction.rl
file. It looks like this:
one_instruction =
(branch_hint? 0x77 rel8) |
(branch_hint? (0x0f 0x87) rel32) |
((0x0f 0x01 0xd0) @CPUFeature_FXSR)
;0x77
and 0x0f 0x87
are opcodes for ja
(aka jnbe
) instruction, but what are branch_hint?
and rel8
/rel32
doing here? Well, “?
” means “optional” (like in most RE-engines) and both branch_hint
and rel8
/rel32
definitions are references to machines defined in the parse_instruction.rl
file. The whole construct describes part of the DFA which is designed to accept the ja
(aka jnbe
) instruction—complete with optional P4-inspired branch prediction prefix. Definition of branch_hint
is trivial and obvious (“branch_hint = 0x2e | 0x3e;
” if you want to know), but rel8
/rel32
are somewhat more “interesting”:rel8 = any @rel8_operand;
rel32 = any{4} @rel32_operand;
rel8_operand
/rel32_operand
are not present in validator_x86_32_instruction.rl
, they are in the parse_instruction.rl
file! But the definition itself is pretty trivial—they just call simple functions from validator_internal.h
:action rel8_operand {
Rel8Operand(current_position + 1, codeblock, jump_dests, size,
&instruction_info_collected);
}
action rel32_operand {
Rel32Operand(current_position + 1, codeblock, jump_dests, size,
&instruction_info_collected);
}
static FORCEINLINE int MarkJumpTarget(size_t jump_dest,
bitmap_word *jump_dests,
size_t size) {
if ((jump_dest & kBundleMask) == 0) {
return TRUE;
}
if (jump_dest >= size) {
return FALSE;
}
BitmapSetBit(jump_dests, jump_dest);
return TRUE;
}
static FORCEINLINE void Rel8Operand(const uint8_t *rip,
const uint8_t codeblock[],
bitmap_word *jump_dests,
size_t jumpdests_size,
uint32_t *instruction_info_collected) {
int8_t offset = rip[-1];
size_t jump_dest = offset + (rip - codeblock);
if (MarkJumpTarget(jump_dest, jump_dests, jumpdests_size))
*instruction_info_collected |= RELATIVE_8BIT;
else
*instruction_info_collected |= RELATIVE_8BIT | DIRECT_JUMP_OUT_OF_RANGE;
}
static FORCEINLINE void Rel32Operand(const uint8_t *rip,
const uint8_t codeblock[],
bitmap_word *jump_dests,
size_t jumpdests_size,
uint32_t *instruction_info_collected) {
int32_t offset =
rip[-4] + 256U * (rip[-3] + 256U * (rip[-2] + 256U * (rip[-1])));
size_t jump_dest = offset + (rip - codeblock);
if (MarkJumpTarget(jump_dest, jump_dests, jumpdests_size))
*instruction_info_collected |= RELATIVE_32BIT;
else
*instruction_info_collected |= RELATIVE_32BIT | DIRECT_JUMP_OUT_OF_RANGE;
}
DIRECT_JUMP_OUT_OF_RANGE
.
While the validator for ia32 mode is very simple and short (it also produces pretty compact code), the validator for x86-64 mode is different. It still has all the same properties the validator for ia32 mode had (valid_targets
and jump_dests
arrays, “normal” and “special” instructions, bundles and rel8_operand
/rel32_operand
actions), but it adds quite a few additional twists to the whole scheme.
It's created in a process which is similar to the process which creates the ia32 validator.
First of all: the ia32 mode validator had one DFA in it and two arrays which kept track of the instruction boundaries, but x86-64 has a few more state variables. Most of them (rex_prefix
, vex_prefix2
, vex_prefix3
, operand_states
, base
, and index
) keep track of the instruction parts (and thus they are cleared before each instruction), but one variable called restricted_register
is used to tie different instructions together. As the name implies it keeps track of the restricted register (if any). The restricted register in the NaCl SFI model on x86-64 systems is a general purpose register which has the top 32bits cleared up. Note that not all restricted registers are born equal: most registers can be restricted and then forgotten (if you write to %eax
and do nothing with the value before call
then nothing problematic or dangerous can ever happen), but %esp
and %ebp
are exceptions. If you write to the %esp
then the very next instruction must be add %r15,%rsp
or lea (%r15,%rsp,1),%rsp
—and %rbp
has similar requirements. This means that if at the end of a bundle, the restricted register is %rsp
or %rbp
, then the program is invalid. For the same reason, if, at the beginning of a normal instruction (this includes the first instruction in the “compound”), we see the restricted %rsp
or %rbp
, then it's an error too. On the other hand, few rare special instructions which are used to restore the SFI invariant WRT %rsp
or %rbp
will only be accepted if the restricted register is %rsp
xor %rbp
(depending on the special instruction).
The hard part is, as before, in the DFA. First of all, the main machine is similar to what we had in ia32 mode, but subtly different: it's “(normal_instruction | special_instruction)*
” now. I.e.: one_instruction
is replaced with normal_instruction
. And what is normal_instruction
? Why, it's “one_instruction - special_instruction
”, of course! Well… this is unexpected: why will we want to remove special_instruction
s from normal_instruction
s only to add them back? The answer is related to actions: recall how actions work. When we remove special_instruction
from one_instruction
we also remove the associated actions. This is important in the x86-64 case because some special instructions are just normal instructions which are permitted to violate the usual rules! E.g. “special” instruction and $~0x1f,%rsp
(which is used to align the stack pointer) changes the %rsp
directly which is usually forbidden, but because of the properties of and $xxx,…
(for any $xxx
< 0
) we know that invariants will not be violated.
This approach works well, but only if violations are detected at the instruction end. E.g. the aforementioned and $~0x1f,%rsp
instruction is encoded as 0x48 0x83 0xe4 0xe0 and after we've read 0x48 0x83 0xe4 we already know it's a normal instruction (opcode 0x83
means it's and
) which writes to %rsp
(0x48
opcode 0xe4
means it's some instruction which accepts some kind of immediate and writes to %rsp
) and we'll signal the error at this point—the fact that later we'll find out it's a special_instruction
which is accepted anyway will not matter: SP_MODIFIED
error will be triggered which means the code is rejected!
This means that we can not do actual conditions checking till the very end of the normal instruction (we can try to process some, but not all, of them, but this approach will be quite complex and fragile—not something you want in the most critical security piece). There is an exception: memory access. This one is checked inline: memory accesses outside of the “40GiB safe area” is strictly forbidden no matter how “special” the instruction is. That's why it's checked immediately after operands discovery. This is what the relevant fragment for the and
instruction looks like:
(0x83 (opcode_4 any* & operand_disp @check_access) imm8 @process_0_operand) |
(0x83 (opcode_4 any* & operand_rip @check_access) imm8 @process_0_operand) |
(REX_B? 0x83 (opcode_4 any* & single_register_memory @check_access) imm8 @process_0_operand) |
(REX_X? 0x83 (opcode_4 any* & operand_sib_pure_index @check_access) imm8 @process_0_operand) |
(REX_XB? 0x83 (opcode_4 any* & operand_sib_base_index @check_access) imm8 @process_0_operand) |
(lock 0x83 (opcode_4 any* & operand_disp @check_access) imm8 @process_0_operand) |
(lock 0x83 (opcode_4 any* & operand_rip @check_access) imm8 @process_0_operand) |
(lock REX_B? 0x83 (opcode_4 any* & single_register_memory @check_access) imm8 @process_0_operand) |
(lock REX_X? 0x83 (opcode_4 any* & operand_sib_pure_index @check_access) imm8 @process_0_operand) |
(lock REX_XB? 0x83 (opcode_4 any* & operand_sib_base_index @check_access) imm8 @process_0_operand) |
(REX_B? 0x83 (opcode_4 @operand0_32bit any* & modrm_registers @operand0_from_modrm_rm) imm8 @process_1_operand) |
check_access
is triggered after parsing ModRM/SIB bytes, but before parsing immNN
field while process_N_operands
action is triggered at the very end of the “normal” instruction. Even if the instruction does not use the immNN
field, the check_access
action is still triggered before the process_N_operands
action. This is important because the check_access
action actually depends on the previous state of restricted_register
variable while process_N_operands
action changes the restricted_register
variable. Note that it's only triggered for “normal” instructions—“special” instructions either do the work themselves (e.g. add %r15,%rsp
—which is only valid if the previous state of the restricted_register
variable was REG_RSP
and changes it to NO_REG
in case of success) or call the usual process_N_operands
action (e.g. mov %rsp,%rbp
calls process_0_operands
which ensures that this operation is not called when restricted_register
is set to REG_RSP
/REG_RBP
state and transitions it to NO_REG
state).
You can find yet another surprising thing in the snippet above: and
instruction is handled either as instruction with zero operands or as instruction with one operand… but of course in reality it always has two operands! Something is strange here… Well, sure: the decoder part of the validator is as streamlined as possible. We just ignore all non-register arguments and arguments which are not written to (but we don't ignore memory accesses if they happen here, of course). That's why and
has either one or zero operands as far as the validator is concerned.
And, finally, there are a twist related to “superinstructions” (sequences of normal instructions followed by the “dangrous” instruction like jmp *%rax
or maskmovq
): in ia32 case we only had naclcall
and nacljmp
, they both included two instructions and we only needed to avoid marking second one as “valid jump target”. In x86-64 mode “superinstructions” can be much longer (they can include two, three, or even five instructions!) thus the simple approach (don't mark the beginning of the “dangerous” superinstruction as a valid jump target) does not work. Instead we mark all bytes of the superinstruction as “invalid jump target” using UnmarkValidJumpTargets
function. Note that we also need to mark the beginning of regular instruction as “invalid jump target” if said instruction uses restricted_register
—that's why at we mark mark the first byte of the next instruction as the “valid jump target” in the end_of_instruction_cleanup
, not the first byte of the current instruction. Note that even invalid instructions would be marked as valid jump targets in this scheme, but we don't care about this peculiarity because the validation result will be negative anyway.
Operands handling is, again, not that complex… if you are familiar with bit operations. An initial version of the validator used a simple array of records to store the information and everything worked well… with GCC, that is. MSVC produced awful code which was almost 30% slower and also needed twenty minutes to compile, thus we replaced this simple version with the current macro-based one.
All the information about encountered operands is collected in a single scalar variable operand_states
. The layout of said variable looks like this:
63 | 39 | 38 | 37 | 36 | 32 | 31 | 30 | 29 | 28 | 24 | 23 | 22 | 21 | 20 | 16 | 15 | 14 | 13 | 12 | 8 | 7 | 6 | 5 | 4 | 0 |
padding | operand4: register_type | operand4: register_name | padding | operand3: register_type | operand3: register_name | padding | operand2: register_type | operand2: register_name | padding | operand1: register_type | operand1: register_name | padding | operand0: register_type | operand0: register_name | |||||||||||
↖ 0 if normal register | ↖ 0 if normal register | ↖ 0 if normal register | ↖ 0 if normal register | ↖ 0 if normal register |
Register names are defined in the register_name
enum: the first 16 are identical to the AMD/Intel names (from REG_RAX
to REG_R15
) while the other 16 are used (partially) to describe non-register operands (memory operand, immediate operand, REG_RIP
and REG_RIZ
, etc). This means that if the operand's name is >15 then it can be ignored. There are only four operand types: OperandSandboxIrrelevant
, OperandSandbox8bit
, OperandSandboxRestricted
, and OperandSandboxUnrestricted
. The first type is something not related to a general purpose register (x87, MMX, XMM, or YMM registers fall unto this category). We need to handle 8bit operands specially because they are finicky: if REX
byte is used they access %spl
, %bps
, %sil
, and %dil
, but when REX
byte is not used the same numbers are reused for %ah
, %ch
, %dh
, and %bh
! The last two types are the most important: these are 32bit operands (which will make the appropriate register “restricted”) or 16bit/64bit operands (these may affect the register in question negatively if it's %rbp
, %rsp
, or %r15
, but for other registers these are just ignored). Note that if you assign 0
to this variable then all operands will be of OperandSandboxIrrelevant
type.
Now the set of macros used to work with operands should look less mysterious:
#define SET_OPERAND_NAME(INDEX, REGISTER_NAME) \
operand_states |= ((REGISTER_NAME) << ((INDEX) << 3))
#define SET_OPERAND_FORMAT(INDEX, FORMAT) \
SET_OPERAND_FORMAT_ ## FORMAT(INDEX)
#define SET_OPERAND_FORMAT_OPERAND_FORMAT_8_BIT(INDEX) \
operand_states |= OPERAND_SANDBOX_8BIT << (5 + ((INDEX) << 3))
#define SET_OPERAND_FORMAT_OPERAND_FORMAT_16_BIT(INDEX) \
operand_states |= OPERAND_SANDBOX_UNRESTRICTED << (5 + ((INDEX) << 3))
#define SET_OPERAND_FORMAT_OPERAND_FORMAT_32_BIT(INDEX) \
operand_states |= OPERAND_SANDBOX_RESTRICTED << (5 + ((INDEX) << 3))
#define SET_OPERAND_FORMAT_OPERAND_FORMAT_64_BIT(INDEX) \
operand_states |= OPERAND_SANDBOX_UNRESTRICTED << (5 + ((INDEX) << 3))
#define CHECK_OPERAND(INDEX, REGISTER_NAME, KIND) \
((operand_states & (0xff << ((INDEX) << 3))) == \
((((KIND) << 5) | (REGISTER_NAME)) << ((INDEX) << 3)))
SET_OPERAND_NAME(0, REG_RAX)
are used by actions to set name of the operand (this particular one is used by operand0_rax
action) while calls like SET_OPERAND_FORMAT(0, OPERAND_FORMAT_2_BIT)
are used by actions to set the type of operand (this particular one is used by operand0_2bit
action). Note that we don't handle 2bit operands in the set of macros above. This is not a mistake: 2bit operands are only ever used as immediate operands (and then only in two instructions: vpermil2pd
and vpermil2ps
) and we don't process immediate operands here. If they will be by some reason left in the validator_x86_64_instruction.rl
file this will lead to the compile-time error, not to some kind of weird overflow which may [potentially] produce security hole.
Almost all manipulations with operand_states
are done using macros described above, but there are one another construct which accesses the operand_states
directly:
#define CHECK_OPERAND_RESTRICTED(INDEX) \
/* Take 2 bits of operand type from operand_states as *restricted_register */\
/* and also make sure operand_states denotes a register (4th bit == 0). */\
(operand_states & (0x70 << ((INDEX) << 3))) == \
(OPERAND_SANDBOX_RESTRICTED << (5 + ((INDEX) << 3)))
operand_states
then it's pretty easy to understand what goes on here: (operand_states & (0x70 << ((INDEX) << 3))) == (OPERAND_SANDBOX_RESTRICTED << (5 + ((INDEX) << 3)))
yields TRUE
if and only if zeroth operand is “normal” register and it's of type OperandSandboxRestricted
. This is actually central piece of the restricted_register
handling—most other pieces just return it back to NO_REG
state.
CPUID
support.CPUID
support is implemented using a large set of actions embedded in the definition of instructions (see, e.g. @CPUFeature_FXSR
in the line for instruction 0x0f 0x01 0xd0
AKA xgetbv
). CPUID-related actions are triggered when we know the identity of the instruction (which happens at different times for different instructions: some instructions are detected when the opcode is read, some use opcode extension, etc—AMD/Intel manuals contain all the gory details), but the definition for said actions in validator_x86_32_instruction.rl
are very simple
action CPUFeature_FXSR {
SET_CPU_FEATURE(CPUFeature_FXSR);
}
validator_internal.h
. SET_CPU_FEATURE
is defined asif (!(FEATURE(kValidatorCPUIDFeatures.data))) { \
instruction_info_collected |= UNRECOGNIZED_INSTRUCTION; \
} \
if (!(FEATURE(cpu_features->data))) { \
instruction_info_collected |= CPUID_UNSUPPORTED_INSTRUCTION; \
}
CPUFeature_FXSR
is not the name of variable, but the name of a macrodefinition. This is needed to handle special cases where CPUFeature
does not correspond to a single CPUID
bit. E.g. prefetch
instruction is available when any one of two bits are set: 3DNnow!
bit or dedicated Prefetch instruction
bit. AMD documentation also claims prefetch
is always available if LongMode
bit is set but Intel documentation does not support this assertion. On the other hand vaesenc
is available when both AES
and AVX
bits are set. And our ABI permits lzcnt
and tzcnt
unconditionally (thus CPUFeature_LZCNT
does not check for anything but just returns TRUE
in all cases).
Note: there are two CPUID masks: a hardcoded one (it can be replaced if you link in a different definition of the validator_cpuid_features
global variable in your program) and a runtime-supplied one (usually obtained from an actual CPUID
call in production, but hardcoded in tests). New instructions are first added in the “production disabled” mode and must pass a security review before they can be used in Chrome.
Dynamic code modification support is implemented with the help of the CALL_USER_CALLBACK_ON_EACH_INSTRUCTION
option. Normally the user callback is only used when some kind of error is detected, but if this option is used then callback is called after each instruction. When that happens, the callback has all the information needed to process the instruction: collected errors, information about immediates, etc.
All that information is squeezed into the instruction_info_collected
variable. It has the following format:
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 8 | 7 | 6 | 5 | 4 | 3 | 0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Cumulative size of anyfields. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Instruction has two immediates. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Instruction displacement size. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Instruction has relative offset. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Register, zero-extended by the instruction. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Instruction is valid, but it access memory using register which is zero-extended by previous instruction. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ DFA error: invalid instruction. Validation then resumes from the next bundle. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Unaligned direct jump to address outside of given region. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Instruction is not supported for a given CPUID mask. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Base register is not %rbp , %rsp , or %r15 . | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Index register is not zero-extended by previous instruction. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: %rbp/%rsp sandboxing detected. Next two bits reveal details of the error: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | └ ia32 mode: reserved; amd64 mode (only if some %rbp /%rsp related error is detected):00: Instruction which zero-extends %rbp must be followed by add %r15,%rbp , lea (%rbp,%r15,1),%rbp , or lea 0x0(%rbp,%r15,1),%rbp .01: add %r15,%rbp , lea (%rbp,%r15,1),%rbp , or lea 0x0(%rbp,%r15,1),%rbp is used after instruction which does not zero-extend %rbp .10: Instruction which zero-extends %rsp must be followed by add %r15,%rsp or lea (%rsp,%r15,1),%rsp .11: add %r15,%rsp or lea (%rsp,%r15,1),%rsp is used after instruction which does not zero-extend %rsp . | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ %r15b , %r15w , %r15d , or %r15 is modified. %r15 is untouchable in amd64 mode. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: %bpl , %bp , or %rbp is incorrectly modified. Only %rbp can be modified and then only by special instructions. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: %spl , %sp , or %rsp is incorrectly modified. Only %rsp can be modified and then only by special instructions. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Bad call alignment: call must end at the end of the bundle, since nacljmp only can jump to aligned address. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | └ Reserved. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Instruction is modifiable. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | └ Special instruction (uses different validation rules from the regular instruction). Can not be changed in ia32bit mode. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | └ Last byte is not immediate. It's either opcode, register number or register number and two-bit immediate. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | └ Invalid jump target. When this flag is set instruction_begin and instruction_end both point to the jump target instruction, not to the jump instruction itself. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
└ Reserved. |
Note that half of the information does not make sense for ia32 mode and is not collected by ValidateChunkIA32.
Using this information you can determine if the given instruction follows special rules (only naclcall
and nacljmp
in ia32 mode; a lot of different commands in amd64 mode: %rbp
/%rsp
modifications, string instructions, naclcall
, and nacljmp
), if it includes relative offsets (commands like jcc
, jmp
, loopcc
, or call
), displacements (most commands which access memory support displacements), or immediates (immediates are supported by many different commands; they can be combined with displacement if the command accesses memory). Tests might use the information collected to precisely separate different anyfields (immediates, displacements, relative offsets), but in production only few bits are used to determine if the instruction can be changed: in ia32 mode only the special instructions naclcall
and nacljmp
can not be changed, while the amd64 situation is the opposite: only call
and mov
instructions can be changed, and only in their anyfields part.
Code replacement is not performed by ValidateChunk*
function directly. Instead it's done by a higher-level function in dfa_validate_*.c
.
It calls ValidateChunk*
with the CALL_USER_CALLBACK_ON_EACH_INSTRUCTION
option to compare lengths of instructions in two fragments in the callback. IA32 mode uses the SPECIAL_INSTRUCTION
flag in the callback's validation_info
to determine if the instruction can be changed (all non-special instructions are fair game), but in amd64 mode we only allow changes in a few hand-picked instructions (currently call
and mov
) and they are marked with MODIFIABLE_INSTRUCTION
flag.
One tricky thing there is handling of relative jumps and calls: if relative jump (or call) triggers DIRECT_JUMP_OUT_OF_RANGE
but is bit-to-bit identical to the original instruction it's accepted anyway: this means that this particular jump
(or call
) jumps (or calls) some valid position outside of a given range. If it must be changed then you need to pass a bigger region to the ValidatorCodeReplacement_x86_*
function—this way the validator will have a chance to check the landing place for validity (this is, of course, not needed if the landing point is bundle-aligned).
In ia32 mode the whole instruction can be changed, but in amd64 mode we don't allow arbitrary changes to the instruction, we only allow changes to anyfields (immediates, displacements, relative offsets) which is somewhat tricky: most instructions put them at the end, but some instruction use last byte for:
cmpccsd
/vcmpccsd
and cmpccss
/vcmpccss
, and pclmulqdq
/vpclmulqdq
.vblendvpd
/vblendvps
), some FMA4 instructions (such as vfmaddsubpd
), and some XOP instructions (such as vpperm
).vpermil2pd
/vpermil2ps
.All these instructions set the LAST_BYTE_IS_NOT_IMMEDIATE
flag; the last form can be distinguished because it sets the IMMEDIATE_2BIT
flag (which actually includes the LAST_BYTE_IS_NOT_IMMEDIATE
flag).
This is done by a very simple function which uses the CALL_USER_CALLBACK_ON_EACH_INSTRUCTION
mode to process instructions one-after-another.
The only remaining issue (but a big one) is about the generation of the actual decoders ({decoder,validator}_x86_{32,64}_instruction.rl files)
. This is a big part of the whole package, but, thankfully, it happens in a significantly less hostile environment: the decoder and validator must work even if they are processing specially-crafted files created by clever adversaries while gen_dfa
processes data files created by us and should only correctly process certain “good” files.
To understand how it works it's better to start with the decoders. Remember how we've talked about “streamlined data structures”, “indispensable minimum of the information”, etc? This approach produces fast and a [relatively] simple validator, but it makes it hard to test and debug. To facilitate testing and debugging we create separate decoders: these return all the information about all the instructions they can parse and in fact can produce output identical to objdump's output.
They are used to verify the description of the instructions from .def
files—with a special attention to the length of said instructions.
Decoders are created using the familiar process.
There are few big differences between standalone decoders and simplified decoders embedded in ValidateChunkIA32
/ValidateChunkAMD64
:
.def
files.struct instruction
—common for both decoders.All these facts mean that standalone decoders are significantly larger and slower—but also much easier to understand. For each regular instruction validator DFA and decoder DFA define exactly the same language and only differ in actions thus validator and decoder accept the same set of byte sequences as instructions.