Contemporary computer systems are extremely powerful and most complex components and libraries are built like a luxury car: they include a lot of comfort and safety technologies which are designed to improve live of the user of said components. This also facilitates code reuse via modular programming and generally improves maintainability.
Unfortunately these complex structures, improved comfort for the library user and commendable flexibility have a flip side: they lead to a lot of additional work in runtime! You first fill and then parse complex data structures—and this takes time. You often produce a lot of information on the low levels which is just not used on higher levels—and this work is also not free.
New validator is built differently. It only keeps around the indispensable minimum of the information needed to prove (or disprove) that code is safe. Similarly to how F1 car uses custom-designed car seats we use custom-designed data structures to push the data from one point of validator to another one. We only collect the bare minimum of the information (and perhaps a little bit besides that to make testing possible)—and if the requirements are changing we often change all the pieces: from gen_dfa
input data format to the highest-level dfa_validate_32.c
/dfa_validate_64.c
external API adapters.
This streamlining was one of the most important design goals of a new validator. And indeed the code which reaches the CPU is very simple: it does not contain complex data structures and multilayered functions while all the previous validators had many layers and quite a few complex data structures. How can it be? Were all these structures superfluous and unnecessary? Well… not really. New validator throws away all that complexity and trades it for a few comparisons and jumps. Tens of thousands comparisons and similar number of jumps, to be exact. In a single flat function. Basically we trade runtime complexity for build-time complexity.
But note that build-time complexity ≠ source code complexity. Since our goal is to produce extremely fast validator, not extremely complicated validator with unpenetrable source we try to keep it's source as simple as feasible. To achieve this we employ ragel and our own generator of ragel code. Why two levels of indirection? Ragel is industry-standard tool for DFA generation (you can find it in most Linux distributions, article on the Wikipedia was added in 2006, etc) and our generator is used to produce ragel output from textual description of the x86 instruction set. Said textual description uses form which is pretty close to what you can find in AMD manual (Intel manuals use similar acronims, but they use significantly different approach to describe VEX-encoded commands). The initial goal was to use snippets from the manual, but this proved to be unfeasible is a few cases because manual is designed to be read by human. For example POPF/D/Q Fv
describes instruction which has different names in legacy 16bit mode, 32bit ia32 mode and 64bit x86-64 mode while STOSW/D/Q Yv, rAX
is an instruction which is called differently if different prefixes are used. The fact that you have access to all three forms of STOSW/D/Q
in x86-64 mode but can only use POPF
and POPFD
is not reflected anywhere. To solve this problem we use slightly more formalized (and thus machine-parseable) description, but, thankfully, such cases are rare thus most commands in our tables are disribed exactly as they are described in the AMD manual.
If you know how Ragel works or if the phrase “Ragel is compiler of finite state machines and it can produce not just finite-state automata but finite state transducers and we use this capability in our work” makes perfect sense and clarifies the affairs to you then you can skip next [optional] part.
I'll not explain what the DFA is (it's explained in CS course you've heard years back… or you can refresh you knowleadge on Wikipedia). But I'll explain a little about Ragel's take on finite state transducers. Extensive documentation with all the gory details is on Ragel's site, but while it explains how to use Ragel it does not explain what it is and why you may want to use it.
Let's start with the first question: what it is. Ragel is compiler of DFA machines… but with a twist. You describe DFA structure using simple RE-style format and Ragel generates the corresponding code in C (D/Go/Java/Ruby/etc: Ragel supports a lot of laguages, but we are interested in C here). When you describe the DFA you just write acceptable bytes and then use the following operations: concatenation (“1 . 2” will accept either “1” followed by “2”), union (“1 | 2” will accept either “1” or “2”), intersection (“('a'..'n') & ('m'..'z')” will accept either “m” or “n”), difference (“('a'..'n') - ('m'..'z')” will accept everything between “a” and “l”, but will not accept either “m” or “n”) and kleene star (“(1 | 2)*” will accept any number of “1” or “2”).
These operations can produce quite non-trivial result: e.g. “("b" . ("aa"+ | "aaa"+))*” will produce the following DFA:
If, instead of “("aa"+ | "aaa"+)” in the example above you'll use something like “("a"{5}+ | "a"{7}+ | "a"{11}+)” then the resulting DFA will include almost four hundreds nodes and over five hundreds transitions! This limits applicability of DFA technology: e.g. it's possible to describe "valid code sequence" (including bundles, "restricted registers" and everything else) as a DFA, but… said DFA will include millions of nodes and billions of transitions!
2.1. Ragel actions.
To overcome this problem Ragel offers so-called "actions": pieces of code which are called when certain pieces in DFA are reached. E.g. we can mark begin and end of “aa” (or “aaa”) in the example above—“("b" . (("aa" >begin @end)+ | ("aaa" >begin @end)+ ))*” produces the following DFA:
Let's see what happens if we'll feed it with “baaaaaaaaa” sequence:
- offset 0: nothing
- offset 1:
begin
- offset 2:
end
- offset 3:
begin
thenend
- offset 4:
end
thenbegin
- offset 5:
begin
- offset 6:
end
- offset 7:
begin
- offset 8:
end
- offset 9:
begin
thenend
Hmm. Something is wrong here: why do we have so many
begin
's andend
's?!! Let's try to change the DFA a bit: “("b" . (("aa" >begin2 @end2)+ | ("aaa" >begin3 @end3)+ ))*” produces the following DFA:This time we have:
- offset 0: nothing
- offset 1:
begin2
thenbegin3
- offset 2:
end2
- offset 3:
begin2
thenend3
- offset 4:
end2
thenbegin3
- offset 5:
begin2
- offset 6:
end2
thenend3
- offset 7:
begin2
thenbegin3
- offset 8:
end2
- offset 9:
begin2
thenend3
Ah-ha. Now everything is clear. DFA is DFA: it does not support memory and it does not support rollbacks. This means that our DFA it processing two branches simultaneously—both “"aa"+” and “"aaa"+”. We'll need to keep this in mind. Couple of another observations:
- When we used just
begin
action actionbegin
was called once, but when we split it in two (begin2
andbegin3
) both are called! By default Ragel merges actions.- Actions are called in non-random order—take a look on offset 4:
end2
is called beforebegin3
. That's becausebegin3
has lower priority thanend2
! Note that in previous example this same effect was observed, but it was quite mysterious there. The closer the action is to the beginning of the source file the higher it's priority is.
Here is the build diagram:
*.def
files contain instruction definitions taken almost verbatim from AMD instruction manual. They are parsed by gen_dfa
, which in turn produces ragel definition of regular language of all instructions (validator_x86_32_instruction.rl
). This regular language (machine in ragel terms) is used as a building block to define language of all 'valid' bundles (give or take some subtle details we will discuss later). Language of valid bundles is defined in validator_x86_32.rl
.
To understand how validator works it's best to start from function ValidateChunkIA32
in validator_x86_32.rl
. Said function is very short and “simple”: it allocates couple of arrays (valid_targets
and jump_dests
), then cycles over code passed to it (processing it in bundle-sized chunks) and at the end it compares valid jump targets and collected jump destinations… that's it. Oh, and it also includes couple of cryptic lines right in the middle of innermost cycle:
%% write init;
%% write exec;
These lines instruct ragel to insert DFA code (in C) here. Resulting output will go to file validator_x86-32.c
, which performs actual validation.
Our main DFA is “(one_instruction | special_instruction)*
”—i.e. it accepts sequence of “normal” instructions and “special” instructions.
It consumes byte by byte from current_position
pointer until one of the following ending conditions is met:
current_position == end_of_bundle
).Note that even if automaton leaves prematurely (before the end of bundle), validation goes on from the beginning of the next bundle. If one bundle is rejected then the whole chunk is always rejected, but this approach makes it possible to diagnose more errors in one pass which helps while code is developed.
Apparently collection of valid jump targets and actual target destinations happens inside this automaton. How?
Just like in example above there are two actions: first one is triggered at the beginning of the instruction
(“normal” or “special”)—it's used to remember the beginning of the instruction, to clear the instruction_info_collected
, and to mark the first byte of the instruction as valid target for the direct jump; second one is triggered at the final byte of the instruction
(“normal” or “special”)—and is used to report errors. And there are also one additional action which is declared as “$err
”. This is error fallback action: it's called whenever there is no transition for a particular byte in our DFA. This means we've hit either forbidden instruction like lgdt
or some undefined byte sequence… in both cases UNRECOGNIZED_INSTRUCTION
error is reported and processing is stopped.
There are three “special” instructions in IA32 case: naclcall
, nacljmp
and mov %gs:0x0/0x4,%reg
(public ABI allows read-only access to %gs:0
, and read-only access to %gs:4
is allowed for IRT). The last one is declared as “special” instruction to simplify the validation logic (and DFA, too): instead of accepting all versions of mov %gs:something,%reg
instruction followed by additional logic which rejects most possibilities (only plain vanialla “zero” is allowed here as per ABI) we only describe this one version of the instruction and ragel does the rest. naclcall
and nacljmp
include special action which clears the “valid destination address” bit (remember the story with begin
and end
actions above? when first byte of a second half of naclcall
/nacljmp
is processed it's processed as both part of the naclcall
/nacljmp
and as a start of a regular instruction, too).
This explains how valid_targets
array is filled and invalid instructions are rejected. Note that even invalid instruction would be marked as valid jump target, but we don't care about this peculiarity because validation result will be negative anyway.
But of course there are jump_dests
, too. Special instructions don't touch it, but something obviously fills the array, isn't it. This can only be result of processing of normal instructions, thus we need to go deeper. Where it all comes from? To understand that we need to look on [autogenerated] validator_x86_32_instruction.rl
file. The file looks like this:
one_instruction =
(branch_hint? 0x77 rel8) |
(branch_hint? (0x0f 0x87) rel32) |
((0x0f 0x01 0xd0) @CPUFeature_FXSR)
;0x77
and 0x0f 0x87
are opcodes for ja
(aka jnbe
) instruction, but what are branch_hint?
and rel8
/rel32
are doing here? Well, “?
” means “optional” (like in most RE-engines) and both branch_hint
and rel8
/rel32
definitions are references to machines defined in the semi-manual simple helper machines and actions part of validator_x86_32_instruction.rl
file. The whole construct describes part of the DFA which is designed to accept ja
(aka jnbe
) instruction—complete with optional P4-inspired branch prediction prefix. Definition of branch_hint
is trivial and obvious (“branch_hint = 0x2e | 0x3e;
” if you want to know), but rel8
/rel32
are somewhat more “interesting”:rel8 = any @rel8_operand;
rel32 = any{4} @rel32_operand;
rel8_operand
/rel32_operand
are not present in validator_x86_32_instruction.rl
, they are in validator_x86_32.rl
file! But the definition itself is pretty trivial:action rel8_operand {
int8_t offset = (uint8_t) (p[0]);
size_t jump_dest = offset + (p - data) + 1;
if (!MarkJumpTarget(jump_dest, jump_dests, size)) {
instruction_info_collected |= DIRECT_JUMP_OUT_OF_RANGE;
}
}
action rel32_operand {
int32_t offset =
(p[-3] + 256U * (p[-2] + 256U * (p[-1] + 256U * ((uint32_t) p[0]))));
size_t jump_dest = offset + (p - data) + 1;
if (!MarkJumpTarget(jump_dest, jump_dests, size)) {
instruction_info_collected |= DIRECT_JUMP_OUT_OF_RANGE;
}
}
DIRECT_JUMP_OUT_OF_RANGE
.
While validator for ia32 mode is very simple and short (it also produces pretty compact code) validator for x86-64 mode is different. It still has all the same properties validator for ia32 mode had (valid_targets
and jump_dests
arrays, “normal” and “special” instructions, bundles and rel8_operand
/rel32_operand
actions), but it adds quite a few additional twists to the whole scheme.
It's created in a process which is similar to the process which creates the ia32 validator.
First of all: ia32 mode validator had one DFA in it and two arrays which kept track of the instruction boundaries but x86-64 has few more state variables. Most of them (rex_prefix
, vex_prefix2
, vex_prefix3
, operand_states
, base
, and index
) keep track of the instruction parts (and thus they are cleared before each instruction), but one variable called restricted_register
is used to tie different instructions together. As the name implies it keeps track of the restricted register (if any). Restricted register in the NaCl SFI model on x86-64 systems is general purpose register which has top 32bits cleared up. Note that not all restricted registers are born equal: most registers can be restricted and then forgotten (if you write to %eax
and do nothing with the value before call
then nothing problematic or dangerous can ever happen), but %esp
and %ebp
are exceptions. If you write to the %esp
then the very next instruction must be add %r15,%rsp
or lea (%r15,%rsp,1),%rsp
—and %rbp
has similar requirements. This means that if at the end of a bundle restricted register is %rsp
or %rbp
then program is invalid. For the same reason if at the beginning of a normal instruction (this includes first instruction in the “compound”) we see restricted %rsp
or restricted %rbp
then it's an error, too. On the other hand few rare special instructions which are used to restore the SFI invariant WRT %rsp
or %rbp
will only be accepted if restricted register is %rsp
xor %rbp
(depending on special instruction).
The hard part is, as before, in the DFA. First of all, main machine is similar to what we had in ia32 mode, but subtly different: it's “(normal_instruction | special_instruction)*
” now. I.e.: one_instruction
is replaced with normal_instruction
. And what is normal_instruction
? Why, it's “one_instruction - special_instruction
”, of course! Well… this is unexpected: why will we want to remove special_instruction
s from normal_instruction
s only to add them back? The answer is related to actions: recall how actions work. When we remove special_instruction
from one_instruction
we also remove the associated actions. This is important in x86-64 case because some special instructions are just a normal instructions which are permitted to violate the usual rules! E.g. “special” instruction and $~0x1f,%rsp
(which is used to align the stack pointer) changes the %rsp
directly which is usually forbidden, but because of properties of and $xxx,…
(for any $xxx
< 0
) we know that invariants will not be violated.
This approach works well, but only if violations are detected at the instruction end. E.g. the aforementioned and $~0x1f,%rsp
instruction is encoded as 0x48 0x83 0xe4 0xe0 and after we've read 0x48 0x83 0xe4 we already know it's normal instruction (opcode 0x83
means it's and
) which writes to %rsp
(0x48
opcode 0xe4
means it's some instruction which accepts some kind of immediate and writes to %rsp
) and we'll signal the error at this point then the fact that later we'll find out it's special_instruction
which is accepted anyway will not matter: SPL_MODIFIED
error will be triggered which will mean that code is rejected!
This means that we can not do an actual conditions checking till the very end of normal instruction (we can try to process some of them but not all of them but this approach will be quite complex and fragile—not something you want in the most critical security piece). But there are an exception: memory access. This one is checked inline: memory access outside of “40GiB safe area” is strictly forbidden no matter how “special” the instruction is. That's why it's checked immediately after operands discovery. This is how relevant fragment for the and
instruction look like:
(0x83 (opcode_4 any* & any . any* & operand_disp @check_access) imm8 @process_0_operands) |
(0x83 (opcode_4 any* & any . any* & operand_rip @check_access) imm8 @process_0_operands) |
(REX_B? 0x83 (opcode_4 any* & any . any* & single_register_memory @check_access) imm8 @process_0_operands) |
(REX_X? 0x83 (opcode_4 any* & any . any* & operand_sib_pure_index @check_access) imm8 @process_0_operands) |
(REX_XB? 0x83 (opcode_4 any* & any . any* & operand_sib_base_index @check_access) imm8 @process_0_operands) |
(lock 0x83 (opcode_4 any* & any . any* & operand_disp @check_access) imm8 @process_0_operands) |
(lock 0x83 (opcode_4 any* & any . any* & operand_rip @check_access) imm8 @process_0_operands) |
(lock REX_B? 0x83 (opcode_4 any* & any . any* & single_register_memory @check_access) imm8 @process_0_operands) |
(lock REX_X? 0x83 (opcode_4 any* & any . any* & operand_sib_pure_index @check_access) imm8 @process_0_operands) |
(lock REX_XB? 0x83 (opcode_4 any* & any . any* & operand_sib_base_index @check_access) imm8 @process_0_operands) |
(REX_B? 0x83 (opcode_4 @operand0_32bit any* & modrm_registers @operand0_from_modrm_rm) imm8 @process_1_operand) |
check_access
is triggered after parsing ModRM/SIB bytes, but before parsing immNN
field while process_N_operands
action is triggered at the very end of the “normal” instruction. Even if instruction does not use immNN
field check_access
action is still triggerded before process_N_operands
action. This is important because check_access
action actually depends on previous state of restricted_register
variable while process_N_operands
action changes restricted_register
variable. Note that it's only triggered for “normal” instructions—“special” instructions either do the work themselves (e.g. add %r15,%rsp
—which is only valid if previous state of restricted_register
variable was REG_RSP
and changes it to NO_REG
in case of succcess) or call the usual process_N_operands
action (e.g. mov %rsp,%rbp
calls process_0_operands
which ensures that this operation is not called when restricted_register
is set to REG_RSP
/REG_RBP
state and transtions it to NO_REG
state).
You can find yet another suprising thing in the snippet above: and
instruction is handled either as instruction with zero operands or as instruction with one operand… but of course in reality it always has two operands! Something is strange here… Well, sure: the decoder part of validator is as streamlined as possible. We just ignore all non-register arguments and arguments which are not written to (but we don't ignore memory accesses if they happen here, of course). That's why and
has either one or zero operands as far as validator is concerned.
Operands handling as, again, is not that complex… if you are familiar with bit operations. Initial version of the validator used simple array of records to store the information and everything worked well… with GCC, that is. MSVC produced awful code which was almost 30% slower and also needed twenty minutes to do so thus we replaced this simple version with the current macro-based one.
All the information about encountered operands is collected in a single scalar variable operand_states
. The layout of said variable looks like this:
63 | 39 | 38 | 37 | 36 | 32 | 31 | 30 | 29 | 28 | 24 | 23 | 22 | 21 | 20 | 16 | 15 | 14 | 13 | 12 | 8 | 7 | 6 | 5 | 4 | 0 |
padding | operand4: register_type | operand4: register_name | padding | operand3: register_type | operand3: register_name | padding | operand2: register_type | operand2: register_name | padding | operand1: register_type | operand1: register_name | padding | operand0: register_type | operand0: register_name | |||||||||||
↖ 0 if normal register | ↖ 0 if normal register | ↖ 0 if normal register | ↖ 0 if normal register | ↖ 0 if normal register |
Register names are defined in register_name
enum: first 16 are identical to the AMD/Intel names (from REG_RAX
to REG_R15
) while other 16 are used (partially) to describe non-register operands (memory operand, immediate operand, REG_RIP
and REG_RIZ
, etc). This means that if operand's name is >15 then it can be ignored. There are only four operand types: OperandSandboxIrrelevant
, OperandSandbox8bit
, OperandSandboxRestricted
, and OperandSandboxUnrestricted
. First type is something not related to general purpose register (x87, MMX, XMM, or YMM registers fall unto this category). We need to handle 8bit operands specially because they are finicky: if REX
byte is used they access %spl
, %bps
, %sil
, and %dil
, but when REX
byte is not used the same numbers are reused for %ah
, %ch
, %dh
, and %bh
! Last two types are the most important: these are 32bit operands (which will make the appropriate register “restricted”) or 16bit/64bit operands (these may affect register in question negatively if that's %rbp
, %rsp
, or %r15
, but for other registers these are just ignored). Note that if you assign 0
to this variable then all operands will be of OperandSandboxIrrelevant
type.
Now the set of macros used to work with operands should look less mysterious:
#define SET_OPERAND_NAME(N, S) operand_states |= ((S) << ((N) * 8))
#define SET_OPERAND_TYPE(N, T) SET_OPERAND_TYPE_ ## T(N)
#define SET_OPERAND_TYPE_OPERAND_SIZE_8_BIT(N) operand_states |= OperandSandbox8bit << (5 + ((N) << 3))
#define SET_OPERAND_TYPE_OPERAND_SIZE_16_BIT(N) operand_states |= OperandSandboxUnrestricted << (5 + ((N) << 3))
#define SET_OPERAND_TYPE_OPERAND_SIZE_32_BIT(N) operand_states |= OperandSandboxRestricted << (5 + ((N) << 3))
#define SET_OPERAND_TYPE_OPERAND_SIZE_64_BIT(N) operand_states |= OperandSandboxUnrestricted << (5 + ((N) << 3))
#define CHECK_OPERAND(N, S, T) ((operand_states & (0xff << ((N) << 3))) == ((S | (T << 5)) << ((N) << 3)))
SET_OPERAND_NAME(0, REG_RAX)
are used by actions to set name of the operand (this particular one is used by operand0_rax
action) while calls like SET_OPERAND_TYPE(0, OPERAND_SIZE_2_BIT)
are used by actions to set the type of operand (this particular one is used by operand0_2bit
action). Note that we don't handle 2bit operands in the set of macros above. This is not a mistake: 2bit operands are only ever used as immediate operands (and then only in two instructions: vpermil2pd
and vpermil2ps
) and we don't process immediate operands here. If they will be by some reason left in the validator_x86_64_instruction.rl
file this will lead to the compile-time error, not to some kind of weird overflow which may [potentially] produce security hole.
Almost all manipulations with operand_states
are done using macros described above, but there are one construct in process_N_operands
function which accesses the operand_states
directly:
/* Take 2 bits of operand type from operand_states as *restricted_register,
* make sure operand_states denotes a register (4th bit == 0). */
} else if ((operand_states & 0x70) == (OperandSandboxRestricted << 5)) {
*restricted_register = operand_states & 0x0f;
}
operand_states
then it's pretty easy to understand what goes on here: (operand_states & 0x70) == (OperandSandboxRestricted << 5)
yeilds TRUE
if and only if zeroth operand is “normal” register and it's of type OperandSandboxRestricted
. This is actually central piece of the restricted_register
handling—most other pieces just return it back to NO_REG
state.
CPUID
support.CPUID
support is implemented using large set of actions embedded in definition of instructions (see, e.g. @CPUFeature_FXSR
in the line for instruction 0x0f 0x01 0xd0
AKA xgetbv
). CPUID-related actions are triggered when we know the identity of the instruction (which happens at different times for different instructions: some instructions are detected when opcode is read, some use opcode extension, etc—AMD/Intel manuals contain all the gory details), but the definition for said actions in validator_x86_32_instruction.rl
are very simple
action CPUFeature_FXSR {
SET_CPU_FEATURE(CPUFeature_FXSR);
}
validator_internal.h
. SET_CPU_FEATURE
is defined asif (!(F##_Allowed)) { \
instruction_info_collected |= UNRECOGNIZED_INSTRUCTION; \
} \
if (!(F)) { \
instruction_info_collected |= CPUID_UNSUPPORTED_INSTRUCTION; \
}
CPUFeature_FXSR
is not the name of variable, but the name of macrodefinition. This is needed to handle special cases where CPUFeature
does not correspond to a single CPUID
bit. E.g. prefetch
instruction is available when any one of two bits are set: 3DNnow!
bit or deficated Prefetch instruction
bit. AMD documtntation also claims prefetch
is always available if LongMode
bit is set but Intel documentation does not support this assertion. On the other hand vaesenc
is available when both AES
and AVX
bits are set. And our ABI permits lzcnt
and tzcnt
uncoditionally (thus CPUFeature_LZCNT
does not check for anything but just returns TRUE
in all cases).
Note: there are two CPUID masks: hardcoded one (it can be replaced if you link in different definition of validator_cpuid_features
global variable in your program) and runtime-supplied one (usually obtained from actual CPUID
call in production, but hardcoded in tests). New instructions are first added in “production disabled” mode and must pass a security review before they can be used in Chrome.
Dynamic code modification support is implemented with the help of CALL_USER_CALLBACK_ON_EACH_INSTRUCTION
option. Normally user callback is only used when some kind of error is detected, but if this option is used then callback is called after each instruction. When that happend callback have all the information needed to process the instruction: collected errors, information about immediates, etc.
All that information is squeezed in instruction_info_collected
variable. It has the following format:
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 8 | 7 | 6 | 5 | 4 | 3 | 0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | ↑ | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Cumulutive size of anyfields. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Instruction has two immediates. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Instruction displacement size. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Instruction has relative offset. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Register, zero-extended by the instruction. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Instruction is valid, but it access memory using register which is zero-extended by previous instruction. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ DFA error: invalid instruction. Validation then resumes from the next bundle. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Unaligned direct jump to address outside of given region. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Instruction is not supported for a given CPUID mask. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Base register is not %rbp , %rsp , or %r15 . | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Index register is not zero-extended by previous instruction. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: %rbp/%rsp sandboxing detected. Next two bits reveal details of the error: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | ┊ ┊ ┊ ┊ ┊ | └ ia32 mode: reserved; amd64 mode (only if some %rbp /%rsp related error is detected):00: Instruction which zero-extends %rbp must be followed by add %r15,%rbp , lea (%rbp,%r15,1),%rbp , or lea 0x0(%rbp,%r15,1),%rbp .01: add %r15,%rbp , lea (%rbp,%r15,1),%rbp , or lea 0x0(%rbp,%r15,1),%rbp is used after instruction which does not zero-extend %rbp .10: Instruction which zero-extends %rsp must be followed by add %r15,%rsp or lea (%rsp,%r15,1),%rsp .11: add %r15,%rsp or lea (%rsp,%r15,1),%rsp is used after instruction which does not zero-extend %rsp . | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ %r15b , %r15w , %r15d , or %r15 is modified. %r15 is untouchable in amd64 mode. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: %bpl , %bp , or %rbp is incorrectly modified. Only %rbp can be modified and then only by special instructions. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: %spl , %sp , or %rsp is incorrectly modified. Only %rsp can be modified and then only by special instructions. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | ┊ | └ Bad call alignment: call must end at the end of the bundle, since nacljmp only can jump to aligned address. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | ┊ | └ Reserved. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | ┊ | └ ia32 mode: reserved; amd64 mode: Instruction is modifiable. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | ┊ | └ Special instruction (uses different validation rules from the regular instruction). Can not be changed in ia32bit mode. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | ┊ | └ Last byte is not immediate. It's either opcode, register number or register number and two-bit immediate. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
┊ | └ Invalid jump target. When this flag is set instruction_begin and instruction_end both point to the jump target instruction, not to the jump instruction itself. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
└ Reserved. |
Note that half of the information does not make sense for ia32 mode and is not collected by ValidateChunkIA32.
Using this information you can determine if the given instruction follows special rules (only naclcall
and nacljmp
in ia32 mode; a lot of different commands in amd64 mode: %rbp
/%rsp
modifications, string instructions, naclcall
, and nacljmp
), if it includes relative offsets (commands like jcc
, jmp
, loopcc
, or call
), displacements (most commands which access memory support displacements), or immediates (immediates are supported by many different commands; they can be combined with displacement if command accesses memory). Tests may use the information collected to precisely separate different anyfields (immediates, displacements, relative offsets), but in production only few bits are used to determine if the instruction can be changed or not: in ia32 mode only special instructions naclcall
and nacljmp
can not be changed, while amd64 situation is the opposite: only call
and mov
instructions can be changed, and only in their anyfields part.
Code replacement is not performed by ValidateChunk*
function directly. Instead it's done by higher-level function in dfa_validate_*.c
.
It calls ValidateChunk*
with CALL_USER_CALLBACK_ON_EACH_INSTRUCTION
option to compare lengths of instructions in two fragments in callback. IA32 mode uses SPECIAL_INSTRUCTION
flag in callback's validation_info
to determine if instruction can be changed (all non-special instructions are fair game), but in amd64 mode we only allow changes in a few hand-picked instructions (currently call
and mov
) and they are marked with MODIFIABLE_INSTRUCTION
flag.
One tricky thing there is handling of relative jumps and calls: if relative jump (or call) triggers DIRECT_JUMP_OUT_OF_RANGE
but is bit-to-bit identical to the original instruction it's accepted anyway: this means that this particular jump
(or call
) jumps (or calls) some valid position outside of a given range. If it must be changed then you need to pass bigger region to the ValidatorCodeReplacement_x86_*
function—this way validator will have a chance to check the landing place for validity (this is, of course, not needed if landing point is bundle-aligned).
In ia32 mode whole instruction can be changed, but in amd64 mode we don't allow arbitrary changes to the instruction, we only allow changes to anyfields (immediates, displacements, relative offsets) which is somewhat tricky: most instruction put them at the end, but some instruction use last byte for:
cmpccsd
/vcmpccsd
and cmpccss
/vcmpccss
, and pclmulqdq
/vpclmulqdq
.vblendvpd
/vblendvps
), some FMA4 instructions (such as vfmaddsubpd
), and some XOP instructions (such as vpperm
).vpermil2pd
/vpermil2ps
.All these instructions set LAST_BYTE_IS_NOT_IMMEDIATE
flag, last form can be distinguished because it sets IMMEDIATE_2BIT
flag (which actually includes LAST_BYTE_IS_NOT_IMMEDIATE
flag).
This is done by very simple function which uses CALL_USER_CALLBACK_ON_EACH_INSTRUCTION
mode to process instructions one-after-another.
The only remaining issue (but a big one) is about generation of the actual decoders ({decoder,validator}_x86_{32,64}_instruction.rl files)
. This is big part of the whole package, but, thankfully, it happens in significantly less hostile environment: decoder and validator must work even if they are processing specially-crafted file created by clever adversary while gen_dfa
processes data files created by us and should only correcly process certain “good” files.
To understand how it works it's better to start with the decoders. Remember how we've talked about “streamlined data structures”, “indispensable minimum of the information”, etc? This approach produces fast and [relatively] simple validator, but it makes it hard to test and debug it. To facilitate testing and debugging we create separate decoders: these return all the information about all the intructions they can parse and in fact can produce output identical to objdump's output.
They are used to verify the description of the instructions from .def
files—with a special attention to the length of a said instructions.
Decoders are created using familiar process.
There are few big differences between standalone decoders and simplified decoders embedded in ValidateChunkIA32
/ValidateChunkAMD64
:
.def
files.struct instruction
—common for both decoders.All these facts mean that standalone decoders are singnificantly larger and slower—but also much easier to understand. And simplified decoders are using the exact same DFA with only some actions changed or omitted.