Source: http://en.wikipedia.org/wiki/File:Duesenberg.jpg
Source: http://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Felipe_Massa_2011_Malaysia_FP1.jpg/800px-Felipe_Massa_2011_Malaysia_FP1.jpg

New, DFA-based validator with 5-10x speed of the original one, or…
Luxury car to F1 car.

Trust me: every problem in computer science may be solved by an indirection, but those indirections are expensive. Pointer chasing is just about the most expensive thing you can do on modern CPU's.
—Linus Torvalds
  1. DFA, Ragel, macros and inline functions, oh my…
  2. What is ragel and how it works.
    1. Ragel actions.
  3. Validator for x86-32 mode
    1. Valid jump targets (instruction boundaries).
    2. Jump destinations.
  4. Validation for x86-64 mode.
    1. “Secondary” states.
    2. “Normal” instructions.
    3. Operands handling.
  5. Features beyond minimal validation.
    1. CPUID support.
    2. Dynamic code modification support.
      1. Replacement validation.
      2. Replacement copying.
  6. Decoders.

1. DFA, Ragel, macros and inline functions, oh my…

Contemporary computer systems are extremely powerful and most complex components and libraries are built like a luxury car: they include a lot of comfort and safety technologies which are designed to improve live of the user of said components. This also facilitates code reuse via modular programming and generally improves maintainability.

Unfortunately these complex structures, improved comfort for the library user and commendable flexibility have a flip side: they lead to a lot of additional work in runtime! You first fill and then parse complex data structures—and this takes time. You often produce a lot of information on the low levels which is just not used on higher levels—and this work is also not free.

New validator is built differently. It only keeps around the indispensable minimum of the information needed to prove (or disprove) that code is safe. Similarly to how F1 car uses custom-designed car seats we use custom-designed data structures to push the data from one point of validator to another one. We only collect the bare minimum of the information (and perhaps a little bit besides that to make testing possible)—and if the requirements are changing we often change all the pieces: from gen_dfa input data format to the highest-level dfa_validate_32.c/dfa_validate_64.c external API adapters.

This streamlining was one of the most important design goals of a new validator. And indeed the code which reaches the CPU is very simple: it does not contain complex data structures and multilayered functions while all the previous validators had many layers and quite a few complex data structures. How can it be? Were all these structures superfluous and unnecessary? Well… not really. New validator throws away all that complexity and trades it for a few comparisons and jumps. Tens of thousands comparisons and similar number of jumps, to be exact. In a single flat function. Basically we trade runtime complexity for build-time complexity.

But note that build-time complexity ≠ source code complexity. Since our goal is to produce extremely fast validator, not extremely complicated validator with unpenetrable source we try to keep it's source as simple as feasible. To achieve this we employ ragel and our own generator of ragel code. Why two levels of indirection? Ragel is industry-standard tool for DFA generation (you can find it in most Linux distributions, article on the Wikipedia was added in 2006, etc) and our generator is used to produce ragel output from textual description of the x86 instruction set. Said textual description uses form which is pretty close to what you can find in AMD manual (Intel manuals use similar acronims, but they use significantly different approach to describe VEX-encoded commands). The initial goal was to use snippets from the manual, but this proved to be unfeasible is a few cases because manual is designed to be read by human. For example POPF/D/Q Fv describes instruction which has different names in legacy 16bit mode, 32bit ia32 mode and 64bit x86-64 mode while STOSW/D/Q Yv, rAX is an instruction which is called differently if different prefixes are used. The fact that you have access to all three forms of STOSW/D/Q in x86-64 mode but can only use POPF and POPFD is not reflected anywhere. To solve this problem we use slightly more formalized (and thus machine-parseable) description, but, thankfully, such cases are rare thus most commands in our tables are disribed exactly as they are described in the AMD manual.

2. What is ragel and how it works.

If you know how Ragel works or if the phrase “Ragel is compiler of finite state machines and it can produce not just finite-state automata but finite state transducers and we use this capability in our work” makes perfect sense and clarifies the affairs to you then you can skip next [optional] part.

I'll not explain what the DFA is (it's explained in CS course you've heard years back… or you can refresh you knowleadge on Wikipedia). But I'll explain a little about Ragel's take on finite state transducers. Extensive documentation with all the gory details is on Ragel's site, but while it explains how to use Ragel it does not explain what it is and why you may want to use it.

Let's start with the first question: what it is. Ragel is compiler of DFA machines… but with a twist. You describe DFA structure using simple RE-style format and Ragel generates the corresponding code in C (D/Go/Java/Ruby/etc: Ragel supports a lot of laguages, but we are interested in C here). When you describe the DFA you just write acceptable bytes and then use the following operations: concatenation (“1 . 2” will accept either “1” followed by “2”), union (“1 | 2” will accept either “1” or “2”), intersection (“('a'..'n') & ('m'..'z')” will accept either “m” or “n”), difference (“('a'..'n') - ('m'..'z')” will accept everything between “a” and “l”, but will not accept either “m” or “n”) and kleene star (“(1 | 2)*” will accept any number of “1” or “2”).

These operations can produce quite non-trivial result: e.g. “("b" . ("aa"+ | "aaa"+))*” will produce the following DFA:


If, instead of “("aa"+ | "aaa"+)” in the example above you'll use something like “("a"{5}+ | "a"{7}+ | "a"{11}+)” then the resulting DFA will include almost four hundreds nodes and over five hundreds transitions! This limits applicability of DFA technology: e.g. it's possible to describe "valid code sequence" (including bundles, "restricted registers" and everything else) as a DFA, but… said DFA will include millions of nodes and billions of transitions!

2.1. Ragel actions.

To overcome this problem Ragel offers so-called "actions": pieces of code which are called when certain pieces in DFA are reached. E.g. we can mark begin and end of “aa” (or “aaa”) in the example above—“("b" . (("aa" >begin @end)+ | ("aaa" >begin @end)+ ))*” produces the following DFA:

Let's see what happens if we'll feed it with “baaaaaaaaa” sequence:

  • offset 0: nothing
  • offset 1: begin
  • offset 2: end
  • offset 3: begin then end
  • offset 4: end then begin
  • offset 5: begin
  • offset 6: end
  • offset 7: begin
  • offset 8: end
  • offset 9: begin then end

Hmm. Something is wrong here: why do we have so many begin's and end's?!! Let's try to change the DFA a bit: “("b" . (("aa" >begin2 @end2)+ | ("aaa" >begin3 @end3)+ ))*” produces the following DFA:

This time we have:

  • offset 0: nothing
  • offset 1: begin2 then begin3
  • offset 2: end2
  • offset 3: begin2 then end3
  • offset 4: end2 then begin3
  • offset 5: begin2
  • offset 6: end2 then end3
  • offset 7: begin2 then begin3
  • offset 8: end2
  • offset 9: begin2 then end3

Ah-ha. Now everything is clear. DFA is DFA: it does not support memory and it does not support rollbacks. This means that our DFA it processing two branches simultaneously—both “"aa"+” and “"aaa"+”. We'll need to keep this in mind. Couple of another observations:

  1. When we used just begin action action begin was called once, but when we split it in two (begin2 and begin3) both are called! By default Ragel merges actions.
  2. Actions are called in non-random order—take a look on offset 4: end2 is called before begin3. That's because begin3 has lower priority than end2! Note that in previous example this same effect was observed, but it was quite mysterious there. The closer the action is to the beginning of the source file the higher it's priority is.

3. Validator for x86-32 mode.

Here is the build diagram:


Gray elements are hand-written, white elements are generated and dark-gray are aforementioned code generators.

*.def files contain instruction definitions taken almost verbatim from AMD instruction manual. They are parsed by gen_dfa, which in turn produces ragel definition of regular language of all instructions (validator_x86_32_instruction.rl). This regular language (machine in ragel terms) is used as a building block to define language of all 'valid' bundles (give or take some subtle details we will discuss later). Language of valid bundles is defined in validator_x86_32.rl.

To understand how validator works it's best to start from function ValidateChunkIA32 in validator_x86_32.rl. Said function is very short and “simple”: it allocates couple of arrays (valid_targets and jump_dests), then cycles over code passed to it (processing it in bundle-sized chunks) and at the end it compares valid jump targets and collected jump destinations… that's it. Oh, and it also includes couple of cryptic lines right in the middle of innermost cycle:


    %% write init;
    %% write exec;

These lines instruct ragel to insert DFA code (in C) here. Resulting output will go to file validator_x86-32.c, which performs actual validation.

Our main DFA is “(one_instruction | special_instruction)*”—i.e. it accepts sequence of “normal” instructions and “special” instructions. It consumes byte by byte from current_position pointer until one of the following ending conditions is met:

If end of bundle is reached when automaton is not in accepting state, it means instruction we are currenly reading crosses bundle boundary (strictly speaking, it may be not a valid instruction at all, only a prefix of some valid instruction (we haven't seen what's in the next bundle yet), but anyway it's something that definitely violates ABI).

Note that even if automaton leaves prematurely (before the end of bundle), validation goes on from the beginning of the next bundle. If one bundle is rejected then the whole chunk is always rejected, but this approach makes it possible to diagnose more errors in one pass which helps while code is developed.

Apparently collection of valid jump targets and actual target destinations happens inside this automaton. How?

3.1. Valid jump targets (instruction boundaries).

Just like in example above there are two actions: first one is triggered at the beginning of the instruction (“normal” or “special”)—it's used to remember the beginning of the instruction, to clear the instruction_info_collected, and to mark the first byte of the instruction as valid target for the direct jump; second one is triggered at the final byte of the instruction (“normal” or “special”)—and is used to report errors. And there are also one additional action which is declared as “$err”. This is error fallback action: it's called whenever there is no transition for a particular byte in our DFA. This means we've hit either forbidden instruction like lgdt or some undefined byte sequence… in both cases UNRECOGNIZED_INSTRUCTION error is reported and processing is stopped.

There are three “special” instructions in IA32 case: naclcall, nacljmp and mov %gs:0x0/0x4,%reg (public ABI allows read-only access to %gs:0, and read-only access to %gs:4 is allowed for IRT). The last one is declared as “special” instruction to simplify the validation logic (and DFA, too): instead of accepting all versions of mov %gs:something,%reg instruction followed by additional logic which rejects most possibilities (only plain vanialla “zero” is allowed here as per ABI) we only describe this one version of the instruction and ragel does the rest. naclcall and nacljmp include special action which clears the “valid destination address” bit (remember the story with begin and end actions above? when first byte of a second half of naclcall/nacljmp is processed it's processed as both part of the naclcall/nacljmp and as a start of a regular instruction, too).

This explains how valid_targets array is filled and invalid instructions are rejected. Note that even invalid instruction would be marked as valid jump target, but we don't care about this peculiarity because validation result will be negative anyway.

3.2. Jump destinations.

But of course there are jump_dests, too. Special instructions don't touch it, but something obviously fills the array, isn't it. This can only be result of processing of normal instructions, thus we need to go deeper. Where it all comes from? To understand that we need to look on [autogenerated] validator_x86_32_instruction.rl file. The file looks like this:


    ⋮
  Semi-manual simple helper machines and actions
    ⋮
  one_instruction =
      ⋮
    (branch_hint? 0x77 rel8) |
    (branch_hint? (0x0f 0x87) rel32) |
      ⋮
    ((0x0f 0x01 0xd0) @CPUFeature_FXSR);

0x77 and 0x0f 0x87 are opcodes for ja (aka jnbe) instruction, but what are branch_hint? and rel8/rel32 are doing here? Well, “?” means “optional” (like in most RE-engines) and both branch_hint and rel8/rel32 definitions are references to machines defined in the semi-manual simple helper machines and actions part of validator_x86_32_instruction.rl file. The whole construct describes part of the DFA which is designed to accept ja (aka jnbe) instruction—complete with optional P4-inspired branch prediction prefix. Definition of branch_hint is trivial and obvious (“branch_hint = 0x2e | 0x3e;” if you want to know), but rel8/rel32 are somewhat more “interesting”:
    rel8 = any @rel8_operand;
    rel32 = any{4} @rel32_operand;
It's "more interesting not because it's complex or non-obvious. The interesting part here is the fact that actions rel8_operand/rel32_operand are not present in validator_x86_32_instruction.rl, they are in validator_x86_32.rl file! But the definition itself is pretty trivial:
  action rel8_operand {
    int8_t offset = (uint8_t) (p[0]);
    size_t jump_dest = offset + (p - data) + 1;

    if (!MarkJumpTarget(jump_dest, jump_dests, size)) {
      instruction_info_collected |= DIRECT_JUMP_OUT_OF_RANGE;
    }
  }
  action rel32_operand {
    int32_t offset =
        (p[-3] + 256U * (p[-2] + 256U * (p[-1] + 256U * ((uint32_t) p[0]))));
    size_t jump_dest = offset + (p - data) + 1;

    if (!MarkJumpTarget(jump_dest, jump_dests, size)) {
      instruction_info_collected |= DIRECT_JUMP_OUT_OF_RANGE;
    }
  }
We just check if jump target passes preliminary check (direct jump to the outside of the region is always invalid) and that's not so then we detect error DIRECT_JUMP_OUT_OF_RANGE.

4. Validation for x86-64 mode.

While validator for ia32 mode is very simple and short (it also produces pretty compact code) validator for x86-64 mode is different. It still has all the same properties validator for ia32 mode had (valid_targets and jump_dests arrays, “normal” and “special” instructions, bundles and rel8_operand/rel32_operand actions), but it adds quite a few additional twists to the whole scheme.

It's created in a process which is similar to the process which creates the ia32 validator.


Gray elements are hand-written, white elements are generated and dark-gray are code generators.

4.1. “Secondary” states.

First of all: ia32 mode validator had one DFA in it and two arrays which kept track of the instruction boundaries but x86-64 has few more state variables. Most of them (rex_prefix, vex_prefix2, vex_prefix3, operand_states, base, and index) keep track of the instruction parts (and thus they are cleared before each instruction), but one variable called restricted_register is used to tie different instructions together. As the name implies it keeps track of the restricted register (if any). Restricted register in the NaCl SFI model on x86-64 systems is general purpose register which has top 32bits cleared up. Note that not all restricted registers are born equal: most registers can be restricted and then forgotten (if you write to %eax and do nothing with the value before call then nothing problematic or dangerous can ever happen), but %esp and %ebp are exceptions. If you write to the %esp then the very next instruction must be add %r15,%rsp or lea (%r15,%rsp,1),%rsp—and %rbp has similar requirements. This means that if at the end of a bundle restricted register is %rsp or %rbp then program is invalid. For the same reason if at the beginning of a normal instruction (this includes first instruction in the “compound”) we see restricted %rsp or restricted %rbp then it's an error, too. On the other hand few rare special instructions which are used to restore the SFI invariant WRT %rsp or %rbp will only be accepted if restricted register is %rsp xor %rbp (depending on special instruction).

4.2. “Normal” instructions.

The hard part is, as before, in the DFA. First of all, main machine is similar to what we had in ia32 mode, but subtly different: it's “(normal_instruction | special_instruction)*” now. I.e.: one_instruction is replaced with normal_instruction. And what is normal_instruction? Why, it's “one_instruction - special_instruction”, of course! Well… this is unexpected: why will we want to remove special_instructions from normal_instructions only to add them back? The answer is related to actions: recall how actions work. When we remove special_instruction from one_instruction we also remove the associated actions. This is important in x86-64 case because some special instructions are just a normal instructions which are permitted to violate the usual rules! E.g. “special” instruction and $~0x1f,%rsp (which is used to align the stack pointer) changes the %rsp directly which is usually forbidden, but because of properties of and $xxx,… (for any $xxx < 0) we know that invariants will not be violated.

This approach works well, but only if violations are detected at the instruction end. E.g. the aforementioned and $~0x1f,%rsp instruction is encoded as 0x48 0x83 0xe4 0xe0 and after we've read 0x48 0x83 0xe4 we already know it's normal instruction (opcode 0x83 means it's and) which writes to %rsp (0x48 opcode 0xe4 means it's some instruction which accepts some kind of immediate and writes to %rsp) and we'll signal the error at this point then the fact that later we'll find out it's special_instruction which is accepted anyway will not matter: SPL_MODIFIED error will be triggered which will mean that code is rejected!

This means that we can not do an actual conditions checking till the very end of normal instruction (we can try to process some of them but not all of them but this approach will be quite complex and fragile—not something you want in the most critical security piece). But there are an exception: memory access. This one is checked inline: memory access outside of “40GiB safe area” is strictly forbidden no matter how “special” the instruction is. That's why it's checked immediately after operands discovery. This is how relevant fragment for the and instruction look like:


    (0x83 (opcode_4 any* & any . any* & operand_disp @check_access) imm8 @process_0_operands) |
    (0x83 (opcode_4 any* & any . any* & operand_rip @check_access) imm8 @process_0_operands) |
    (REX_B? 0x83 (opcode_4 any* & any . any* & single_register_memory @check_access) imm8 @process_0_operands) |
    (REX_X? 0x83 (opcode_4 any* & any . any* & operand_sib_pure_index @check_access) imm8 @process_0_operands) |
    (REX_XB? 0x83 (opcode_4 any* & any . any* & operand_sib_base_index @check_access) imm8 @process_0_operands) |
    (lock 0x83 (opcode_4 any* & any . any* & operand_disp @check_access) imm8 @process_0_operands) |
    (lock 0x83 (opcode_4 any* & any . any* & operand_rip @check_access) imm8 @process_0_operands) |
    (lock REX_B? 0x83 (opcode_4 any* & any . any* & single_register_memory @check_access) imm8 @process_0_operands) |
    (lock REX_X? 0x83 (opcode_4 any* & any . any* & operand_sib_pure_index @check_access) imm8 @process_0_operands) |
    (lock REX_XB? 0x83 (opcode_4 any* & any . any* & operand_sib_base_index @check_access) imm8 @process_0_operands) |
    (REX_B? 0x83 (opcode_4 @operand0_32bit any* & modrm_registers @operand0_from_modrm_rm) imm8 @process_1_operand) |
As you can see check_access is triggered after parsing ModRM/SIB bytes, but before parsing immNN field while process_N_operands action is triggered at the very end of the “normal” instruction. Even if instruction does not use immNN field check_access action is still triggerded before process_N_operands action. This is important because check_access action actually depends on previous state of restricted_register variable while process_N_operands action changes restricted_register variable. Note that it's only triggered for “normal” instructions—“special” instructions either do the work themselves (e.g. add %r15,%rsp—which is only valid if previous state of restricted_register variable was REG_RSP and changes it to NO_REG in case of succcess) or call the usual process_N_operands action (e.g. mov %rsp,%rbp calls process_0_operands which ensures that this operation is not called when restricted_register is set to REG_RSP/REG_RBP state and transtions it to NO_REG state).

You can find yet another suprising thing in the snippet above: and instruction is handled either as instruction with zero operands or as instruction with one operand… but of course in reality it always has two operands! Something is strange here… Well, sure: the decoder part of validator is as streamlined as possible. We just ignore all non-register arguments and arguments which are not written to (but we don't ignore memory accesses if they happen here, of course). That's why and has either one or zero operands as far as validator is concerned.

4.3. Operands handling.

Operands handling as, again, is not that complex… if you are familiar with bit operations. Initial version of the validator used simple array of records to store the information and everything worked well… with GCC, that is. MSVC produced awful code which was almost 30% slower and also needed twenty minutes to do so thus we replaced this simple version with the current macro-based one.

All the information about encountered operands is collected in a single scalar variable operand_states. The layout of said variable looks like this:

6339383736323130292824232221201615141312876540
paddingoperand4:
register_type
operand4:
register_name
paddingoperand3:
register_type
operand3:
register_name
paddingoperand2:
register_type
operand2:
register_name
paddingoperand1:
register_type
operand1:
register_name
paddingoperand0:
register_type
operand0:
register_name
 ↖
    0 if normal
    register
 ↖
    0 if normal
    register
 ↖
    0 if normal
    register
 ↖
    0 if normal
    register
 ↖
    0 if normal
    register

Register names are defined in register_name enum: first 16 are identical to the AMD/Intel names (from REG_RAX to REG_R15) while other 16 are used (partially) to describe non-register operands (memory operand, immediate operand, REG_RIP and REG_RIZ, etc). This means that if operand's name is >15 then it can be ignored. There are only four operand types: OperandSandboxIrrelevant, OperandSandbox8bit, OperandSandboxRestricted, and OperandSandboxUnrestricted. First type is something not related to general purpose register (x87, MMX, XMM, or YMM registers fall unto this category). We need to handle 8bit operands specially because they are finicky: if REX byte is used they access %spl, %bps, %sil, and %dil, but when REX byte is not used the same numbers are reused for %ah, %ch, %dh, and %bh! Last two types are the most important: these are 32bit operands (which will make the appropriate register “restricted”) or 16bit/64bit operands (these may affect register in question negatively if that's %rbp, %rsp, or %r15, but for other registers these are just ignored). Note that if you assign 0 to this variable then all operands will be of OperandSandboxIrrelevant type.

Now the set of macros used to work with operands should look less mysterious:


#define SET_OPERAND_NAME(N, S) operand_states |= ((S) << ((N) * 8))
#define SET_OPERAND_TYPE(N, T) SET_OPERAND_TYPE_ ## T(N)
#define SET_OPERAND_TYPE_OPERAND_SIZE_8_BIT(N) operand_states |= OperandSandbox8bit << (5 + ((N) << 3))
#define SET_OPERAND_TYPE_OPERAND_SIZE_16_BIT(N) operand_states |= OperandSandboxUnrestricted << (5 + ((N) << 3))
#define SET_OPERAND_TYPE_OPERAND_SIZE_32_BIT(N) operand_states |= OperandSandboxRestricted << (5 + ((N) << 3))
#define SET_OPERAND_TYPE_OPERAND_SIZE_64_BIT(N) operand_states |= OperandSandboxUnrestricted << (5 + ((N) << 3))
#define CHECK_OPERAND(N, S, T) ((operand_states & (0xff << ((N) << 3))) == ((S | (T << 5)) << ((N) << 3)))
Calls like SET_OPERAND_NAME(0, REG_RAX) are used by actions to set name of the operand (this particular one is used by operand0_rax action) while calls like SET_OPERAND_TYPE(0, OPERAND_SIZE_2_BIT) are used by actions to set the type of operand (this particular one is used by operand0_2bit action). Note that we don't handle 2bit operands in the set of macros above. This is not a mistake: 2bit operands are only ever used as immediate operands (and then only in two instructions: vpermil2pd and vpermil2ps) and we don't process immediate operands here. If they will be by some reason left in the validator_x86_64_instruction.rl file this will lead to the compile-time error, not to some kind of weird overflow which may [potentially] produce security hole.

Almost all manipulations with operand_states are done using macros described above, but there are one construct in process_N_operands function which accesses the operand_states directly:


    /* Take 2 bits of operand type from operand_states as *restricted_register,
     * make sure operand_states denotes a register (4th bit == 0). */
    } else if ((operand_states & 0x70) == (OperandSandboxRestricted << 5)) {
      *restricted_register = operand_states & 0x0f;
    }
If you'll take a look on the layout of operand_states then it's pretty easy to understand what goes on here: (operand_states & 0x70) == (OperandSandboxRestricted << 5) yeilds TRUE if and only if zeroth operand is “normal” register and it's of type OperandSandboxRestricted. This is actually central piece of the restricted_register handling—most other pieces just return it back to NO_REG state.

5. Features beyond minimal validation.

5.1. CPUID support.

CPUID support is implemented using large set of actions embedded in definition of instructions (see, e.g. @CPUFeature_FXSR in the line for instruction 0x0f 0x01 0xd0 AKA xgetbv). CPUID-related actions are triggered when we know the identity of the instruction (which happens at different times for different instructions: some instructions are detected when opcode is read, some use opcode extension, etc—AMD/Intel manuals contain all the gory details), but the definition for said actions in validator_x86_32_instruction.rl are very simple


  action CPUFeature_FXSR {
    SET_CPU_FEATURE(CPUFeature_FXSR);
  }
This time magic is in validator_internal.h. SET_CPU_FEATURE is defined as
  if (!(F##_Allowed)) { \
    instruction_info_collected |= UNRECOGNIZED_INSTRUCTION; \
  } \
  if (!(F)) { \
    instruction_info_collected |= CPUID_UNSUPPORTED_INSTRUCTION; \
  }
IOW: it's pretty straighforward and simple, but there are a twist: CPUFeature_FXSR is not the name of variable, but the name of macrodefinition. This is needed to handle special cases where CPUFeature does not correspond to a single CPUID bit. E.g. prefetch instruction is available when any one of two bits are set: 3DNnow! bit or deficated Prefetch instruction bit. AMD documtntation also claims prefetch is always available if LongMode bit is set but Intel documentation does not support this assertion. On the other hand vaesenc is available when both AES and AVX bits are set. And our ABI permits lzcnt and tzcnt uncoditionally (thus CPUFeature_LZCNT does not check for anything but just returns TRUE in all cases).

Note: there are two CPUID masks: hardcoded one (it can be replaced if you link in different definition of validator_cpuid_features global variable in your program) and runtime-supplied one (usually obtained from actual CPUID call in production, but hardcoded in tests). New instructions are first added in “production disabled” mode and must pass a security review before they can be used in Chrome.

5.2. Dynamic code modification support.

Dynamic code modification support is implemented with the help of CALL_USER_CALLBACK_ON_EACH_INSTRUCTION option. Normally user callback is only used when some kind of error is detected, but if this option is used then callback is called after each instruction. When that happend callback have all the information needed to process the instruction: collected errors, information about immediates, etc.

All that information is squeezed in instruction_info_collected variable. It has the following format:

31302928272625242322212019181716151413128765430
       
VALIDATION_ERRORS_MASK
 
RESTRICTED_REGISTER_MASK
 
DISPLACEMENT_SIZE_MASK
 
IMMEDIATES_SIZE_MASK
 0                                                                         
   
  └ Cumulutive size of anyfields.
  └ Instruction has two immediates.
 └ Instruction displacement size.
 └ Instruction has relative offset.
└ ia32 mode: reserved; amd64 mode: Register, zero-extended by the instruction.
└ ia32 mode: reserved; amd64 mode: Instruction is valid, but it access memory using register which is zero-extended by previous instruction.
└ DFA error: invalid instruction. Validation then resumes from the next bundle.
└ Unaligned direct jump to address outside of given region.
└ Instruction is not supported for a given CPUID mask.
└ ia32 mode: reserved; amd64 mode: Base register is not %rbp, %rsp, or %r15.
└ ia32 mode: reserved; amd64 mode: Index register is not zero-extended by previous instruction.
└ ia32 mode: reserved; amd64 mode: %rbp/%rsp sandboxing detected. Next two bits reveal details of the error:








































└ ia32 mode: reserved; amd64 mode (only if some %rbp/%rsp related error is detected):
    00: Instruction which zero-extends %rbp must be followed by add %r15,%rbp, lea (%rbp,%r15,1),%rbp, or lea 0x0(%rbp,%r15,1),%rbp.
    01:add %r15,%rbp, lea (%rbp,%r15,1),%rbp, or lea 0x0(%rbp,%r15,1),%rbp is used after instruction which does not zero-extend %rbp.
    10: Instruction which zero-extends %rsp must be followed by add %r15,%rsp or lea (%rsp,%r15,1),%rsp.
    11: add %r15,%rsp or lea (%rsp,%r15,1),%rsp is used after instruction which does not zero-extend %rsp.
└ %r15b, %r15w, %r15d, or %r15 is modified. %r15 is untouchable in amd64 mode.
└ ia32 mode: reserved; amd64 mode: %bpl, %bp, or %rbp is incorrectly modified. Only %rbp can be modified and then only by special instructions.
└ ia32 mode: reserved; amd64 mode: %spl, %sp, or %rsp is incorrectly modified. Only %rsp can be modified and then only by special instructions.
└ Bad call alignment: call must end at the end of the bundle, since nacljmp only can jump to aligned address.
└ Reserved.
└ ia32 mode: reserved; amd64 mode: Instruction is modifiable.
└ Special instruction (uses different validation rules from the regular instruction). Can not be changed in ia32bit mode.
└ Last byte is not immediate. It's either opcode, register number or register number and two-bit immediate.
└ Invalid jump target. When this flag is set instruction_begin and instruction_end both point to the jump target instruction, not to the jump instruction itself.
└ Reserved.

Note that half of the information does not make sense for ia32 mode and is not collected by ValidateChunkIA32.

Using this information you can determine if the given instruction follows special rules (only naclcall and nacljmp in ia32 mode; a lot of different commands in amd64 mode: %rbp/%rsp modifications, string instructions, naclcall, and nacljmp), if it includes relative offsets (commands like jcc, jmp, loopcc, or call), displacements (most commands which access memory support displacements), or immediates (immediates are supported by many different commands; they can be combined with displacement if command accesses memory). Tests may use the information collected to precisely separate different anyfields (immediates, displacements, relative offsets), but in production only few bits are used to determine if the instruction can be changed or not: in ia32 mode only special instructions naclcall and nacljmp can not be changed, while amd64 situation is the opposite: only call and mov instructions can be changed, and only in their anyfields part.

5.2.1. Replacement validation.

Code replacement is not performed by ValidateChunk* function directly. Instead it's done by higher-level function in dfa_validate_*.c.

It calls ValidateChunk* with CALL_USER_CALLBACK_ON_EACH_INSTRUCTION option to compare lengths of instructions in two fragments in callback. IA32 mode uses SPECIAL_INSTRUCTION flag in callback's validation_info to determine if instruction can be changed (all non-special instructions are fair game), but in amd64 mode we only allow changes in a few hand-picked instructions (currently call and mov) and they are marked with MODIFIABLE_INSTRUCTION flag.

One tricky thing there is handling of relative jumps and calls: if relative jump (or call) triggers DIRECT_JUMP_OUT_OF_RANGE but is bit-to-bit identical to the original instruction it's accepted anyway: this means that this particular jump (or call) jumps (or calls) some valid position outside of a given range. If it must be changed then you need to pass bigger region to the ValidatorCodeReplacement_x86_* function—this way validator will have a chance to check the landing place for validity (this is, of course, not needed if landing point is bundle-aligned).

In ia32 mode whole instruction can be changed, but in amd64 mode we don't allow arbitrary changes to the instruction, we only allow changes to anyfields (immediates, displacements, relative offsets) which is somewhat tricky: most instruction put them at the end, but some instruction use last byte for:

All these instructions set LAST_BYTE_IS_NOT_IMMEDIATE flag, last form can be distinguished because it sets IMMEDIATE_2BIT flag (which actually includes LAST_BYTE_IS_NOT_IMMEDIATE flag).

5.2.2. Replacement copying.

This is done by very simple function which uses CALL_USER_CALLBACK_ON_EACH_INSTRUCTION mode to process instructions one-after-another.

6. Decoders.

The only remaining issue (but a big one) is about generation of the actual decoders ({decoder,validator}_x86_{32,64}_instruction.rl files). This is big part of the whole package, but, thankfully, it happens in significantly less hostile environment: decoder and validator must work even if they are processing specially-crafted file created by clever adversary while gen_dfa processes data files created by us and should only correcly process certain “good” files.

To understand how it works it's better to start with the decoders. Remember how we've talked about “streamlined data structures”, “indispensable minimum of the information”, etc? This approach produces fast and [relatively] simple validator, but it makes it hard to test and debug it. To facilitate testing and debugging we create separate decoders: these return all the information about all the intructions they can parse and in fact can produce output identical to objdump's output.

They are used to verify the description of the instructions from .def files—with a special attention to the length of a said instructions.

Decoders are created using familiar process.


Gray elements are hand-written, white elements are generated and dark-gray are code generators.

There are few big differences between standalone decoders and simplified decoders embedded in ValidateChunkIA32/ValidateChunkAMD64:

All these facts mean that standalone decoders are singnificantly larger and slower—but also much easier to understand. And simplified decoders are using the exact same DFA with only some actions changed or omitted.