Project: x86-devirt

21 Sep 2018

Unpackme - x86 Virtualizer

Today, I am going to be going through how x86devirt works to disassemble and devirtualize the behaviour of code obfuscated using the x86virt virtual machine. I needed several tools to complete this task, the development of which will be covered in this article.

A code virtualizer protects code behaviour by retargeting some subroutines or sections of code from the x86/x64 platform (which is well understood and documented) into a (usually somewhat random) platform that we do not understand. Additionally, the tools we use to discover and analyze behaviour in executable code (such as radare2 or x64dbg) also do not understand it well. This makes identifying malicious behaviour, developing generic signatures or extracting other behavioral details from the code impossible without reversing the process in some way.

While it costs a large overhead in performance, obfuscation using code virtualization is very effective. Depending on the complexity of the protector/virtualizer, these packers can be very painful and tedious to work through.

Resources

You can see the final product (devirtualizer) of this article at the following GitHub URL: https://github.com/JeremyWildsmith/x86devirt

We are reverse engineering an unpackme by ReWolf at the following URL: https://tuts4you.com/download/1850/

This sample has been packed with an open-source application by ReWolf that is publicly posted on GitHub at the following URL: https://github.com/rwfpl/rewolf-x86-virtualizer

If you would like to look at a very simple example of a virtual machine, I have a project up on my GitHub that demonstrates the basic function of a VM Stub and you can see how exactly it works. The project is located at the following URL: https://github.com/JeremyWildsmith/StackLang

Tools / Knowledge

In this article I am going to use the following tools & knowledge:

  • The disassembler, written in the C++ programming language
  • x86 Intel Assembly
  • YARA Signatures
  • The udis86 library, used to disassemble x86 instructions
  • Python, used to automate x64dbg and the x64dbgpy plugin, as well as run Angr simulations to extract the jmp mappings
  • The x86virt unpackme sample by ReWolf on tuts4you.com (https://tuts4you.com/download/1850/)

How the x86virt VM Works

Before I dive into how x86devirt works, I am going to give a brief overview on my findings of how the x86virt VM works.

x86virt starts by taking the application to be protected and grabbing some subroutines that it has decided to protect. The x86 code for these subroutines is translated into an instruction set that uses the following format:

(Instruction Size)(Instruction Prefix)(Instruction Data)

  • Instruction Size - The size in bytes of the instruction
  • Instruction Prefix - This is 0xFFFF if the instruction is a VM instruction. Otherwise, the remaining bytes in the instruction data (and instruction prefix) should be interpreted as a x86 instruction
  • Instruction Data - If (Instruction Prefix) is 0xFFFF, this is the VM Opcode bytes followed by the operand bytes. Otherwise, these bytes are appended to the (Instruction Prefix) to form a valid x86 instruction.

For every VM Layer or virtualized target, x86virt randomizes the opcodes for that VM. So when we devirtualize, we need to map the opcodes to their respective VM instruction.

x86virt also takes instructions like the ones below:

mov eax, [esp + 0x8]

And translates them into something like this:

mov VMR, 0x8
add VMR, esp
mov eax, [VMR]

VMR is a register that only exists in the virtual machine, so during devirtualization, we need to interpret this and translate it back into its original form.

Finally, x86virt encrypts the entire instruction using an encryption algorithm that is, to some extent, randomized. Every time an instruction is executed, it is first decrypted and then interpreted. However, there is one consistency with instruction encryption between targets and VM layers: regardless of how the instruction was encrypted, in its encrypted form, the first byte XORed with the second byte will always give you the instruction length.

Another important note regarding instruction encryption is that the key to decrypt it is the address of that VM instruction relative to the start of the virtual function/code stub it belongs to. In other words, the key is the offset to that instruction in the function that has been virtualized. This means you cannot just blindly decrypt all the instructions in a function. You must do proper control flow analysis to determine where in memory the valid bytecode instructions are by identifying and following conditional or unconditional jumps.

So to disassemble an x86virt VM instruction, we must know:

  1. The size of the instruction
  2. The offset to that instruction
  3. The encryption algorithm (because it is somewhat random)

There is one last piece of the puzzle that we must consider when disassembling x86virt VM code. The conditional jumps for x86virt VM are encoded with a jump type operand. The part of the code that interprets the jump type operand is also somewhat random between VM layers and virtualized targets. We need to handle this case in our devirtualizer as well.

The x86devirt Disassembler

An important part to the x86-devirtualizer is the disassembler. The role of this module is to take a stub of virtualized, encrypted x86virt bytecode that has been extracted from the protected application and produce a NASM x86 Assembly translation of the encrypted bytecode. To do this, it needs a few pieces of information from the protected application:

  1. A dump of the decryption algorithm that the VM stub uses to decrypt VM instruction before interpreting them
  2. The mappings of opcodes to their respective behaviour, since the opcodes are randomized (i.e, 0x33 maps to add, 0x22 maps to mov)
  3. The mappings of jmp type operand to their respective jump behavior (i.e, jmp type 2 maps to je, 3 maps to jne, etc...)
  4. A dump of the function / stub of code to be devirtualized

We will see how this information is extracted by looking at the x86devirt.py x64dbg plugin later, but for now we will assume we have been provided this information.

The first step is to get the instruction length, decrypt the instruction and then identify whether we should interpret the instruction as an x86 instruction or a VM bytecode instruction. We see this being done in disassemblers' decodeVmInstruction method:

unsigned int decodeVmInstruction(vector<DecodedVmInstruction>& decodedBuffer, uint32_t vmRelativeIp, VmInfo& vmInfo) {

    uint32_t instrLength = getInstructionLength(vmInfo.getBaseAddress() + vmRelativeIp, vmInfo);

    //Read with offset 1, to trim off instr length byte
    unique_ptr<uint8_t[]> instrBuffer = vmInfo.readMemory(vmInfo.getBaseAddress() + vmRelativeIp + 1, instrLength);
    vmInfo.decryptMemory(instrBuffer.get(), instrLength, vmRelativeIp);

    DecodedInstructionType_t instrType = DecodedInstructionType_t::INSTR_UNKNOWN;

    if(*reinterpret_cast<unsigned short*>(instrBuffer.get()) == 0xFFFF) {
        //Offset by 2 which removes the 0xFFFF part of the instruction.

        //Map instructions correctly
        instrBuffer[2] = vmInfo.getOpcodeMapping(instrBuffer[2]);
        decodedBuffer = disassembleVmInstruction(instrBuffer.get() + 2, instrLength - 2, vmRelativeIp, vmInfo);
    } else {
        decodedBuffer = disassemble86Instruction(instrBuffer.get(), instrLength, vmInfo.getBaseAddress() + vmRelativeIp);
    }

    return instrLength + 1;
}

We see that the size is extracted using the getInstructionLength method (this will simply XOR the first two bytes to get the length). After that, the instruction is decrypted by using the dumped decryption subroutine extracted from the protected application (a more proper approach would be to emulate the code rather than directly executing it). Finally, we examine the first word to identify how to decode the instruction (as an x86 instruction or as a VM instruction). If the instruction is a VM bytecode instruction, we need to look up the opcode in the opcode mapping to determine what behaviour it maps to.

The way we disassemble x86 instructions is by using the udis86 library, and also keeping some basic information about the disassembled instruction. You can see how that is done below:

vector<DecodedVmInstruction> disassemble86Instruction(const uint8_t* instrBuffer, uint32_t instrLength, const uint32_t instrAddress) {
    DecodedVmInstruction result;
    result.isDecoded = false;
    result.address = instrAddress;
    result.controlDestination = 0;
    result.size = instrLength;

    memcpy(result.bytes, instrBuffer, instrLength);

    ud_set_input_buffer(&ud_obj, instrBuffer, instrLength);
    ud_set_pc(&ud_obj, instrAddress);
    unsigned int ret = ud_disassemble(&ud_obj);
    strcpy(result.disassembled, ud_insn_asm(&ud_obj));

    if(ret == 0)
        result.type = DecodedInstructionType_t::INSTR_UNKNOWN;
    else
        result.type = (!strncmp(result.disassembled, "ret", 3) ? DecodedInstructionType_t::INSTR_RETN : DecodedInstructionType_t::INSTR_MISC);

    vector<DecodedVmInstruction> resultSet;
    resultSet.push_back(result);

    return resultSet;
}

When it comes to disassembling x86virt bytecode instructions, we do that with a different subroutine, disassembleVmInstruction. The purpose of this subroutine is fairly straightforward so I won't bore you by reading the code line by line. However, some interesting cases are case 1 and case 2, which are essentially x86 instructions with VMR as the operand. It is also worth noting case 7 where the decoding of x86virt jump instructions are handled and case 16 which just signals for the VM to stop interpreting (and has no x86 equivalent)

Once we can disassemble the instructions, we need to properly identify where they are. As was previously mentioned, this requires some control flow analysis. During control flow analysis, the disassembler identifies the different blocks of code in a subroutine by using the getDisassembleRegions function. The getDisassembleRegions basically returns the regions of code in a method that can be reached using conditional or unconditional jumps inside of a virtualized function. We can see its behaviour below:

vector<DisassembledRegion> getDisassembleRegions(const uint32_t initialIp, VmInfo& vmInfo) {
    vector<DisassembledRegion> disassembledStubs;
    queue<uint32_t> stubsToDisassemble;
    stubsToDisassemble.push(initialIp);

    while(!stubsToDisassemble.empty()) {
        uint32_t vmRelativeIp = stubsToDisassemble.front() - vmInfo.getBaseAddress();
        stubsToDisassemble.pop();

        if(isInRegions(disassembledStubs, vmRelativeIp))
            continue;

        DisassembledRegion current;
        current.min = vmRelativeIp;

        bool continueDisassembling = true;
        while(vmRelativeIp <= vmInfo.getDumpSize() && continueDisassembling) {

            vector<DecodedVmInstruction> instrSet;

            vmRelativeIp += decodeVmInstruction(instrSet, vmRelativeIp, vmInfo);

            for(auto& instr : instrSet) {
                if(instr.type == DecodedInstructionType_t::INSTR_UNKNOWN) {
                    stringstream msg;
                    msg << "Unknown instruction encountered: 0x" << hex << ((unsigned long)instr.bytes[0]);
                    throw runtime_error(msg.str());
                }

                if(instr.type == DecodedInstructionType_t::INSTR_JUMP || instr.type == DecodedInstructionType_t::INSTR_CONDITIONAL_JUMP)
                    stubsToDisassemble.push(instr.controlDestination);

                if(instr.type == DecodedInstructionType_t::INSTR_STOP || instr.type == DecodedInstructionType_t::INSTR_RETN || instr.type == DecodedInstructionType_t::INSTR_JUMP)
                    continueDisassembling = false;
            }
        }

        current.max = vmRelativeIp;
        disassembledStubs.push_back(current);
    }

    //Now we must resolve all overlapping stubs
    for(auto it = disassembledStubs.begin(); it != disassembledStubs.end();) {
        if(isInRegions(disassembledStubs, it->min, it->max))
            disassembledStubs.erase(it++);
        else
            it++;
    }

    return disassembledStubs;
}

The getDisassembleRegions performs the following functionality:

  1. Disassemble the virtualized subroutine from its start address and continues to do so until it encounters a jump (conditional or unconditional) or a return.
  2. If a conditional jump is encountered, its destination address is queued up to be the next region to be disassembled and the disassembler continues executing.
  3. If an unconditional jump is encountered, the destination address is queued up and disassembling of the current block ends
  4. If a ret is encountered, disassembling of the current block ends.
  5. Loops until there are no more regions to be disassembled.

The problem with the above algorithm is that it will identify code such as what is seen below:

labelD:
...
...
labelA:
...
...
jmp labelB
...
...
labelB:
...
...
jz labelD
...

As having overlapping blocks of code, which will result in redundant blocks of code and thus redundant disassembled output. This was solved by testing for and removing smaller overlapping regions:

    //Now we must resolve all overlapping stubs
    for(auto it = disassembledStubs.begin(); it != disassembledStubs.end();) {
        if(isInRegions(disassembledStubs, it->min, it->max))
            disassembledStubs.erase(it++);
        else
            it++;
    }

    return disassembledStubs;

After we know the basic blocks of a subroutine that contain valid executable code, we can begin disassembling them. This is done in the disassembleStub routine:

bool disassembleStub(const uint32_t initialIp, VmInfo& vmInfo) {

    vector<DisassembledRegion> stubs = getDisassembleRegions(initialIp, vmInfo);

    //Needs to be sorted, otherwise (due to jump sizes) may not fit into original location
    //Sorting should match it with the way it was implemented.
    sort(stubs.begin(), stubs.end(), sortRegionsAscending);

    if(stubs.empty()) {
        printf(";No stubs detected to disassemble.. %d", stubs.size());
        return true;
    }

    vector<DecodedVmInstruction> instructions;
    for(auto& stub : stubs) {

        bool continueDisassembling = true;
        DecodedVmInstruction blockMarker;
        blockMarker.type = DecodedInstructionType_t::INSTR_COMMENT;
        strcpy(blockMarker.disassembled, "BLOCK");
        instructions.push_back(blockMarker);
        for(uint32_t vmRelativeIp = stub.min; continueDisassembling && vmRelativeIp < stub.max;) {

            vector<DecodedVmInstruction> instrSet;

            vmRelativeIp += decodeVmInstruction(instrSet, vmRelativeIp, vmInfo);

            for(auto& instr : instrSet) {
                if(instr.type == DecodedInstructionType_t::INSTR_UNKNOWN)
                    throw runtime_error("Unknown instruction encountered");

                if(instr.type == DecodedInstructionType_t::INSTR_STOP) {
                    continueDisassembling = false;
                    break;
                }

                instructions.push_back(instr);
            }

        }

        instructions.push_back(blockMarker);
    }

    for(auto& i : eliminateVmr(instructions)) {
        formatInstructionInfo(i);
    }

    return true;
}

An important note with this method is that it sorts the disassembled regions by their start address after doing the control flow analysis with getDisassembledRegions. This sorting must be done because the natural order was thrown out of whack by the queuing nature of the control flow analysis. Functionally, the order doesn't really make a difference in a normal application because at the end of the day, the code is still going to execute the same way regardless of where the instructions are. However, the way in which the blocks are organized will change the size of the code once it is assembled in NASM due to the way jump instructions are encoded on the x86 platform. Essentially, the distance between the jump instructions and their destination addresses will change depending on the order of the code blocks in the function, and the distance will influence the size of the jump instruction. If the devirtualized code is not the size of its original form (i.e, before it is was passed into x86virt to be virtualized) or smaller, then it will not fit back into where it was ripped from. While functionally, it doesn't matter that it isn't in the "proper" location, it does matter later when we encounter multiple VM layers because our signatures will not match partial handlers etc.

Other than that, there isn't anything too weird or noteworthy here until we encounter the call to eliminateVmr. Remember that I mentioned how x86virt creates a virtual register. We need to eliminate that because we cannot assemble that through NASM or produce valid x86 code while we have a virtual register. Below, we can see the behaviour of eliminateVmr:

vector<DecodedVmInstruction> eliminateVmr(vector<DecodedVmInstruction>& instructions) {
    auto itVmrStart = instructions.end();
    vector<DecodedVmInstruction> compactInstructionlist;

    for(auto it = instructions.begin(); it != instructions.end(); it++) {
        if(!strncmp("mov VMR,", it->disassembled, 8) && itVmrStart == instructions.end()) {
            itVmrStart = it;
        }else if(itVmrStart != instructions.end() && strstr(it->disassembled, "[VMR]") != 0)
        {
            for(auto listing = itVmrStart; listing != it+1; listing++) {
                DecodedVmInstruction comment = *listing;
                comment.type = INSTR_COMMENT;
                compactInstructionlist.push_back(comment);
            }
            compactInstructionlist.push_back(eliminateVmrFromSubset(itVmrStart, it + 1));
            itVmrStart = instructions.end();
        } else if (itVmrStart == instructions.end()) {
            compactInstructionlist.push_back(*it);
        }
    }

    return compactInstructionlist;
}

The way VMR is used in the virtualized code is fairly convenient. VMR is essentially used to calculate pointer addresses. For example, it only ever appears in a similar form to:

mov VMR, 0
add VMR, ecx
shl VMR, 2
add VMR, 15
mov eax, [VMR]

It always starts with operations on VMR and ends with VMR being dereferenced. This means that we can essentially replace all of those instructions with:

mov eax, [ecx * 2 + 15]

So eliminateVmr will look for a pattern where the destination operand is VMR and then some operations on VMR followed by a dereference on VMR. Everything between that pattern can always be simplified using the same algorithm. You can see the specifics of that algorithm in eliminateVmrFromSubset:

DecodedVmInstruction eliminateVmrFromSubset(vector<DecodedVmInstruction>::iterator start, vector<DecodedVmInstruction>::iterator end) {
    bool baseReg2Used = false;
    bool baseReg1Used = false;
    char baseReg1Buffer[10];
    char baseReg2Buffer[10];
    uint32_t multiplierReg1 = 1;
    uint32_t multiplierReg2 = 1;

    uint32_t offset = 0;

    for(auto it = start; it != end; it++) {
        char* dereferencePointer = 0;

        if(!strncmp(it->disassembled, "mov VMR, 0x", 11)) {
            offset = strtoul(&it->disassembled[11], NULL, 16);
            baseReg1Used = false;
            baseReg2Used = false;
            multiplierReg1 = multiplierReg2 = 1;
        } else if(!strncmp(it->disassembled, "mov VMR, ", 9)) {
            baseReg1Used = true;
            baseReg2Used = false;
            multiplierReg1 = multiplierReg2 = 1;
            offset = 0;
            strcpy(baseReg1Buffer, &it->disassembled[9]);
        } else if(!strncmp(it->disassembled, "add VMR, 0x", 11)) {
            offset += strtoul(&it->disassembled[11], NULL, 16);
        } else if(!strncmp(it->disassembled, "add VMR, ", 9)) {
            if(baseReg1Used) {
                baseReg2Used = true;
                strcpy(baseReg2Buffer, &it->disassembled[9]);
            } else {
                baseReg1Used = true;
                strcpy(baseReg1Buffer, &it->disassembled[9]);    
            }
        } else if(!strncmp(it->disassembled, "shl VMR, 0x", 11)) {
            uint32_t shift = strtoul(&it->disassembled[11], NULL, 16);
            offset = offset << shift;
            if(baseReg1Used) {
                multiplierReg1 = multiplierReg1 << shift;
            }
            if(baseReg2Used) {
                multiplierReg2 = multiplierReg2 << shift;
            }
        }
    }

    auto lastInstruction = end - 1;
    string reconstructInstr(lastInstruction->disassembled);
    stringstream reconstructed;

    reconstructed << "[";

    if(baseReg1Used) {
        if(multiplierReg1 != 1)
            reconstructed << "0x" << hex << multiplierReg1 << " * ";

        reconstructed << baseReg1Buffer;
    }

    if(baseReg2Used) {
        reconstructed << " + ";
        if(multiplierReg2 != 1)
            reconstructed << "0x" << hex << multiplierReg2 << " * ";

        reconstructed << baseReg2Buffer;
    }

    if(offset != 0 || !(baseReg1Used))
        reconstructed <<  " + 0x" << hex << offset;

    reconstructed << "]";

    reconstructInstr.replace(reconstructInstr.find("[VMR]"), 5, reconstructed.str());

    DecodedVmInstruction result;

    result.isDecoded = true;
    result.address = start->address;
    result.size = 0;
    result.type = lastInstruction->type;
    strcpy(result.disassembled, reconstructInstr.c_str());

    return result;
}

Once VMR is eliminated, all that is left to do is print out the disassembly, which we can see being done here in disassembleStub:

...
    for(auto& i : eliminateVmr(instructions)) {
        formatInstructionInfo(i);
    }
...

Generating a Jump Map with Angr

As I mentioned earlier, when it comes to the x86virt bytecode conditional / unconditional jump instruction handler, we need to extract the jump mappings (that is, which value in the jump type operand matches which type of jump). The way the x86virt jump handler works is that it takes the first operand (which is the jump type) and passes it into a somewhat randomly generated subroutine. This subroutine returns true if the EFLAGS are in a condition that permits jumping, or false otherwise.

Because this subroutine is not static and is a bit different for every VM Layer or virtualized target, we need some way of extracting these mappings out of that randomly generated subroutine. If you are not familiar with Angr or symbolic execution, I suggest you read a tiny bit on it before reading this section because the learning curve can be a bit steep.

The jump maps are extracted by running an Angr simulation on the jump decoder that was extracted from the protected application. The simulation is done in x86devirt_jmp.py and the dump of the decoder is provided by x86devirt.py (which we will get into later).

The way this was performed was first by creating a table of x86 jump types with EFLAG values that permit a jump to be taken for that jump type and EFLAG values that do not permit a jump to be taken. Additionally, all jump types in the table were prioritized.

Jump type priority worked by giving x86 jump types that test less flags a lower priority than jump types that test more flags. The reason jump types need to be prioritized is because there is overlap in the conditions that need to be checked for different jumps. An example of this is the JZ jump (ZF = 0) and the JA (ZF = 0 and CF = 0) jump. Essentially, if a set of x candidate x86 jump types can be mapped to a particular jump type y, then y should be mapped to the highest priority jump type in the set of x.

Below we see the list of possible jumps:

possibleJmps = [
    {
        "name": "jz",
        "must": [0x40],
        "not": [0x1, 0],
        "priority": 1
    },
    {
        "name": "jo",
        "must": [0x800],
        "not": [0],
        "priority": 1
    },
    ...
    ...

Below is the code responsible for mapping which emulated states permit jumping and which states do not, for all jump types (0-15):

def getJmpStatesMap(proj):
    statesMap = {}

    state = proj.factory.blank_state(addr=0x0)
    state.add_constraints(state.regs.edx >= 0)
    state.add_constraints(state.regs.edx <= 15)
    simgr = proj.factory.simulation_manager(state)
    r = simgr.explore(find=0xDA, avoid=0xDE, num_find=100)

    for state in r.found:
        val = state.solver.eval(state.regs.edx)
        val = val - 0xD
        val = val / 2

        if(not statesMap.has_key(val)):
            statesMap[val] = {"must": [], "not": []}

        statesMap[val]["must"].append(state)

    state = proj.factory.blank_state(addr=0x0)
    state.add_constraints(state.regs.edx >= 0)
    state.add_constraints(state.regs.edx <= 15)
    simgr = proj.factory.simulation_manager(state)
    r = simgr.explore(find=0xDE, avoid=0xDA, num_find=100)

    for state in r.found:
        val = state.solver.eval(state.regs.edx)
        val = val - 0xD
        val = val / 2

        statesMap[val]["not"].append(state)

    return statesMap

The method essentially performs the following:

  1. Iterate through all states that reach a positive/negative return (jump allowed or not allowed to be taken)
  2. Resolve the constraint on the jump type (Jump type is stored in EDX)
  3. Append that state to either the "must" set, which is states that reached a positive return permitting the jump to be taken (offset 0xDA in the dumped jump decoder code) or "not" (offset 0xDE in the dumped jump decoder code)

After we know which states permit jumping or restrict jumping for each jump type, we can begin testing the constraints on the EFLAGS register that allow for arriving at those states to determine which kind of x86 jump it maps to:

def decodeJumps(inputFile):
    proj = angr.Project(inputFile, main_opts={'backend': 'blob', 'custom_arch': 'i386'}, auto_load_libs=False)

    stateMap = getJmpStatesMap(proj)
    jumpMappings = {}
    for key, val in stateMap.iteritems():

        for jmp in possibleJmps:
            satisfiedMustsRemaining = len(jmp["must"])
            satisfiedNotsRemaining = len(jmp["not"])

            for state in val["must"]:
                for con in jmp["must"]:
                    if (state.solver.satisfiable(
                            extra_constraints=[state.regs.eax & controlFlowBits == con & controlFlowBits])):
                        satisfiedMustsRemaining -= 1;

            for state in val["not"]:
                for con in jmp["not"]:
                    if (state.solver.satisfiable(
                            extra_constraints=[state.regs.eax & controlFlowBits == con & controlFlowBits])):
                        satisfiedNotsRemaining -= 1;

            if(satisfiedMustsRemaining <= 0 and satisfiedNotsRemaining <= 0):
                if(not jumpMappings.has_key(key)):
                    jumpMappings[key] = []

                jumpMappings[key].append(jmp)

    finalMap = {}
    for key, val in jumpMappings.iteritems():
        maxPriority = 0;
        jmpName = "NOE FOUND"
        for j in val:
            if(j["priority"] > maxPriority):
                maxPriority = j["priority"]
                jmpName = j["name"]
        finalMap[jmpName] = key
        print("Mapped " + str(key) + " to " + jmpName)

    proj.terminate_execution()
    return finalMap

For each possible x86virt jump type, we test against each candidate x86 jump type to see if the x86 jump's "not" and "must" sets can be satisfied accordingly by the restrictions on the EFLAGS registers in each state. If all "not"s and "must"s are satisfied, then the x86 jump is added as a possible candidate for that jump type.

Later, we iterate through all candidate x86 jumps for each jump type and choose the one with the highest priority to be mapped to it.

Finding the Signatures in the Protected Binary & Dumping Required Data

Finally, on to the last module needed to devirtualize code. x86devirt.py is the x64dbgpy Python plugin that instructs x64dbg on how to devirtualize the target.

x86devirt uses YARA rules to locate sections of code in the protected application, including the VM Stub, the instruction handlers, etc... You can see these YARA rules in VmStub.yara, VmRef.yara and instructions.yara:

  1. VmStub.yara is the YARA signature of the Virtual Machine interpreter.
  2. VmRef.yara is the YARA signature to detect where the application passes control off to the interpreter to begin interpreting a section of x86virt bytecode.
  3. instructions.yara is a set of YARA signatures for the different x86 instruction handlers.

An important note with these signatures is that they must match the original VM Stub and the devirtualized code generated by x86devirt. For example, consider a target that has been virtualized using two VM layers. After the first layer has been devirtualized, there will be a new VM stub in plain x86 form (that is, the second layer of virtualization). These signatures need to detect that second layer. However, the second layer was assembled using a different environment than the first layer and thus we need to take care for some special x86 instruction encodings in our signature (some x86 instructions have more than one way of being encoded). For example, consider:

add eax, ebx ; Encoded as 03C3
add eax, ebx ; Encoded as 01D8

So, when developing our signatures, we need to keep in mind that NASM could choose either encoding. With YARA, I just masked out instructions like these. If you are ever developing a signature with these constraints, please look into this more. There are plenty of resources on this topic: https://www.strchr.com/machine_code_redundancy

When it comes to devirtualizing x86virt, we must perform the following:

  1. Locate all VM Stubs that are present in plain x86 form using the vmStub.yara YARA signature
  2. Extract the decryption routine from the VM Stub
  3. Locate all references to that VM Stub (i.e, all areas where the VM is invoked to begin virtualizing code)
  4. Through each reference, extract the address where the virtualized bytecode is, and where the original code was ripped from
  5. Emulate part of the VM Stub to locate all instruction handlers and their opcodes
  6. Apply the YARA signatures to identify which handler (and subsequently, opcode) maps to which instruction behaviour and produce the instruction / opcode mappings
  7. Locate the JXX instruction handler and dump the part of the handler responsible for testing whether, given the jump type and state of the EFLAGS, the jump is taken. This is passed to x86devirt_jmp.py to extract the jump mappings
  8. For each reference, dump the virtualized code around that references and, with the jmp mappings, instruction mappings and the decryption routine, feed it to the x86virt-disassembler to be disassembled
  9. Finally, run NASM on the disassembler output to produce a x86 binary blob that can be written into where the virtualized code was ripped from, therefore restoring the virtualized function to its original form
  10. Loop back and search for any newly unveiled VM stubs

This process, once completed, will leave the x64dbg debugger in a state that allows the application to be cleanly dumped without any need for the VM stub.

Conclusion

Thanks for reading my blog post on x86devirt. I welcome any questions or criticism, please contact me via my email on GitHub (https://github.com/JeremyWildsmith)