DEV Community


Posted on

Debugging ARM Assembly With LLDB Part 1. Setting Break Points

In this tutorial I'm going to be teaching you how to debug ARM assembly language on MacOS with the lldb debugger. Ultimately, my goal with these tutorials is to teach advanced debugging techniques which will be applicable to both offensive and defensive information security. These tutorials are designed for novices with very little programming experience.

Lets get started by writing a simple C program like so:

// hello_world.c

#include <stdio.h>

int main(){
        printf("Hello, World !\n");
        return 1;

Enter fullscreen mode Exit fullscreen mode

As I've said before C is the Latin of programming languages, it is a very verbose language that has influenced innumerable programming languages. It is very nice in that it gives the programmer a ton of control over memory utilization, and because it is a compiled language it runs very fast. Because of these traits C was the language chosen to write most operating systems and higher level programming languages in.

C is such a low level language that in order to print something to STDOUT we must include the Standard Input/Output library, which is exactly what we're doing when we declare #include <stdio.h> at the top of our program, this allows us access to the printf function.

All programs must run in a main function, languages like Ruby, Python and JavaScript abstract away the utilization and deceleration of a main function but if you look at the stack traces for those languages closely you'll see that it is there in the background. Functions run and evaluate to a value, and we must specify the type of that value so that the CPU can allocate a proper amount of memory.

In this short example we return an integer which is a rational number which takes up 4 bytes of memory. A byte is a collection of 8 bits. A bit is a single segment of memory that can be in a binary state of either a 1 or 0. This means that an integer is actually composed of 32 1's or 0's residing within memory. Typically a successful run of a program will be denoted by returning 1, while an error will be noted by the return of 0.

Computers are smart, but they aren't smart enough to read and run C program. C is just a convenient language for humans to write in. A compiler is what transforms our C code to a series of binary instructions that the computer can "understand" and execute.

MacOS comes with the gcc compiler by default. Gcc can do a lot, if you run gcc --help you'll see all of the different flags that the tool comes with. We're only compiling a relatively simple program and we can do so with this command:

gcc -g -o hello_world hello_world.c

The -o flag allows us to name our executable in this case it will be called hello_world. The -g flag tells our compiler to store the debugging symbols of the executable within a .dSYM directory. If you poke around and run ls in the directory where you compiled your program you'll notice that we now have a hello_world executable and a hello_world.dSYM directory. The -o flag is optional, if we chose to omit it our binary would be given the default name which is a.out, and correspondingly a a.out.dSYM directory would be created.

Lets run our executable by typing ./hello_world into the terminal.
You didn't come here just to write "Hello, World!" programs. It's time to dive deep into the magical world of computation by debugging our executable with lldb the native debugger on MacOS. The debugger will bring us into a shell like environment where we can run, pause and modify our program in real time. You can start your debugger with lldb hello_world.

You're now in the debuggers shell. If you ever get lost you can run help to see a list of commands. If you want to know more about a specific command, say for example the breakpoint command you would type help breakpoint. Lastly there are sub-commands to each command and if you wanted to learn more about the breakpoint set command you'd type help breakpoint set.

Now that we're in our debugger shell we can view our original C code by running
list main which should produce the following:

File: /Users/corery/c_projects/hello_world.c
   1    #include <stdio.h>
   3    int main(){
   4        printf("Hello, World !\n");
   5        return 1;
   6    }
Enter fullscreen mode Exit fullscreen mode

Let's pause our program right before the printf function executes by setting a break point on line 4.

(lldb) breakpoint set -l 4
Breakpoint 1: where = hello_world`main + 24 at hello_world.c:4:2, address = 0x0000000100003f88
Enter fullscreen mode Exit fullscreen mode

The address of line 4 within the memory of the binary is at 0x0000000100003f88. This number is written in what is called hexadecimal which is a base 16 numerical system, as opposed to the traditional base 10 numbering system that you're used to. The numeral system of base 10 vs base 16 is shown below:

base 10 | base 16
    0   |   0
    1   |   1
    2   |   2
    3   |   3
    4   |   4 
    5   |   5
    6   |   6
    7   |   7
    8   |   8
    9   |   9
    10  |   A
    11  |   B
    12  |   C
    13  |   D
    14  |   E
    15  |   F

Enter fullscreen mode Exit fullscreen mode

Typically when debugging you don't need to convert hexadecimal numbers to base 10 to understand what's going on, that would be way too much work. You just need to be able to understand where different segments of memory addresses are in relation to one another. That is to say which addresses are higher and lower than one another.

Now that we have some understanding of base 16 and we've set a breakpoint it's time to run our program with the unironically name run command. Notice how the program stops just before the printf call. You should see the following:

(lldb) run
Process 3181 launched: '/Users/corery/c_projects/hello_world' (arm64)
Process 3181 stopped
* thread #1, queue = '', stop reason = breakpoint 1.1
    frame #0: 0x0000000100003f88 hello_world main at hello_world.c:4:2
   1    #include <stdio.h>
   3    int main(){
-> 4        printf("Hello, World !\n");
   5        return 1;
   6    }
Target 0: (hello_world) stopped.

Enter fullscreen mode Exit fullscreen mode

Our debugger conviently shows both the memory adress of the paused programs code and the actual C line which we're stopped at. Exectables like the one we compiled are divided into 4 memory segments from lowest to highest adresses they are the code, data, stack, and heap. We'll dive into each of these segments into more detail later, for now all you need to understand is that the code segment stores the actual instruction set for the executable and as you can see our program is paused at the instructions located at 0x0000000100003f88.

As our program takes a break from running we can finally dive into the magical world of ARM assembly. Lets disassemble our program.

(lldb) disass
    0x100003f70 <+0>:  sub    sp, sp, #0x20
    0x100003f74 <+4>:  stp    x29, x30, [sp, #0x10]
    0x100003f78 <+8>:  add    x29, sp, #0x10
    0x100003f7c <+12>: stur   wzr, [x29, #-0x4]
    0x100003f80 <+16>: adrp   x0, 0
    0x100003f84 <+20>: add    x0, x0, #0xfa8         ; "Hello, World !\n"
->  0x100003f88 <+24>: bl     0x100003f9c            ; symbol stub for: printf
    0x100003f8c <+28>: mov    w0, #0x1
    0x100003f90 <+32>: ldp    x29, x30, [sp, #0x10]
    0x100003f94 <+36>: add    sp, sp, #0x20
    0x100003f98 <+40>: ret    
Enter fullscreen mode Exit fullscreen mode

That's a lot to take in, as you can see our 2 line C function produced 11 lines of assembly, imagine how much assembly a 100 line C program would produce. I used to think of C as a low level programming language, mostly because I was used to programming with Ruby but after realizing how verbose and complex assembly language is I came to appreciate the level of abstraction and utility of C.

You came here to understand ARM assembly, so lets break down the first line of code.

0x100003f70 <+0>: sub sp, sp, #0x20

The 0x100003f70 all the way to the left is the memory address of the instruction. The actual instructions located at 0x100003f70 are located all the way to the right ie. sub sp, sp, #0x20. Like any language ARM assembly has a grammar or syntax that it's instructions must meet. In this case the format is <operation> <destination>, <source1>, <source2>. Note note all instructions make use of the second operand as it is refereed to as flexible. Flexible operations are a distinguishing feature between x86_64 and ARM assembly.

Before diving into what every line of this assembly means we need to understand two concepts: registers and operations. A process register is a hardware variable, it is where our computer stores data as the program executes. Registers are stored within the RAM of the system, the more RAM you have the more processes you can run in parrallel because you have more available registers. We'll go over registers in the next part of this series.

Top comments (0)