There are many different types of code optimization when it comes to assembly or assembler code.
There is of course most popular speed optimization that focuses on the fastest possible code, often with the use of MMX, SSE, AVX instructions to process as much data as possible.
But there is one particular area of assembly programming that focuses on size optimization. I have used this knowledge many times in many of my software reverse engineering projects to modify compiled binaries with a limited amount of space available to include the modified code or to develop shellcodes for 0-day exploits, where again the size of the shellcode is limited.
Programmers who write in an assembler tend to think that if you write in an assembler their code is already optimized to the maximum (finally, it's an assembler!), but as I've found out there any many tricks that can be used to achieve even better results in terms of minimizing the code size.
mov eax,0 ; 5 bytes -> B0 00 00 00 00
xor eax,eax ; 2 bytes -> 33 C0
sub eax,eax ; 2 bytes -> 2B C0
and eax,0 ; 3 bytes -> 83 E0 00
As it turns out, even the simplest operation can take up to 5 bytes, but if we use
xor instruction instead, the same operation will take 2 bytes in the resulting program code. The value
0 is often used as a base parameter for WinAPI functions.
push offset szSansSerif ; lpFace ; 5 bytes push 0 ; pitch and family ; 2 bytes push 0 ; output quality ; 2 bytes push 0 ; clipping precision ; 2 bytes push 0 ; output precision ; 2 bytes push 1 ; char set identifier ; 2 bytes push 0 ; strikeout attribute flag ; 2 bytes push 1 ; underline attribute flag ; 2 bytes push 0 ; italic attribute flag ; 2 bytes push 400 ; font weight(normal) ; 5 bytes push 0 ; base-line orientation angle ; 2 bytes push 0 ; angle of escapement ; 2 bytes push 0 ; logical average character ; 2 bytes push 0Dh ; logical height of font ; 2 bytes call CreateFontA
The total number of bytes of instructions needed to remember the parameters of the
CreateFontA percentage call will take 34 bytes in this case.
sub eax,eax ; 2 bytes push offset szSansSerif ; lpFace ; 5 bytes push eax ; pitch and family ; 1 byte push eax ; output quality ; 1 byte push eax ; clipping precision ; 1 byte push eax ; output precision ; 1 byte push 1 ; char set identifier ; 2 bytes push eax ; strikeout attribute flag ; 1 byte push 1 ; underline attribute flag ; 2 bytes push eax ; italic attribute flag ; 1 byte push 400 ; font weight(normal) ; 5 bytes push eax ; base-line orientation angle ; 1 byte push eax ; angle of escapement ; 1 byte push eax ; logical average character ; 1 byte push 0Dh ; logical height of font ; 2 bytes call CreateFontA
This time 27 bytes, a small profit compared to the previous function, but sometimes these few bytes can be useful for something else.
If we need to pass the same parameters to the function, it's usually done like this:
push 0 ; 2 bytes push 0 ; 2 bytes push 0 ; 2 bytes push 0 ; 2 bytes push 0 ; 2 bytes push 0 ; 2 bytes push 0 ; 2 bytes ================================ = 14 bytes
Or more size optimized, like this:
sub eax,eax ; 2 bytes push eax ; 1 byte push eax ; 1 byte push eax ; 1 byte push eax ; 1 byte push eax ; 1 byte push eax ; 1 byte push eax ; 1 byte =============================== = 9 bytes
But it can be further size optimized using a simple loop:
sub eax,eax ; 2 bytes push 7 ; 2 bytes pop ecx ; 1 byte @save_args: push eax ; 1 byte loop @save_args ; 2 bytes ================================ = 8 bytes
I haven't seen this type of size optimization nor in GCC or even in LLVM generated code (with size optimizations enabled), so it's a trick strictly reserved for hand-optimized assembly code.
If we intend to zero the
edx register, we normally do so by e.g.
xor edx,edx but you can do it even more easily by using the
cdq instruction (it stands for Convert Double to Quad).
cdq instruction causes the
edx register to be filled with a sign bit from
eax register (sign bit is the most significant bit of the register value, so in this case it's the 31st bit).
So if we know that in
eax we have e.g.
1, then execution of the
cdq instruction will cause
edx to be reset to zero.
If you are not sure about the content of the
eax register (for example, after the function calls) you shouldn't use, because it can lead to errors:
eax=80000001h = 1000000000000000000000000000000000000001b ^ most significant bit of the EAX register is set to 1
This execution of
cdq will cause
edx to be filled with a bit of
eax, which is
1, so in
edx there will be
cdq instruction takes only one byte.
mov eax,7Fh 5 bytes ; B0 FF 00 00 00
sub eax,eax 4 bytes ; 2 bytes C0 mov al,7Fh ; B0 FF
push 7Fh 3 bytes ; 6A FF pop eax ; 58
It is often necessary to transfer values from 0-255 range into 32-bit register. We can do it like this:
mov eax,4 ; B0 04 00 00 00
This instruction takes 5 bytes. A value of 4 is treated as a full 32-bit value that needs 4 bytes to encode. The most optimized solution is to store aka
push this value on the stack and
pop it back to the CPU register:
push 4 ; 6A 04 pop eax ; 58
This time it takes only 3 bytes, even though it takes up more space in the source code, it takes up fewer bytes on the disk!
It should be mentioned, that the compiler will write the shortened form of
push instruction if the value is between 0-127 (signed integer value).
If you want to use the shortened version of
push instruction even for signed integer values, you need to do it either by using:
or by using helper macro
pushb macro byteval db 06Ah,byteval endm pushb 080h ; store 128 value ( pop eax
After these instructions are completed, the
eax will hold a value of
0FFFFFF80h (-80h) but why not
The numbers in the range
128-255 in the short version of
push instruction are treated as negative numbers (aka sign-extended).
The sign bit from the short encoded integer value is then copied to the upper bits of the CPU register:
00000000 00000000 00000000 10000000 = 00000080h ^integer sign bit 11111111 11111111 11111111 10000000 = FFFFFF80h ^signed integer
There is another trick to make the code a little short in case you want to encode values in the range from
128-255 to a full 32-bit value:
mov eax,255 ; bytes
Size optimized way:
xor eax,eax ; bytes mov al,255 ; bytes
This is another of the tricks often overlooked by HLL compilers.
Functions by definition return some values. In the case of WinAPI functions, the returned value is always stored in the
Depending on the function, returned values can differ and it could be
-1, file handle, etc.
CreateFileA function returns -1 in
eax register when we don't have access to the file we just wanted to open.
But another WinAPI function like
CreateIcon returns in eax 0 if there is an error.
We can use those values, before checking the MSDN documentation to our advantage:
push ... call LoadBitmapA
LoadBitmapA function says the function returns the handle to the bitmap on success and
0 on error.
push .. call LoadBitmapA cmp eax,0 ; 83 F0 00 jz @error
cmp eax,0 instruction takes 3 bytes. Can't we do it better? Of course, we can by using logical operations like
call LoadBitmapA or eax,eax ; 0B C0 jz @error
call LoadBitmapA test eax,eax ; 85 C0 jz @error
Both of the
test instructions sets the CPU zero flag if the
eax register value is set to
0, it gives us the same result as the
cmp eax,0 instruction but with 1 byte less size in output code.
We can optimize it even further by using
call LoadBitmapA xchg eax,ecx ; 1 byte jecxz @error ; jecxz instruction takes 2 bytes (the same as jxx short range branches)
jecxz instruction jumps to the provided label if the
ecx register is set to
But there is a catch! The instruction itself is a conditional branch instruction to the nearest label in range of -127 to 128 bytes from the instruction itself in compiled code (it's a short jump type instruction only).
So if your destination, in our case
@error label is further away in compiled code than that you will get an error message from the compiler.
Some assembly compilers like an old school TASM compiler will automatically translate
jecxz with destinations further than 128 bytes to:
call LoadBitmapA xchg eax,ecx jecxz @dummy jmp @next @dummy: jmp @error @next:
Many WinAPI functions returns
-1 (0FFFFFFFh) value on error. How can we check it? The simplest way is of course:
call CreateFileA cmp eax,-1 ; 83 F0 00 je @error
We can get the same result using much more size optimized code:
call CreateFileA inc eax ; if there was -1 value returned, the inc instruction will set the EAX register to 0 je @error ; and we can detect it with a conditional JE/JZ instruction dec ; if there wasn't an error, restore the originally returned value
In this case, the resulting code will be 1 byte smaller than the one using
Say you have a value of
4 stored in the
eax register and a value of
98 stored in
edx register. How to exchange those two registers?
We can do it like this:
push eax push edx pop eax pop edx
This takes 4 bytes. We can use a temporary register like this:
mov ebx,eax mov eax,edx mov edx,ebx
But this one is even bigger with 6 bytes.
Or we can use this one clever trick using the logical
xor edx,eax xor eax,edx xor edx,eax
Still 6 bytes in output code. But there is one overlooked instruction, not used by HLL compilers anymore.
xchg (from eXCHange), it's size is just 1 byte in output code and it does just what we need:
xchg eax,edx ; 92h
Is 1 byte in size, but:
xchg edx,esi ; 87h 0D6h
xchg instruction takes only 1 byte in output code, but only if one of the exchanged registers is
eax. Otherwise it's encoded as 2 bytes.
You will learn that many other instructions are smaller if you use the
eax register e.g.:
add edi,400000h ; 6 bytes -> 81 C7 00 00 40 00 add eax,400000h ; 5 bytes -> 05 00 00 40 00
So it's the same instruction
add, but if the
eax is used - the output code is 1 byte smaller. Keep that in mind.
There is a separate set of string instructions in CPUs. They operate on
edi registers only.
Some of those instructions are rarely used by modern compilers, but they have one advantage to us - the size of the output code.
Let's look at this example. We have a simple loop and after each iteration, we increase the value of the
esi pointer by
_loop_label: ... ... ... add esi,4 loop _loop_label
Easy & simple. But the:
add esi,4 ; 83 C6 04
instruction takes 3 bytes. But we can use the string instruction
lodsd to make our code shorter and it does exactly the same:
lodsd ; AD = add esi,4 lodsw ; 66 0A = add esi,2 lodsb ; 0A = add esi,1
There are 3 variants of this instruction, operating on 32 bit, 16 bit and 8 bit values:
lodsd ; mov eax,dword ptr[esi] ; add esi,4 lodsw ; mov ax,word ptr[esi] ; add esi,2 lodsb ; mov al,byte ptr[esi] ; inc esi
So the optimized loop could look like this:
_loop_label: ... ... ... lodsd ; mov eax,dword ptr[esi] ; add esi,4 loop _loop_label
So we can use it a short version of
add esi,4 instruction, just keep in mind it access the memory pointer in
esi register (so it cannot be any value, it must be a pointer to some data) and it writes to
If you need to preserve the value of the
eax register you can do it like this:
_loop_label: ... push eax lodsd pop eax loop _loop_label
There is also a
scasX instruction. It compares the value pointed by the
edi register to the value from
eax register and increases (if the direction flag DF is set to 0, use the
cld instruction) or decreases (if the direction flag DF is set to 1, use the
std instruction) the value of the
edi registers. It also comes in 3 variants for 32 bit, 16 bit and 8-bit comparisons. In order to use it, you need to make sure the
edi register points to a valid data buffer, so again it cannot be any number or value you want because it will end with an exception if you try that (access violation).
So if one of registers you want to increase is
edi, instead of this:
add edi,4 ; 83 C7 04
it's better to use:
scasd ; AF scasw ; 66 AF scasb ; AE
and it works like this:
scasd ; cmp dword ptr[edi],eax ; add edi,4 scasw ; cmp word ptr[edi],ax ; add edi,2 scasb ; cmp byte ptr[edi],al ; inc edi
The CPU direction flag decides if the value of the
edi register is increased or decreased:
std ; set DF (Direction Flag), 1 byte scasd ; cmp dword ptr[edi],eax ; sub edi,4
Keep in mind the direction flag (DF) is always cleared after the application starts, at least for the Windows PE executables and it's also expected to be clear between any WinAPI functions.
So if you ever set it with
std instruction, make sure to reset it back afterward with
cld otherwise you might end up with hard to find bugs related to this issue in other applications or OS components.
std ; set DF (Direction Flag), 1 byte lodsd ; mov eax,dword ptr[esi] ; sub esi,4 ... ... cld ; restore DF to its expected default state
It may seem that all this size optimization doesn't make sense nowadays, but it may come in handy if, for example, you write some shellcode or you need to modify the compiled code using as few instructions as possible, and the space to use will be very modest, the knowledge about optimization may be very useful.
If you want to learn more, you can read my free articles about programming (assembler, C/C++), malware analysis, and reverse engineering.