DEV Community

Cover image for How to create raw bytecode in python
Aitor
Aitor

Posted on

How to create raw bytecode in python

Introduction

Well, first I had started doing this on github.io, but then a group of virtual gangsters beat me up and made me realize that there was no point in using that having this fantastic website. Well, there it is, i use this, as the title says, this is a raw python bytecode tutorial, i hope you enjoy it (Because there is a second part...)

Pre requirements


  • Basic knowledge of Python
  • Know what a bytes object is
  • Know the concept of stack

What is Python?

Python is a multiparadigm interpreted programming language, it supports polymorphism, object-oriented programming (OOP / OOP) and imperative programming.

How does it work?


Python, as already named, is an interpreted language, this means that it passes through an interpreter that connects what the computer is going to do, with what you write. Python does not generate machine code as a C or C ++ program would generate, but rather works more or less like Java, it has a virtual machine that interprets bytecode. This default interpreter is CPython, which is responsible for executing the bytecode on your computer. Here we are not going to use compilers, but rather we are going to handle language implementations, basically interpreters that interprets (forgive the redundancy) the written code after translating it into bytecode. There is a wide variety of these, e.g. IronPython (C # implementation), Jython (pure Java implementation), Micropython (C version optimized to run on microcontrollers).
Here is a schematic of how Python works and the steps that the interpreter takes to run the code that you wrote.
Alt Text

How to create USABLE bytecode

Well, we have two things, first, stripped bytecode, that is, bytes in hexadecimal representing opcodes and parameters, and secondly, we have CodeType, a data type in Python that helps us to create ByteCode that SUITABLE AND USABLE. Also to assemble, you have to know how to disassemble, we are going to use the module dis, this module is used to disassemble functions, files and code.

import dis

def sum (x, y):
    return x + y
dis.dis (sum)

Enter fullscreen mode Exit fullscreen mode

The output of that snippet of code is as follows

1. 4   0 LOAD_FAST    0 (x)
2.     2 LOAD_FAST    1 (y)
3.     4 BINARY_ADD
4.     6 RETURN_VALUE
>>>
Enter fullscreen mode Exit fullscreen mode

As we can see, all of that is bytecode, now the explanation.


As you may have noticed, I listed the lines in the output in order to make this explanation easier.
Each instruction in Python has a specific OPCODE (Operation Code), in this case we use 3, LOAD_FAST BINARY_ADD RETURN_VALUE, we will explain what each one does.

  • LOAD_FAST: Loads a variable to the top of the stack (Top Of Stack).
  • BINARY_ADD: Add the two values ​​at the top of the stack and return them to the top of the stack.
  • RETURN_VALUE: Returns the value that is in TOS.

Well, now that we've explain the opcodes, we can get an idea of ​​how our code works internally, but there are still doubts, annoying but necessary doubts, like these, "What is the 4 on the left side, the 4 that is at the beginning of the first line?", "What are the numbers to the left of the OPCODES? "Why does a 0 appear to the right of LOAD_FAST?, And the 1?", "We wouldn't want to loadx and y to add them instead of 0 and 1?".


Well, I will answer in order.

  • The 4 is the line where the disassembled bytecode begins.
  • These numbers represent the offset of the bytes.
  • The 0 and the 1 correspond to an index, since the variables of the code are stored in a list (array), the 0 and 1 represent the index, however, the module dis tells us which variable is to the right of this number (hence the 0 (x) and 1 (y)). *

How do we re-create our function to make it bytecode?


Well, the first thing we do is import CodeType andFunctionType (To pass it to function) from the [types] module (https://docs.python.org/3/library/types.html#module-types)

import dis
from types import CodeType, FunctionType

def sum (x, y):
    return x + y
Enter fullscreen mode Exit fullscreen mode

After this, we are going to create our object code
python

import dis
from types import CodeType, FunctionType

def sum (x, y):
    return x + y

# This will be explained later, these are flags
CO_OPTIMIZED = 0x0001
CO_NEWLOCALS = 0x0002
CO_NOFREE = 0x0002

my_code = CodeType (
    2, #argcount
    0, #kwonlyargcount
    2, #nlocals
    2, #stacksize
    (CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE), #flags
    bytes ([124, 0, 124, 1, 23, 0, 83, 0]), #codestring
    (0,), #constants
    (), # names of constants or global (names)
    ('x', 'y',), #variable names (varnames)
    'blog_no_name', #filename
    'crafted_sum', #name (code name / function)
     9, #Firstlineno (First line where this code appears)
     b'', #lnotab
     (), #freevars
     (), # freecellvars
     )

_sum = FunctionType (my_code, {})
result = _sum (213,3)
print (result)

# Expected output
# 216
Enter fullscreen mode Exit fullscreen mode

Well well ... Many new things appear, we will explain these arguments right now.

CodeType: argcount, kwonlyargcount, nlocals, stacksize, flags, codestring, constants, names, varnames, filename, name, firstlineno, lnotab, freevars, freecellvars

Argument Description
argcount Number of arguments
kwonlyargcount Number of keyword arguments
nlocals Number of local variables (In this case 2, x and y)
stacksize Maximum size in bytes that the stack will have (In this case 2 because x y requires two spaces in the stack frame)
flags The flags are what determine some conditions of the bytecode, you can be guided by this reference . We are going to delve into flags in a more advanced tutorial.
codestring This is a list (array) of bytes containing the sequence in question, in 124 it means LOAD_FAST, 23 BINARY_ADD and 83 RETURN_VALUE
constants A tuple with the value of the constants (such as integers, False, True, built-in functions ...)
names A tuple containing the name of the constants respectively
varnames Local variable name
filename This string represents the name of the file, when this value is not used it can be any string
name Name of the code object or function
firstlineno Represents the first line in which the code is executed, relevant if we import a file, otherwise it can be any integer
lnotab This is a mapping between the offsets of the bytecode object and the offset of the lines, if you are not interested in putting information on the lines, you can use b''
freevars I will explain these variables in an advanced tutorial, it is used in closures
cellvars These variables are defined within a closure

One last two things to note before moving on to FunctionType, the first is that the 0s that follow the opcodes * eg [124, 0, ...] * are the argument, and the second is that each bytecode can vary from version to version, to know or orient yourself about the codestring, you can use the following snippet

def sum (x, y):
    return x + y
sum.__ code __.co_code

# Expected output in Python 3.7.9 (The version I use)
# b '|\x00|\x01\x17\x00S\x00'
# The bytes are interpreted as characters, probably to make it more readable. (If we put chr (124) it will print the character |)
Enter fullscreen mode Exit fullscreen mode

"Crafting" the function

We are going to use FunctionType now.
FunctionType: code, globals, name, argdefs, closure

Argument Description
code Object code (osea, CodeType)
globals A dictionary containing the globals as follows `{" Name ": ValueName}` that way, Name becomes an identifier, and then it is accessed as if it were a variable
name (Optional) Override the value of the object code)
argdefs (Optional) A tuple that specifies the value of the default arguments
closure (Optional) A tuple that supplies the ties for the freevars

Well, once this is clear, now we would only have to add a FunctionType with our object code (my_code) and call it.

import dis
from types import CodeType, FunctionType

def sum (x, y):
    return x + y
Enter fullscreen mode Exit fullscreen mode

After this, we are going to create our object code

import dis
from types import CodeType, FunctionType

def sum (x, y):
    return x + y

# This I will explain later, they are flags
CO_OPTIMIZED = 0x0001
CO_NEWLOCALS = 0x0002
CO_NOFREE = 0x0002

my_code = CodeType (
    2, #argcount
    0, #kwonlyargcount
    2, #nlocals
    2, #stacksize
    (CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE), #flags
    bytes ([124, 0, 124, 1, 23, 0, 83, 0]), #codestring
    (0,), #constants
    (), # names of constants or global (names)
    ('x', 'y',), #variable names (varnames)
    'blog_no_name', #filename
    'crafted_sum', #name (code / function name)
    9, #Firstlineno (First line where this code appears)
    b '', #lnotab
    (), #freevars
    (), # freecellvars
    )

_sum = FunctionType (my_code, {})
result = _sum (213,3)
print (result)

# Expected output
# 216
Enter fullscreen mode Exit fullscreen mode

This is all for now, later I will upload another tutorial explaining the closures

Sources

Top comments (0)