DEV Community

Chris White
Chris White

Posted on

8

A Tour of CPython Compilation

In order for a programming language to work, there has to be something to build and/or run it. Some languages also have multiple implementations, but also a reference implementation that dictates how the language works on a standardized level. For Python the reference implementation is CPython which is written in the C programming language (along with some Python of course). This article will look over the process of going from python code to an executable format using CPython.

What Is CPython?

CPython is the official reference implementation of Python written in the C programming language. The historical design is documented in PEP-0339. To better facilitate updates however, design internals are now hosted in the devguide site. Source code for CPython is easily accessible via the official GitHub repository.

High Level Compiler Overview

Compilers are an entire process which varies depending on the source and output. Python has roughly the following workflow:

  1. Generate a parser from a grammar file
  2. Run the parser on code that meets the grammar specifications
  3. Use this along with ASDL definitions to generate an AST (Abstract Syntax Tree)
  4. Generate a control flow and opcode from it

The more in depth steps can be found in the devguide site.

Grammar

As with actual spoken languages, programming languages have their own grammar as well. I've you've read any RFCs for networking protocols you may have come across a variation of Backus–Naur form, commonly referred to as BNF. The HTTP protocol uses an augmented version of it for its standard. Other languages such as Ruby may even utilize a grammar file.

Parsing expression grammar (PEG) is what Python utilizes and can be found under the Grammar folder in the source code root. This is written to a C parser via the pegen tool. This can easily be run from a downloaded source directory as well:

$ cd Tools/peg_generator/
$ python3 -m pegen -q c ../../Grammar/python.gram ../../Grammar/Tokens -o parse.c
Enter fullscreen mode Exit fullscreen mode

parse.c will now contain the parsing code for it mapped against various tokens. A small sample:

static const int n_keyword_lists = 9;
static KeywordToken *reserved_keywords[] = {
    (KeywordToken[]) {{NULL, -1}},
    (KeywordToken[]) {{NULL, -1}},
    (KeywordToken[]) {
        {"if", 656},
        {"as", 654},
        {"in", 667},
        {"or", 581},
        {"is", 589},
        {NULL, -1},
    },
    (KeywordToken[]) {
        {"del", 613},
        {"def", 669},
        {"for", 666},
        {"try", 638},
        {"and", 582},
        {"not", 588},
        {NULL, -1},
    },
    (KeywordToken[]) {
        {"from", 618},
        {"pass", 504},
        {"with", 629},
        {"elif", 658},
        {"else", 659},
        {"None", 611},
        {"True", 610},
        {NULL, -1},
    },
Enter fullscreen mode Exit fullscreen mode

Shown here are a few of the reserved keywords for the language. This is also the step where certain syntax errors can be found. For example:

def my_broken_function():
    foo(
    return "Never reaches here"
Enter fullscreen mode Exit fullscreen mode

Attempting to run this will produce:

$ python3 invalid_token.py 
  File "parse_samples/invalid_token.py", line 2
    foo(
       ^
SyntaxError: '(' was never closed
Enter fullscreen mode Exit fullscreen mode

Which is referenced in the parser error handling:

static inline void
raise_unclosed_parentheses_error(Parser *p) {
       int error_lineno = p->tok->parenlinenostack[p->tok->level-1];
       int error_col = p->tok->parencolstack[p->tok->level-1];
       RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError,
                                  error_lineno, error_col, error_lineno, -1,
                                  "'%c' was never closed",
                                  p->tok->parenstack[p->tok->level-1]);
}
Enter fullscreen mode Exit fullscreen mode

Now python also has a standard library tokenize module that can be used to produce some basic structure information:

token_example.py

def say_hello():
    print("Hello, World!")

say_hello()
Enter fullscreen mode Exit fullscreen mode

The tokenize module can be run to produce:

$ python3 -m tokenize token_example.py 
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,3:            NAME           'def'          
1,4-1,13:           NAME           'say_hello'    
1,13-1,14:          OP             '('            
1,14-1,15:          OP             ')'            
1,15-1,16:          OP             ':'            
1,16-1,17:          NEWLINE        '\n'           
2,0-2,4:            INDENT         '    '         
2,4-2,9:            NAME           'print'        
2,9-2,10:           OP             '('            
2,10-2,25:          STRING         '"Hello, World!"'
2,25-2,26:          OP             ')'            
2,26-2,27:          NEWLINE        '\n'           
3,0-3,1:            NL             '\n'           
4,0-4,0:            DEDENT         ''             
4,0-4,9:            NAME           'say_hello'    
4,9-4,10:           OP             '('            
4,10-4,11:          OP             ')'            
4,11-4,12:          NEWLINE        ''             
5,0-5,0:            ENDMARKER      ''
Enter fullscreen mode Exit fullscreen mode

Now there's a few problems with this current form:

  • It includes punctuation and other labels which are entirely for structure but useless to calling methods
  • Labels are very generic, such as "NAME"
  • Changes require source code modification

Abstract Syntax Tree (AST)

To get around these issues an AST can be generated. This will provide more context around various tokens. Token output is supplemented at this stage through Abstract Syntax Description Lanuguage (ASDL) which can be found in Parser/Python.asdl. If you're interested in some practical use of ASDL in a more condensed format, I recommend looking at the oil shell article on it. Python provides an ast module to get an idea of what an AST looks like for python code. Much like tokenize it can be run via the command line:

$ python3 -m ast token_example.py 
Module(
   body=[
      FunctionDef(
         name='say_hello',
         args=arguments(
            posonlyargs=[],
            args=[],
            kwonlyargs=[],
            kw_defaults=[],
            defaults=[]),
         body=[
            Expr(
               value=Call(
                  func=Name(id='print', ctx=Load()),
                  args=[
                     Constant(value='Hello, World!')],
                  keywords=[]))],
         decorator_list=[]),
      Expr(
         value=Call(
            func=Name(id='say_hello', ctx=Load()),
            args=[],
            keywords=[]))],
   type_ignores=[])
Enter fullscreen mode Exit fullscreen mode

This is a lot more structured than the token form. It also focuses solely on what's required to execute various methods. A more involved example:

expanded_ast_example.py

class MyClass(object):
    @staticmethod
    def say_hello(name: String = 'John Doe') -> None:
        print(name)
Enter fullscreen mode Exit fullscreen mode

This includes decorators, default arguments, and type hinting. The output from ast:

$ python3 -m ast expanded_ast_example.py 
Module(
   body=[
      ClassDef(
         name='MyClass',
         bases=[
            Name(id='object', ctx=Load())],
         keywords=[],
         body=[
            FunctionDef(
               name='say_hello',
               args=arguments(
                  posonlyargs=[],
                  args=[
                     arg(
                        arg='name',
                        annotation=Name(id='String', ctx=Load()))],
                  kwonlyargs=[],
                  kw_defaults=[],
                  defaults=[
                     Constant(value='John Doe')]),
               body=[
                  Expr(
                     value=Call(
                        func=Name(id='print', ctx=Load()),
                        args=[
                           Name(id='name', ctx=Load())],
                        keywords=[]))],
               decorator_list=[
                  Name(id='staticmethod', ctx=Load())],
               returns=Constant(value=None))],
         decorator_list=[])],
   type_ignores=[])
Enter fullscreen mode Exit fullscreen mode

This is now a lot more useful to the compiler. It takes the tokens and builds a structure on how they related with other components. Type notations are included as part of the respective FunctionDef structure, and the class has its base inheritance (object) included. In the end though this is simply structure and not actually executing anything.

Bytecode

Python utilizes an instruction set intermediate built from python code. Such mechanisms are popular in languages such as Ruby, PHP, and JVM based languages. So why is this step important? In terms of portability consider a compiled C program:

  • A Linux compiled program won't run as-is on a Windows system
  • A program compiled on a 32 bit system may have issues on a 64 bit one
  • A Linux compiled program may have issues on another Linux system due to missing shared libraries
  • etc.

The benefit in bytecode is that as long as there is some kind of evaluation program such as a Virtual Machine on the target system, code execution is possible. Note that this just means the code can be run but not that the actual run code is portable (a number of methods in the os module for example). Thankfully the Python standard library and popular modules often deal with operating system differences behind the scenes so you don't have to. Python uses a specific set of opcodes for this purpose:

def_op('CACHE', 0)
def_op('POP_TOP', 1)
def_op('PUSH_NULL', 2)
def_op('INTERPRETER_EXIT', 3)
def_op('END_FOR', 4)
def_op('END_SEND', 5)
def_op('TO_BOOL', 6)
Enter fullscreen mode Exit fullscreen mode

There's two ways that opcodes can be generated for discovery purposes. The first is the dis (disassembler) module. It's meant to be used on the repl or as part of python script:

import dis

def say_hello():
    print("Hello, World!")

dis.dis(say_hello)
Enter fullscreen mode Exit fullscreen mode

Will produce:

  4           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('Hello, World!')
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE
Enter fullscreen mode Exit fullscreen mode

The disassembly is scoped to a specific method given. Full compilation of an AST to bytecode is handled by the python compiler. This can also be interfaced by the python module py_compile:

$ python3 -m py_compile token_example.py
$ ls __pycache__/token_example.cpython-310.pyc 
__pycache__/token_example.cpython-310.pyc
Enter fullscreen mode Exit fullscreen mode

Now unlike the other runs this doesn't produce immediate standard output, and instead produces a binary .pyc version. Important to note is that the python implementation and version is included. While the bytecode is platform independent it's still VM dependent due to the opcodes potentially being changed. As an example the CALL_FUNCTION opcode doesn't even exist in the 3.11 version. This means if I copy the .pyc I generated on Linux over to a window system in a Python 3.10 virtual environment:

> python .\__pycache__\token_example.cpython-310.pyc
Hello, World!
> python --version
Python 3.10.0
Enter fullscreen mode Exit fullscreen mode

Everything works just fine. This is what's meant with the "platform independent" part. On the other hand if I try a python 3.9 virtual environment:

> python .\__pycache__\token_example.cpython-310.pyc
RuntimeError: Bad magic number in .pyc file
Enter fullscreen mode Exit fullscreen mode

Note: Due to version dependency, from here on GitHub links will mostly link to the 3.10 branch.

This bytecode is actually a binary file, and the opcodes you see from dis are labels that map to respective numeric values. The "magic" being named is simply a fancy way of saying that some portion of the binary data is used to identify the format. This originates from the magic patterns used by the file command to identify file types. This is defined in Lib/importlib/_bootstrap_external.py (3.10 example):

MAGIC_NUMBER = (3439).to_bytes(2, 'little') + b'\r\n'
Enter fullscreen mode Exit fullscreen mode

So this 3439 is a unique tag for the header identifier. It ends up as a 2 byte little endian (the one modern systems use) value. If we open up the first two bytes of the .pyc file:

>>> fp = open('__pycache__/token_example.cpython-310.pyc', 'rb')
>>> magic_tag = fp.read(2)
>>> int.from_bytes(magic_tag, byteorder='little')
3439
>>> fp.close()
Enter fullscreen mode Exit fullscreen mode

Now if the same source is py_compile'ed using python 3.9:

>>> fp = open('__pycache__/token_example.cpython-39.pyc', 'rb')
>>> magic_tag = fp.read(2)
>>> int.from_bytes(magic_tag, byteorder='little')
3425
>>> fp.close()
Enter fullscreen mode Exit fullscreen mode

It matches the magic value.

pyc parsing

Despite knowing the magic header, the rest of a pyc file is interesting. Looking at PEP-552 gives a general idea of the current state of the pyc binary header:

The pyc header currently consists of 3 32-bit words. We will expand it to 4

So four 32 bit words or 16 bytes. was introduced in python 3.7 to support reproducible builds. Looking at the py_compile source we can see the different invalidation modes:

def _get_default_invalidation_mode():
    if os.environ.get('SOURCE_DATE_EPOCH'):
        return PycInvalidationMode.CHECKED_HASH
    else:
        return PycInvalidationMode.TIMESTAMP
Enter fullscreen mode Exit fullscreen mode

So a method using hash calculations is utilized if the SOURCE_DATE_EPOCH environment variable is set as per the reproducible builds standard. A different compilation method will be chosen depending on this mode:

    if invalidation_mode == PycInvalidationMode.TIMESTAMP:
        source_stats = loader.path_stats(file)
        bytecode = importlib._bootstrap_external._code_to_timestamp_pyc(
            code, source_stats['mtime'], source_stats['size'])
    else:
        source_hash = importlib.util.source_hash(source_bytes)
        bytecode = importlib._bootstrap_external._code_to_hash_pyc(
            code,
            source_hash,
            (invalidation_mode == PycInvalidationMode.CHECKED_HASH),
        )
Enter fullscreen mode Exit fullscreen mode

This uses importlib which handles a lot of the interesting details of bytecode compilation. Looking at the two methods we see:

def _code_to_timestamp_pyc(code, mtime=0, source_size=0):
    "Produce the data for a timestamp-based pyc."
    data = bytearray(MAGIC_NUMBER)
    data.extend(_pack_uint32(0))
    data.extend(_pack_uint32(mtime))
    data.extend(_pack_uint32(source_size))
    data.extend(marshal.dumps(code))
    return data


def _code_to_hash_pyc(code, source_hash, checked=True):
    "Produce the data for a hash-based pyc."
    data = bytearray(MAGIC_NUMBER)
    flags = 0b1 | checked << 1
    data.extend(_pack_uint32(flags))
    assert len(source_hash) == 8
    data.extend(source_hash)
    data.extend(marshal.dumps(code))
    return data
Enter fullscreen mode Exit fullscreen mode

Which correlates with the 4 32 bit words mentioned:

Timestamp

  • 32 bit magic
  • 32 bit 0 padding
  • 32 bit modification time
  • 32 bit source code size

Hash

  • 32 bit magic
  • 32 bit flags
  • 64 bit hash of source based on SipHash

The Hunt For PyCodeObject

The rest of the data is a marshal dump. marshal is a python module used to serialize internal python objects. Serialization is the process of taking data structures and presenting them in way that allows for the same structures to be produced by another system. In this case the Python data structures are packaged in a byte format. Now looking at what's being passed to marshal:

    loader = importlib.machinery.SourceFileLoader('<py_compile>', file)
    source_bytes = loader.get_data(file)
    try:
        code = loader.source_to_code(source_bytes, dfile or file,
                                     _optimize=optimize)
Enter fullscreen mode Exit fullscreen mode

It's using importlib.machinery.SourceFileLoader.source_to_code or at least, the parent SourceLoader implementation of it. What we'll do is make a more isolated use case with a breakpoint to traverse it using the pdb module:

from importlib.machinery import SourceFileLoader

loader = SourceFileLoader('<py_compile>', 'token_example.py')
source_bytes = loader.get_data('token_example.py')
breakpoint()
code = loader.source_to_code(source_bytes, 'token_example.py')
Enter fullscreen mode Exit fullscreen mode

Once running I do a step using the debugger a few times:

-> code = loader.source_to_code(source_bytes, 'token_example.py')
(Pdb) s
--Call--
> <frozen importlib._bootstrap_external>(942)source_to_code()
(Pdb) a
self = <_frozen_importlib_external.SourceFileLoader object at 0x7f4a64de3c10>
data = b'def say_hello():\n    print("Hello, World!")\n\nsay_hello()'
path = 'token_example.py'
_optimize = -1
(Pdb) s
> <frozen importlib._bootstrap_external>(947)source_to_code()
(Pdb) s
> <frozen importlib._bootstrap_external>(948)source_to_code()
(Pdb) s
> <frozen importlib._bootstrap_external>(947)source_to_code()
(Pdb) s
--Call--
> <frozen importlib._bootstrap>(233)_call_with_frames_removed()
(Pdb) a
f = <built-in function compile>
args = (b'def say_hello():\n    print("Hello, World!")\n\nsay_hello()', 'token_example.py', 'exec')
kwds = {'dont_inherit': True, 'optimize': -1}
(Pdb)
Enter fullscreen mode Exit fullscreen mode

So first the data is the actual source code read from the python file as bytes. Then is a path which indicates where the code came from. Stepping in more we see another _call_with_frames_removed which calls a built in compile with specific positional and keyword arguments. Now looking at the devguide exploring the internals:

For builtin functions, the typical layout is:
Python/bltinmodule.c
Lib/test/test_builtin.py
Doc/library/functions.rst

Looking at Python/compile.c it's so huge that GitHub takes a decent amount of time to load a preview. The start of the file gives us at least some hint of where to start:

/*
 * This file compiles an abstract syntax tree (AST) into Python bytecode.
 *
 * The primary entry point is _PyAST_Compile(), which returns a
 * PyCodeObject.  The compiler makes several passes to build the code
 * object:
 *   1. Checks for future statements.  See future.c
 *   2. Builds a symbol table.  See symtable.c.
 *   3. Generate an instruction sequence. See compiler_mod() in this file.
 *   4. Generate a control flow graph and run optimizations on it.  See flowgraph.c.
 *   5. Assemble the basic blocks into final code.  See optimize_and_assemble() in
 *      this file, and assembler.c.
 *
 * Note that compiler_mod() suggests module, but the module ast type
 * (mod_ty) has cases for expressions and interactive statements.
 *
 * CAUTION: The VISIT_* macros abort the current function when they
 * encounter a problem. So don't invoke them when there is memory
 * which needs to be released. Code blocks are OK, as the compiler
 * structure takes care of releasing those.  Use the arena to manage
 * objects.
 */
Enter fullscreen mode Exit fullscreen mode

Checking the _PyAST_Compile() function:

PyCodeObject *
_PyAST_Compile(mod_ty mod, PyObject *filename, PyCompilerFlags *flags,
               int optimize, PyArena *arena)
{
    struct compiler c;
    PyCodeObject *co = NULL;
    PyCompilerFlags local_flags = _PyCompilerFlags_INIT;
    int merged;

    if (!__doc__) {
        __doc__ = PyUnicode_InternFromString("__doc__");
        if (!__doc__)
            return NULL;
    }
    if (!__annotations__) {
        __annotations__ = PyUnicode_InternFromString("__annotations__");
        if (!__annotations__)
            return NULL;
    }
    if (!compiler_init(&c))
        return NULL;
    Py_INCREF(filename);
    c.c_filename = filename;
    c.c_arena = arena;
    c.c_future = _PyFuture_FromAST(mod, filename);
    if (c.c_future == NULL)
        goto finally;
    if (!flags) {
        flags = &local_flags;
    }
    merged = c.c_future->ff_features | flags->cf_flags;
    c.c_future->ff_features = merged;
    flags->cf_flags = merged;
    c.c_flags = flags;
    c.c_optimize = (optimize == -1) ? _Py_GetConfig()->optimization_level : optimize;
    c.c_nestlevel = 0;

    _PyASTOptimizeState state;
    state.optimize = c.c_optimize;
    state.ff_features = merged;

    if (!_PyAST_Optimize(mod, arena, &state)) {
        goto finally;
    }

    c.c_st = _PySymtable_Build(mod, filename, c.c_future);
    if (c.c_st == NULL) {
        if (!PyErr_Occurred())
            PyErr_SetString(PyExc_SystemError, "no symtable");
        goto finally;
    }

    co = compiler_mod(&c, mod);

 finally:
    compiler_free(&c);
    assert(co || PyErr_Occurred());
    return co;
}
Enter fullscreen mode Exit fullscreen mode

It returns a PyCodeObject. The python docs has some information on the PyCodeObject type. However it doesn't show the full picture. Thankfully the code for it has a list of supported properties:

static PyMemberDef code_memberlist[] = {
    {"co_argcount",     T_INT,          OFF(co_argcount),        READONLY},
    {"co_posonlyargcount",      T_INT,  OFF(co_posonlyargcount), READONLY},
    {"co_kwonlyargcount",       T_INT,  OFF(co_kwonlyargcount),  READONLY},
    {"co_nlocals",      T_INT,          OFF(co_nlocals),         READONLY},
    {"co_stacksize",T_INT,              OFF(co_stacksize),       READONLY},
    {"co_flags",        T_INT,          OFF(co_flags),           READONLY},
    {"co_code",         T_OBJECT,       OFF(co_code),            READONLY},
    {"co_consts",       T_OBJECT,       OFF(co_consts),          READONLY},
    {"co_names",        T_OBJECT,       OFF(co_names),           READONLY},
    {"co_varnames",     T_OBJECT,       OFF(co_varnames),        READONLY},
    {"co_freevars",     T_OBJECT,       OFF(co_freevars),        READONLY},
    {"co_cellvars",     T_OBJECT,       OFF(co_cellvars),        READONLY},
    {"co_filename",     T_OBJECT,       OFF(co_filename),        READONLY},
    {"co_name",         T_OBJECT,       OFF(co_name),            READONLY},
    {"co_firstlineno",  T_INT,          OFF(co_firstlineno),     READONLY},
    {"co_linetable",    T_OBJECT,       OFF(co_linetable),       READONLY},
    {NULL}      /* Sentinel */
};
Enter fullscreen mode Exit fullscreen mode

Code Object Inspection

Now this isn't even what's going on in the backend. Finding out how exactly that maps to bytes in the pyc file is honestly probably not worth the effort. Not to mention the underlying structure could change though modifications to either marshal or the PyObjectCode structure. Now the actual process of getting the code object from a pyc file would involve skipping the first 16 bytes and using marshal on the rest. Thankfully pkgutil.read_code will essentially do that for us (and magic checking at the same time):

def read_code(stream):
    # This helper is needed in order for the PEP 302 emulation to
    # correctly handle compiled files
    import marshal

    magic = stream.read(4)
    if magic != importlib.util.MAGIC_NUMBER:
        return None

    stream.read(12) # Skip rest of the header
    return marshal.load(stream)
Enter fullscreen mode Exit fullscreen mode

So taking a look in the repl:

>>> from pkgutil import read_code
>>> fp = open('__pycache__/token_example.cpython-310.pyc', 'rb')
>>> code = read_code(fp)
>>> code
<code object <module> at 0x7faf6f6a6290, file "token_example.py", line 1>
Enter fullscreen mode Exit fullscreen mode

There's now a code object. I'll take a look at some of the more interesting properties that the code_memberlist showed:

>>> code.co_code
b'd\x00d\x01\x84\x00Z\x00e\x00\x83\x00\x01\x00d\x02S\x00'
>>> code.co_consts
(<code object say_hello at 0x7faf6f6a61e0, file "token_example.py", line 1>, 'say_hello', None)
>>> code.co_names
('say_hello',)
>>> code.co_filename
'token_example.py'
>>> code.co_name
'<module>'
Enter fullscreen mode Exit fullscreen mode

So now looking at the code byte by byte:

>>> code.co_code[0]
100
>>> code.co_code[1]
0
Enter fullscreen mode Exit fullscreen mode

I notice a number 100 and a 0. This 100 actually maps to the opcode LOAD_CONST in the 3.10 python version. The next is the argument, or the first constant:

>>> code.co_consts[0]
<code object say_hello at 0x7faf6f6a61e0, file "token_example.py", line 1>
Enter fullscreen mode Exit fullscreen mode

So this is loading the say_hello function (they weren't kidding when they said everything is an object). Now it's loading the string constant 'say_hello':

>>> code.co_consts[0]
<code object say_hello at 0x7faf6f6a61e0, file "token_example.py", line 1>
>>> code.co_code[2]
100
>>> code.co_code[3]
1
>>> code.co_consts[1]
'say_hello'
Enter fullscreen mode Exit fullscreen mode

The next opcode is different from the rest:

>>> code.co_code[4]
132
>>> code.co_code[5]
0
Enter fullscreen mode Exit fullscreen mode

This is a call to MAKE_FUNCTION. This and other calls are actually documented in the dis module. Given that this is the latest documentation and opcodes can change between versions, we'll need to use the documentation included in the source:

.. opcode:: MAKE_FUNCTION (flags)

   Pushes a new function object on the stack.  From bottom to top, the consumed
   stack must consist of values if the argument carries a specified flag value

   * ``0x01`` a tuple of default values for positional-only and
     positional-or-keyword parameters in positional order
   * ``0x02`` a dictionary of keyword-only parameters' default values
   * ``0x04`` a tuple of strings containing parameters' annotations
   * ``0x08`` a tuple containing cells for free variables, making a closure
   * the code associated with the function (at TOS1)
   * the :term:`qualified name` of the function (at TOS)
Enter fullscreen mode Exit fullscreen mode

This ends up as 0 since there's nothing extra to pull in save the qualified name say_hello and the code itself. If there was, a bitwise operation would be done against the argument to see if the flags are active:

>>> code.co_code[17] & 0x01
1
>>> code.co_code[17] & 0x02
0
>>> code.co_code[17] & 0x04
4
>>> code.co_code[17] & 0x08
0
Enter fullscreen mode Exit fullscreen mode

This function has a default value, and a tuple of 4 values that establish type annotations for type hinting. A simple array can show the stack loading with qualified name being the TOS(Top of Stack) and the code being TOS+1:

>>> stack.append(code.co_consts[0])
>>> stack.append(code.co_consts[1])
>>> stack
[<code object say_hello at 0x7faf6f6a61e0, file "token_example.py", line 1>, 'say_hello']
>>> stack.pop()
'say_hello'
>>> stack.pop()
<code object say_hello at 0x7faf6f6a61e0, file "token_example.py", line 1>
Enter fullscreen mode Exit fullscreen mode

Next is a STORE_NAME instruction:

>>> code.co_code[6]
90
>>> code.co_code[7]
0
Enter fullscreen mode Exit fullscreen mode

which according to the documentation:

.. opcode:: STORE_NAME (namei)

   Implements ``name = TOS``. *namei* is the index of *name* in the attribute
   :attr:`co_names` of the code object. The compiler tries to use
   :opcode:`STORE_FAST` or :opcode:`STORE_GLOBAL` if possible.
Enter fullscreen mode Exit fullscreen mode

Which we can check the index of the co_names attribute as the docs mention:

>>> code.co_names[0]
'say_hello'
Enter fullscreen mode Exit fullscreen mode

Given that the previous MAKE_FUNCTION call pushed the function into the stack, this means that the name is pointing to it making it callable. Once the function is defined it's immediately called, so LOAD_NAME comes next:

>>> code.co_code[8]
101
>>> code.co_code[9]
0
Enter fullscreen mode Exit fullscreen mode

This is pulling the value of say_name that was also used by STORE_NAME previously. As for the documentation:

.. opcode:: LOAD_NAME (namei)

   Pushes the value associated with ``co_names[namei]`` onto the stack.
Enter fullscreen mode Exit fullscreen mode

The value associated with say_hello is the code for the method. As might be expected next the actual function is called via CALL_FUNCTION:

>>> code.co_code[10]
131
>>> code.co_code[11]
0
Enter fullscreen mode Exit fullscreen mode

The documentation for it:

.. opcode:: CALL_FUNCTION (argc)

   Calls a callable object with positional arguments.
   *argc* indicates the number of positional arguments.
   The top of the stack contains positional arguments, with the right-most
   argument on top.  Below the arguments is a callable object to call.
   ``CALL_FUNCTION`` pops all arguments and the callable object off the stack,
   calls the callable object with those arguments, and pushes the return value
   returned by the callable object.
Enter fullscreen mode Exit fullscreen mode

Now there are no arguments in say_hello() so the argument ends up being 0. The return value gets pushed to the stack which will then be picked up by POP_TOP:

>>> code.co_code[12]
1
>>> code.co_code[13]
0
Enter fullscreen mode Exit fullscreen mode

The name is a pretty good giveaway of what POP_TOP does:

.. opcode:: POP_TOP

   Removes the top-of-stack (TOS) item.
Enter fullscreen mode Exit fullscreen mode

Next the None constant is loaded as there's no return for the function:

>>> code.co_code[14]
100
>>> code.co_code[15]
2
>>> print(code.co_consts[2])
None
Enter fullscreen mode Exit fullscreen mode

Finally, to close everything up the main code RETURN_VALUE will be executed:

>>> code.co_code[16]
83
>>> code.co_code[17]
0
Enter fullscreen mode Exit fullscreen mode

According to the documentation:

.. opcode:: RETURN_VALUE

   Returns with TOS to the caller of the function
Enter fullscreen mode Exit fullscreen mode

As the TOS is None this is what will end up getting returned. This completes the execution as there's no bytes left to work with:

>>> code.co_code[18]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index out of range
Enter fullscreen mode Exit fullscreen mode

It turns out that dis also works with code objects, so if we pass our code to dis we can validate our bytecode observations:

>>> import dis
>>> dis.dis(code)
  1           0 LOAD_CONST               0 (<code object say_hello at 0x7faf6f6a61e0, file "token_example.py", line 1>)
              2 LOAD_CONST               1 ('say_hello')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (say_hello)

  4           8 LOAD_NAME                0 (say_hello)
             10 CALL_FUNCTION            0
             12 POP_TOP
             14 LOAD_CONST               2 (None)
             16 RETURN_VALUE

Disassembly of <code object say_hello at 0x7faf6f6a61e0, file "token_example.py", line 1>:
  2           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('Hello, World!')
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE
Enter fullscreen mode Exit fullscreen mode

It even provides us with the disassembly bytecode for the say_name function. The values referenced by various LOAD_ type calls are handled for us as well so we don't need to think about that. This is much more workable than trying to parse out the code bytes manually!

Evaluation

The programmatic handling of bytecode functionality can be found in Python/ceval.c. In particular through this rather large switch statement function:

        switch (opcode) {

        /* BEWARE!
           It is essential that any operation that fails must goto error
           and that all operation that succeed call DISPATCH() ! */

        case TARGET(NOP): {
            DISPATCH();
        }

        case TARGET(LOAD_FAST): {
            PyObject *value = GETLOCAL(oparg);
            if (value == NULL) {
                format_exc_check_arg(tstate, PyExc_UnboundLocalError,
                                     UNBOUNDLOCAL_ERROR_MSG,
                                     PyTuple_GetItem(co->co_varnames, oparg));
                goto error;
            }
            Py_INCREF(value);
            PUSH(value);
            DISPATCH();
        }

        case TARGET(LOAD_CONST): {
            PREDICTED(LOAD_CONST);
            PyObject *value = GETITEM(consts, oparg);
            Py_INCREF(value);
            PUSH(value);
            DISPATCH();
        }
Enter fullscreen mode Exit fullscreen mode

In the latest Python branch this is made a bit more manageable by having the bytecode functionality as part of a generated source file:

#if !USE_COMPUTED_GOTOS
    dispatch_opcode:
        switch (opcode)
#endif
        {

#include "generated_cases.c.h"
Enter fullscreen mode Exit fullscreen mode

The actual generated_cases.c.h is, as the name suggests, a series of generated case statements sourced in Python/bytecodes.c:

        inst(STORE_NAME, (v -- )) {
            PyObject *name = GETITEM(FRAME_CO_NAMES, oparg);
            PyObject *ns = LOCALS();
            int err;
            if (ns == NULL) {
                _PyErr_Format(tstate, PyExc_SystemError,
                              "no locals found when storing %R", name);
                DECREF_INPUTS();
                ERROR_IF(true, error);
            }
            if (PyDict_CheckExact(ns))
                err = PyDict_SetItem(ns, name, v);
            else
                err = PyObject_SetItem(ns, name, v);
            DECREF_INPUTS();
            ERROR_IF(err, error);
        }
Enter fullscreen mode Exit fullscreen mode

The exec builtin is also able to execute code objects:

>>> exec(code)
Hello, World!
Enter fullscreen mode Exit fullscreen mode

It's underlying code is shared with ceval's. Also worth a mention is that code object properties are read only:

>>> code.co_code = b'd\x00d\x01\x84\x00Z\x00e\x00\x83\x00\x01\x00d\x02S\x00'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: readonly attribute
Enter fullscreen mode Exit fullscreen mode

So something to manipulate code objects would need to interface with the respective Python C API.

Conclusion

This concludes a rather short tour of parsing and a very long tour of bytecodes. One major take back to this is that bytecode related functionality will be dependent upon the Python implementation and version being used. This means you may have to utilize the documentation in the respective source code version branch if you're not using the latest python. I hope this article proved useful for those who wanted to see what's under the hood, or maybe even dive in a little deeper!

Top comments (0)