DEV Community

Pierre Gradot
Pierre Gradot

Posted on • Edited on

Sharing strings between C and Python through byte buffers

If you use a buffer of bytes to share strings between C (or C++) and Python, you may end up with weird results, because both languages do not represent strings the same way.

I got bitten lately and here is what I have learnt.

C strings

If you have used C to handle strings once in your lifetime, then you have run into one of the most painful things in C: null-terminated strings πŸ˜‘

A string is just a pointer in C: at the given address you will find bytes that are interpreted as characters. How does the string end? Simply when a special character is found: '\0' (its numerical value being 0). Forget this character and your string ends... somewhere (if the program doesn't crash before).

These C strings are available in C++ too. Even if you would probably prefer to use std::strings in C++, there are valid reasons to stick to C strings. The most common reason is when you cannot use dynamic memory allocation, which is almost always the case on embedded systems.

Python strings

Python is implemented in C. To be accurate, CPython, the reference implementation, the implementation almost everybody uses, is written in C. So you may think that Python strings are C strings then.

They are not.

Take a look at stringobject.h from Python's SVN.

Line 13, this comment says it all:

Type PyStringObject represents a character string. An extra zero byte is reserved at the end to ensure it is zero-terminated, but a size is present so strings with null bytes in them can be represented.

String representation in Python is different and more complex than in C.

Why is it a problem when sharing a buffer of bytes?

I have an IoT device with Bluetooth Low Energy (BLE) connectivity. Bluetooth Low Energy devices transfer data back and forth using concepts called "services" and "characteristics". One of my characteristics is a buffer to share a string: 20 bytes with the name of my device.

The firmware of this device is written in C++ but because dynamic memory allocation is disabled, C strings are used. One string of this kind is hence written to the buffer.

My computer runs a Python software that connects to the device. It reads the 20 bytes and decodes it as a string.

Because of the different string representations between C(++) and Python, I ran into funny bugs with garbage characters in Python.

Code to illustrate the issue

Here a C++ code that reproduces the issue:

#include <array>
#include <cstring>
#include <iomanip>
#include <iostream>

using Characteristic = std::array<char, 20>;

int main() {

    Characteristic device_name{};

    // Write the original name
    const char* original_name = "The original name";
    std::strcpy(device_name.data(), original_name);
    std::cout << device_name.data() << ' ' << std::strlen(device_name.data()) << '\n';

    // Write the new name
    // (this doesn't replace remaining characters with 0s in the buffer)
    const char* new_name = "New name";
    std::strcpy(device_name.data(), new_name);
    std::cout << device_name.data() << ' ' << std::strlen(device_name.data()) << '\n';

    // Mimic serialization over BLE
    // (in fact, this prints a byte literal for the Python code below)
    for (auto c : device_name) {
        std::cout << "\\x" << std::setfill('0') << std::setw(2) << std::hex << static_cast<std::uint32_t>(c);
    }
}
Enter fullscreen mode Exit fullscreen mode

NOTE: this code is bad! strcpy will cause a buffer overrun if names are greater than 20 chars.

The output is:

The original name 17
New name 8
\x4e\x65\x77\x20\x6e\x61\x6d\x65\x00\x6e\x61\x6c\x20\x6e\x61\x6d\x65\x00\x00\x00
Enter fullscreen mode Exit fullscreen mode

As you can see, there is an intermediate byte with the value 0 in the buffer.

Let's take those bytes and decode them with Python:

# Mimic deserialization over BLE
# (just take what the C++ code prints)
raw = b'\x4e\x65\x77\x20\x6e\x61\x6d\x65\x00\x6e\x61\x6c\x20\x6e\x61\x6d\x65\x00\x00\x00'
assert len(raw) == 20

# Get an 'str' object
string = raw.decode('ascii')
print('Decoded: [{}] {}'.format(string, len(string)))
Enter fullscreen mode Exit fullscreen mode

The output is:

Decoded: [New name nal name   ] 20
Enter fullscreen mode Exit fullscreen mode

Oops! Python doesn't stop on the first '\0' so the device name is wrong in my Python software 😳

Code to solve the issue

The easiest and fastest solution is to make the byte decoding more robust in Python. split() can do the trick:

raw = b'\x4e\x65\x77\x20\x6e\x61\x6d\x65\x00\x6e\x61\x6c\x20\x6e\x61\x6d\x65\x00\x00\x00'

string = raw.decode('ascii').split('\0')[0]
print('Decoded and split: [{}] {}'.format(string, len(string)))
Enter fullscreen mode Exit fullscreen mode

The device name is correct then:

Decoded: [New name] 8
Enter fullscreen mode Exit fullscreen mode

Another solution would be to set all trailing bytes to 0 in the C++ program. The exact way to do this depends on your actual code. For the code above, a solution could be to use strncpy:

const char* new_name = "New name";
std::strncpy(device_name.data(), new_name, device_name.size() - 1);
Enter fullscreen mode Exit fullscreen mode

The output becomes:

\x4e\x65\x77\x20\x6e\x61\x6d\x65\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
Enter fullscreen mode Exit fullscreen mode

strncpy has two advantages over strcpy. First, if new_name doesn't fit in the buffer, there is no buffer overrun. Second, if new_name is smaller than the size of the buffer, all trailing bytes are set to 0, as stated in the documentation:

If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written.

All in all, the best solution is maybe to make both server and client more robust πŸ˜…

EDIT from September 1st 2021

Almost one year later, I am running into another bug related to this topic. Because my actual C++ code is not as the code shown above, I am relying only on the Python code to solve this issue of intermediate \0.

But there is another potential issue: the bytes after the first \0 may not be valid ASCII characters and decode('ascii') will fail. Here is an example:

name = bytearray(b'This a long name\x00+\xed\xb8')
decoded = name.decode('ascii').split('\0')[0]
Enter fullscreen mode Exit fullscreen mode

The error is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 18: ordinal not in range(128)

I have to split the buffer before decoding. Here is a possible solution:

name = bytearray(b'This a long name\x00+\xed\xb8')

index = name.index(0)
decoded = name[0:index].decode('ascii')
Enter fullscreen mode Exit fullscreen mode

Top comments (0)