If you use a buffer of bytes to share strings between C (or C++) and Python, you may end up with weird results, because both languages do not represent strings the same way.
I got bitten lately and here is what I have learnt.
C strings
If you have used C to handle strings once in your lifetime, then you have run into one of the most painful things in C: null-terminated strings 😑
A string is just a pointer in C: at the given address you will find bytes that are interpreted as characters. How does the string end? Simply when a special character is found: '\0'
(its numerical value being 0). Forget this character and your string ends... somewhere (if the program doesn't crash before).
These C strings are available in C++ too. Even if you would probably prefer to use std::string
s in C++, there are valid reasons to stick to C strings. The most common reason is when you cannot use dynamic memory allocation, which is almost always the case on embedded systems.
Python strings
Python is implemented in C. To be accurate, CPython, the reference implementation, the implementation almost everybody uses, is written in C. So you may think that Python strings are C strings then.
They are not.
Take a look at stringobject.h
from Python's SVN.
Line 13, this comment says it all:
Type PyStringObject represents a character string. An extra zero byte is reserved at the end to ensure it is zero-terminated, but a size is present so strings with null bytes in them can be represented.
String representation in Python is different and more complex than in C.
Why is it a problem when sharing a buffer of bytes?
I have an IoT device with Bluetooth Low Energy (BLE) connectivity. Bluetooth Low Energy devices transfer data back and forth using concepts called "services" and "characteristics". One of my characteristics is a buffer to share a string: 20 bytes with the name of my device.
The firmware of this device is written in C++ but because dynamic memory allocation is disabled, C strings are used. One string of this kind is hence written to the buffer.
My computer runs a Python software that connects to the device. It reads the 20 bytes and decodes it as a string.
Because of the different string representations between C(++) and Python, I ran into funny bugs with garbage characters in Python.
Code to illustrate the issue
Here a C++ code that reproduces the issue:
#include <array>
#include <cstring>
#include <iomanip>
#include <iostream>
using Characteristic = std::array<char, 20>;
int main() {
Characteristic device_name{};
// Write the original name
const char* original_name = "The original name";
std::strcpy(device_name.data(), original_name);
std::cout << device_name.data() << ' ' << std::strlen(device_name.data()) << '\n';
// Write the new name
// (this doesn't replace remaining characters with 0s in the buffer)
const char* new_name = "New name";
std::strcpy(device_name.data(), new_name);
std::cout << device_name.data() << ' ' << std::strlen(device_name.data()) << '\n';
// Mimic serialization over BLE
// (in fact, this prints a byte literal for the Python code below)
for (auto c : device_name) {
std::cout << "\\x" << std::setfill('0') << std::setw(2) << std::hex << static_cast<std::uint32_t>(c);
}
}
NOTE: this code is bad! strcpy
will cause a buffer overrun if names are greater than 20 char
s.
The output is:
The original name 17
New name 8
\x4e\x65\x77\x20\x6e\x61\x6d\x65\x00\x6e\x61\x6c\x20\x6e\x61\x6d\x65\x00\x00\x00
As you can see, there is an intermediate byte with the value 0 in the buffer.
Let's take those bytes and decode them with Python:
# Mimic deserialization over BLE
# (just take what the C++ code prints)
raw = b'\x4e\x65\x77\x20\x6e\x61\x6d\x65\x00\x6e\x61\x6c\x20\x6e\x61\x6d\x65\x00\x00\x00'
assert len(raw) == 20
# Get an 'str' object
string = raw.decode('ascii')
print('Decoded: [{}] {}'.format(string, len(string)))
The output is:
Decoded: [New name nal name ] 20
Oops! Python doesn't stop on the first '\0'
so the device name is wrong in my Python software 😳
Code to solve the issue
The easiest and fastest solution is to make the byte decoding more robust in Python. split()
can do the trick:
raw = b'\x4e\x65\x77\x20\x6e\x61\x6d\x65\x00\x6e\x61\x6c\x20\x6e\x61\x6d\x65\x00\x00\x00'
string = raw.decode('ascii').split('\0')[0]
print('Decoded and split: [{}] {}'.format(string, len(string)))
The device name is correct then:
Decoded: [New name] 8
Another solution would be to set all trailing bytes to 0 in the C++ program. The exact way to do this depends on your actual code. For the code above, a solution could be to use strncpy
:
const char* new_name = "New name";
std::strncpy(device_name.data(), new_name, device_name.size() - 1);
The output becomes:
\x4e\x65\x77\x20\x6e\x61\x6d\x65\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
strncpy
has two advantages over strcpy
. First, if new_name
doesn't fit in the buffer, there is no buffer overrun. Second, if new_name
is smaller than the size of the buffer, all trailing bytes are set to 0, as stated in the documentation:
If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written.
All in all, the best solution is maybe to make both server and client more robust 😅
EDIT from September 1st 2021
Almost one year later, I am running into another bug related to this topic. Because my actual C++ code is not as the code shown above, I am relying only on the Python code to solve this issue of intermediate \0
.
But there is another potential issue: the bytes after the first \0
may not be valid ASCII characters and decode('ascii')
will fail. Here is an example:
name = bytearray(b'This a long name\x00+\xed\xb8')
decoded = name.decode('ascii').split('\0')[0]
The error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 18: ordinal not in range(128)
I have to split the buffer before decoding. Here is a possible solution:
name = bytearray(b'This a long name\x00+\xed\xb8')
index = name.index(0)
decoded = name[0:index].decode('ascii')
Top comments (0)