Konstantin Grechishchev

Posted on Jul 30, 2022

7 ways to pass a string between 🦀 Rust and C

#rust #tutorial #ffi #memory

Interoperability with C is one of the most incredible things in Rust. The ability to call safe Rust code from C and use well-known libraries with a C interface from Rust is a crucial reason for the fast adoption of Rust across the industry. It also allows us to distribute the code better by implementing C interfaces for the rust crate, so it could be used by the software written in any language capable of calling C.

Writing the FFI interface is also quite confusing and hard to get correct on the first attempt. How to deal with all these into_raw and as_ptr methods without leaking the memory and causing the security vulnerability? It scares people as well: the usage of the unsafe keyword is nearly inevitable.

I'll try to shine some light on the memory aspects of the FFI interfaces and provide a few hopefully useful patterns I've used in my projects.

NOTE: I would be using strings here as an example, however, the technics decried are applicable to transfer the byte arrays or pointers to the structs on the heap in Box or Arc types as well.

Things to know before we jump into coding

I would like to talk about a few essential rules before we implement our first FFI function. It is important to keep them in mind during the design as missing one of them would likely lead to the bug in the form of a crash or a memory leak.

Rule #1: One pointer - one allocator

You might think that allocation of the memory is nothing but a call to some magical operation system API. The reality is that getting a chunk of memory to write your buffer to is a sophisticated (and expensive!) operation. Compilers and library developers are tempted to apply all sorts of optimizations (like getting a bigger chunk of memory to avoid often calls to the operating system) to optimize it and they implement it differently!

You should make no assumptions about the type of memory allocator used by the caller of your library. They don't have to use malloc and are free from the libc! In other words, the memory allocated by the rust code should be deleted by the rust code and the pointers obtained over the FFI boundary should be returned back to be released. If you've allocated memory using malloc, do not try to convert it to Box and drop it! Do not try callingfree on the pointer obtained by calling Box::into_raw()!

Rule #2: Think about the ownership

Rust is a memory-safe language and it is explicit about ownership. When you see Box<dyn Any> in your code, you know that the memory to store Any would be released as soon as you drop the Box. In contrast, when you see void*, you do not right away have an idea whether you should call free on it or would someone else do it (or maybe it is not even required as it points to the stack)?

Rust have a naming convention for the methods to convert the struct to the raw pointer. Standard library structs like Box, Arc, CStr and CString provide as_ptr and pair of the into_raw and from_raw methods. Not every struct provides all three of them, which makes them even more confusing. They are so crucial, that it is worth spending some time discussing them here.

Let's look into the CString as it has all three of the above. Both as_ptr and into_raw methods provide you a pointer of the same type. However, like a void* mentioned above, these pointers are different in terms of ownership.

The as_ptr method takes &self by reference. It means that the CString instance exists on a stack after as_ptr returns and keeps the ownership of the data. In other words, the pointer returned points to the data still owned by the CString instance. Dropping the instance would keep the pointer dangling (pointing to bad memory). You should never use this pointer after the CString instance is dropped. In safe rust, this property of the pointer is represented by the lifetime of the reference (analogs to pointer) and controlled by the compiler but with a raw pointer, all bets are off.

Unlike as_ptr, the into_raw accepts the self by value and destroys it. Wait, would not destroy release the memory? It turns out that into_raw does not call the drop method! It creates an owning pointer and "leaks" the block of memory provided by the rust allocator out of the rust compiler control. You introduce the memory leak if you simply drop this pointer without calling the from_raw method on it. However, it would never dangle (unless you change it or clone before calling from_raw).

You should use as_ptr if you would like to let C temporarily "borrow" the rust memory. This has a huge advantage, because the C code does not need to worry about releasing it, but it also limits the pointer lifetime. It is probably a bad idea to save this pointer in some global struct or pass it to another thread. Returning such pointer as the result of the function call is likely a bad idea as well!

The into_raw method moves the ownership of the data to C. It gives a lot of freedom to keep the pointer around for as long as the code needs, but it is important to make sure that it is transferred back to rust to be removed at some point!

Memory representation of the String

Unfortunately, rust and C represent strings differently. The c string is usually a char* pointer pointing to the array of char with /0 in the end. Rust stores the array of chars and the length of it instead.

Due to the above reason, you should not convert rust String and str type directly to raw pointers and back. You would like to use CString and CStr intermediate types to achieve it. Usually, CString is used to pass the rust string to C code, and CStr is used to convert C string to rust &str. Note that this conversion is not always causing the copy of the underlying data. Such, the &str obtained from CStr will keep internally pointing to C allocated array and its lifetime is bound to the lifetime of the pointer.

NOTE: String:new copies the data, but CStr::new does not!

Project setup

How to link rust and C together

There are plenty of materials online devoted to building the C code and linking it to the rust crate using build.rs file, however I significantly fewer articles about adding rust code to the C project. In contrast, I would like to implement the main function of my example project in C language and use CMake as the build system. I would the CMake project to use the rust crate as a library and generate C header files based on the rust code. The source code of the complete project is in github.

Running Cargo from CMake

I've stated with generating simple CMake 3 console application project.

The first thing we need to do is to define the command to build the rust library and the location of the rust artifacts:

if (CMAKE_BUILD_TYPE STREQUAL "Debug")
    set(CARGO_CMD RUSTFLAGS=-Zsanitizer=address cargo build -Zbuild-std --target x86_64-unknown-linux-gnu)
    set(TARGET_DIR "x86_64-unknown-linux-gnu/debug")
else ()
    set(CARGO_CMD cargo build --release)
    set(TARGET_DIR "release")
endif ()
SET(LIB_FILE "${CMAKE_CURRENT_BINARY_DIR}/${TARGET_DIR}/librust_lib.a")

The command to build the debug version of crate would probably look a bit awkward for people familiar with rust. It could very well be replaced with just cargo build, however, I would like to make use of rust unstable address sanitizer feature to ensure the absence of the memory leaks.

Second, we need to define the custom command and custom target to depend on the command output. We can then define a static imported library called rust_lib and make it depending on the target to build it:

add_custom_command(OUTPUT ${LIB_FILE}
        COMMENT "Compiling rust module"
        COMMAND CARGO_TARGET_DIR=${CMAKE_CURRENT_BINARY_DIR} ${CARGO_CMD}
        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/rust_lib)

add_custom_target(rust_lib_target DEPENDS ${LIB_FILE})
add_library(rust_lib STATIC IMPORTED GLOBAL)
add_dependencies(rust_lib rust_lib_target)

Finally, we can link our binary together with a rust library (and other system libraries required). We also enable address sanitizer for C code:

target_compile_options(rust_c_interop PRIVATE -fno-omit-frame-pointer -fsanitize=address)
target_link_libraries(rust_c_interop PRIVATE Threads::Threads rust_lib ${CMAKE_DL_LIBS} -fno-omit-frame-pointer -fsanitize=address)

Running the CMake build will now build the rust create automatically and link with it. However we have to way to call rust methods from C yet.

Generating C headers and adding them to the CMake project

The easiest way to obtain the headers for the rust code is to use the cbingen library.

We can then add the following code to the build.rs file of our crate to detect all extern "C" functions defined in rust and generate a header for them in the header file under include/ directory:

let crate_dir = env::var("CARGO_MANIFEST_DIR").unwrap();
let package_name = env::var("CARGO_PKG_NAME").unwrap();
let output_file = PathBuf::from(&crate_dir)
    .join("include")
    .join(format!("{}.h", package_name));

cbindgen::generate(&crate_dir)
    .unwrap()
    .write_to_file(output_file);

We should also create the cbindgen.toml file in the root folder of the rust crate and add the language = "C" line into it.

The only thing left to do is to ask CMake to look for the headers in the include folder of the rust crate:

SET(LIB_HEADER_FOLDER "${CMAKE_CURRENT_SOURCE_DIR}/rust_lib/include")
set_target_properties(rust_lib
        PROPERTIES
        IMPORTED_LOCATION ${LIB_FILE}
        INTERFACE_INCLUDE_DIRECTORIES ${LIB_HEADER_FOLDER})

5 ways to pass the Rust string to C

Finally we are all set. Imagine we now want to get some string from data from rust and use it in C (just to print to console). How can we do it safely and without leaking the RAM?

Option #1: Provide create and delete methods

Use this method when we don't know how long the C code would like to be able to access the string. A good indication of it is that we would like to return a pointer out of the rust method. We would transfer the ownership to C by constructing the CString object and casting it to the pointer using into_raw. The free method needs to just construct CString back and drop it to release the RAM:

#[no_mangle]
pub extern fn create_string() -> *const c_char {
    let c_string = CString::new(STRING).expect("CString::new failed");
    c_string.into_raw() // Move ownership to C
}

/// # Safety
/// The ptr should be a valid pointer to the string allocated by rust
#[no_mangle]
pub unsafe extern fn free_string(ptr: *const c_char) {
    // Take the ownership back to rust and drop the owner
    let _ = CString::from_raw(ptr as *mut _);
}

It is crucial to call free_string at some point to avoid the leak:

const char* rust_string = create_string();
printf("1. Printed from C: %s\n", rust_string);
free_string(rust_string);

Do not call libc free method on the string and do not try to modify the content pointed by such pointer!

The above works well, but what if we want to unload the rust library while we are using the ram or want to release it in the part of the code which is not aware of the rust library? We should be able to achieve by using one of the below 3 options.

Option #2: Allocate the buffer and copy the data

Remember the rule #1? If we would like to release the memory in C using, let's say free method, we should allocate it using malloc. But how would the rust know about malloc? One solution is to "ask" rust how much memory does it need and then allocate a buffer for it:

size_t len = get_string_len();
char *buffer = malloc(len);
copy_string(buffer);
printf("4. Printed from C: %s\n", buffer);
free(buffer);

Rust just needs tell the right buffer size and carefully copy the rust string into it (without missing the 0 byte!):

#[no_mangle]
pub extern fn get_string_len() -> usize {
    STRING.as_bytes().len() + 1
}

/// # Safety
/// The ptr should be a valid pointer to the buffer of required size
#[no_mangle]
pub unsafe extern fn copy_string(ptr: *mut c_char) {
    let bytes = STRING.as_bytes();
    let len = bytes.len();
    std::ptr::copy(STRING.as_bytes().as_ptr().cast(), ptr, len);
    std::ptr::write(ptr.offset(len as isize) as *mut u8, 0u8);
}

This is great as we don't have to implement the free_string and just use free instead. Another great advantage is that C code is allowed to modify the buffer as it wants (that is why we use *mut c_char and not *const c_char).

The problem is we still need to implement the additional method get_string_len and the solution still allocates a new block of memory and copies the data (but CString::new also does it).

We can also use this method if you would like to move rust string to buffer allocated on the stack of the C function, but we should ensure it has enough space!

Option #3: Pass the memory allocator method to rust

Ok, we remember rule #1, but can we avoid get_string_len method and find some way to allocate memory in rust instead? It turns out we can! One way is to simply pass the function to allocate ram to rust:

type Allocator = unsafe extern fn(usize) -> *mut c_void;

/// # Safety
/// The allocator function should return a pointer to a valid buffer
#[no_mangle]
pub unsafe extern fn get_string_with_allocator(allocator: Allocator) -> *mut c_char {
    let ptr: *mut c_char = allocator(get_string_len()).cast();
    copy_string(ptr);
    ptr
}

Here and below we use copy_string form the example above. We can now then use the get_string_with_allocator as following:

char* rust_string_3 = get_string_with_allocator(malloc);
printf("3. Printed from C: %s\n", rust_string_3);
free(rust_string_3);

The solution is identical to the option #2, and has the same pros and cons.

But we now have to pass additional allocator parameter. We can probably optimize it a bit and avoid passing it to every function, but register it in some global variable instead.

Option #4: Call glibc from rust

Ok, what if we are sure that our C code would use a given version of malloc/free only to allocate memory (are we ever sure about anything like that is out of the scope of the article)? Well, in this case we are brave enough to use libc crate in our rust code:

#[no_mangle]
pub unsafe extern fn get_string_with_malloc() -> *mut c_char {
    let ptr: *mut c_char = libc::malloc(get_string_len()).cast();
    copy_string(ptr);
    ptr
}

The C code stays pretty much the same:

char* rust_string_4 = get_string_with_malloc();
printf("4. Printed from C: %s\n", rust_string_4);
free(rust_string_4);

This way we don't need to provide the allocator method, but we've significantly limited the C code as well. We better document it very well and avoid using this option unless we are 100% sure it is safe!

Option #5: Borrow the rust string

So far we were always passing the ownership of the data to C. But what if we don't need to do it? An example of this situation is when the rust code needs to call some synchronous C method and pass some data to it. The as_ptr method of CString would help us:

type Callback = unsafe extern fn(*const c_char);

#[no_mangle]
pub unsafe extern fn get_string_in_callback(callback: Callback) {
    let c_string = CString::new(STRING).expect("CString::new failed");
    // as_ptr() keeps ownership in rust unlike into_raw()
    callback(c_string.as_ptr())
}

Unfortunately, even in this case CString:new will copy the data (as it needs to put zero byte in the end).

The C code would look like this:

void callback(const char* string) {
    printf("5. Printed from C: %s\n", string);
}

int main() {
    get_string_in_callback(callback);
    return 0;
}

We should always prefer this way when we have a known lifetime of the C pointer as it guarantees the absence of memory leaks.

Two ways to pass the C string to Rust

In conclusion, I would like to speak about the vice versa operation of converting the C string to rust types. There are 2 options available:

Convert it to &str without copying the data
Copy the data and receive the String.

I'll have the same example for both since they are very similar and in fact, option #2 requires option #1 to be used first

Here is our C code. We allocate the data on the heap, but we could also pass the pointer to the stack:

char *test = (char*) malloc(13*sizeof(char));
strcpy(test, "Hello from C");
print_c_string(test);
free(test);

The rust implementation looks as following:

#[no_mangle]
/// # Safety
/// The ptr should be a pointer to valid String
pub unsafe extern fn print_c_string(ptr: *const c_char) {
    let c_str = CStr::from_ptr(ptr);
    let rust_str = c_str.to_str().expect("Bad encoding");
    // calling libc::free(ptr as *mut _); causes use after free vulnerability
    println!("1. Printed from rust: {}", rust_str);
    let owned = rust_str.to_owned();
    // calling libc::free(ptr as *mut _); does not cause after free vulnerability
    println!("2. Printed from rust: {}", owned);
}

Note that we use CStr here instead of CString. Do not try calling the CString:from_raw method on the pointer not created by CString::into_raw!

It is also important to note here that the &str reference will have a lifetime bound to the lifetime of the c_str object method and not a 'static lifetime. Rust compiler is trying to prevent you to avoid returning the &str out of the method or moving it to a global variable/another thread, because the &str reference becomes invalid as soon as C code frees the memory.

If you need to keep the ownership of the data in rust for a long time, simply call to_owned() to obtain a copy of the string! If we would like to avoid copying, we are free to carry CStr around, but we should make sure that C code does not frees the memory while we are using it!

Summary

We've talked about rust and C interoperability today and considered different ways to pass the data across the FFI boundary. As mentioned above, the concepts could be applied to transfer other data types as well and to FFI bridges to other programming languages.

I really hope you've found the above tips beneficial and practical and welcome any questions and feedback.

Source code.
Please react to the article and star the repository if it was useful!

Top comments (8)

frag • Aug 14 '22

Very nice article! Thank you.
I have a question on the last use case, print_c_string.
What if C frees the string right before or during .to_owned() or .clone() in Rust?

Konstantin Grechishchev • Aug 14 '22

The result would be the use after free vulnerability and the behavior in this case considered undefined. If you are lucky, it would just crash.

However, it indicates the race conditions in the C code. Consider this: there is some thread which has called print_c_string and the control is passed to rust. This thread is now executing rust code.

So, when you say "C frees the string right before or during .to_owned() or .clone()", then you assume that there is another C thread that calls free on the pointer which is currently used by the thread calling print_c_string. At this point, rust/c FFI becomes irrelevant, as I could replace print_c_string with the usual c printf and the same problem will occur.

The key point is that we call .to_owned before print_c_string returns.

frag • Aug 17 '22

So no multithreading can be involved in to_owned() - print_c_string execution. That part must be sequential.
Unless one heavily documents the C API and claims that no free shall occur (and then eventually using the mechanism to free from Rust)