Basti Ortiz

Posted on Aug 22, 2021 • Edited on Sep 2, 2021

Optimizing Immutable Strings in Rust

#rust #performance #tips

An Aversion to Clones and Heap Allocations

Rust, Go, and C++ programmers alike have a particular aversion to heap allocations and cloned data. As much as possible, variables are allocated locally (in the stack) so that there is no need for extra dynamic memory.¹ In other words, we often go above and beyond to ensure that we only allocate when absolutely necessary.

In fact, C++ makes further optimizations by utilizing compile-time language features that allow instructions to be evaluated and inlined beforehand. This results in a program that is no longer required to execute certain instructions during runtime. A notable example is the small-string optimization, where "small strings" are inlined to avoid dynamic memory allocations altogether.

Similarly, cloning data is often a last resort. Consider an image which can be represented as some arbitrary array of bytes. If one were to clone said image, this would certainly be expensive. This leads us to prefer reference semantics over value semantics. That is, we often pass objects by reference rather than by value if ownership cannot be transferred. Indeed, copying a pointer/reference is surely cheaper than copying an entire array of bytes.

Thus, there is one function in particular that Rustaceans admittedly hate to see the most:

// Suppose we have an image...
let image: Vec<u8> = get_image_bytes_from_somewhere();

// Every Rustacean's kryptonite...
let cloned = image.clone();

That one line of code alone sends shivers down the spine. Although there are some cases when cloning is truly necessary, it is nevertheless an unpleasant sight to behold.

In this article, we will discuss certain ways to optimize the way we allocate and clone data in immutable contexts. Throughout this article, we will be using the String type as a particular example, but please do keep in mind that the concepts are generally applicable elsewhere.

An Example in Multithreaded Contexts

Suppose we find ourselves in a situation where we have to share an immutable string across multiple threads as part of a runtime configuration. For instance, perhaps this may be a file name, a username, or an argument flag. A ~~painfully~~ naive way to accomplish this is to simply clone the string for each thread.

use std::thread::{self, JoinHandle};

// Runtime configuration from somewhere...
let username: String = get_username();

// Spawn ten threads
let handles: Vec<_> = (0..10)
    .map(|id| {
        // NOTE: `username` is captured by reference here.
        // That is, `String::clone(&username)`. Observe that
        // obtaining a reference is sufficient for cloning
        // the **entire** string.
        let cloned = username.clone();

        // Here, we move the owned clone of the `username`.
        thread::spawn(move || {
            println!("Hello {} from Thread #{}!", cloned, id);
        })
    })
    .collect();

// Join all the threads
handles.into_iter()
    .try_for_each(JoinHandle::join)
    .expect("failed to join all threads");

Though, we can definitely do better than this. Instead of cloning the String and its contents, why not just use a std::sync::Arc instead? That way, we're only cloning the smart pointer to the String rather than the String contents itself. Indeed, a similar solution is given in the Rust Book's chapter on "Fearless Concurrency".

use std::{
    sync::Arc,
    thread::{self, JoinHandle},
};

// Observe that we now wrap the string in an `Arc`.
let username: Arc<String> = Arc::new(get_username());

// Spawn ten threads again
let handles: Vec<_> = (0..10)
    .map(|id| {
        // Here, we are explicitly cloning the smart pointer,
        // not the `String` itself. Note that it is possible to
        // write `username.clone()` instead, but that may be ambiguous
        // to readers of this code. It is best to explicitly
        // invoke the `Clone` implementation for `Arc`.
        let cloned = Arc::clone(&username);

        // We now move the cloned smart pointer to the thread.
        thread::spawn(move || {
            println!("Hello {} from Thread #{}!", cloned, id);
        })
    })
    .collect();

// Join all the threads again
handles.into_iter()
    .try_for_each(JoinHandle::join)
    .expect("failed to join all threads");

Some Subtle Inefficiencies

This improvement is much better, but there seems to be something strange about an Arc<String>. Since Arc is a smart pointer, we know that it implements std::ops::Deref. In this case, the Deref::Target is String. But, we know that String also implements Deref, where Deref::Target is the primitive type str. Therefore, in order to use the underlying str, we have to go through two levels of Deref indirection!

It must be noted, however, that the compiler may optimize the double indirection, but the same cannot be said about memory usage. Recall that all pointer types that hold Sized targets have the same size as a usize. In 64-bit targets, that is worth 8 bytes.

DISCLAIMER: From here on out, I will be assuming 64-bit targets when computing pointer sizes.

Since String is a Sized type, Arc<String> is actually just the size of a pointer. Indeed, at least in terms of size, the Arc smart pointer is a "zero-cost abstraction" over a regular pointer. But, let us not forget that Arc is actually a pointer to some heap allocation. In this case, it points to a String.

As of writing (Rust 1.54), we know that a String is essentially a Vec<u8>, where the bytes just happen to be valid UTF-8. Furthermore, a Vec<u8> is composed of two fields: an internal RawVec and the length of the stored data. A RawVec is just an internal wrapper over the capacity and the underlying pointer to the data buffer.

Therefore, a String is actually worth three pointers: an integer length (usize), an integer capacity (usize), and a data pointer (usize). In total, that is worth 24 bytes!

Putting it all together, an Arc<String> is an 8-byte smart pointer to a 24-byte heap-allocated String (of some capacity)—plus 8 bytes for the strong count (AtomicUsize) and 8 more bytes for the weak count (AtomicUsize)—which dereferences to a heap-allocated str (of some length <= capacity).

In other words, an Arc<String> is actually 8 bytes (on the stack) plus 40 bytes (on the heap) plus the underlying str allocation of capacity bytes (on another part of the heap). This is at least 48 bytes in total if we disregard the str!²

// The total memory used is at least `8 + 24 + 16 + 11 = 59` bytes!
// Note that I have disregarded alignment to keep the example simple.
use std::sync::Arc;
let text = Arc::new(String::from("Hello World"));

// Here, we dereference the variable to verify how much
// memory is used by each level of indirection.
use std::mem::size_of_val;
dbg!(size_of_val::<Arc<String>>(&text)); // 8
dbg!(size_of_val::<String>(&text));      // 24
dbg!(size_of_val::<str>(&text));         // 11

Taking Advantage of Immutability

Phew! That was a lot to take in, but I hope I have demonstrated that the double indirection and the Vec overhead is quite considerable for an immutable string. So... how do we fix this?

First and foremost, we have to address the double indirection. As demonstrated earlier, a pointer to a pointer to a str is not exactly the most efficient way to handle our use case.

The first key insight comes from the fact that we assumed that the str must be immutable. That is, we don't actually need the mutation methods provided by String. If we can somehow interact with the underlying str of the String instead, we can eliminate the double indirection entirely. Luckily, Arc allows us to do just that! Recall that Arc<str> implements From<String>.³ This conversion is sufficient for our use case.

use std::sync::Arc;

// Retrieve the string from somwhere...
let text: String = get_string_from_somwhere();

// Here, we "consume" the `String` wrapper
// as a smart pointer to a `str`. Alternatively,
// we may invoke the `Arc::from` syntax.
let owned_reference: Arc<_> = text.into();
todo!("spawn threads here");

And that's pretty much it! Now, when we clone the smart pointer across multiple threads, we're simply cloning the direct pointer to the underlying str—no more double indirection!

Analyzing the memory usage, we find that Arc<str> is worth 16 bytes. But wait... wasn't Arc<String> just 8 bytes? Yes, it was! This is because String is Sized. As mentioned earlier, all pointers to Sized types are worth 8 bytes (a single pointer).

Meanwhile, recall that a pointer to a dynamically sized type (such as a str) is actually a wide pointer. That is, they contain much more information than just a plain pointer. In this case, a &str, Box<str>, Rc<str>, Arc<str>, or any other pointer-like wrapper to a str is worth 16 bytes because they internally contain a direct pointer to the str allocation (usize) and its length (usize).

Yes, this does mean that subsequent clones of the wide pointer is now twice as large as those of a regular pointer, but it does allow us to work around the double indirection issue. Consequently, the wide pointer is all we need to refer to the str—no extra heap allocations required! Another neat side effect is that excess capacity is also removed, thus enabling more efficient memory utilization.

In summary, the Arc<str> is only worth 16 bytes on the stack. Meanwhile, on the heap, there are 8 bytes for the strong count (AtomicUsize) and 8 bytes for the weak count (AtomicUsize) in addition to the length bytes of the underlying str allocation. This brings the total to 32 + length bytes. If compared to Arc<String> which is worth 48 bytes (alone) plus the length of the str plus the excess capacity (if any), these are significant savings if I do say so myself!

NOTE: It is important to emphasize that Arc requires some overhead due to atomics. In particular, if it weren't for the strong count and the weak count, 16 bytes may be shaved in both cases. That is, if we had used a Box instead, we would attain the same results minus 16 bytes, regardless of whether it was an Arc<str> or an Arc<String>.

Therefore, if mutability is not necessary, then perhaps an owned str reference may be a worthwhile optimization. The same also applies for [u8], Path, CStr, and other slice-like types.

Double Indirection Illustration — This is not an accurate figure, but it's meant to roughly illustrate the optimization. The overhead due to the strong count and the weak count have been omitted for brevity.

An Applied Example with Hash Maps

Switching gears, let us consider a different example where we maintain a table of usernames and coins (represented by integers). The table is essentially a mapping between an immutable username and the amount of coins in their wallet.

A straightforward implementation makes use of std::collections::HashMap. Instinctively, we first reach out to HashMap<String, u16> as the initial type signature. But, since the usernames are immutable, it is possible to remove the String abstraction altogether. In code:

use std::iter::repeat_with;

fn generate_username() -> (String, u16) {
    todo!("generate string-coins pairs somehow")
}

// This results in a `HashMap<Box<str>, u16>`.
// Observe the the key-type is a `Box<str>` instead
// of the typical `String` wrapper.
let table: HashMap<_, _> = repeat_with(generate_username)
    .map(|(name, coins)| (name.into_boxed_str(), coins))
    .collect();

In fact, at some point, the Rust compiler used this optimization to efficiently keep track of a table of immutable symbols in code. Shown below is a code snippet where this occurs:

pub struct Interner {
    names: HashMap<Box<str>, Symbol>,
    strings: Vec<Box<str>>,
}

Conclusion

Immutability is a wonderful thing! It enables a multitude of valuable optimizations and assumptions about the code we write. In the first place, immutability is the only reason why it was possible to "dissolve" the String wrapper in the examples above. Really, the String wrapper only exists to provide an interface for "resizable" strings—that is, via String::push.

Otherwise, when mutability is not necessary, the String wrapper frankly provides nothing more than accidental overhead. For the relatively cheap cost of a mandatory wide pointer, the immutable Box<str> gains access to the following optimizations:

Direct ownership over the str primitive.
No more double indirection.
No more excess capacity.
No more extra heap allocations.
Lower memory usage overall.

The same can be said about Vec<T> (for some type T), which appropriately wraps a slice [T]. In fact, there exists a plethora of such wrapper types: namely those for Path, CStr, and many other slice-like types.

Perhaps the most valuable lesson to be learned here is to know your types. Sure, Rust claims "zero-cost abstractions", but this does not mean we should keep our guard down from possible ways to further optimize our programs.

As illustrated earlier, a closer look into the abstractions may reveal its unnecessary cost. Thus, it is paramount to keenly recognize and properly assess the constraints of our use cases. In doing so, we may even inadvertently stumble upon aggressive optimizations that let us make the most out of our computer resources.

Dynamic memory is inherently less predictable since its behavior can only be determined at runtime. An example would be user input. There is no way for the compiler to predict and optimize for the plethora of ways a user may input a string. The best a program can do is to allocate memory dynamically (based on the potential size of the input) and to resize accordingly. Such behavior cannot be done and optimized at compile-time. Though, it is possible to allocate a "fixed budget" for the string, which may allow for some optimizations. Generally speaking, however, this is not the most flexible option. ↩
Just to be extra sure, I verified this with Compiler Explorer using rustc 1.54. The compiler flags were --edition 2018 and -O (for optimization). Without loss of generality, I hard-coded the 11-byte string "Hello World". Indeed, the produced assembly allocates 11 bytes for the str, 24 bytes for the String, and 8 bytes for the Arc<String> smart pointer. @jakubdabek also provided a Godbolt example in the comments section. ↩
Technically, the From<String> implementation just dereferences the string and copies the underlying str into the Arc allocation. Therefore, the real magic happens at the From<&str> implementation. ↩

Top comments (3)

Jakub Dąbek • Sep 2 '21

How did you get to the conclusion that there are only as many bytes allocated by Arc<T> as T takes? That would make it have the same functionality as Box<T>. Just to be extra extra sure, I replicated the code shown in the article (working godbolt here), and I see 2 allocations: 11 bytes for the String data and 40 bytes for the Arc<String>, which contains the 24 for String and 2 AtomicUsizes for strong and weak count.

Interestingly, for Arc<str> there are 32 bytes allocated, because the allocation code aligns to 8 (16 for strong and weak counts, 11 for "Hello World" rounded up to 16).

Apropos note 3., while the From<&str> is interesting, the realest magic happens in copy_from_slice 😉

Other points from the article stand, as there's memory savings to be had and there's rarely any reason to keep immutable strings in String (although there are some niche uses for double indirection e.g. ArcSwap)

Basti Ortiz • Sep 2 '21

How did you get to the conclusion that there are only as many bytes allocated by Arc as T takes?

Ah, that is my mistake. I have totally missed that! My original conception of the article came from one of my projects when I realized that I could use a Box<str> over a String. Then, as I wrote this article, I wanted to apply the same optimization over multi-threaded contexts via Arc. It had not occurred to me (until you mentioned it) that Arc actually has some overhead due to atomics, which Box does not have. I had falsely assumed otherwise.

This led to my incorrect conclusion that Arc allocates as Box would. So to make a long story short, I had unintentionally disregarded the Arc overhead, thinking that it was a Box.

Thanks for pointing this out! I shall edit the article accordingly.

Diretnan Domnan • Aug 28 '21

Eye opening. Thanks a lot

DEV Community

Optimizing Immutable Strings in Rust

An Aversion to Clones and Heap Allocations

An Example in Multithreaded Contexts

Some Subtle Inefficiencies

Taking Advantage of Immutability

An Applied Example with Hash Maps

Conclusion

Top comments (3)

Read next

SQL Performance: A Deep Dive into Indexing with examples

How to Run Multiple Tokio Runtimes in a Rust Application

Easily automate Rust web service deployments on AWS without DevOps

What are Closures in Rust?