An Aversion to Clones and Heap Allocations
Rust, Go, and C++ programmers alike have a particular aversion to heap allocations and cloned data. As much as possible, variables are allocated locally (in the stack) so that there is no need for extra dynamic memory.1 In other words, we often go above and beyond to ensure that we only allocate when absolutely necessary.
In fact, C++ makes further optimizations by utilizing compile-time language features that allow instructions to be evaluated and inlined beforehand. This results in a program that is no longer required to execute certain instructions during runtime. A notable example is the small-string optimization, where "small strings" are inlined to avoid dynamic memory allocations altogether.
Similarly, cloning data is often a last resort. Consider an image which can be represented as some arbitrary array of bytes. If one were to clone said image, this would certainly be expensive. This leads us to prefer reference semantics over value semantics. That is, we often pass objects by reference rather than by value if ownership cannot be transferred. Indeed, copying a pointer/reference is surely cheaper than copying an entire array of bytes.
Thus, there is one function in particular that Rustaceans admittedly hate to see the most:
// Suppose we have an image...
let image: Vec<u8> = get_image_bytes_from_somewhere();
// Every Rustacean's kryptonite...
let cloned = image.clone();
That one line of code alone sends shivers down the spine. Although there are some cases when cloning is truly necessary, it is nevertheless an unpleasant sight to behold.
In this article, we will discuss certain ways to optimize the way we allocate and clone data in immutable contexts. Throughout this article, we will be using the String
type as a particular example, but please do keep in mind that the concepts are generally applicable elsewhere.
An Example in Multithreaded Contexts
Suppose we find ourselves in a situation where we have to share an immutable string across multiple threads as part of a runtime configuration. For instance, perhaps this may be a file name, a username, or an argument flag. A painfully naive way to accomplish this is to simply clone the string for each thread.
use std::thread::{self, JoinHandle};
// Runtime configuration from somewhere...
let username: String = get_username();
// Spawn ten threads
let handles: Vec<_> = (0..10)
.map(|id| {
// NOTE: `username` is captured by reference here.
// That is, `String::clone(&username)`. Observe that
// obtaining a reference is sufficient for cloning
// the **entire** string.
let cloned = username.clone();
// Here, we move the owned clone of the `username`.
thread::spawn(move || {
println!("Hello {} from Thread #{}!", cloned, id);
})
})
.collect();
// Join all the threads
handles.into_iter()
.try_for_each(JoinHandle::join)
.expect("failed to join all threads");
Though, we can definitely do better than this. Instead of cloning the String
and its contents, why not just use a std::sync::Arc
instead? That way, we're only cloning the smart pointer to the String
rather than the String
contents itself. Indeed, a similar solution is given in the Rust Book's chapter on "Fearless Concurrency".
use std::{
sync::Arc,
thread::{self, JoinHandle},
};
// Observe that we now wrap the string in an `Arc`.
let username: Arc<String> = Arc::new(get_username());
// Spawn ten threads again
let handles: Vec<_> = (0..10)
.map(|id| {
// Here, we are explicitly cloning the smart pointer,
// not the `String` itself. Note that it is possible to
// write `username.clone()` instead, but that may be ambiguous
// to readers of this code. It is best to explicitly
// invoke the `Clone` implementation for `Arc`.
let cloned = Arc::clone(&username);
// We now move the cloned smart pointer to the thread.
thread::spawn(move || {
println!("Hello {} from Thread #{}!", cloned, id);
})
})
.collect();
// Join all the threads again
handles.into_iter()
.try_for_each(JoinHandle::join)
.expect("failed to join all threads");
Some Subtle Inefficiencies
This improvement is much better, but there seems to be something strange about an Arc<String>
. Since Arc
is a smart pointer, we know that it implements std::ops::Deref
. In this case, the Deref::Target
is String
. But, we know that String
also implements Deref
, where Deref::Target
is the primitive type str
. Therefore, in order to use the underlying str
, we have to go through two levels of Deref
indirection!
It must be noted, however, that the compiler may optimize the double indirection, but the same cannot be said about memory usage. Recall that all pointer types that hold Sized
targets have the same size as a usize
. In 64-bit targets, that is worth 8 bytes.
DISCLAIMER: From here on out, I will be assuming 64-bit targets when computing pointer sizes.
Since String
is a Sized
type, Arc<String>
is actually just the size of a pointer. Indeed, at least in terms of size, the Arc
smart pointer is a "zero-cost abstraction" over a regular pointer. But, let us not forget that Arc
is actually a pointer to some heap allocation. In this case, it points to a String
.
As of writing (Rust 1.54), we know that a String
is essentially a Vec<u8>
, where the bytes just happen to be valid UTF-8. Furthermore, a Vec<u8>
is composed of two fields: an internal RawVec
and the length
of the stored data. A RawVec
is just an internal wrapper over the capacity
and the underlying pointer to the data buffer.
Therefore, a String
is actually worth three pointers: an integer length
(usize
), an integer capacity
(usize
), and a data pointer (usize
). In total, that is worth 24 bytes!
Putting it all together, an Arc<String>
is an 8-byte smart pointer to a 24-byte heap-allocated String
(of some capacity
)—plus 8 bytes for the strong count (AtomicUsize
) and 8 more bytes for the weak count (AtomicUsize
)—which dereferences to a heap-allocated str
(of some length <= capacity
).
In other words, an Arc<String>
is actually 8 bytes (on the stack) plus 40 bytes (on the heap) plus the underlying str
allocation of capacity
bytes (on another part of the heap). This is at least 48 bytes in total if we disregard the str
!2
// The total memory used is at least `8 + 24 + 16 + 11 = 59` bytes!
// Note that I have disregarded alignment to keep the example simple.
use std::sync::Arc;
let text = Arc::new(String::from("Hello World"));
// Here, we dereference the variable to verify how much
// memory is used by each level of indirection.
use std::mem::size_of_val;
dbg!(size_of_val::<Arc<String>>(&text)); // 8
dbg!(size_of_val::<String>(&text)); // 24
dbg!(size_of_val::<str>(&text)); // 11
Taking Advantage of Immutability
Phew! That was a lot to take in, but I hope I have demonstrated that the double indirection and the Vec
overhead is quite considerable for an immutable string. So... how do we fix this?
First and foremost, we have to address the double indirection. As demonstrated earlier, a pointer to a pointer to a str
is not exactly the most efficient way to handle our use case.
The first key insight comes from the fact that we assumed that the str
must be immutable. That is, we don't actually need the mutation methods provided by String
. If we can somehow interact with the underlying str
of the String
instead, we can eliminate the double indirection entirely. Luckily, Arc
allows us to do just that! Recall that Arc<str>
implements From<String>
.3 This conversion is sufficient for our use case.
use std::sync::Arc;
// Retrieve the string from somwhere...
let text: String = get_string_from_somwhere();
// Here, we "consume" the `String` wrapper
// as a smart pointer to a `str`. Alternatively,
// we may invoke the `Arc::from` syntax.
let owned_reference: Arc<_> = text.into();
todo!("spawn threads here");
And that's pretty much it! Now, when we clone the smart pointer across multiple threads, we're simply cloning the direct pointer to the underlying str
—no more double indirection!
Analyzing the memory usage, we find that Arc<str>
is worth 16 bytes. But wait... wasn't Arc<String>
just 8 bytes? Yes, it was! This is because String
is Sized
. As mentioned earlier, all pointers to Sized
types are worth 8 bytes (a single pointer).
Meanwhile, recall that a pointer to a dynamically sized type (such as a str
) is actually a wide pointer. That is, they contain much more information than just a plain pointer. In this case, a &str
, Box<str>
, Rc<str>
, Arc<str>
, or any other pointer-like wrapper to a str
is worth 16 bytes because they internally contain a direct pointer to the str
allocation (usize
) and its length (usize
).
Yes, this does mean that subsequent clones of the wide pointer is now twice as large as those of a regular pointer, but it does allow us to work around the double indirection issue. Consequently, the wide pointer is all we need to refer to the str
—no extra heap allocations required! Another neat side effect is that excess capacity is also removed, thus enabling more efficient memory utilization.
In summary, the Arc<str>
is only worth 16 bytes on the stack. Meanwhile, on the heap, there are 8 bytes for the strong count (AtomicUsize
) and 8 bytes for the weak count (AtomicUsize
) in addition to the length
bytes of the underlying str
allocation. This brings the total to 32 + length
bytes. If compared to Arc<String>
which is worth 48 bytes (alone) plus the length
of the str
plus the excess capacity
(if any), these are significant savings if I do say so myself!
NOTE: It is important to emphasize that
Arc
requires some overhead due to atomics. In particular, if it weren't for the strong count and the weak count, 16 bytes may be shaved in both cases. That is, if we had used aBox
instead, we would attain the same results minus 16 bytes, regardless of whether it was anArc<str>
or anArc<String>
.
Therefore, if mutability is not necessary, then perhaps an owned str
reference may be a worthwhile optimization. The same also applies for [u8]
, Path
, CStr
, and other slice-like types.
An Applied Example with Hash Maps
Switching gears, let us consider a different example where we maintain a table of usernames and coins (represented by integers). The table is essentially a mapping between an immutable username and the amount of coins in their wallet.
A straightforward implementation makes use of std::collections::HashMap
. Instinctively, we first reach out to HashMap<String, u16>
as the initial type signature. But, since the usernames are immutable, it is possible to remove the String
abstraction altogether. In code:
use std::iter::repeat_with;
fn generate_username() -> (String, u16) {
todo!("generate string-coins pairs somehow")
}
// This results in a `HashMap<Box<str>, u16>`.
// Observe the the key-type is a `Box<str>` instead
// of the typical `String` wrapper.
let table: HashMap<_, _> = repeat_with(generate_username)
.map(|(name, coins)| (name.into_boxed_str(), coins))
.collect();
In fact, at some point, the Rust compiler used this optimization to efficiently keep track of a table of immutable symbols in code. Shown below is a code snippet where this occurs:
pub struct Interner {
names: HashMap<Box<str>, Symbol>,
strings: Vec<Box<str>>,
}
Conclusion
Immutability is a wonderful thing! It enables a multitude of valuable optimizations and assumptions about the code we write. In the first place, immutability is the only reason why it was possible to "dissolve" the String
wrapper in the examples above. Really, the String
wrapper only exists to provide an interface for "resizable" strings—that is, via String::push
.
Otherwise, when mutability is not necessary, the String
wrapper frankly provides nothing more than accidental overhead. For the relatively cheap cost of a mandatory wide pointer, the immutable Box<str>
gains access to the following optimizations:
-
Direct ownership over the
str
primitive. - No more double indirection.
- No more excess capacity.
- No more extra heap allocations.
- Lower memory usage overall.
The same can be said about Vec<T>
(for some type T
), which appropriately wraps a slice [T]
. In fact, there exists a plethora of such wrapper types: namely those for Path
, CStr
, and many other slice-like types.
Perhaps the most valuable lesson to be learned here is to know your types. Sure, Rust claims "zero-cost abstractions", but this does not mean we should keep our guard down from possible ways to further optimize our programs.
As illustrated earlier, a closer look into the abstractions may reveal its unnecessary cost. Thus, it is paramount to keenly recognize and properly assess the constraints of our use cases. In doing so, we may even inadvertently stumble upon aggressive optimizations that let us make the most out of our computer resources.
-
Dynamic memory is inherently less predictable since its behavior can only be determined at runtime. An example would be user input. There is no way for the compiler to predict and optimize for the plethora of ways a user may input a string. The best a program can do is to allocate memory dynamically (based on the potential size of the input) and to resize accordingly. Such behavior cannot be done and optimized at compile-time. Though, it is possible to allocate a "fixed budget" for the string, which may allow for some optimizations. Generally speaking, however, this is not the most flexible option. ↩
-
Just to be extra sure, I verified this with Compiler Explorer using
rustc 1.54
. The compiler flags were--edition 2018
and-O
(for optimization). Without loss of generality, I hard-coded the 11-byte string"Hello World"
. Indeed, the produced assembly allocates 11 bytes for thestr
, 24 bytes for theString
, and 8 bytes for theArc<String>
smart pointer. @jakubdabek also provided a Godbolt example in the comments section. ↩ -
Technically, the
From<String>
implementation just dereferences the string and copies the underlyingstr
into theArc
allocation. Therefore, the real magic happens at theFrom<&str>
implementation. ↩
Top comments (3)
How did you get to the conclusion that there are only as many bytes allocated by
Arc<T>
asT
takes? That would make it have the same functionality asBox<T>
. Just to be extra extra sure, I replicated the code shown in the article (working godbolt here), and I see 2 allocations: 11 bytes for theString
data and 40 bytes for theArc<String>
, which contains the 24 forString
and 2AtomicUsize
s for strong and weak count.Interestingly, for
Arc<str>
there are 32 bytes allocated, because the allocation code aligns to 8 (16 for strong and weak counts, 11 for "Hello World" rounded up to 16).Apropos note 3., while the
From<&str>
is interesting, the realest magic happens incopy_from_slice
😉Other points from the article stand, as there's memory savings to be had and there's rarely any reason to keep immutable strings in
String
(although there are some niche uses for double indirection e.g.ArcSwap
)Ah, that is my mistake. I have totally missed that! My original conception of the article came from one of my projects when I realized that I could use a
Box<str>
over aString
. Then, as I wrote this article, I wanted to apply the same optimization over multi-threaded contexts viaArc
. It had not occurred to me (until you mentioned it) thatArc
actually has some overhead due to atomics, whichBox
does not have. I had falsely assumed otherwise.This led to my incorrect conclusion that
Arc
allocates asBox
would. So to make a long story short, I had unintentionally disregarded theArc
overhead, thinking that it was aBox
.Thanks for pointing this out! I shall edit the article accordingly.
Eye opening. Thanks a lot