Note: a video version of this tutorial is available Python: How simple string concatenation can kill your code performance
In general, there is no problem with string concatenation
String concatenation is one of the first and most used features in programming. You learn variables basics, integers, booleans, and strings. It’s even one of the first things we learn in the python official Documentation:
So yes, string concatenation is not bad, and it might even be impossible to do programming without needing to concatenate strings for one reason or another.
Although, yes, like almost any useful tool or feature, when used the wrong way it can lead to really bad effects. So it’s important to be aware of how to use them correctly.
When string concatenation could be dangerous?
You should pay attention to string concatenation when those two conditions are met:
- The feature is a performance-critical one. (Example: a text data analysis feature on a web application that is running, and a user in the website interface is waiting to get back the result, so your code should be as fast as possible)
- You’re dealing with a lot of concatenation (starting from 10000) because of a loop maybe, and/or the strings you’re concatenating are really really long.
That can lead to significant overhead in terms of memory allocation and garbage collection, slow down your code, and in the worse case just make your code crash in production 😬.
And, that also means it's totally safe to use string concatenation when those conditions are not met, and you’re sure that, these conditions can never be met in the future. So for instance:
firstname = "John"
lastname = "Joe"
fullname = fistname + lastname
is totally okay because there is no way someone’s name can become long enough to create a performance issue. Anyways, not in this world 😅.
Why string concatenation is dangerous in those cases?
In Python, strings are immutable objects, which means that every time a string is concatenated with another string, a new string object is created in memory.
To illustrate that let's run this simple code:
my_str = "Hello"
print(f"my_str: {my_str}")
print(f"id = {id(my_str)}")
my_str += " World"
print(f"\n my_str: {my_str}")
print(f"id = {id(my_str)}")
Note: Python
id()
function returns theidentity
of an object. The identity of an object is an integer, which is guaranteed to be unique and constant for this object during its lifetime.
The output will be:
my_str: Hello
id = 140575060484848
my_str: Hello World
id = 140575060532336
As you can see the id
of the string before the concatenation and after are not the same so python has created a new string.
The solution to that issue and a comparison with string concatenation
To address this performance issue, you can use alternative techniques such as using a list to hold the individual string components and then joining them together using the join
method.
Python doesn’t create a new list each time you append an element, and the join method is optimized to avoid creating a new string object for each string in the list it has to join.
import timeit
# Concatenate strings
start_time = timeit.default_timer()
s = ''
for i in range(10000):
s += 'a'
elapsed_time_concat = timeit.default_timer() - start_time
print('Time taken for string concatenation:', elapsed_time_concat)
# Join strings
start_time = timeit.default_timer()
lst = []
for i in range(10000):
lst.append('a')
s = ''.join(lst)
elapsed_time_join = timeit.default_timer() - start_time
print('Time taken for string join: ', elapsed_time_join)
join_performance = (elapsed_time_concat - elapsed_time_join) / elapsed_time_concat
print('Performance gained with join', join_performance)
In this example, we generate a string of length 10,000 by concatenating the string a
10,000 times using both the string concatenation and the join method. The output shows that the join method is significantly faster than the concatenation method:
Time taken for string concatenation: 0.0019936899916501716
Time taken for string join: 0.0011655330017674714
Performance gained with join 41.53890491255549 %
As you can see, by using join
we have increased the time performance by almost 42%
which is wonderful.
Diving deep into the why: Math and a Big(O) explanation
This behavior could also be explained using BigO notation especially, the time complexity.
Let’s focus on the concatenation part of our previous code.
We have a string s
that we concatenate M
times with a string c
of length on length N
.
s = ''
c = 'c'
for i in range(10000):
s += c
So here N=1
and M = 10000
Keep this in mind:
The complexity to create a string of length
N
is O(N).
So because concatenation creates a new string
of lengthlen(s+c)
, the complexity of the concatenation at each iteration is thelength of s at the previous iteration
+length of string c
.
As you can see the total, or overall complexity is 1 + 2 + 3 + 4 + 5 +…+ M
.
which we can mathematically approximate at (M(M+1)) / 2
.
And by neglecting the minority exponents we have:
Let’s try to so see the time complexity with the join
approach:
lst = []
c = 'c'
for i in range(10000):
lst.append(c)
s = ''.join(lst)
In this case, we have two steps:
- The loop to create the list
for i in range(10000):
lst.append(c)
The complexity here is O(M), and in our case M=1000
because python has to execute lst.append(c)
M
times.
- The join
s = ''.join(lst)
Here the complexity is O(T) whereT = total length of the final string after the join
and in our caseT = N * M
.
So the final time complexity is NM + M = M (N + 1)
and we can approximate it as O(MN)
.
Which is far better than O(M**2)
for the simple concatenation.
That is for this tutorial. I hope you understand when and why string concatenation could be dangerous so you can use join
instead when it’s needed.
Thanks for reading, don’t forget to checkout the video version on our youtube channel here and I see you at the next one.
Take care.
Top comments (2)
Well, oddly enough, but (at least in CPython) benign s+='s' is actually O(N) - github.com/mickvav/string_mystery
Really interesting. Thanks !