DEV Community

Cover image for Python: How simple string concatenation can kill your code performance

Posted on

Python: How simple string concatenation can kill your code performance

Note: a video version of this tutorial is available Python: How simple string concatenation can kill your code performance

In general, there is no problem with string concatenation

String concatenation is one of the first and most used features in programming. You learn variables basics, integers, booleans, and strings. It’s even one of the first things we learn in the python official Documentation:

String concatenation from Python offical documentation

So yes, string concatenation is not bad, and it might even be impossible to do programming without needing to concatenate strings for one reason or another.

Although, yes, like almost any useful tool or feature, when used the wrong way it can lead to really bad effects. So it’s important to be aware of how to use them correctly.

When string concatenation could be dangerous?

You should pay attention to string concatenation when those two conditions are met:

  1. The feature is a performance-critical one. (Example: a text data analysis feature on a web application that is running, and a user in the website interface is waiting to get back the result, so your code should be as fast as possible)
  2. You’re dealing with a lot of concatenation (starting from 10000) because of a loop maybe, and/or the strings you’re concatenating are really really long.

That can lead to significant overhead in terms of memory allocation and garbage collection, slow down your code, and in the worse case just make your code crash in production 😬.

And, that also means it's totally safe to use string concatenation when those conditions are not met, and you’re sure that, these conditions can never be met in the future. So for instance:

firstname = "John"
lastname = "Joe"
fullname = fistname + lastname
Enter fullscreen mode Exit fullscreen mode

is totally okay because there is no way someone’s name can become long enough to create a performance issue. Anyways, not in this world 😅.

Why string concatenation is dangerous in those cases?

In Python, strings are immutable objects, which means that every time a string is concatenated with another string, a new string object is created in memory.

To illustrate that let's run this simple code:

my_str = "Hello"
print(f"my_str: {my_str}")
print(f"id = {id(my_str)}")
my_str += " World"
print(f"\n my_str: {my_str}")
print(f"id = {id(my_str)}")
Enter fullscreen mode Exit fullscreen mode

Note: Python id() function returns the identity of an object. The identity of an object is an integer, which is guaranteed to be unique and constant for this object during its lifetime.

The output will be:

my_str: Hello
id = 140575060484848

my_str: Hello World
id = 140575060532336
Enter fullscreen mode Exit fullscreen mode

As you can see the id of the string before the concatenation and after are not the same so python has created a new string.

The solution to that issue and a comparison with string concatenation

To address this performance issue, you can use alternative techniques such as using a list to hold the individual string components and then joining them together using the join method.

Python doesn’t create a new list each time you append an element, and the join method is optimized to avoid creating a new string object for each string in the list it has to join.

import timeit

# Concatenate strings
start_time = timeit.default_timer()
s = ''
for i in range(10000):
    s += 'a'
elapsed_time_concat = timeit.default_timer() - start_time
print('Time taken for string concatenation:', elapsed_time_concat)

# Join strings
start_time = timeit.default_timer()
lst = []
for i in range(10000):
s = ''.join(lst)
elapsed_time_join = timeit.default_timer() - start_time
print('Time taken for string join:         ', elapsed_time_join)

join_performance = (elapsed_time_concat - elapsed_time_join) / elapsed_time_concat
print('Performance gained with join', join_performance)
Enter fullscreen mode Exit fullscreen mode

In this example, we generate a string of length 10,000 by concatenating the string a 10,000 times using both the string concatenation and the join method. The output shows that the join method is significantly faster than the concatenation method:

Time taken for string concatenation: 0.0019936899916501716
Time taken for string join:          0.0011655330017674714
Performance gained with join 41.53890491255549 %
Enter fullscreen mode Exit fullscreen mode

As you can see, by using join we have increased the time performance by almost 42% which is wonderful.

Diving deep into the why: Math and a Big(O) explanation

This behavior could also be explained using BigO notation especially, the time complexity.

Let’s focus on the concatenation part of our previous code.

We have a string s that we concatenate M times with a string c of length on length N.

s = ''
c = 'c'
for i in range(10000):
    s += c
Enter fullscreen mode Exit fullscreen mode

So here N=1 and M = 10000

Keep this in mind:

The complexity to create a string of length N is O(N).
So because concatenation creates a new string
of length len(s+c) , the complexity of the concatenation at each iteration is the length of s at the previous iteration + length of string c.

Table: Illustrate string concatenation time complexity
As you can see the total, or overall complexity is 1 + 2 + 3 + 4 + 5 +…+ M .
which we can mathematically approximate at (M(M+1)) / 2 .
And by neglecting the minority exponents we have:

String concatenation time complexity Big O

Let’s try to so see the time complexity with the join approach:

lst = []
c = 'c'
for i in range(10000):
s = ''.join(lst)
Enter fullscreen mode Exit fullscreen mode

In this case, we have two steps:

  1. The loop to create the list
for i in range(10000):
Enter fullscreen mode Exit fullscreen mode

The complexity here is O(M), and in our case M=1000 because python has to execute lst.append(c) M times.

  1. The join s = ''.join(lst) Here the complexity is O(T) where T = total length of the final string after the join and in our case T = N * M .

So the final time complexity is NM + M = M (N + 1) and we can approximate it as O(MN).

Which is far better than O(M**2) for the simple concatenation.

That is for this tutorial. I hope you understand when and why string concatenation could be dangerous so you can use join instead when it’s needed.

Thanks for reading, don’t forget to checkout the video version on our youtube channel here and I see you at the next one.

Take care.

Top comments (2)

mikhail_vasilyev_139c91f5 profile image
Mikhail Vasilyev • Edited

Well, oddly enough, but (at least in CPython) benign s+='s' is actually O(N) -

lashortlist profile image
La Shortlist

Really interesting. Thanks !