Once again, we’re back with another Python topic. Today, we’ll talk about how to compare strings in Python. Typically, I try to stay away from strings because they have a lot of complexity (e.g. different languages, implementations, etc.). That said, I decided to take a risk with this one. Hope you like it!
As a bit of a teaser, here’s what you can expect in this article. We’ll be looking at a few different comparison operators in Python including
> as well as
is. In addition, we’ll talk about how these operators can be used to compare strings and when to use them. If you want to know more, you’ll have to keep reading.
Check it out! I put together a video resource for this post in case you’re not interested in reading the whole thing. In this video, I tested out my new Yeti mic, so let me know how it sounds. Otherwise, feel free to keep reading. I appreciate it!
Let’s imagine we’re building up a simple search engine. For example, we have a bunch of files with text in them, and we want to be able search through those documents for certain keywords. How would we do that?
At the core of this search engine, we’ll have to compare strings. For instance, if we search our system for something about the Pittsburgh Penguins (say, Sidney Crosby), we’ll have to look for documents that contain our keyword. Of course, how do we know whether or not we have a match?
Specifically, we want to know how we can compare two strings for equality. For example, is “Sidney Crosby” the same as “Sidney Crosby”? How about “sidney crosby”? Or even “SiDnEy CrOsBy”? In other words, what constitutes equality in Python?
Of course, equality isn’t the only way to compare strings. For example, how can we compare strings alphabetically/lexicographically? Does “Malkin” come before or after “Letang” in a list?
If any of these topics sound interesting, you’re in luck. We’ll cover all them and more in this article.
In this section, we’ll take a look at a few different ways to compare strings. First, we’ll look at a brute force solution which involves looping over each character to check for matches. Then, we’ll introduce the comparison operators which abstract away the brute force solution. Finally, we’ll talk about identity.
Since strings are iterables, there’s nothing really stopping us from writing a loop to compare each character:
penguins_87 = "Crosby" penguins_71 = "Malkin" is_same_player = True for a, b in zip(penguins_87, penguins_71): if a != b: is_same_player = False break
In this example, we zip both strings and loop over each pair of characters until we don’t find a match. If we break before we’re finished, we know we don’t have a match. Otherwise, our strings are “identical.”
While this gets the job done for some strings, it might fail in certain scenarios. For example, what happens if one of the strings is longer than the other?
penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel"
As it turns out,
zip() will actually truncate the longer string. To deal with that, we might consider doing a length check first:
penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" is_same_player = len(penguins_87) == len(penguins_59) if is_same_player: for a, b in zip(penguins_87, penguins_59): if a != b: is_same_player = False break
Of course, even with the extra check, this solution is a bit overkill and likely error prone. In addition, this solution only works for equality. How do we check if a string is “less” than another lexicographically? Luckily, there are other solutions below.
Fun fact: we don’t have to write our own string equality code to compare strings. As it turns out, there are several core operators that work with strings right out of the box:
Using our Penguins players from above, we can try comparing them directly:
penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" penguins_87 == penguins_87 # True penguins_87 == penguins_71 # False penguins_87 >= penguins_71 # False penguins_59 <= penguins_71 # True
Now, it’s important to note that these comparison operators work with the underlying ASCII representation of each character. As a result, seemingly equivalent strings might not appear to be the same:
penguins_87 = "Crosby" penguins_87_small = "crosby" penguins_87 == penguins_87_small # False
When we compare “Crosby” and “crosby”, we get
False because “c” and “C” aren’t equivalent:
ord('c') # 99 ord('C') # 67
Naturally, this can lead to some strange behavior. For example, we might say “crosby” is less than “Malkin” because “crosby” comes before “Malkin” alphabetically. Unfortunately, that’s not how Python interprets that expression:
penguins_87_small = "crosby" penguins_71 = "Malkin" penguins_87_small < penguins_71 # False
In other words, while these comparison operators are convenient, they don’t actually perform a case-insensitive comparison. Luckily, there are all sorts of tricks we can employ like converting both strings to uppercase or lowercase:
penguins_87_small = "crosby" penguins_71 = "Malkin" penguins_87_small.lower() < penguins_71.lower() penguins_87_small.upper() < penguins_71.upper()
Since strings in Python are immutable like most languages, these methods don’t actually manipulate the underlying strings. Instead, the return new ones.
All that said, strings are inherently complex. I say that has a bit of a warning because there are bound to be edge cases where the solutions in this article don’t work as expected. After all, we’ve only scratched the surface with ASCII characters. Try playing around with some strings that don’t include English characters (e.g. 🤐, 汉, etc.). You may be surprised by the results.
Before we move on, I felt like it was important to mention another way of comparing strings: identity. In Python,
== isn’t the only way to compare things; we can also use
is. Take a look:
penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" penguins_87 is penguins_87 # True penguins_87 is penguins_71 # False
Here, it’s tough to see any sort of difference between this solution and the previous one. After all, the output is the same. That said, there is a fundamental difference here. With equality (
==), we compare the strings by their contents (i.e. letter by letter). With identity (
is), we compare the strings by their location in memory (i.e address/reference).
To see this in action, let’s create a few equivalent strings:
penguins_87 = "Crosby" penguins_87_copy = "Crosby" penguins_87_clone = "Cros" + "by" penguins_8 = "Cros" penguins_7 = "by" penguins_87_dupe = penguins_8 + penguins_7 id(penguins_87) # 65564544 id(penguins_87_copy) # 65564544 id(penguins_87_clone) # 65564544 id(penguins_87_dupe) # 65639392 Uh Oh!
In the first three examples, the Python interpreter was able to tell that the constructed strings were the same, so the interpreter didn’t bother making space for the two clones. Instead, it gave the latter two,
penguins_87_clone, the same ID. As a result, if we compare any of the first three strings with either
is, we’ll get the same result:
penguins_87 == penguins_87_copy == penguins_87_clone # True penguins_87 is penguins_87_copy is penguins_87_clone # True
When we get to the last string,
penguins_87_dupe, we run into a bit of an issue. As far as I can tell, the interpreter isn’t able to know what the value of the expression is until runtime. As a result, it creates a new location for the resulting string—despite the fact that “Crosby” already exists. If we modify our comparison chains from above, we’ll see a different result:
penguins_87 == penguins_87_copy == penguins_87_clone == penguins_87_dupe # True penguins_87 is penguins_87_copy is penguins_87_clone is penguins_87_dupe # False
The main takeaway here is to only use
== when comparing strings for equality (an any object for that matter). After all, there’s no guarantee that the Python interpreter is going to properly identify equivalent strings and give them the same ID. That said, if you need to compare two strings for identity, this is the way to go.
Normally, I would check each solution for performance, but they’re not all that similar. Instead, I figured we could jump right to the challenge.
Now that we know how to compare strings in Python, I figured we could try using that knowledge to write a simple string sorting algorithm. For this challenge, you can assume ASCII strings and case sensitivity. However, you’re free to optimize your solutions as needed. All I care about is the use of the operators discussed in this article.
If you need a sample list to get started, here’s the current forward roster for the Pittsburgh Penguins (reverse sorted alphabetically):
penguins_2019_2020 = [ 'Tanev', 'Simon', 'Rust', 'McCann', 'Malkin', 'Lafferty', 'Kahun', 'Hornqvist', 'Guentzel', 'Galchenyuk', 'Di Pauli', 'Crosby', 'Blueger', 'Blandisi', 'Bjugstad', 'Aston-Reese' ]
When you’re finished, share your solution on Twitter using #RenegadePython. Here’s my sample solution to get you started!
Then, head on over to my article titled How to Sort a List of Strings in Python to see a few clever solutions.
And with that, we’re all done. Check out all the solutions here:
penguins_87 = "Crosby" penguins_71 = "Malkin" penguins_59 = "Guentzel" # Brute force comparison (equality only) is_same_player = len(penguins_87) == len(penguins_59) if is_same_player: for a, b in zip(penguins_87, penguins_59): if a != b: is_same_player = False break # Direct comparison penguins_87 == penguins_59 # False penguins_87 > penguins_59 # False penguins_71 <= penguins_71 # True # Identity checking penguins_87 is penguins_87 # True penguins_71 is penguins_87 # False
- How to Sort a List of Strings in Python
- How to Sort a List of Dictionaries in Python
- How to Format a String in Python
If nothing else, thanks for taking some time to check out this article. See you next time!
The post How to Compare Strings in Python: Equality and Identity appeared first on The Renegade Coder.