I wanted to talk about sets, and four great operations they provide:
- Intersection: Elements two sets have in common.
- Union: All the elements from both sets.
- Difference: Elements present on one set, but not on the other.
- Symmetric Difference: Elements from both sets, that are not present on the other.
We are going to do this using Python (3.7), because it rules!
So, let's dive right into it.
Some basic understanding of sets.
Sets are data structures that, based on specific rules they define, provide certain cool operations and guarantees to make a lot of things easier when we use them.
Two basic rules we need to know are these:
- Items in a set are unique. We can not have a set that contains two items that are equal.
- Items in a set are not ordered. Depending on the implementation, items may be sorted using the hash of the value stored, but when you use sets you need to assume items are ordered in a random manner.
And a cool guarantee they provide us: Checking if an element is present on a set has a time complexity constant (O(1)), always. This means, checking if a set contains an element is super fast.
The most popular implementation of a set used to achieve this is the hashed set. Oversimplified, it basically stores the items in a key-value store, where the key is a hash of the value stored.
So, to create a set in Python we use the constructor set
which receives any iterable: my_pets = set(['dog', 'cat', 'parrot'])
(no, I don't have a parrot, nor a cat, dog person here).
Curly braces may be used too, which will avoid creating an intermediate list, but as they are also used for dictionaries, you can not create an empty set with curly braces. Also, someone can get confused, so I avoid them and stick to the constructor.
Sets are collections, therefore they are iterable:
my_pets = set(['dog', 'cat', 'parrot'])
for pet in my_pets:
print('Hello, {}!'.format(pet))
The above program serves as a demonstration of how elements in a set are ordered, below is the output of the program where you can see the order I defined was not respected by the interpreter:
Hello, dog!
Hello, parrot!
Hello, cat!
So, with this information, let's just dive into these cool operations.
Intersection
The intersection between two sets, results in a third set that contains the elements present in both.
For example, if I calculate the intersection between {1, 2, 3}
and {2, 3, 4}
, the output will be {2, 3}
.
If this helps you understand, here is a naive implementation in Python:
a = set([1, 2, 3])
b = set([2, 3, 4])
result = {element for element in a if element in b}
print(result)
Man, is Python awesome or what? Of course, this will print {2, 3}
as expected. Anyway, that was irrelevant because Python provides a function to do this as part of its standard library. The same can be achieved with the following:
a = set([1, 2, 3])
b = set([2, 3, 4])
result = a.intersection(b)
print(result)
Notice that, because of the nature of the operation, a.intersection(b)
and b.intersection(a)
is the same. It is a commutative operation.
Union
The union between two sets, results in a third set with all the elements from both sets.
For example, if I calculate the union between {1, 2, 3}
and {2, 3, 4}
, the output will be {1, 2, 3, 4}
. Kind of like a concatenation, but having in mind that sets can not have repeated elements.
Again, let's see a naive implementation in Python:
a = set([1, 2, 3])
b = set([2, 3, 4])
c = set()
for element in a:
c.add(element)
for element in b:
c.add(element)
print(c)
As expected, the output is {1, 2, 3, 4}
. Again, though, Python does provide a native function to do this:
a = set([1, 2, 3])
b = set([2, 3, 4])
c = a.union(b)
print(c)
Notice that, because of the nature of the operation, a.union(b)
and b.union(a)
is the same. It is a commutative operation.
Difference
The difference between two sets results in a third set with the element from the first, that are not present on the second. Right now I'm going to tell you, this operation is not commutative.
For example, if I have a set a = {1, 2, 3}
and a set b = {2, 3, 4}
, the difference between a and b
is {1}
and the difference between b and a
is {4}
.
Here's a naive implementation:
a = set([1, 2, 3])
b = set([2, 3, 4])
c = {element for element in a if element not in b}
print(c)
This outputs {1}
as expected, and will print {4}
if we invert a
and b
. Again, an implementation using Python's standard lib:
a = set([1, 2, 3])
b = set([2, 3, 4])
c = a.difference(b)
print(c)
Symmetric difference
The symmetric difference between two sets results in a third set with the elements, from both sets, that are not present on the other.
For example, if I calculate the symmetric difference between {1, 2, 3}
and {2, 3, 4}
, the output will be {1, 4}
. This operation is commutative.
Here's a naive implementation:
a = set([1, 2, 3])
b = set([2, 3, 4])
c = set()
for element in a:
if element not in b:
c.add(element)
for element in b:
if element not in a:
c.add(element)
print(c)
The above implementation will print {1, 4}
, as expected. Of course, Python has this on its standard libary too: symmetric_difference
.
a = set([1, 2, 3])
b = set([2, 3, 4])
c = a.symmetric_difference(b)
print(c)
Conclusion
Sets are really useful, and actually pretty fun. Give them a try. I lived a lot of time constrained to lists, and when I finally sat down and learned how to work with sets started finding better, more elegant solutions to a lot of problems.
Top comments (8)
Hi Sebastian, nice article! Sets are great and "set comprehensions" are so cool. I have a tip for construction:
You can use the set literal notation to avoid creating the intermediate list. Unfortunately for an empty set you still need the
set()
function:I'm not sure they make Python's code more readable (well, the difference operator maybe) but you can use a few operators in place of the functions you perfectly explained up here:
In the spirit of the Python zen (explicit is better than implicity) I don't think I've ever used the operators, aside from the
-
for difference which is self explanatory. The other three remind me too much of bitwise operators in C :DActually the bitwise operations is completely applicable to the sets (i.e. the wiki XOR article contains Venn's diagrams). That's happening because of the bitwise operations are applied not to the values but to their presence in the set.
Thanks for reminding me about that. I still prefer the explicit version in "day to day" code though, especially because they can be used dynamically if needed!
Them being bitwise operators makes them really easy to remember, IMO - after all, what's the fundamental difference between a bitmask and a set of booleans? :)
(Of course in that case
difference
should look like&~
and not-
...)Python is a high level language though, there's no builtin concept of masking bits so, taking into account that we can't assume C knowledge by the reader, I reckon that:
is more readable than
especially six months from now with a different programmer tasked to fix something 😆
(also remember than there are no other major contexts in Python itself where
a & b
means anything)I agree that the long names are more readable, but Python does provide operators for
&
and|
already; on integers they provide the bitwise logic, same as in C.Also some things make use of those operators for other purposes; for example, Peewee ORM uses them for its query generator.
Anyway, just because a language is high level doesn't mean it doesn't (and shouldn't) provide bitwise functionality. Bitwise operations are still really useful for a lot of purposes in a lot of fields and I absolutely would not discount how necessary they are.
Oh thanks fluffy, I totally forgot about those. I rarely see them anywhere that I probably forgot :) My bad!
Can't argue with that hehe
Hi, thanks a lot for the comment.
As stated in my post, yes, we can use curly braces to avoid the intermediate collection, but I think, unless performance is an issue, we should use the constructor, as curly braces are also used for dictionaries.
On the other hand, I did not know about those operators, so thanks a lot for bringing them up, although I agree with you, the usage of the functions is more readable.
Again, thanks for your coment.
Regards!