Three months have passed since I wrote my last post. I’ve finished the CS50 course on time, but work and life made me choose others than writing, but I wanted to recover the feeling of doing something for the dev community, so here we go again… Finishing the CS50 series with this post! I hope you enjoy it.
In past lessons, we’ve learned about memory, and we’ve been working with arrays. The problem with arrays to store a large amount of data is that they have fixed size and their values are stored in contiguous memory addresses, so if you want to add a new item you have to allocate a new block of memory with enough space for the new array, copy value by value including the new element, and lastly, delete the old array. Arrays offer constant time
O(1) for lookup thanks to random access, but they’re bad for inserting/deleting an element, especially when we are working with large arrays because copy value by value takes linear time
A linked list is a data structure that, unlike arrays, it allows us to store a list of values in different parts of memory, giving us a better time complexity for inserting new values.
With linked lists doesn’t matter where the data is stored, we can use small chunks of free memory for each node, stitching them together with pointers:
We have the tradeoff of needing to allocate more memory for each item (for its value and the pointer to the next element), in order to spend less time adding values, because you don’t need all the copy process arrays to do to add an element.
Linked lists are composed of nodes that store its value and the pointer to the next element of the list. We could define a node in C like this:
typedef struct node
struct node *next;
Linked lists usually have a property called
head, which is a pointer to the first node of the list. Each time we want to add a new node to the list, we just make the new node point to the current head element, and then make
head point to the new node. In this way we are inserting new nodes at the beginning of the list, which takes constant time
O(1) since the required steps don’t change doesn’t matter how many items the list has. On the other hand, inserting nodes into a sorted list will have a running time of
O(n), since we might need to insert our nodes at the end.
Unlike arrays, searching is not a powerful feature of linked lists, we have a running time of
O(n) for searching since we need to follow each node, one at a time, to search for a value. We won’t be able to use binary search either, since we can’t use indexes to access elements.
A tree is another data structure where each node points to other nodes. It allows us to store a list of values in different parts of memory, just like linked lists. But, preserves the searching power of arrays, allowing us to use Binary search (divide and conquer approach), since each node points to one to the left (with a smaller value) and one to the right (with a larger value).
Every left child is smaller than the root, and every right child is larger than the root, as we can see in the picture below:
Each node has at most two children, or nodes it is pointing to. And like a linked list, we’ll want to keep a
header pointer to the beginning of the list, the root of the tree. To search for a number, we’ll start at the root node, and be able to recursively search the left or right subtree.
We can define a node in C like this:
typedef struct node
struct node *left;
struct node *right;
If we add nodes in inefficient ways, though, our binary search tree might start to look like a linked list:
We can make the tree balanced, or more optimal, by making the node with value
2 the new root node.
With a balanced binary search tree, the running time for search and insert will be
O(log n). But if our tree isn’t balanced, it can devolve into a linked list, with running time for search and insert of
Binary trees are used for efficient searching and sorting.
A hash table is a data structure that allows us to associate keys with values. It looks like an array, where we can jump to each location by its index (key). Keys are hashed and converted into an index where we can find the value stored.
If we need to implement a phone book, we could think of each location as a letter from A through Z, and insert names into each location. If we have multiple names with the same first letter, we can add them with a linked list, like this. Using a linked list to store values is a method to solve hash collisions, and is called chaining:
To create the hash table in C, we might write:
Since a hash table is just an array of pointers to nodes, with
NUMBER_OF_BUCKETS as its size.
To decide which location in the array a value should be placed in, we use a hash function, which takes some input and produces an index.
It turns out that the worst case running time for searching a hash table is
O(n) since all of our values might be in the same location, devolving into a linked list as well. But, on average the running time is constant
Hash tables are used for database indexes, implementing caches, implementing set data structures, etc. And offer constant time for adding, removeing, and lookup elements. But they’re not an ideal data structure if sorting is the goal.
A trie is a tree with arrays as nodes. It gives us constant time lookup, even for massive datasets. Supposing we are looking to store names:
Each array will have locations that represent each letter, A-Z.
For each word, the first letter will point to an array, where the next valid letter will point to another array, and so on, until we reach a boolean value indicating the end of a valid word, marked in green:
With multiple names, we start seeing some of the space being reused for the first letters that are shared:
We might define a trie in code with a structure like this:
typedef struct node
struct node *children[SIZE_OF_ALPHABET];
At each node, we’ll have a boolean value that indicates if it’s a valid word. Then, we’ll have an array of
SIZE_OF_ALPHABET pointers to other nodes, called
The good thing is that the height of our trie is the length of the longest word we want to store.
And even if our data structure has lots of words, the maximum lookup time will be just the length of the word we’re looking for. Doesn’t matter how many names contain the trie, will always take 5 steps to find the name “Harry”.
Tries have a constant time
O(1), for searching and insertion. The cost for this, though, is that we need lots of memory to store mostly arrays with null pointers.
Here you have some sort of summary of each data structure seen on this post:
- Insertion is easy - just stack onto the front
- Deletion is easy - once you find the element
- Lookup is bad - have to rely on linear search
- Relatively difficult to sort - unless you’re willing to compromise on super-fast insertion and instead sort as you construct
- Relatively small size-wise (not as small as arrays)
- Insertion is a two-step process - hash, then add
- Deletion is easy - once you find the element
- Lookup is on average better than linked lists because you have the benefit of a real-world constant factor. You have smaller linked lists on each location instead of one entire linked list.
- Not an ideal data structure if sorting is the goal, just use an array
- Can run the gamut of size. It depends on the height of the hash table and the length of the linked lists on each location.
- Insertion is complex - A lot of dynamic memory allocation, but gets easier as you go
- Deletion is easy - Just free a node
- Lookup is fast - not quite as fast as an array, but almost
- Already sorted - sorts as you build in almost all situations
- Rapidly becomes huge, even with very little data present, not great if space is at a premium
If you wanna know more about time complexity of different data structures and sorting algorithms here you have a great cheatsheet: bigocheatsheet.com
There are more data structures like queues, stacks, doubly linked lists, b-trees, etc; but maybe we can see some of them in other posts. This is a really important topic to learn and understand, not just because they’re present in every tech interview, but because they’re really helpful and improve us as developers to take better decisions when storing data.
On each data structure, we always pay a price, we must choose to minimize space, or minimize time. It’s not really possible to get the best of both worlds. That’s why is so important to learn about data structures: to make smart decisions.
See you soon! 😉