This blog is going to ramble. That’s OK. I’m writing it for me, not for anyone else. But if someone wants to follow along, then I’m happy to have you. Welcome!
Once upon a time, I worked somewhere that needed to store a whole lot of semi-structured data that was extracted from documents. After investigating for some time, we decided that a graph database would be the best approach. However, at the time graph databases were rare and inefficient. So we made the mistake of building our own. This actually worked, and it’s still in use in a few places. But we made some decisions based on the technology of the day, and things have moved on.
More recently, I started building a new system that was specifically meant to run entirely in memory. This also worked, and it’s in use in my company today. But I’ve started making some moves to save the data durably, and this part isn’t needed for work. It’s “for fun”. It’s my hobby. I have a clear plan for it all, but there is still a lot of implementation to go.
Family life being what it is, I feel that a discussion of what I’m doing will serve to help keep me focused on my task, and simultaneously document what I’m doing. I also realize that I’m bringing together a number of threads that not everyone knows, so maybe there’s interest out there for someone else to read this as well.
So my upcoming posts will be building on several different ideas, some of which may seem a little unrelated at first glance.
We often hear how the only time we will ever need to know the ins and outs of data structures is during job interviews, while in reality, no one needs to know how they really work. Even if you do know how they work, then you should always use libraries, and no one should ever write their own. Right?
To a certain extent, this is indeed true. Interviewers who ask these questions are really just checking if CS graduates learned their data structures courses at college, or if non-graduates learned sufficient CS to have covered this material. Either way, it doesn’t make sense for an interviewer to ask.
If a CS graduate actually graduated, then they passed an exam with this material. Maybe they’ve forgotten it today, but a quick refresher will bring them back up to speed. Stop trying to catch them out. For anyone who didn’t do CS, then why do you think they should know this? It’s not like they’re going to use it 99% of the time. If they need it, they’ll learn it. There is a myriad of skills with a greater ROI for someone learning to code.
Despite this, there are still a few reasons to know about data structures.
- Someone has to write them. New languages are invented all the time, and to make the language useful these structures usually need to be reimplemented.
- Knowing the underlying structure is vital for knowing which one to use, and which algorithms work best with them. Shifting structure types can speed up an algorithm significantly, or allow a different better algorithm to be used.
- Every so often a particular type of data will have a property that allows a customized structure to store it more efficiently in terms of speed and/or space. You need to understand the rules of structures if you ever hope to break those rules safely.
- Durable storage. That’s what these blog posts are about.
What do I mean when I say that storage is durable? I’m saying that when your programs shut down or your machines go offline, then when you come back to it your data will still be there. The obvious example of this is a file. But there are also databases, S3 buckets, persistent RAM (when it is released), and so on.
As an aside, that brief list is a little disingenuous. In reality, everything gets stored on some kind of media as ones and zeros, with an addressing mechanism for finding specific ones and zeros. Abstractions then get layered over this. Hard drive controllers take blocks of data and an address, determine the physical location of that address and write the data. Operating systems take that abstraction and create the “file” abstraction over it. Databases take the file abstraction and then create their table abstraction over that. Or network file systems can layer a file abstraction over a network protocol. And these abstractions can then be layered again and again, always hiding the mechanism underlying them. It can get fancy and silly, but it always comes back to storing your data so you can come back to it later.
A simple way to store data is to just write it out to a file. Come back later, and you read it back again. This works for some applications, but not others. For instance, if I write an English dictionary to a file in alphabetical order and read it back later, then to look up a word I need only search back and forth alphabetically. But if I were to collect names and student numbers from people as they entered a school, and wrote it to a file in the order that the students walked in, then reading it back later is going to be hard. If I want Mary Smith’s student number I have to read the entire file right up until I find her. So we want some kind of structure to what we’re writing.
This brings me back to what needs to be stored to make an effective graph database. This is based on structures, and how those structures can be stored in various abstractions, not just files. I plan to start this story by describing some history of building structures and then build them to the point where they make sense for what we are going.
Part 2 can be found here.