This whole post is about a video from Hussein Nasser I saw from youtube. This is just the summarise version of the video along with my key takeaways from the video
The discussion is on
- How to handle data in billion row based table
- What kind of approach can be taken
- How to redesign the table to handle 2 billion rows based table
Here the discussion is being started with how a twitter follower table has been designed. A simple approach, make a table with which person follows whom. A 2-3 columns based rows. But it will be a huge table in the long term in sense if we use this for twitter. So what can be done
- Do normal query without the concept of
brute forcingthe data without the concept of anything. Do multi threading, multi processing and find the data from the table using lots of machines (
indexingon the table and find the data from the table using the indexed data.
- Now you have billions of data so indexing is huge, to search that use database
partitioningin the same disk. Use pair partitioning
- Now to optimize more use
shardingin the system(multiple host). But it adds more
complexityin the system. The client needs to be aware of the shard info before querying, then needs to find the proper
partitionfor the query, then make an actual query. Which makes another layer of logics along with business logics.
- Another way is to
redesignthe system like in the profile table add 2 more rows to hold db columns like follower count, followers(in json). In that way a profile holds all information about followers of a profile. The problem might generate how to write/edit this data. But that is another kind of system design like querying, CQRS, event based solution. It solves current issues.