DEV Community

Cover image for Best Practices Working with Billion-row Tables in Databases
Ashraful Islam Nixon
Ashraful Islam Nixon

Posted on

Best Practices Working with Billion-row Tables in Databases

This whole post is about a video from Hussein Nasser I saw from youtube. This is just the summarise version of the video along with my key takeaways from the video

Learning:

The discussion is on

  • How to handle data in billion row based table
  • What kind of approach can be taken
  • How to redesign the table to handle 2 billion rows based table

Summary:

Here the discussion is being started with how a twitter follower table has been designed. A simple approach, make a table with which person follows whom. A 2-3 columns based rows. But it will be a huge table in the long term in sense if we use this for twitter. So what can be done

  1. Do normal query without the concept of indexing. Just brute forcing the data without the concept of anything. Do multi threading, multi processing and find the data from the table using lots of machines (map reduce)
  2. Use indexing on the table and find the data from the table using the indexed data.
  3. Now you have billions of data so indexing is huge, to search that use database partitioning in the same disk. Use pair partitioning
  4. Now to optimize more use sharding in the system(multiple host). But it adds more complexity in the system. The client needs to be aware of the shard info before querying, then needs to find the proper partition for the query, then make an actual query. Which makes another layer of logics along with business logics.
  5. Another way is to redesign the system like in the profile table add 2 more rows to hold db columns like follower count, followers(in json). In that way a profile holds all information about followers of a profile. The problem might generate how to write/edit this data. But that is another kind of system design like querying, CQRS, event based solution. It solves current issues.

Discussion (0)