System Architecture (4 Part Series)
The goal of the project is to design a Url shortening service like bit.ly or tinyurl.com that is realtime scalable.
Url Shortener is a service that creates a short alias URL for a corresponding long URL. So, now whenever the user visits the short URL, he will be redirected to the original URL.
Example: Let the original URL be: Long URL: https://nlogn.in/horizontal-scaling-and-vertical-scaling/ Short URL is: https://tinyurl.com/tepyk8t Whenever you will click the second short URL, you will automatically be redirected to the page referred by the long URL. Try Now.
The URL shortening service should have the following features:
- Given a long URL it should be able to generate Unique Short URL.
- Given a short URL it should redirect to the original URL.
- The URL should expire after a timespan.
- The service should be REST API accessible.
- It should provide analytics features i.e, how many times the URL is visited.
- A user should be able to pick a custom URL.
The important thing to note is that the number of reading requests will be 1000 times more than the number of write requests.
Suppose, we are going to receive 1B(1 Billion) new URL's shorting requests per month, with a 100:1 read/write ratio. So, the total number of redirections:
100*1B = 100B redirections
If we are going to store, each URL(long and corresponding short URL) for almost 10 years, then the numbers of URLs we are going to store are:
1B*120 = 120B URLs
Now if on average every URL is going to be of 200 bytes, then total storage required:
200B*120byte = 24TB storages 😱
Let's starts by making two functions accessible through REST API:
1. create_shortURL( long_url, api_key, custom_url, expiry_date)
- long_url: A long URL that needs to be shortened.
- api_key: A unique API key provided to each user, to protect from the spammers, access and resource control for user etc.
- custom_url(optional): The custom short link URL, user want to use.
- expiry_date(optional): The date after which the short URL becomes invalid.
- Return Value: The short Url generated, or error code in case of the inappropriate parameter.
2. redirect_shortURL( short_url)
- short_url: The short URL generated from the above function.
- Return Value: The original long URL, or invalid URL error code.
The URL that we have to generate should be a short URL. But how much short? Also, what type of encoding should be done?🤔
The encoding can be Base36 (characters allowed [a-z][0-9]) or Base62 ( characters allowed [A-Z][a-z][0-9]) or Base64 (characters allowed [A-Z][a-z][0-9],'+','/'). Since all of the characters in Base64 are URL safe characters, we will choose Base64 as our encoding technique.
Using Base64 encoding if we choose the
URL with length 6, will give 64^6 = ~68.7 Billion URLs URL with length 7, will give 64^7 = ~5 Trillion URLs URL with length 8, will give 64^8 = ~281 Trillion URLs
We have already estimated that we would be storing 120B URLs, hence we can safely choose the short URL to be 7 characters long.
Now, to generate a unique short URL, we can calculate the MD5 hash of the long URL, which will produce a 128-bit hash value. Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.
(MD5 result = 32character output = 32*4bit = 128 bit output. Base64 encoding will use 6 bit to represent each character, MD5 -> Base64 give (128/6) ~ 22 character output)
For choosing the short URL, first, we can randomly swap some character of the Base64 encoding result and then pick any 7 characters randomly from the result.
Using the above method we can come across 2 potential issues:
- If multiple users enter the same Long URL, in the worst case they can get the same short URL.
- If the user enters the URL-encoded URL i.e, https://nlogn.in/system-design?rel=source (UTF-8 format) and https%3A%2F%2Fnlogn.in%2Fsystem-design%3Frel%3Dsource (URL-encoded format). Both URLs are the same but encoding is different. Check here.
We can append the user IP address to a the long URL and then do the shortening porcess, in this way we won't get redundency. Correct?🤔
No, two users can have same public IP addresses if they are connected via same router.
One solution is to add user-id or API key to the long URL and then do the shortening. This will work fine, but user have to be logged in to create short URL. Let's stick to this for now.
For every request recieved we will use UTF-8 encoding only. If a URL is URL-encoded, we will first convert it into UTF-8 format. For this we can have our own service or we can use some 3rd party services.
Let's see the data we need to store:
User ID: A unique user id or API key to make user globally distinguishable.
Name: The name of the user.
Email: The email id of the user
Password: Password of the user to facilitate login feature.
Creation Date: The date on which the user was registered.
Short Url: 7 character long unique short URL.
Original Url: The original long URL.
UserId: The unique user id or API key of the user who created the short URL.
Expiration Date: The date after which this short URL should become invalid.
- We will be storing Billions of links.
- The short link redirection should be fast.
- Our service is going to be read heavily.
- The only relation that is going to exist is which user created which URL and that too is going to accessed very less.
We have two different choices of databases: 1) Relational Databases(MySQL) 2) NoSQL Databases( Cassandra).
In general, Relational Databases are good if we have lots of complex queires involving joins, but they are slow. NoSQL databases are pathetic at handling the relationship queries but they are faster.
Now, we don't really need lots of relationship among data, but we do need the fast read and write speed. Hence we will choose NoSQL Database. The key for each row can be the shorturl, because it is going to be globally unique.
To scale out our database, we need to partition it into several machines or nodes, so that it can store information about billions of URLs. Hence now we can store more data in memory because of more machines or nodes. For database sharding we will use hash based partition technique.
In this technique, we will find the hash of the shorturl we are going to store, and determine the machine/shard in which we are going to store this particular URL. The hash function will randomly distribute the URLs into different partation or shard. We can decide the number of shards we are going to make and then we can choose appropriate hash function that random number representing the partation/shard number.( Ex if we have chosen 512 shards, then hash function will generate a number between [1-512] or [0-511] ).
We know that our database is going to be read heavily. Till now we have find the way to speed up the writing process, but the reads are still slow. So we have to find some way to speed up the reading process.
Caching is the solution. We can cache the URLs that are going to be accessed frequently. For example, a url that appears on the trending page of any social networking website. Hence many people are going to visit the url. We can use the caching servies like memcached.
Things we need to consider after adding the caching layer:
When the cache is full, we need to replace the URLs in cache with the treanding ones. For this we will use LRU(Least Recently Used) policy. The URL in the cace which have been refered the least number of times will be removed.
Synchronizing the cache with the original URL. If the user updates or deletes the original link, the corresponding changes has to be reflected in the cache too.
We can shared the cache too. This will help us store more data in memory because of the more machines. For deciding which thing go to which shard can be done using "Hashing" or "Consistant Hashing".
Now this is now we have speed up our read and write request, but still our system is prone to network bandwidth bottleneck and single point of failure.
To overcome the problem of limited network bandwidth and single point of failue, we will use Load Balancers. Load Balancer does its magic by diving the traffic among a group of server thus resulting in improved response and availability of a website or application. Read more here...
To distriubte the load among server we will use the Least Bandwidth Method. This algorithm will choose the server currently serving the least amount of traffic, measured in megabits per second (Mbps).
We can place the Load Balancers between:
- The client and the server.
- The server and the database.
This is how we will design a highly scalable URL Shortening Service that can work in realtime. 👨💻
This post was originally published at nlogn
Claim your page on DEV before someone else does
Level up every day