So first of all, there is Big Data. To make it short, Big Data is the way to handle (semi)-structured or unstructured datasets, which are too large to fit in the memory of a single/a few computer(s) bla bla bla.
And we talk about Big Data because today almost everything produces data: your phone, your fridge, your socks... everything. Then people use these datasets to understand their customers' behaviour, to monitor some kind of web-services or to do like fraud detection or financial things you know.
To illustrate my point, let's say you're the developer of a very popular multiplayer game and as soon as a player performs a key action, your server save it in a log file. So, you have like a log file that contains billion of records, and you'd like to extract some useful information such as the average number of quests completed, per server.
You got a basic computer and you want to do these computations, how does it look like? Well, firstly, your program has to take this dataset and cut it into several chunks via a partitioning algorithm. Then, it has to spin up a few tasks or some kind of workers/threads that will independently operate on a chunk, in parallel. Lastly, they will combine, if necessary, their result into one single output.
Great, if you run this, your computer either explodes or runs out of memory because that's really too much data to handle... You have to find another way. Maybe you could do some kind of vertical scaling by buying this overpriced $50,000 New Apple Mac Pro. It might probably help (I hope hehe xd) but that's not really scalable at all.
In fact, If your game becomes even more popular than it is, you'll get a larger volume of data, you'll have to upgrade your computer, spend a shitload of money and so on, until you get a file so large that you can't store it on any f**** machine.
No... you have to find a more effective way. Maybe, could you just distribute the problem across multiple machines? Yay!
So instead upgrading you computer in order to be able to do these computations, you'll use a cluster of very traditional machines to get the job done in a short time. It's about the same configuration as above but with some key differences.
The dataset is chopped up into pieces, each piece has a number of copies and they are stored over several nodes. That's what we do via a distributed file system (DFS). Then, we no longer have these kind of threads sharing the same memory but worker nodes operating independently of each other, in parallel, on a fragment of the dataset.
Alright, the more dataset grows, the more nodes you use... and you get the job done fast. You may use hundreds or thousands of machines, that's highly scalable.
I've mentioned that kind of distributed parallel computing tasks over several computers, but how do we do that effectively? Here comes MapReduce (heavy breathing intensifies).
MapReduce is a programming paradigm invented by Google which describes a way of structuring your job that allows it to be easily run on a many machines. What does that mean? Well, if you write your computation in a specific way then it's really easy for your computation to take advantage of hundreds or thousands of computers. And it's very easy to understand, sweet!
Let's say you have a huge file that holds billion of Amazon reviews, and you'd like to grab the average rating by product category. With MapReduce that's a walk in the park.
First of, there is the map phase. You have a bunch of computers and your dataset, which has been sliced into pieces, will be spread over them. These machines do not communicate with each other during the map phase, they focus on the given chunks/partition and they run the user-defined function on each element to produce key-value pairs.
So for each element we'll grab the product category as a key and the customer rating as a value (product_category, user_rating), we end up with that representation:
|(Books, 3.5)||(Phones, 4.5)||(Cooking, 4)|
|(Phones, 5)||(Cooking, 5)||(Books, 2)|
|(Phones, 3)||(Books, 3.5)||(Phones, 5)|
Then, there is a shuffle phase which is going to sort these keys and send them to the next phase so every reducers obtains all values associated with the same key.
|(Books, [3.5, 3.5])||(Cooking, [4, 5])||(Phones, [4.5, 3, 5])|
Finally, here comes the reduce phase which combines all the results from the map stage to aggregate one value as an output. In this example, we wants to get the average rating:
|(Books, 3.5)||(Cooking, 4.5)||(Phones, 4.17)|
That's all. And this can scale up to hundreds and thousands of machines, the process will be exactly the same. That's how big compagnies such as Google built this kind of search engine indexing.
If you're planning to build MapReduce-style jobs, there is Hadoop for that.
Hadoop is very popular open-source project under Apache which provides you a distributed file system called Hadoop Distributed File System (HDFS) to store your large datasets. And it provides also the famous data-processing framework: Hadoop MapReduce. If you've read so far, I hope you probably know how this works ;).
Just keep in mind that MapReduce may not fit your needs. In fact, you could solve basically any class of problem with MapReduce if you take your algorithm, twist it enough and shove it into that map-shuffle-reduce style structure. But the framework can be too restrictive for what you're trying to achieve. Then, MapReduce is... slow, and that's because it has to read/write the data frequently between stages and these operations are really expensive.
Finally, you can do batch-style processing only. That means, if you take our previous online game example, the game server will generate daily logs. So for instance, you'll get all the events from today but you'll have to wait tomorrow to be able to run your MapReduce job and get the results. But you want to get a much lower latency, you'd like to build a real-time analytics dashboard to be able to tweak your game based on your player's behaviour. That is streaming processing, that's another chapter of the book but that's a game changer in the industry.
But fortunately, thanks to Apache folks and Google's work, you have now some advanced, fast, fault tolerant, Big Data tools built on top of MapReduce which allows you to deal either with bounded or unbounded datasets with a higher level of abstraction.
You no longer just have these two functions
reduce() but more expressive APIs and more efficient runners. In a next post I'll talk about Apache Beam which is a portable, unified programming model which is a glue between multiple programming language, multiple runners and multiple data processing modes.