Here I want to present my book on advanced algorithms for data-intensive applications named "Probabilistic Data Structures and Algorithms in Big Data Applications" (ISBN: 9783748190486), that I published some months ago. The detailed information about the book you can find at https://pdsa.gakhov.com or Amazon and below I give you some introduction to the topic this book is about.
As you might already know, the Big Data is more than simply a matter of size. The datasets of Big Data are larger, more complex, and generated more rapidly than our current resources can handle. Such datasets are so extensive that traditional data processing software and algorithms just can't manage them. This is why Big Data doesn't refer to data, it refers to technology. Thus, to successfully approach such problems, we need to think about how to improve/change the technology we use. One of the examples is to use advanced algorithms with a completely different ideology than the classical ones.
The Probabilistic data structures and algorithms (PDSA) is a family of advanced techniques that are optimized to use fixed or sublinear memory and constant execution time; they often based on the hashing and have many other useful features. However, they also have some disadvantages such as they cannot provide the exact answers and have some probability of error (that, actually, can be controlled). The tradeoff between the error and the resources is another feature that distinguishes the algorithms and data structures of this family.
Such technologies have found very naturally their use-cases in Big Data, since there we also have the tradeoff - either left the whole data unprocessed or agree that some results are not entirely exact.
I guess, everyone has heard about Volume, Velocity, and Variety, The Three Vs of Big Data. These dimensions imply such problems as efficient membership querying, cardinality estimation, frequency and rank computation for data streams, and similarity calculation. Just as an example, we can point some algorithms from the PDSA family by addressing The 3 Vs of Big Data.
You might never have heard about most of them, but such data structures are not necessarily new. For example, the most known data structure is the Bloom filter, designed in the 70s. It efficiently solves the problem of performing membership queries (a task to decide whether some element belongs to the dataset or not) in a constant time without requirements to store all elements. This is an example of a probabilistic data structure, but there are much more that have been designed for various tasks in many domains.
Such data structures already implemented in many popular and well-known products such as Google Big Query, Amazon Redshift, Redis, Cassandra, CouchDB, Elasticsearch, and many others.
Andrii Gakhov@gakhovWhen you make a read request for some partition, Apache @cassandra has to merge in-RAM and on-disk data. To avoid loading each data file, it uses such well-known #probabilistic data structure as Bloom filter (with 0.1% false positive probability) #PDSABook cassandra.apache.org/doc/latest/ope…08:35 AM - 09 Apr 2019
For instance, you can use
approx_count_distinct command in Spark SQL or
PFCOUNT in Redis that actually powered by the HyperLogLog algorithm, another example of PDSA.
The book consists of six chapters, each preceded by an introduction and followed by a brief summary and bibliography for further reading relating to that chapter. Every chapter is dedicated to one particular problem in Big Data applications, it starts with an in-depth explanation of the problem and follows by introducing data structures and algorithms that can be used to solve it efficiently.
The first chapter gives a brief overview of popular hash functions and hash tables that are widely used in probabilistic data structures. Chapter 2 is devoted to approximate membership queries, the most well-known use case of such structures. In chapter 3 data structures that help to estimate the number of unique elements are discussed. Chapters 4 and 5 are dedicated to important frequency- and rank-related metrics computations in streaming applications. Chapter 6 consists of data structures and algorithms to solve similarity problems, particularly — the nearest neighbor search.
If you are interested in usage of such data structures from your code, there are plenty of implementation in almost all programming language. As a side effect of the book, I have created a Python library PDSA, https://github.com/gakhov/pdsa, that is implemented in Cython and can be easily used from any Python3-powered applications. By the way, everybody is welcome to contribute!