Free Big Data Resources (2 Part Series)
Data is becoming a cornerstone in software services. Whether it is the business model or it drives revenue or both, tech companies are flocking to use this "free" resource to provide better services and excel over there competitors.
If you are in the "new-sexy" position in computer science or you're doing research in this field, you will find the resources in this article extremely helpful the same way they helped me. Frontier companies in this field like Google, Facebook, LinkedIn, and Twitter as well as big universities released tens of papers and articles on the subject outlining internal projects they worked on. These projects were released later as open sources to become a stable in the field. To save you the time and pain of getting lost in the labyrinth of endless resources over the internet (the way I did), I compiled a categorized list here for your pleasure. I will try to update the list frequently to keep it up-to-data.
I divided the resources into 8 main categories. This first part includes the following four:
- Big Data Storage & NoSQL Databases
- Interactive Data Analytics
- Big Data Challenges and Ecosystems
- Resource Management
Bigtable: the terrabyte NoSQL database behind google cloud storage.
Cassandra: the Facebook column-oriented database.
Voldemort: Distributed database by LinkedIn.
Dynamo: Amazon's key-value store.
HBase: Column-oriented storage over HDFS by Facebook.
Neo4J: the famous graph database.
Snowflake: A data warehouse for the cloud.
The Google File System: the big data file system and the base behind distributed storage in Hadoop.
HDFS: The Hadoop Distributed File System.
RCFile: Data placement for data warehouses used in Apache Hive.
Parquet: columnar storage format.
Haystack: an object storage system optimized for Facebook’s Photos application.
Windows Azure Storage: Cloud Storage System from Microsoft.
Data management in cloud environments - NoSQL and NewSQL data stores: A paper surveying data stores beyond SQL such as Redis, HBase, ...etc.
Dremel: analytics system by Google.
Impala: SQL engine for Hadoop by Cloudera.
Drill: An open source implementation of Dremel.
Dryad: a framework to define dataflow graphs from Microsoft.
Tez: an open source implementation of Dryad from Hortonworks and Microsoft.
Kudo: A storage for fast analytics on Big Data by Cloudera.
Google: how the large-scale search engine was built.
The Lambda Architecture: an architecture for data pipelines.
The Kappa Architecture: an alternative architecture for data pipelines.
Summingbird: a framework for integrating batch and online computations.
Eventual Consistency: A look at how data consistency works in NoSQL database systems.
Paxos: a consensus algorithm for distributed systems.
Raft: an alternative consensus algorithm to Paxos
Zookeeper: Coordinator and distributed configuration system by Yahoo!.
YARN: resource manager for Hadoop.
Borg: Cluster manager by Google.
Kubernetes: container-orchestration system for automating application deployment, scaling, and management by Google.
The second part will focus on Algorithms, ML, and data processing systems to stay tuned ;)