Free Big Data Resources (2 Part Series)
This is a continuation of the resources I listed in part 1
This part includes the following four categories:
- Machine Learning & Algorithms in Big Data
- Data Processing Systems
- Real-time Processing
- Graph Processing
Recommending items to more than a billion people: An article about collaborative filtering at Facebook.
Machine Learning with Sparkling Water: Using H2O the machine learning framework with Apache Spark.
MLlib: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.
TensorFlow: the famous large-scale machine learning library.
Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.
Airflow: a workflow management system by AirBnB.
Oozie: a workflow management system for Hadoop by Yahoo!.
BlinkDb: analytics on large scale data from Berkeley.
FlumeJava: a library for developing parallel data pipelines from Google.
MapReduce: the google framework behind Hadoop.
The Dataflow Model: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.
MillWheel: stream processing engine from Google.
Photon: A tool to join data streams at Google.
Kinesis: stream processing engine from Amazon.
Trill: incremental data analytics engine from Microsoft.
Kafka: the famous distributed messaging system from LinkedIn.
Apache Spark: the famous stream and batch processing engine. It uses distributed memory abstractions: RDDs, Dataframes, and Datasets. Since Spark 2 was released, it moved to structured streaming (resource#2) (3) (4) and the SparkSQL library was introduced to allow SQL queries over Spark Dataframes. The whole Databricks blog is a great resource for the project.
SparkR: a Spark library to write processing application in R.
GraphFrames: distributed graph processing with Spark's Dataframes.
Storm: real-time data processing engine from Twitter.
Heron: the new Storm from Twitter.
Pulsar: real-time data processing engine from eBay.
WTF: the who to follow service at Twitter.
GraphJet: real-time recommendation graph engine at Twitter.
Pregel: large-scale graph processing engine at Google.
Giraph: open source implementation of Pregel by Facebook.