This is a continuation of the resources I listed in part 1
80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1
Tariq Abughofa ・ Dec 22 '19 ・ 5 min read
This part includes the following four categories:
- Machine Learning & Algorithms in Big Data
- Data Processing Systems
- Real-time Processing
- Graph Processing
Machine Learning and Algorithms in Big Data
Recommending items to more than a billion people: An article about collaborative filtering at Facebook.
Machine Learning with Sparkling Water: Using H2O the machine learning framework with Apache Spark.
MLlib: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.
TensorFlow: the famous large-scale machine learning library.
Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.
Data Processing Systems
Airflow: a workflow management system by AirBnB.
Oozie: a workflow management system for Hadoop by Yahoo!.
BlinkDb: analytics on large scale data from Berkeley.
FlumeJava: a library for developing parallel data pipelines from Google.
MapReduce: the google framework behind Hadoop.
Pig: an engine that supports PigLatin a procedural dataflow language for Hadoop from Yahoo.
Hive (resource#2): A data warehouse on top of Hadoop.
The Dataflow Model: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.
MillWheel: stream processing engine from Google.
Photon: A tool to join data streams at Google.
Kinesis: stream processing engine from Amazon.
Apache Flink (resource#2): stream and batch processing engine from TU Berlin.
Trill: incremental data analytics engine from Microsoft.
Kafka: the famous distributed messaging system from LinkedIn.
Apache Spark: the famous stream and batch processing engine. It uses distributed memory abstractions: RDDs, Dataframes, and Datasets. Since Spark 2 was released, it moved to structured streaming (resource#2) (3) (4) and the SparkSQL library was introduced to allow SQL queries over Spark Dataframes. The whole Databricks blog is a great resource for the project.
SparkR: a Spark library to write processing application in R.
GraphX (resource#2): distributed graph processing with Spark's RDDs.
GraphFrames: distributed graph processing with Spark's Dataframes.
SnappyData (resource#2): a transaction datastore on top of Spark.
Real-time Processing
Samza (resource#2) (3) (4): Stream processing engine from LinkedIn.
Storm: real-time data processing engine from Twitter.
Heron: the new Storm from Twitter.
Real-time data processing at facebook.
Pulsar: real-time data processing engine from eBay.
Graph Processing
WTF: the who to follow service at Twitter.
GraphJet: real-time recommendation graph engine at Twitter.
Pregel: large-scale graph processing engine at Google.
Giraph: open source implementation of Pregel by Facebook.
Top comments (0)