Tariq Abughofa

Posted on Dec 29, 2019 • Originally published at rabbitoncode.com

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 2

#datascience #resources #distributedsystems #machinelearning

This is a continuation of the resources I listed in part 1

80+ Free Big Data Resources to Satisfy Your Knowledge Appetite - part 1

Tariq Abughofa ・ Dec 22 '19 ・ 5 min read

#database #datascience #nosql

This part includes the following four categories:

Machine Learning & Algorithms in Big Data
Data Processing Systems
Real-time Processing
Graph Processing

Machine Learning and Algorithms in Big Data

Recommending items to more than a billion people: An article about collaborative filtering at Facebook.

Machine Learning with Sparkling Water: Using H2O the machine learning framework with Apache Spark.

MLlib: Scalable Machine Learning library on Apache Spark from Stanford/Databricks.

TensorFlow: the famous large-scale machine learning library.

Large-scale parallel collaborative filtering for the Netflix prize: an algorithm that for large scale recommendations of Netflix movies.

Data Processing Systems

Airflow: a workflow management system by AirBnB.

Oozie: a workflow management system for Hadoop by Yahoo!.

BlinkDb: analytics on large scale data from Berkeley.

FlumeJava: a library for developing parallel data pipelines from Google.

MapReduce: the google framework behind Hadoop.

Pig: an engine that supports PigLatin a procedural dataflow language for Hadoop from Yahoo.

Hive (resource#2): A data warehouse on top of Hadoop.

The Dataflow Model: the model behind Google Cloud Dataflow which provides simplified stream and batch processing.

MillWheel: stream processing engine from Google.

Photon: A tool to join data streams at Google.

Kinesis: stream processing engine from Amazon.

Apache Flink (resource#2): stream and batch processing engine from TU Berlin.

Trill: incremental data analytics engine from Microsoft.

Kafka: the famous distributed messaging system from LinkedIn.

Apache Spark: the famous stream and batch processing engine. It uses distributed memory abstractions: RDDs, Dataframes, and Datasets. Since Spark 2 was released, it moved to structured streaming (resource#2) (3) (4) and the SparkSQL library was introduced to allow SQL queries over Spark Dataframes. The whole Databricks blog is a great resource for the project.

SparkR: a Spark library to write processing application in R.

GraphX (resource#2): distributed graph processing with Spark's RDDs.

GraphFrames: distributed graph processing with Spark's Dataframes.

SnappyData (resource#2): a transaction datastore on top of Spark.

Real-time Processing

Samza (resource#2) (3) (4): Stream processing engine from LinkedIn.

Storm: real-time data processing engine from Twitter.

Heron: the new Storm from Twitter.

Real-time data processing at facebook.

Pulsar: real-time data processing engine from eBay.

Graph Processing

WTF: the who to follow service at Twitter.

GraphJet: real-time recommendation graph engine at Twitter.

Pregel: large-scale graph processing engine at Google.

Giraph: open source implementation of Pregel by Facebook.

DEV Community