DEV Community

Amazon EMR Summary

What is Amazon EMR

EMR stands for Elastic MapReduce and EMR helps you create Hadoop clusters for doing big data on AWS.

This allows you to analyze and process vast amount of data.
Anytime you would see anything related to big data clusters with Hadoop clusters, then think Amazon EMR.

These clusters have to be provisioned and they can be made of hundreds of EC2 instances.

Why would you use EMR?

EMR comes bundled with a lot of tools that big data specialist use.
For example, Apache Spark, or HBase, or Presto, or Apache Flink.
They're very difficult to set up, so Amazon EMR will take care of all the provisioning and the configuration of these services for you.

You can also auto-scale your entire cluster and it's integrated with spot instances for you to benefit from price reductions.

The use cases of Amazon EMR

You can use Amazon EMR for Data Processing, doing Machine Learning, Web Indexing and Big Data, but all of them using big data related technologies such as Hadoop, Spark, HBase Preso Flink, and so on.

Amazon EMR Components

Amazon EMR is made of clusters of EC2 instances and you have different kind of nodes.

Master Node
The Master Node manages the cluster, it will coordinate and manage the health of all your other nodes, and it must be long running.

Core Node
they're here to run tasks and also store data, and they must be long running as well.

Task Node
which is there just to run tasks. Usually you can take spot instances for it and using task node is optional.


Top comments (1)

nowsathk profile image

I have compiled a list of common errors and their corresponding solutions encountered during the setup of EMR Clusters with DynamoDB.

Check here