Easy Distributed Processing Using Amazon Elastic MapReduce (EMR)

Amazon Elastic MapReduce (EMR)

EMR is a mechanism that easily uses Hadoop, Hive, Apache Spark on AWS. One can distribute big data using Hadoop and Spark. Hadoop is a mechanism that allows MapReduce to distribute data to multiple machines. Hive is a mechanism that enables Hadoop to be operated by SQL, while Spark is an alternative to MapReduce part and aims to speed up with on memory. Despite its inappropriate orientation, Amazon Redshift generally seems to be faster and more expensive.

Because it is a cloud service, it makes it possible to launch a cluster of virtual servers in just a few minutes. One can adjust the number of servers that make up the cluster according to the demand of computing capacity. Since it is hourly billing, even if you want to process large amounts of data quickly in a short period of time you can respond with high-cost performance.

In addition, it can cooperate with other AWS services, and data stored in S3, RDS and DynamoDB can be accessed from the cluster. In particular, S3 can separate data from HDFS and save data without being conscious.

Benefits of Using Elastic MapReduce (EMR)

No Need To Construct Environment – For using Hadoop, CDH, MapR, etc. one need to install various settings etc. is. In contrast, in the case of EMR, it is not necessary to install Hadoop because everything is provided in a well-prepared state. Just by launching EMR from the GUI console of AWS or command line, the EC 2 instance is activated behind the scenes, Java and Hadoop are installed, various settings are done properly.

No Operation Required – This is something that can be said to the AWS as a whole rather than the EMR, but it is also a great merit that it takes no trouble of operation. When you move Hadoop with your own server, the possibility that a specific server malfunction or an HDD fails as a result of using multiple servers is inevitable. If such a problem occurs, it is necessary to respond each time, but if it is EMR, even if such a problem occurs, just stop that node and set up another new node to solve it.

No Need to Capture Modifications, Upgrade – EMR will automatically verify the latest modifications and update version of Hadoop itself, so even if the user is not conscious he can get the latest stable environment.

What happens when Elastic MapReduce (EMR) start?

  • EC 2 instance starts up
  • Hadoop execution environment is automatically installed
  • Processing is performed by reading the data for input arranged in S 3
  • (Execute something processing)
  • Upon completion of processing, the result is automatically uploaded to S 3
  • EC2 instance terminates

Amazon EMR not only integrates Hadoop and AWS but also distributed processing using new technologies such as nodes and steps.

There are three types of nodes are defined

Master node – This is the node that manages the cluster, and only one exists in each cluster.
It monitors the status of each task and the state of the instance group and performs management so as to maintain the correct state.

Core node – This is a node mapped to a slave node and uses the Hadoop Distributed File System (HDFS) to execute tasks and store data.

Task node – Run the task on the node that is mapped to the Hadoop slave node.



EMR (Elastic MapReduce) makes it possible to construct troublesome Hadoop and install each application with a single button, and one can see that distribution processing can be easily executed by using SQL-like queries. Furthermore, as an advantage of EMR, it is possible to operate as many instances as necessary when needed, which is very advantageous for scale.