官术网_书友最值得收藏!

Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs

EMR bootstrap actions provide us a mechanism to configure the EC2 instances before running our MapReduce computations. Examples of bootstrap actions include providing custom configurations for Hadoop, installing any dependent software, distributing a common dataset, and so on. Amazon provides a set of predefined bootstrap actions as well as allowing us to write our own custom bootstrap actions. EMR runs the bootstrap actions in each instance before Hadoop cluster services are started.

In this recipe, we are going to use a stop words list to filter out the common words from our WordCount sample. We download the stop words list to the workers using a custom bootstrap action.

How to do it...

The following steps show you how to download a file to all the EC2 instances of an EMR computation using a bootstrap script:

  1. Save the following script to a file named download-stopwords.sh. Upload the file to a Blob container in the Amazon S3. This custom bootstrap file downloads a stop words list to each instance and copies it to a pre-designated directory inside the instance.
    #!/bin/bash
    set -e
    wget http://www.textfixer.com/resources/common-english-words-with-contractions.txt
    mkdir –p /home/hadoop/stopwords
    mv common-english-words-with-contractions.txt /home/hadoop/stopwords
    
  2. Complete steps 1 to 10 of the Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe in this chapter.
  3. Select the Add Bootstrap Actions option in the Bootstrap Actions tab. Select Custom Action in the Add Bootstrap Actions drop-down box. Click on Configure and add. Give a name to your action in the Name textbox and provide the S3 path of the location where you uploaded the download-stopwords.sh file in the S3 location textbox. Click on Add.
    How to do it...
  4. Add Steps if needed.
  5. Click on the Create Cluster button to launch instances and to deploy the MapReduce cluster.
  6. Click on Refresh in the EMR console and go to your Cluster Details page to view the details of the cluster.

There's more...

Amazon provides us with the following predefined bootstrap actions:

  • configure-daemons: This allows us to set Java Virtual Machine (JVM) options for the Hadoop daemons, such as the heap size and garbage collection behavior.
  • configure-hadoop: This allows us to modify the Hadoop configuration settings. Either we can upload a Hadoop configuration XML or we can specify individual configuration options as key-value pairs.
  • memory-intensive: This allows us to configure the Hadoop cluster for memory-intensive workloads.
  • run-if: This allows us to run bootstrap actions based on a property of an instance. This action can be used in scenarios where we want to run a command only in the Hadoop master node.

You can also create shutdown actions by writing scripts to a designated directory in the instance. Shutdown actions are executed after the job flow is terminated.

Refer to http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html for more information.

主站蜘蛛池模板: 阿拉尔市| 稻城县| 涟水县| 灵璧县| 隆昌县| 新野县| 博罗县| 扎赉特旗| 延边| 越西县| 兴宁市| 高要市| 鹤峰县| 凤庆县| 格尔木市| 屏东县| 山阴县| 宜兰县| 中卫市| 宿松县| 新乡市| 澜沧| 蒲江县| 昌邑市| 彰武县| 玉门市| 光泽县| 朝阳区| 安阳县| 灵山县| 尚义县| 蓬安县| 山东| 中山市| 施秉县| 三门峡市| 夏河县| 石嘴山市| 丰城市| 新安县| 西华县|