官术网_书友最值得收藏!

Distcp usage

In Hadoop, we deal with large data, so performing a simple copy operation might not be the optimal thing to do. Imagine copying a 1 TB file from one cluster to another, or within the same cluster to a different path, and after 50% of the copy operation it times out. In this situation, the copy has to be started from the beginning.

Getting ready

This recipe shows the steps needed to copy files within and across the cluster. Ensure that the user has a running cluster with YARN configured to run MapReduce, as discussed in Chapter 1, Hadoop Architecture and Deployment.

For this recipe, there is no configuration needed to run Distcp; just make sure HDFS and YARN is up and running.

How to do it...

  1. ssh to Namenode or the edge node and execute the following command to copy the projects directory to the new directory:
    $ hadoop distcp /projects /new
    
  2. The preceding command will submit a MapReduce job to the cluster, and once the job finishes we can see the data copied at the destination.
  3. We can perform an incremental copy as well by using the following command:
    How to do it...
  4. The copy can be performed across clusters as a backup, or simply to move data from one cluster to another:
    $ hadoop distcp hdfs://master1.cyrus.com:9000/projects hdfs://nn1.cluster1.com:9000/projects
    

How it works...

Distcp is similar to sync, but it works in a distributed manner. Rather than just using one node, it uses multiple nodes in the cluster to copy parts of the data. It uses MapReduce to perform this operation, so any failures are taken care of automatically by the framework.

主站蜘蛛池模板: 方山县| 名山县| 临邑县| 固安县| 寻乌县| 惠安县| 开封市| 望谟县| 通山县| 鲁山县| 安丘市| 广平县| 平原县| 峨边| 宜川县| 屏山县| 沙坪坝区| 龙川县| 高陵县| 陇西县| 阿克苏市| 苏尼特左旗| 永寿县| 临夏市| 翁源县| 桑日县| 白河县| 渝北区| 雅安市| 荣昌县| 克山县| 宁晋县| 延安市| 渝北区| 革吉县| 涡阳县| 四子王旗| 牡丹江市| 平舆县| 宜兴市| 巴青县|