- Hadoop Beginner's Guide
- Garry Turkington
- 427字
- 2021-07-29 16:51:41
Using languages other than Java with Hadoop
We have mentioned previously that MapReduce programs don't have to be written in Java. Most programs are written in Java, but there are several reasons why you may want or need to write your map and reduce tasks in another language. Perhaps you have existing code to leverage or need to use third-party binaries—the reasons are varied and valid.
Hadoop provides a number of mechanisms to aid non-Java development, primary amongst these are Hadoop Pipes that provides a native C++ interface to Hadoop and Hadoop Streaming that allows any program that uses standard input and output to be used for map and reduce tasks. We will use Hadoop Streaming heavily in this chapter.
How Hadoop Streaming works
With the MapReduce Java API, both map and reduce tasks provide implementations for methods that contain the task functionality. These methods receive the input to the task as method arguments and then output results via the Context
object. This is a clear and type-safe interface but is by definition Java specific.
Hadoop Streaming takes a different approach. With Streaming, you write a map task that reads its input from standard input, one line at a time, and gives the output of its results to standard output. The reduce task then does the same, again using only standard input and output for its data flow.
Any program that reads and writes from standard input and output can be used in Streaming, such as compiled binaries, Unixshell scripts, or programs written in a dynamic language such as Ruby or Python.
Why to use Hadoop Streaming
The biggest advantage to Streaming is that it can allow you to try ideas and iterate on them more quickly than using Java. Instead of a compile/jar/submit cycle, you just write the scripts and pass them as arguments to the Streaming jar file. Especially when doing initial analysis on a new dataset or trying out new ideas, this can significantly speed up development.
The classic debate regarding dynamic versus static languages balances the benefits of swift development against runtime performance and type checking. These dynamic downsides also apply when using Streaming. Consequently, we favor use of Streaming for up-front analysis and Java for the implementation of jobs that will be executed on the production cluster.
We will use Ruby for Streaming examples in this chapter, but that is a personal preference. If you prefer shell scripting or another language, such as Python, then take the opportunity to convert the scripts used here into the language of your choice.
- 現(xiàn)代測控電子技術
- 平面設計初步
- 計算機控制技術
- 數(shù)據(jù)挖掘?qū)嵱冒咐治?/a>
- Expert AWS Development
- Cloudera Administration Handbook
- Practical Big Data Analytics
- 網(wǎng)絡安全技術及應用
- Hadoop應用開發(fā)基礎
- 基于神經(jīng)網(wǎng)絡的監(jiān)督和半監(jiān)督學習方法與遙感圖像智能解譯
- Photoshop行業(yè)應用基礎
- 企業(yè)級Web開發(fā)實戰(zhàn)
- Windows 7故障與技巧200例
- 傳感技術基礎與技能實訓
- 工程地質(zhì)地學信息遙感自動提取技術