舉報

會員
Hadoop MapReduce v2 Cookbook(Second Edition)
最新章節(jié):
Index
IfyouareaBigDataenthusiastandwishtouseHadoopv2tosolveyourproblems,thenthisbookisforyou.ThisbookisforJavaprogrammerswithlittletomoderateknowledgeofHadoopMapReduce.Thisisalsoaone-stopreferencefordevelopersandsystemadminswhowanttoquicklygetuptospeedwithusingHadoopv2.ItwouldbehelpfultohaveabasicknowledgeofsoftwaredevelopmentusingJavaandabasicworkingknowledgeofLinux.
目錄(129章)
倒序
- coverpage
- Hadoop MapReduce v2 Cookbook Second Edition
- Credits
- About the Author
- Acknowledgments
- About the Author
- About the Reviewers
- www.PacktPub.com
- Support files eBooks discount offers and more
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Chapter 1. Getting Started with Hadoop v2
- Introduction
- Setting up Hadoop v2 on your local machine
- Writing a WordCount MapReduce application bundling it and running it using the Hadoop local mode
- Adding a combiner step to the WordCount MapReduce program
- Setting up HDFS
- Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
- Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
- HDFS command-line file operations
- Running the WordCount program in a distributed cluster environment
- Benchmarking HDFS using DFSIO
- Benchmarking Hadoop MapReduce using TeraSort
- Chapter 2. Cloud Deployments – Using Hadoop YARN on Cloud Environments
- Introduction
- Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
- Saving money using Amazon EC2 Spot Instances to execute EMR job flows
- Executing a Pig script using EMR
- Executing a Hive script using EMR
- Creating an Amazon EMR job flow using the AWS Command Line Interface
- Deploying an Apache HBase cluster on Amazon EC2 using EMR
- Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
- Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
- Chapter 3. Hadoop Essentials – Configurations Unit Tests and Other APIs
- Introduction
- Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
- Shared user Hadoop clusters – using Fair and Capacity schedulers
- Setting classpath precedence to user-provided JARs
- Speculative execution of straggling tasks
- Unit testing Hadoop MapReduce applications using MRUnit
- Integration testing Hadoop MapReduce applications using MiniYarnCluster
- Adding a new DataNode
- Decommissioning DataNodes
- Using multiple disks/volumes and limiting HDFS disk usage
- Setting the HDFS block size
- Setting the file replication factor
- Using the HDFS Java API
- Chapter 4. Developing Complex Hadoop MapReduce Applications
- Introduction
- Choosing appropriate Hadoop data types
- Implementing a custom Hadoop Writable data type
- Implementing a custom Hadoop key type
- Emitting data of different value types from a Mapper
- Choosing a suitable Hadoop InputFormat for your input data format
- Adding support for new input data formats – implementing a custom InputFormat
- Formatting the results of MapReduce computations – using Hadoop OutputFormats
- Writing multiple outputs from a MapReduce computation
- Hadoop intermediate data partitioning
- Secondary sorting – sorting Reduce input values
- Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
- Using Hadoop with legacy applications – Hadoop streaming
- Adding dependencies between MapReduce jobs
- Hadoop counters to report custom metrics
- Chapter 5. Analytics
- Introduction
- Simple analytics using MapReduce
- Performing GROUP BY using MapReduce
- Calculating frequency distributions and sorting using MapReduce
- Plotting the Hadoop MapReduce results using gnuplot
- Calculating histograms using MapReduce
- Calculating Scatter plots using MapReduce
- Parsing a complex dataset with Hadoop
- Joining two datasets using MapReduce
- Chapter 6. Hadoop Ecosystem – Apache Hive
- Introduction
- Getting started with Apache Hive
- Creating databases and tables using Hive CLI
- Simple SQL-style data querying using Apache Hive
- Creating and populating Hive tables and views using Hive query results
- Utilizing different storage formats in Hive - storing table data using ORC files
- Using Hive built-in functions
- Hive batch mode - using a query file
- Performing a join with Hive
- Creating partitioned Hive tables
- Writing Hive User-defined Functions (UDF)
- HCatalog – performing Java MapReduce computations on data mapped to Hive tables
- HCatalog – writing data to Hive tables from Java MapReduce computations
- Chapter 7. Hadoop Ecosystem II – Pig HBase Mahout and Sqoop
- Introduction
- Getting started with Apache Pig
- Joining two datasets using Pig
- Accessing a Hive table data in Pig using HCatalog
- Getting started with Apache HBase
- Data random access using Java client APIs
- Running MapReduce jobs on HBase
- Using Hive to insert data into HBase tables
- Getting started with Apache Mahout
- Running K-means with Mahout
- Importing data to HDFS from a relational database using Apache Sqoop
- Exporting data from HDFS to a relational database using Apache Sqoop
- Chapter 8. Searching and Indexing
- Introduction
- Generating an inverted index using Hadoop MapReduce
- Intradomain web crawling using Apache Nutch
- Indexing and searching web documents using Apache Solr
- Configuring Apache HBase as the backend data store for Apache Nutch
- Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
- Elasticsearch for indexing and searching
- Generating the in-links graph for crawled web pages
- Chapter 9. Classifications Recommendations and Finding Relationships
- Introduction
- Performing content-based recommendations
- Classification using the na?ve Bayes classifier
- Assigning advertisements to keywords using the Adwords balance algorithm
- Chapter 10. Mass Text Data Processing
- Introduction
- Data preprocessing using Hadoop streaming and Python
- De-duplicating data using Hadoop streaming
- Loading large datasets to an Apache HBase data store – importtsv and bulkload
- Creating TF and TF-IDF vectors for the text data
- Clustering text data using Apache Mahout
- Topic discovery using Latent Dirichlet Allocation (LDA)
- Document classification using Mahout Naive Bayes Classifier
- Index 更新時間:2021-07-23 20:33:18
推薦閱讀
- Flask Web全棧開發(fā)實戰(zhàn)
- Visual C++程序設計學習筆記
- C語言程序設計案例教程(第2版)
- Docker進階與實戰(zhàn)
- R語言游戲數(shù)據(jù)分析與挖掘
- HTML5+CSS3網(wǎng)站設計教程
- C#程序設計
- 精通Linux(第2版)
- 深入理解Elasticsearch(原書第3版)
- Visual Basic程序設計實驗指導(第二版)
- 編程菜鳥學Python數(shù)據(jù)分析
- Instant jQuery Boilerplate for Plugins
- 玩轉.NET Micro Framework移植:基于STM32F10x處理器
- 數(shù)據(jù)科學中的實用統(tǒng)計學(第2版)
- C#程序設計基礎入門教程
- 深入淺出Go語言核心編程
- Spring Cloud微服務架構開發(fā)實戰(zhàn)
- 深入解析WPF編程
- web2py Application Development Cookbook
- HTML+CSS+JavaScript Web前端開發(fā)技術
- 術以載道:軟件過程改進實踐指南
- C# 6 and .NET Core 1.0:Modern Cross:Platform Development
- 網(wǎng)絡工程設計與實踐項目化教程
- Implementing Qlik Sense
- Jenkins 2權威指南
- 動靜有法 Sketch+Principle UI設計基礎教程
- 大模型應用開發(fā)極簡入門:基于GPT-4和ChatGPT
- GameMaker Cookbook
- Java Programming for Beginners
- Nuclio實戰(zhàn)及源碼分析:基于Kubernetes的Serverless FaaS平臺