舉報(bào)

會(huì)員
Big Data Analytics with Hadoop 3
Sridhar Alla 著
更新時(shí)間:2021-06-25 21:27:11
開會(huì)員,本書免費(fèi)讀 >
最新章節(jié):
Summary
BigDataAnalyticswithHadoop3isforyouifyouarelookingtobuildhigh-performanceanalyticssolutionsforyourenterpriseorbusinessusingHadoop3’spowerfulfeatures,oryou’renewtobigdataanalytics.AbasicunderstandingoftheJavaprogramminglanguageisrequired.
最新章節(jié)
- Summary
- Practical AWS EMR cluster
- Amazon EMR
- When should I use AWS Glue?
- AWS Glue
- Benefits of using Kinesis Data Streams
品牌:中圖公司
上架時(shí)間:2021-06-25 20:48:41
出版社:Packt Publishing
本書數(shù)字版權(quán)由中圖公司提供,并由其授權(quán)上海閱文信息技術(shù)有限公司制作發(fā)行
- Summary 更新時(shí)間:2021-06-25 21:27:11
- Practical AWS EMR cluster
- Amazon EMR
- When should I use AWS Glue?
- AWS Glue
- Benefits of using Kinesis Data Streams
- Complex stream processing
- Real-time data analytics
- Real-time metrics and reporting
- Accelerated log and data feed intake and processing
- What can I do with Kinesis Data Streams?
- Amazon Kinesis Data Streams
- Amazon DynamoDB
- Disaster recovery
- Cloud-native application data
- Hybrid Cloud storage
- Data lakes and big data analytics
- Data archiving
- Backup and recovery
- Easy and flexible data transfer
- Most supported platform with the largest ecosystem
- Flexible management
- Query in place
- Comprehensive security and compliance capabilities
- Getting started with Amazon S3
- Introduction to Amazon S3
- When should I use AWS Lambda?
- What is AWS Lambda?
- Amazon EC2 instance store
- Amazon Elastic Block Store
- Amazon EC2 and Amazon Virtual Private Cloud
- Elastic IP addresses
- Amazon EC2 security groups for Linux instances
- Amazon EC2 key pairs
- Tag basics
- Instance types
- Regions and endpoints
- Available regions
- Availability zones
- Regions
- Region and availability zone concepts
- Regions and availability zones
- AMIs
- Instances
- Launching multiple instances of an AMI
- Instances and Amazon Machine Images
- Easy to start
- Inexpensive
- Security
- High reliability
- Integration
- Flexible Cloud hosting services
- Complete control of operations
- Elastic web-scale computing
- Amazon Elastic Compute Cloud
- Using Amazon Web Services
- Summary
- Hybrid Clouds
- Private Clouds
- Community Clouds
- Public Clouds
- Cloud deployment models
- IaaS + PaaS + SaaS
- IaaS + PaaS
- Combining Cloud delivery models
- Software as a Service
- Platform as a Service
- Infrastructure as a Service
- Cloud delivery models
- Resiliency
- Measured usage
- Elasticity
- Multi-tenancy (and resource pooling)
- Ubiquitous access
- On-demand usage
- Cloud characteristics
- Trust boundary
- Organizational boundary
- Additional roles
- Cloud resource administrator
- Cloud service owner
- Cloud consumer
- Cloud provider
- Roles and boundaries
- Limited portability between Cloud providers
- Reduced operational governance control
- Increased security vulnerabilities
- Risks and challenges
- Increased availability and reliability
- Increased scalability
- Goals and benefits
- Cloud service consumer
- Cloud service
- Vertical scaling
- Horizontal scaling
- Types of scaling
- Scaling
- Cloud consumers and Cloud providers
- On-premise
- IT resource
- Cloud
- Concepts and terminology
- Introduction to Cloud Computing
- Summary
- Big data visualization tools
- Using R to visualize data
- Using Python to visualize data
- Heat map
- Bar chart
- Pie chart
- Line charts
- Chart types
- Tableau
- Introduction
- Visualizing Big Data
- Summary
- Cassandra connector
- Elasticsearch connector
- RabbitMQ connector
- Twitter connector
- Kafka connector
- Connectors
- Event time and watermarks
- Broadcasting
- Rescaling
- Rebalancing partitioning
- Random partitioning
- Custom partitioning
- Physical partitioning
- Project
- Select
- split
- Window join
- union
- windowAll
- Session windows
- Sliding windows
- Tumbling windows
- Global windows
- window
- Aggregations
- fold
- reduce
- keyBy
- filter
- flatMap
- map
- Transformations
- File-based
- Socket-based
- Data sources
- Execution environment
- Data processing using the DataStream API
- Introduction to streaming execution model
- Stream Processing with Apache Flink
- Summary
- Writing to a file
- Full outer join
- Right outer join
- Left outer join
- Inner join
- Joins
- Aggregation
- GroupBy
- Transformations
- Generic
- Collection-based
- File-based
- Reading file
- Batch analytics
- Using the Flink cluster UI
- Starting a local Flink cluster
- Installing Flink
- Downloading Flink
- Installing Flink
- Flink the streaming model and bounded datasets
- Continuous processing for unbounded datasets
- Introduction to Apache Flink
- Batch Analytics with Apache Flink
- Summary
- Fault-tolerance semantics
- Handling event time and late date
- Getting deeper into Structured Streaming
- Structured Streaming
- Direct Stream
- Receiver-based
- Interoperability with streaming platforms (Apache Kafka)
- Driver failure recovery
- Data checkpointing
- Metadata checkpointing
- Checkpointing
- Stateful transformations
- Stateless transformations
- Stateful/stateless transformations
- Windows operations
- Transformations
- Discretized Streams
- twitterStream example
- textFileStream example
- queueStream
- binaryRecordsStream
- textFileStream
- fileStream
- rawSocketStream
- socketTextStream
- receiverStream
- Input streams
- Stopping StreamingContext
- Starting StreamingContext
- Creating StreamingContext
- StreamingContext
- Spark Streaming
- Exactly-once processing
- At-most-once processing
- At-least-once processing
- Streaming
- Real-Time Analytics with Apache Spark
- Summary
- Performance implications of join
- Cross join
- Left semi join
- Left anti join
- Outer join
- Right outer join
- Left outer join
- Inner join
- Join types
- Broadcast join
- Shuffle join
- Inner workings of join
- Joins
- ntiles
- Window functions
- Cube
- Rollup
- groupBy
- Covariance
- Standard deviation
- Variance
- skewness
- kurtosis
- sum
- avg
- max
- min
- approx_count_distinct
- last
- first
- count
- Aggregate functions
- Aggregations
- Saving datasets
- Loading datasets
- Encoders
- Explicit schema
- Implicit schema
- Schema – structure of data
- User-defined functions
- Filters
- Pivots
- DataFrame APIs and the SQL API
- SparkSQL and DataFrames
- Batch Analytics with Apache Spark
- Summary
- Data analytics
- ORCH – Oracle connector for Hadoop
- RHIVE – install R on workstations and connect to data in Hadoop
- R and Hadoop Streaming
- RHIPE – execute R inside Hadoop MapReduce
- RHADOOP – install R on workstations and connect to data in Hadoop
- Methods of integrating R and Hadoop
- Summary and outlook for pure open source options
- Execute R inside of MapReduce using RMR2
- Utilize Revolution R Open
- Install R on a shared server and connect to Hadoop
- Install R on workstations and connect to the data in Hadoop
- Introduction
- Statistical Big Data Computing with R and Hadoop
- Summary
- Data analysis
- Using Conda
- Installing Anaconda
- Installing standard Python
- Installation
- Scientific Computing and Big Data Analysis with Python and Hadoop
- Summary
- Cross join
- Left semi join
- Full outer join
- Right outer join
- Left outer join
- Left anti join
- Inner join
- Join patterns
- Filtering patterns
- Average/median/standard deviation
- Min/max/count
- Record count
- Average temperature by city
- Aggregation patterns
- MapReduce patterns
- Scenario
- SingleMapperCombinerReducer job
- Multiple mappers reducer job
- Single mapper reducer job
- Single mapper job
- MapReduce job types
- Output format
- Reduce
- Shuffle and sort
- Partitioner
- Combiner
- Map
- Record reader
- Dataset
- The MapReduce framework
- Big Data Processing with MapReduce
- Summary
- Visualization using Tableau
- Apache Spark
- A cheat sheet on retrieving information
- Language capabilities
- Built-in functions
- Built-in operators
- Built-in operators and functions
- Complex types
- Primitive types
- INSERT statement syntax
- WHERE clauses
- SELECT statement syntax
- Creating a table
- Creating a database
- Using Hive
- Installing Derby
- Downloading and extracting the Hive binaries
- Hive
- The MapReduce framework
- Distributed computing using Apache Hadoop
- Value
- Visualization
- Variability of data
- Veracity of data
- Volume of data
- Velocity of data
- Variety of data
- Introduction to big data
- Inside the data analytics process
- Introduction to data analytics
- Overview of Big Data Analytics
- Summary
- Enabling MapReduce to write to timeline service v.2
- Running timeline service v.2
- Enabling timeline service v.2
- Enabling the co-processor
- Simple deployment for HBase
- Setting up the HBase cluster
- Installing YARN timeline service v.2
- Intra-DataNode balancer
- Erasure Coding
- Setting up the YARN service
- Starting HDFS
- Setting up the NameNode
- Setup password-less ssh
- Installation
- Downloading
- Prerequisites
- Installing Hadoop 3
- Shaded-client JARs
- Shell script rewrite
- Minimum required Java version
- Other changes
- Architecture
- Usability improvements
- Enhancing scalability and reliability
- YARN timeline service v.2
- Types of container execution
- Opportunistic containers
- YARN
- Task-level native optimization
- MapReduce framework
- Port numbers
- Erasure coding
- Intra-DataNode balancer
- High availability
- Hadoop Distributed File System
- Introduction to Hadoop
- Reviews
- Get in touch
- Conventions used
- Download the color images
- Download the example code files
- To get the most out of this book
- What this book covers
- Who this book is for
- Preface
- Packt is searching for authors like you
- About the reviewers
- About the author
- Contributors
- PacktPub.com
- Why subscribe?
- Packt Upsell
- 版權(quán)信息
- 封面
- 封面
- 版權(quán)信息
- Packt Upsell
- Why subscribe?
- PacktPub.com
- Contributors
- About the author
- About the reviewers
- Packt is searching for authors like you
- Preface
- Who this book is for
- What this book covers
- To get the most out of this book
- Download the example code files
- Download the color images
- Conventions used
- Get in touch
- Reviews
- Introduction to Hadoop
- Hadoop Distributed File System
- High availability
- Intra-DataNode balancer
- Erasure coding
- Port numbers
- MapReduce framework
- Task-level native optimization
- YARN
- Opportunistic containers
- Types of container execution
- YARN timeline service v.2
- Enhancing scalability and reliability
- Usability improvements
- Architecture
- Other changes
- Minimum required Java version
- Shell script rewrite
- Shaded-client JARs
- Installing Hadoop 3
- Prerequisites
- Downloading
- Installation
- Setup password-less ssh
- Setting up the NameNode
- Starting HDFS
- Setting up the YARN service
- Erasure Coding
- Intra-DataNode balancer
- Installing YARN timeline service v.2
- Setting up the HBase cluster
- Simple deployment for HBase
- Enabling the co-processor
- Enabling timeline service v.2
- Running timeline service v.2
- Enabling MapReduce to write to timeline service v.2
- Summary
- Overview of Big Data Analytics
- Introduction to data analytics
- Inside the data analytics process
- Introduction to big data
- Variety of data
- Velocity of data
- Volume of data
- Veracity of data
- Variability of data
- Visualization
- Value
- Distributed computing using Apache Hadoop
- The MapReduce framework
- Hive
- Downloading and extracting the Hive binaries
- Installing Derby
- Using Hive
- Creating a database
- Creating a table
- SELECT statement syntax
- WHERE clauses
- INSERT statement syntax
- Primitive types
- Complex types
- Built-in operators and functions
- Built-in operators
- Built-in functions
- Language capabilities
- A cheat sheet on retrieving information
- Apache Spark
- Visualization using Tableau
- Summary
- Big Data Processing with MapReduce
- The MapReduce framework
- Dataset
- Record reader
- Map
- Combiner
- Partitioner
- Shuffle and sort
- Reduce
- Output format
- MapReduce job types
- Single mapper job
- Single mapper reducer job
- Multiple mappers reducer job
- SingleMapperCombinerReducer job
- Scenario
- MapReduce patterns
- Aggregation patterns
- Average temperature by city
- Record count
- Min/max/count
- Average/median/standard deviation
- Filtering patterns
- Join patterns
- Inner join
- Left anti join
- Left outer join
- Right outer join
- Full outer join
- Left semi join
- Cross join
- Summary
- Scientific Computing and Big Data Analysis with Python and Hadoop
- Installation
- Installing standard Python
- Installing Anaconda
- Using Conda
- Data analysis
- Summary
- Statistical Big Data Computing with R and Hadoop
- Introduction
- Install R on workstations and connect to the data in Hadoop
- Install R on a shared server and connect to Hadoop
- Utilize Revolution R Open
- Execute R inside of MapReduce using RMR2
- Summary and outlook for pure open source options
- Methods of integrating R and Hadoop
- RHADOOP – install R on workstations and connect to data in Hadoop
- RHIPE – execute R inside Hadoop MapReduce
- R and Hadoop Streaming
- RHIVE – install R on workstations and connect to data in Hadoop
- ORCH – Oracle connector for Hadoop
- Data analytics
- Summary
- Batch Analytics with Apache Spark
- SparkSQL and DataFrames
- DataFrame APIs and the SQL API
- Pivots
- Filters
- User-defined functions
- Schema – structure of data
- Implicit schema
- Explicit schema
- Encoders
- Loading datasets
- Saving datasets
- Aggregations
- Aggregate functions
- count
- first
- last
- approx_count_distinct
- min
- max
- avg
- sum
- kurtosis
- skewness
- Variance
- Standard deviation
- Covariance
- groupBy
- Rollup
- Cube
- Window functions
- ntiles
- Joins
- Inner workings of join
- Shuffle join
- Broadcast join
- Join types
- Inner join
- Left outer join
- Right outer join
- Outer join
- Left anti join
- Left semi join
- Cross join
- Performance implications of join
- Summary
- Real-Time Analytics with Apache Spark
- Streaming
- At-least-once processing
- At-most-once processing
- Exactly-once processing
- Spark Streaming
- StreamingContext
- Creating StreamingContext
- Starting StreamingContext
- Stopping StreamingContext
- Input streams
- receiverStream
- socketTextStream
- rawSocketStream
- fileStream
- textFileStream
- binaryRecordsStream
- queueStream
- textFileStream example
- twitterStream example
- Discretized Streams
- Transformations
- Windows operations
- Stateful/stateless transformations
- Stateless transformations
- Stateful transformations
- Checkpointing
- Metadata checkpointing
- Data checkpointing
- Driver failure recovery
- Interoperability with streaming platforms (Apache Kafka)
- Receiver-based
- Direct Stream
- Structured Streaming
- Getting deeper into Structured Streaming
- Handling event time and late date
- Fault-tolerance semantics
- Summary
- Batch Analytics with Apache Flink
- Introduction to Apache Flink
- Continuous processing for unbounded datasets
- Flink the streaming model and bounded datasets
- Installing Flink
- Downloading Flink
- Installing Flink
- Starting a local Flink cluster
- Using the Flink cluster UI
- Batch analytics
- Reading file
- File-based
- Collection-based
- Generic
- Transformations
- GroupBy
- Aggregation
- Joins
- Inner join
- Left outer join
- Right outer join
- Full outer join
- Writing to a file
- Summary
- Stream Processing with Apache Flink
- Introduction to streaming execution model
- Data processing using the DataStream API
- Execution environment
- Data sources
- Socket-based
- File-based
- Transformations
- map
- flatMap
- filter
- keyBy
- reduce
- fold
- Aggregations
- window
- Global windows
- Tumbling windows
- Sliding windows
- Session windows
- windowAll
- union
- Window join
- split
- Select
- Project
- Physical partitioning
- Custom partitioning
- Random partitioning
- Rebalancing partitioning
- Rescaling
- Broadcasting
- Event time and watermarks
- Connectors
- Kafka connector
- Twitter connector
- RabbitMQ connector
- Elasticsearch connector
- Cassandra connector
- Summary
- Visualizing Big Data
- Introduction
- Tableau
- Chart types
- Line charts
- Pie chart
- Bar chart
- Heat map
- Using Python to visualize data
- Using R to visualize data
- Big data visualization tools
- Summary
- Introduction to Cloud Computing
- Concepts and terminology
- Cloud
- IT resource
- On-premise
- Cloud consumers and Cloud providers
- Scaling
- Types of scaling
- Horizontal scaling
- Vertical scaling
- Cloud service
- Cloud service consumer
- Goals and benefits
- Increased scalability
- Increased availability and reliability
- Risks and challenges
- Increased security vulnerabilities
- Reduced operational governance control
- Limited portability between Cloud providers
- Roles and boundaries
- Cloud provider
- Cloud consumer
- Cloud service owner
- Cloud resource administrator
- Additional roles
- Organizational boundary
- Trust boundary
- Cloud characteristics
- On-demand usage
- Ubiquitous access
- Multi-tenancy (and resource pooling)
- Elasticity
- Measured usage
- Resiliency
- Cloud delivery models
- Infrastructure as a Service
- Platform as a Service
- Software as a Service
- Combining Cloud delivery models
- IaaS + PaaS
- IaaS + PaaS + SaaS
- Cloud deployment models
- Public Clouds
- Community Clouds
- Private Clouds
- Hybrid Clouds
- Summary
- Using Amazon Web Services
- Amazon Elastic Compute Cloud
- Elastic web-scale computing
- Complete control of operations
- Flexible Cloud hosting services
- Integration
- High reliability
- Security
- Inexpensive
- Easy to start
- Instances and Amazon Machine Images
- Launching multiple instances of an AMI
- Instances
- AMIs
- Regions and availability zones
- Region and availability zone concepts
- Regions
- Availability zones
- Available regions
- Regions and endpoints
- Instance types
- Tag basics
- Amazon EC2 key pairs
- Amazon EC2 security groups for Linux instances
- Elastic IP addresses
- Amazon EC2 and Amazon Virtual Private Cloud
- Amazon Elastic Block Store
- Amazon EC2 instance store
- What is AWS Lambda?
- When should I use AWS Lambda?
- Introduction to Amazon S3
- Getting started with Amazon S3
- Comprehensive security and compliance capabilities
- Query in place
- Flexible management
- Most supported platform with the largest ecosystem
- Easy and flexible data transfer
- Backup and recovery
- Data archiving
- Data lakes and big data analytics
- Hybrid Cloud storage
- Cloud-native application data
- Disaster recovery
- Amazon DynamoDB
- Amazon Kinesis Data Streams
- What can I do with Kinesis Data Streams?
- Accelerated log and data feed intake and processing
- Real-time metrics and reporting
- Real-time data analytics
- Complex stream processing
- Benefits of using Kinesis Data Streams
- AWS Glue
- When should I use AWS Glue?
- Amazon EMR
- Practical AWS EMR cluster
- Summary 更新時(shí)間:2021-06-25 21:27:11