舉報

會員
Learning Spark SQL
Aurobindo Sarkar 著
更新時間:2021-07-02 18:24:31
開會員,本書免費讀 >
最新章節(jié):
Summary
Ifyouareadeveloper,engineer,oranarchitectandwanttolearnhowtouseApacheSparkinaweb-scaleproject,thenthisisthebookforyou.ItisassumedthatyouhavepriorknowledgeofSQLquerying.AbasicprogrammingknowledgewithScala,Java,R,orPythonisallyouneedtogetstartedwiththisbook.
最新章節(jié)
- Summary
- Using cluster managers
- Understanding types of model scoring architectures
- Understanding the challenges in typical ML deployment environments
- Deploying Spark machine learning pipelines
- Implementing a scalable monitoring solution
品牌:中圖公司
上架時間:2021-07-02 18:09:29
出版社:Packt Publishing
本書數(shù)字版權由中圖公司提供,并由其授權上海閱文信息技術有限公司制作發(fā)行
- Summary 更新時間:2021-07-02 18:24:31
- Using cluster managers
- Understanding types of model scoring architectures
- Understanding the challenges in typical ML deployment environments
- Deploying Spark machine learning pipelines
- Implementing a scalable monitoring solution
- Addressing errors in ETL pipelines
- Transforming data in ETL pipelines
- Choosing appropriate data formats
- Building robust ETL pipelines using Spark SQL
- Design considerations for building scalable stream processing applications
- Understanding the Kappa Architecture
- Understanding the Lambda architecture
- Using Apache Spark for stream processing
- Using Apache Spark for batch processing
- Understanding Spark-based application architectures
- Spark SQL in Large-Scale Application Architectures
- Summary
- Understanding performance improvements using whole-stage code generation
- Understanding multi-way JOIN ordering optimization
- Build side selection
- Join operator
- Filter operator
- Statistics collection functions
- Understanding the CBO statistics collection
- Cost-based optimizer in Apache Spark 2.2
- Using external tools for performance tuning
- Exploring Spark application execution metrics
- Visualizing Spark application execution
- Understanding Catalyst transformations
- Understanding the Dataset/DataFrame API
- Understanding Catalyst optimizations
- Optimizing data serialization
- Understanding DataFrame/Dataset APIs
- Introducing performance tuning in Spark SQL
- Tuning Spark SQL Components for Performance
- Summary
- Introducing autoencoders
- Understanding Recurrent Neural Networks
- Using deep neural networks for language processing
- Using neural networks for text classification
- Understanding convolutional neural networks
- Understanding Supervised learning
- Introducing deep learning pipelines
- Tuning hyperparameters of deep learning models
- Working with BigDL
- Introducing TensorFrames
- Introducing DL4J
- Introducing CaffeOnSpark
- Introducing deep learning in Spark
- Understanding stochastic gradient descent
- Understanding representation learning
- Understanding deep learning
- Introducing neural networks
- Using Spark SQL in Deep Learning Applications
- Summary
- Developing a machine learning application
- Using Naive Bayes classifiers
- Understanding themes in document corpuses
- Creating data preprocessing pipelines
- Using word lists
- Computing readability
- Preprocessing textual data
- Using Spark SQL for textual analysis
- Understanding text analysis applications
- Introducing Spark SQL applications
- Developing Applications with Spark SQL
- Summary
- Using SparkR for machine learning
- Visualizing graph nodes and edges
- Visualizing data on a map
- Using SparkR for data visualization
- Using SparkR for computing summary statistics
- Using User Defined Functions (UDFs)
- Merging SparkR DataFrames
- Executing SQL statements on Spark DataFrames
- Running basic operations on Spark DataFrames
- Exploring structure and contents of Spark DataFrames
- Reading and writing Spark DataFrames
- Using SparkR for EDA and data munging tasks
- Understanding SparkR DataFrames
- Understanding the SparkR architecture
- Introducing SparkR
- Using Spark SQL with SparkR
- Summary
- Understanding partitioning in GraphFrames
- Viewing GraphFrame physical execution plan
- Understanding GraphFrame internals
- Processing graphs containing multiple types of relationships
- Analyzing JSON input modeled as a graph
- Saving and loading GraphFrames
- Applying graph algorithms
- Processing subgraphs
- Motif analysis using GraphFrames
- Basic graph queries and operations
- Constructing a GraphFrame
- Exploring graphs using GraphFrames
- Introducing large-scale graph applications
- Using Spark SQL in Graph Applications
- Summary
- Implementing a Spark ML clustering model
- Retrieving our original labels
- Using a Normalizer
- Using Chi-squared selector
- Using VectorSlicer
- Using Bucketizer
- Using encoders
- Using Principal Component Analysis to select features
- Introducing Spark ML tools and utilities
- Changing the ML algorithm in the pipeline
- Selecting the best model
- Making predictions using the PipelineModel
- Creating the training and test Datasets
- Creating a Spark ML pipeline
- Using a Spark ML classifier
- Using VectorAssembler for assembling features into one column
- Using StringIndexer for indexing categorical features and labels
- Building the Spark ML pipeline
- Pre-processing the data
- Exploring the diabetes Dataset
- Implementing a Spark ML classification model
- Deriving good features
- Understanding dimensionality reduction
- Estimating the importance of a feature
- Creating new features from raw data
- Introducing feature engineering
- Understanding the steps in a pipeline application development process
- Understanding Spark ML pipelines and their components
- Introducing machine learning applications
- Using Spark SQL in Machine Learning Applications
- Summary
- Writing a receiver for a custom data source
- Introducing Kafka-Spark Structured Streaming
- Introducing Kafka-Spark integration
- Introducing ZooKeeper concepts
- Introducing Kafka concepts
- Using Kafka with Spark Structured Streaming
- Monitoring streaming queries
- Using the File Sink to save output to a partitioned table
- Using the Memory Sink to save output to a table
- Using the Foreach Sink for arbitrary computations on output
- Using output sinks
- Using the Dataset API in Structured Streaming
- Joining a streaming Dataset with a static Dataset
- Implementing sliding window-based functionality
- Building Spark streaming applications
- Introducing streaming data applications
- Using Spark SQL in Streaming Applications
- Summary
- Creating and running a machine learning pipeline
- Pre-processing data for machine learning
- Preparing data for machine learning
- Extracting data from "messy" columns
- Converting variable-length records to fixed-length records
- Dealing with variable length records
- Computing basic statistics
- Handling missing time-series data
- Using the TimeSeriesRDD object
- Defining a date-time index
- Persisting and loading data
- Processing date fields
- Pre-processing of the time-series Dataset
- Munging time series data
- Removing stop words
- Processing multiple input data files
- Munging textual data
- Combining data using a JOIN operation
- Analyzing missing data
- Pre-processing of the weather Dataset
- Executing other miscellaneous processing steps
- Augmenting the Dataset
- Computing basic statistics and aggregations
- Pre-processing of the household electric consumption Dataset
- Exploring data munging techniques
- Introducing data munging
- Using Spark SQL for Data Munging
- Summary
- Using Spark SQL for creating pivot tables
- Sampling with the RDD API
- Sampling with the DataFrame/Dataset API
- Sampling data with Spark SQL APIs
- Visualizing data with Apache Zeppelin
- Identifying data outliers
- Computing basic statistics
- Identifying missing data
- Using Spark SQL for basic data analysis
- Introducing Exploratory Data Analysis (EDA)
- Using Spark SQL for Data Exploration
- Summary
- Defining and using custom data sources in Spark
- Using Spark with Parquet files
- Using Spark with Avro files
- Using Spark with JSON data
- Using Spark with MongoDB (NoSQL database)
- Using Spark with relational databases
- Selecting Spark data sources
- Understanding data sources in Spark applications
- Using Spark SQL for Processing Structured and Semistructured Data
- Summary
- Understanding Structured Streaming internals
- Using Spark SQL in streaming applications
- Introducing Project Tungsten
- Understanding Catalyst transformations
- Understanding Catalyst optimizations
- Understanding the Catalyst optimizer
- Understanding DataFrames and Datasets
- Understanding Resilient Distributed Datasets (RDDs)
- Understanding Spark SQL concepts
- Introducing SparkSession
- What is Spark SQL?
- Getting Started with Spark SQL
- Questions
- Piracy
- Errata
- Downloading the color images of this book
- Downloading the example code
- Customer support
- Reader feedback
- Conventions
- Who this book is for
- What you need for this book
- What this book covers
- Preface
- Customer Feedback
- Why subscribe?
- www.PacktPub.com
- About the Reviewer
- About the Author
- Credits
- Learning Spark SQL
- Copyright
- Title Page
- cover
- cover
- Title Page
- Copyright
- Learning Spark SQL
- Credits
- About the Author
- About the Reviewer
- www.PacktPub.com
- Why subscribe?
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Downloading the color images of this book
- Errata
- Piracy
- Questions
- Getting Started with Spark SQL
- What is Spark SQL?
- Introducing SparkSession
- Understanding Spark SQL concepts
- Understanding Resilient Distributed Datasets (RDDs)
- Understanding DataFrames and Datasets
- Understanding the Catalyst optimizer
- Understanding Catalyst optimizations
- Understanding Catalyst transformations
- Introducing Project Tungsten
- Using Spark SQL in streaming applications
- Understanding Structured Streaming internals
- Summary
- Using Spark SQL for Processing Structured and Semistructured Data
- Understanding data sources in Spark applications
- Selecting Spark data sources
- Using Spark with relational databases
- Using Spark with MongoDB (NoSQL database)
- Using Spark with JSON data
- Using Spark with Avro files
- Using Spark with Parquet files
- Defining and using custom data sources in Spark
- Summary
- Using Spark SQL for Data Exploration
- Introducing Exploratory Data Analysis (EDA)
- Using Spark SQL for basic data analysis
- Identifying missing data
- Computing basic statistics
- Identifying data outliers
- Visualizing data with Apache Zeppelin
- Sampling data with Spark SQL APIs
- Sampling with the DataFrame/Dataset API
- Sampling with the RDD API
- Using Spark SQL for creating pivot tables
- Summary
- Using Spark SQL for Data Munging
- Introducing data munging
- Exploring data munging techniques
- Pre-processing of the household electric consumption Dataset
- Computing basic statistics and aggregations
- Augmenting the Dataset
- Executing other miscellaneous processing steps
- Pre-processing of the weather Dataset
- Analyzing missing data
- Combining data using a JOIN operation
- Munging textual data
- Processing multiple input data files
- Removing stop words
- Munging time series data
- Pre-processing of the time-series Dataset
- Processing date fields
- Persisting and loading data
- Defining a date-time index
- Using the TimeSeriesRDD object
- Handling missing time-series data
- Computing basic statistics
- Dealing with variable length records
- Converting variable-length records to fixed-length records
- Extracting data from "messy" columns
- Preparing data for machine learning
- Pre-processing data for machine learning
- Creating and running a machine learning pipeline
- Summary
- Using Spark SQL in Streaming Applications
- Introducing streaming data applications
- Building Spark streaming applications
- Implementing sliding window-based functionality
- Joining a streaming Dataset with a static Dataset
- Using the Dataset API in Structured Streaming
- Using output sinks
- Using the Foreach Sink for arbitrary computations on output
- Using the Memory Sink to save output to a table
- Using the File Sink to save output to a partitioned table
- Monitoring streaming queries
- Using Kafka with Spark Structured Streaming
- Introducing Kafka concepts
- Introducing ZooKeeper concepts
- Introducing Kafka-Spark integration
- Introducing Kafka-Spark Structured Streaming
- Writing a receiver for a custom data source
- Summary
- Using Spark SQL in Machine Learning Applications
- Introducing machine learning applications
- Understanding Spark ML pipelines and their components
- Understanding the steps in a pipeline application development process
- Introducing feature engineering
- Creating new features from raw data
- Estimating the importance of a feature
- Understanding dimensionality reduction
- Deriving good features
- Implementing a Spark ML classification model
- Exploring the diabetes Dataset
- Pre-processing the data
- Building the Spark ML pipeline
- Using StringIndexer for indexing categorical features and labels
- Using VectorAssembler for assembling features into one column
- Using a Spark ML classifier
- Creating a Spark ML pipeline
- Creating the training and test Datasets
- Making predictions using the PipelineModel
- Selecting the best model
- Changing the ML algorithm in the pipeline
- Introducing Spark ML tools and utilities
- Using Principal Component Analysis to select features
- Using encoders
- Using Bucketizer
- Using VectorSlicer
- Using Chi-squared selector
- Using a Normalizer
- Retrieving our original labels
- Implementing a Spark ML clustering model
- Summary
- Using Spark SQL in Graph Applications
- Introducing large-scale graph applications
- Exploring graphs using GraphFrames
- Constructing a GraphFrame
- Basic graph queries and operations
- Motif analysis using GraphFrames
- Processing subgraphs
- Applying graph algorithms
- Saving and loading GraphFrames
- Analyzing JSON input modeled as a graph
- Processing graphs containing multiple types of relationships
- Understanding GraphFrame internals
- Viewing GraphFrame physical execution plan
- Understanding partitioning in GraphFrames
- Summary
- Using Spark SQL with SparkR
- Introducing SparkR
- Understanding the SparkR architecture
- Understanding SparkR DataFrames
- Using SparkR for EDA and data munging tasks
- Reading and writing Spark DataFrames
- Exploring structure and contents of Spark DataFrames
- Running basic operations on Spark DataFrames
- Executing SQL statements on Spark DataFrames
- Merging SparkR DataFrames
- Using User Defined Functions (UDFs)
- Using SparkR for computing summary statistics
- Using SparkR for data visualization
- Visualizing data on a map
- Visualizing graph nodes and edges
- Using SparkR for machine learning
- Summary
- Developing Applications with Spark SQL
- Introducing Spark SQL applications
- Understanding text analysis applications
- Using Spark SQL for textual analysis
- Preprocessing textual data
- Computing readability
- Using word lists
- Creating data preprocessing pipelines
- Understanding themes in document corpuses
- Using Naive Bayes classifiers
- Developing a machine learning application
- Summary
- Using Spark SQL in Deep Learning Applications
- Introducing neural networks
- Understanding deep learning
- Understanding representation learning
- Understanding stochastic gradient descent
- Introducing deep learning in Spark
- Introducing CaffeOnSpark
- Introducing DL4J
- Introducing TensorFrames
- Working with BigDL
- Tuning hyperparameters of deep learning models
- Introducing deep learning pipelines
- Understanding Supervised learning
- Understanding convolutional neural networks
- Using neural networks for text classification
- Using deep neural networks for language processing
- Understanding Recurrent Neural Networks
- Introducing autoencoders
- Summary
- Tuning Spark SQL Components for Performance
- Introducing performance tuning in Spark SQL
- Understanding DataFrame/Dataset APIs
- Optimizing data serialization
- Understanding Catalyst optimizations
- Understanding the Dataset/DataFrame API
- Understanding Catalyst transformations
- Visualizing Spark application execution
- Exploring Spark application execution metrics
- Using external tools for performance tuning
- Cost-based optimizer in Apache Spark 2.2
- Understanding the CBO statistics collection
- Statistics collection functions
- Filter operator
- Join operator
- Build side selection
- Understanding multi-way JOIN ordering optimization
- Understanding performance improvements using whole-stage code generation
- Summary
- Spark SQL in Large-Scale Application Architectures
- Understanding Spark-based application architectures
- Using Apache Spark for batch processing
- Using Apache Spark for stream processing
- Understanding the Lambda architecture
- Understanding the Kappa Architecture
- Design considerations for building scalable stream processing applications
- Building robust ETL pipelines using Spark SQL
- Choosing appropriate data formats
- Transforming data in ETL pipelines
- Addressing errors in ETL pipelines
- Implementing a scalable monitoring solution
- Deploying Spark machine learning pipelines
- Understanding the challenges in typical ML deployment environments
- Understanding types of model scoring architectures
- Using cluster managers
- Summary 更新時間:2021-07-02 18:24:31