目錄(233章)
倒序
- cover
- Title Page
- Copyright
- Learning Spark SQL
- Credits
- About the Author
- About the Reviewer
- www.PacktPub.com
- Why subscribe?
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Downloading the color images of this book
- Errata
- Piracy
- Questions
- Getting Started with Spark SQL
- What is Spark SQL?
- Introducing SparkSession
- Understanding Spark SQL concepts
- Understanding Resilient Distributed Datasets (RDDs)
- Understanding DataFrames and Datasets
- Understanding the Catalyst optimizer
- Understanding Catalyst optimizations
- Understanding Catalyst transformations
- Introducing Project Tungsten
- Using Spark SQL in streaming applications
- Understanding Structured Streaming internals
- Summary
- Using Spark SQL for Processing Structured and Semistructured Data
- Understanding data sources in Spark applications
- Selecting Spark data sources
- Using Spark with relational databases
- Using Spark with MongoDB (NoSQL database)
- Using Spark with JSON data
- Using Spark with Avro files
- Using Spark with Parquet files
- Defining and using custom data sources in Spark
- Summary
- Using Spark SQL for Data Exploration
- Introducing Exploratory Data Analysis (EDA)
- Using Spark SQL for basic data analysis
- Identifying missing data
- Computing basic statistics
- Identifying data outliers
- Visualizing data with Apache Zeppelin
- Sampling data with Spark SQL APIs
- Sampling with the DataFrame/Dataset API
- Sampling with the RDD API
- Using Spark SQL for creating pivot tables
- Summary
- Using Spark SQL for Data Munging
- Introducing data munging
- Exploring data munging techniques
- Pre-processing of the household electric consumption Dataset
- Computing basic statistics and aggregations
- Augmenting the Dataset
- Executing other miscellaneous processing steps
- Pre-processing of the weather Dataset
- Analyzing missing data
- Combining data using a JOIN operation
- Munging textual data
- Processing multiple input data files
- Removing stop words
- Munging time series data
- Pre-processing of the time-series Dataset
- Processing date fields
- Persisting and loading data
- Defining a date-time index
- Using the TimeSeriesRDD object
- Handling missing time-series data
- Computing basic statistics
- Dealing with variable length records
- Converting variable-length records to fixed-length records
- Extracting data from "messy" columns
- Preparing data for machine learning
- Pre-processing data for machine learning
- Creating and running a machine learning pipeline
- Summary
- Using Spark SQL in Streaming Applications
- Introducing streaming data applications
- Building Spark streaming applications
- Implementing sliding window-based functionality
- Joining a streaming Dataset with a static Dataset
- Using the Dataset API in Structured Streaming
- Using output sinks
- Using the Foreach Sink for arbitrary computations on output
- Using the Memory Sink to save output to a table
- Using the File Sink to save output to a partitioned table
- Monitoring streaming queries
- Using Kafka with Spark Structured Streaming
- Introducing Kafka concepts
- Introducing ZooKeeper concepts
- Introducing Kafka-Spark integration
- Introducing Kafka-Spark Structured Streaming
- Writing a receiver for a custom data source
- Summary
- Using Spark SQL in Machine Learning Applications
- Introducing machine learning applications
- Understanding Spark ML pipelines and their components
- Understanding the steps in a pipeline application development process
- Introducing feature engineering
- Creating new features from raw data
- Estimating the importance of a feature
- Understanding dimensionality reduction
- Deriving good features
- Implementing a Spark ML classification model
- Exploring the diabetes Dataset
- Pre-processing the data
- Building the Spark ML pipeline
- Using StringIndexer for indexing categorical features and labels
- Using VectorAssembler for assembling features into one column
- Using a Spark ML classifier
- Creating a Spark ML pipeline
- Creating the training and test Datasets
- Making predictions using the PipelineModel
- Selecting the best model
- Changing the ML algorithm in the pipeline
- Introducing Spark ML tools and utilities
- Using Principal Component Analysis to select features
- Using encoders
- Using Bucketizer
- Using VectorSlicer
- Using Chi-squared selector
- Using a Normalizer
- Retrieving our original labels
- Implementing a Spark ML clustering model
- Summary
- Using Spark SQL in Graph Applications
- Introducing large-scale graph applications
- Exploring graphs using GraphFrames
- Constructing a GraphFrame
- Basic graph queries and operations
- Motif analysis using GraphFrames
- Processing subgraphs
- Applying graph algorithms
- Saving and loading GraphFrames
- Analyzing JSON input modeled as a graph
- Processing graphs containing multiple types of relationships
- Understanding GraphFrame internals
- Viewing GraphFrame physical execution plan
- Understanding partitioning in GraphFrames
- Summary
- Using Spark SQL with SparkR
- Introducing SparkR
- Understanding the SparkR architecture
- Understanding SparkR DataFrames
- Using SparkR for EDA and data munging tasks
- Reading and writing Spark DataFrames
- Exploring structure and contents of Spark DataFrames
- Running basic operations on Spark DataFrames
- Executing SQL statements on Spark DataFrames
- Merging SparkR DataFrames
- Using User Defined Functions (UDFs)
- Using SparkR for computing summary statistics
- Using SparkR for data visualization
- Visualizing data on a map
- Visualizing graph nodes and edges
- Using SparkR for machine learning
- Summary
- Developing Applications with Spark SQL
- Introducing Spark SQL applications
- Understanding text analysis applications
- Using Spark SQL for textual analysis
- Preprocessing textual data
- Computing readability
- Using word lists
- Creating data preprocessing pipelines
- Understanding themes in document corpuses
- Using Naive Bayes classifiers
- Developing a machine learning application
- Summary
- Using Spark SQL in Deep Learning Applications
- Introducing neural networks
- Understanding deep learning
- Understanding representation learning
- Understanding stochastic gradient descent
- Introducing deep learning in Spark
- Introducing CaffeOnSpark
- Introducing DL4J
- Introducing TensorFrames
- Working with BigDL
- Tuning hyperparameters of deep learning models
- Introducing deep learning pipelines
- Understanding Supervised learning
- Understanding convolutional neural networks
- Using neural networks for text classification
- Using deep neural networks for language processing
- Understanding Recurrent Neural Networks
- Introducing autoencoders
- Summary
- Tuning Spark SQL Components for Performance
- Introducing performance tuning in Spark SQL
- Understanding DataFrame/Dataset APIs
- Optimizing data serialization
- Understanding Catalyst optimizations
- Understanding the Dataset/DataFrame API
- Understanding Catalyst transformations
- Visualizing Spark application execution
- Exploring Spark application execution metrics
- Using external tools for performance tuning
- Cost-based optimizer in Apache Spark 2.2
- Understanding the CBO statistics collection
- Statistics collection functions
- Filter operator
- Join operator
- Build side selection
- Understanding multi-way JOIN ordering optimization
- Understanding performance improvements using whole-stage code generation
- Summary
- Spark SQL in Large-Scale Application Architectures
- Understanding Spark-based application architectures
- Using Apache Spark for batch processing
- Using Apache Spark for stream processing
- Understanding the Lambda architecture
- Understanding the Kappa Architecture
- Design considerations for building scalable stream processing applications
- Building robust ETL pipelines using Spark SQL
- Choosing appropriate data formats
- Transforming data in ETL pipelines
- Addressing errors in ETL pipelines
- Implementing a scalable monitoring solution
- Deploying Spark machine learning pipelines
- Understanding the challenges in typical ML deployment environments
- Understanding types of model scoring architectures
- Using cluster managers
- Summary 更新時間:2021-07-02 18:24:31
推薦閱讀
- 極簡算法史:從數學到機器的故事
- 流量的秘密:Google Analytics網站分析與優化技巧(第2版)
- 自己動手寫搜索引擎
- 程序設計與實踐(VB.NET)
- 微服務設計原理與架構
- 技術領導力:程序員如何才能帶團隊
- 征服RIA
- Unity Shader入門精要
- Visual C
- 焊接機器人系統操作、編程與維護
- 軟件測試實用教程
- JavaScript編程精解(原書第2版)
- 計算機軟件項目實訓指導
- 編寫高質量代碼之Java(套裝共2冊)
- HTML 5與CSS 3權威指南(第4版·上冊)
- ASP.NET 4權威指南
- 數控編程技能培訓:Cimatron中文版
- Android程序設計:第2版
- FreeSWITCH權威指南
- ANSYS Workbench有限元分析實例詳解(靜力學)
- 實用XML應用開發技術
- HTML5、CSS和JavaScript開發
- Python基礎實例教程(微課版)
- Java程序設計教程與上機實驗
- 精通ASP.NET 3.5網絡編程之安全策略
- MySQL數據庫應用項目教程
- 數據庫應用基礎(Access 2010)
- Node.js權威指南
- 以用戶為中心的軟件設計:打造用戶友好型應用的有效方法和準則
- Hibernate逍遙游記