目錄(235章)
倒序
- coverpage
- Title Page
- Credits
- About the Author
- About the Reviewer
- www.PacktPub.com
- Why subscribe?
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Downloading the color images of this book
- Errata
- Piracy
- Questions
- A First Taste and What’s New in Apache Spark V2
- Spark machine learning
- Spark Streaming
- Spark SQL
- Spark graph processing
- Extended ecosystem
- What's new in Apache Spark V2?
- Cluster design
- Cluster management
- Local
- Standalone
- Apache YARN
- Apache Mesos
- Cloud-based deployments
- Performance
- The cluster structure
- Hadoop Distributed File System
- Data locality
- Memory
- Coding
- Cloud
- Summary
- Apache Spark SQL
- The SparkSession--your gateway to structured data processing
- Importing and saving data
- Processing the text files
- Processing JSON files
- Processing the Parquet files
- Understanding the DataSource API
- Implicit schema discovery
- Predicate push-down on smart data sources
- DataFrames
- Using SQL
- Defining schemas manually
- Using SQL subqueries
- Applying SQL table joins
- Using Datasets
- The Dataset API in action
- User-defined functions
- RDDs versus DataFrames versus Datasets
- Summary
- The Catalyst Optimizer
- Understanding the workings of the Catalyst Optimizer
- Managing temporary views with the catalog API
- The SQL abstract syntax tree
- How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan
- Internal class and object representations of LEPs
- How to optimize the Resolved Logical Execution Plan
- Physical Execution Plan generation and selection
- Code generation
- Practical examples
- Using the explain method to obtain the PEP
- How smart data sources work internally
- Summary
- Project Tungsten
- Memory management beyond the Java Virtual Machine Garbage Collector
- Understanding the UnsafeRow object
- The null bit set region
- The fixed length values region
- The variable length values region
- Understanding the BytesToBytesMap
- A practical example on memory usage and performance
- Cache-friendly layout of data in memory
- Cache eviction strategies and pre-fetching
- Code generation
- Understanding columnar storage
- Understanding whole stage code generation
- A practical example on whole stage code generation performance
- Operator fusing versus the volcano iterator model
- Summary
- Apache Spark Streaming
- Overview
- Errors and recovery
- Checkpointing
- Streaming sources
- TCP stream
- File streams
- Flume
- Summary
- Structured Streaming
- The concept of continuous applications
- True unification - same code same engine
- Windowing
- How streaming engines use windowing
- How Apache Spark improves windowing
- Increased performance with good old friends
- How transparent fault tolerance and exactly-once delivery guarantee is achieved
- Replayable sources can replay streams from a given offset
- Idempotent sinks prevent data duplication
- State versioning guarantees consistent results after reruns
- Example - connection to a MQTT message broker
- Controlling continuous applications
- More on stream life cycle management
- Summary
- Apache Spark MLlib
- Architecture
- The development environment
- Classification with Naive Bayes
- Theory on Classification
- Naive Bayes in practice
- Clustering with K-Means
- Theory on Clustering
- K-Means in practice
- Artificial neural networks
- ANN in practice
- Summary
- Apache SparkML
- What does the new API look like?
- The concept of pipelines
- Transformers
- String indexer
- OneHotEncoder
- VectorAssembler
- Pipelines
- Estimators
- RandomForestClassifier
- Model evaluation
- CrossValidation and hyperparameter tuning
- CrossValidation
- Hyperparameter tuning
- Winning a Kaggle competition with Apache SparkML
- Data preparation
- Feature engineering
- Testing the feature engineering pipeline
- Training the machine learning model
- Model evaluation
- CrossValidation and hyperparameter tuning
- Using the evaluator to assess the quality of the cross-validated and tuned model
- Summary
- Apache SystemML
- Why do we need just another library?
- Why on Apache Spark?
- The history of Apache SystemML
- A cost-based optimizer for machine learning algorithms
- An example - alternating least squares
- ApacheSystemML architecture
- Language parsing
- High-level operators are generated
- How low-level operators are optimized on
- Performance measurements
- Apache SystemML in action
- Summary
- Deep Learning on Apache Spark with DeepLearning4j and H2O
- H2O
- Overview
- The build environment
- Architecture
- Sourcing the data
- Data quality
- Performance tuning
- Deep Learning
- Example code – income
- The example code – MNIST
- H2O Flow
- Deeplearning4j
- ND4J - high performance linear algebra for the JVM
- Deeplearning4j
- Example: an IoT real-time anomaly detector
- Mastering chaos: the Lorenz attractor model
- Deploying the test data generator
- Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud
- Deploying the test data generator flow
- Testing the test data generator
- Install the Deeplearning4j example within Eclipse
- Running the examples in Eclipse
- Run the examples in Apache Spark
- Summary
- Apache Spark GraphX
- Overview
- Graph analytics/processing with GraphX
- The raw data
- Creating a graph
- Example 1 – counting
- Example 2 – filtering
- Example 3 – PageRank
- Example 4 – triangle counting
- Example 5 – connected components
- Summary
- Apache Spark GraphFrames
- Architecture
- Graph-relational translation
- Materialized views
- Join elimination
- Join reordering
- Examples
- Example 1 – counting
- Example 2 – filtering
- Example 3 – page rank
- Example 4 – triangle counting
- Example 5 – connected components
- Summary
- Apache Spark with Jupyter Notebooks on IBM DataScience Experience
- Why notebooks are the new standard
- Learning by example
- The IEEE PHM 2012 data challenge bearing dataset
- ETL with Scala
- Interactive exploratory analysis using Python and Pixiedust
- Real data science work with SparkR
- Summary
- Apache Spark on Kubernetes
- Bare metal virtual machines and containers
- Containerization
- Namespaces
- Control groups
- Linux containers
- Understanding the core concepts of Docker
- Understanding Kubernetes
- Using Kubernetes for provisioning containerized Spark applications
- Example--Apache Spark on Kubernetes
- Prerequisites
- Deploying the Apache Spark master
- Deploying the Apache Spark workers
- Deploying the Zeppelin notebooks
- Summary 更新時間:2021-07-02 18:56:09
推薦閱讀
- Beginning Java Data Structures and Algorithms
- Java面向對象思想與程序設計
- 單片機C語言程序設計實訓100例:基于STC8051+Proteus仿真與實戰
- Magento 2 Development Cookbook
- The Data Visualization Workshop
- Java 11 Cookbook
- 微信小程序項目開發實戰
- Python深度學習:模型、方法與實現
- Maker基地嘉年華:玩轉樂動魔盒學Scratch
- Java并發編程:核心方法與框架
- Kotlin Programming By Example
- Python數據可視化之美:專業圖表繪制指南(全彩)
- Java EE 7 with GlassFish 4 Application Server
- 超簡單:Photoshop+JavaScript+Python智能修圖與圖像自動化處理
- Oracle 12c從入門到精通(視頻教學超值版)
- DevOps 精要:業務視角
- Responsive Web Design with jQuery
- Swift 2 Design Patterns
- Learning NHibernate 4
- Roslyn Cookbook
- BackTrack 5 Cookbook
- Learning Rust
- Mahout實踐指南
- OpenACC并行編程實戰
- Building Business Websites with Squarespace 7
- 軟件工程基礎
- Three.js開發指南
- PrimeFaces Theme Development
- Web Development with Angular and Bootstrap(Third Edition)
- C# 5 First Look