舉報

會員
Hands-On Big Data Analytics with PySpark
ApacheSparkisanopensourceparallel-processingframeworkthathasbeenaroundforquitesometimenow.OneofthemanyusesofApacheSparkisfordataanalyticsapplicationsacrossclusteredcomputers.Inthisbook,youwillnotonlylearnhowtouseSparkandthePythonAPItocreatehigh-performanceanalyticswithbigdata,butalsodiscovertechniquesfortesting,immunizing,andparallelizingSparkjobs.Youwilllearnhowtosourcedatafromallpopulardatahostingplatforms,includingHDFS,Hive,JSON,andS3,anddealwithlargedatasetswithPySparktogainpracticalbigdataexperience.Thisbookwillhelpyouworkonprototypesonlocalmachinesandsubsequentlygoontohandlemessydatainproductionandatscale.ThisbookcoversinstallingandsettingupPySpark,RDDoperations,bigdatacleaningandwrangling,andaggregatingandsummarizingdataintousefulreports.YouwillalsolearnhowtoimplementsomepracticalandproventechniquestoimprovecertainaspectsofprogrammingandadministrationinApacheSpark.Bytheendofthebook,youwillbeabletobuildbigdataanalyticalsolutionsusingthevariousPySparkofferingsandalsooptimizethemeffectively.
目錄(122章)
倒序
- coverpage
- Title Page
- Copyright and Credits
- Hands-On Big Data Analytics with PySpark
- About Packt
- Why subscribe?
- Packt.com
- Contributors
- About the authors
- Packt is searching for authors like you
- Preface
- Who this book is for
- What this book covers
- To get the most out of this book
- Download the example code files
- Download the color images
- Conventions used
- Get in touch
- Reviews
- Installing Pyspark and Setting up Your Development Environment
- An overview of PySpark
- Spark SQL
- Setting up Spark on Windows and PySpark
- Core concepts in Spark and PySpark
- SparkContext
- Spark shell
- SparkConf
- Summary
- Getting Your Big Data into the Spark Environment Using RDDs
- Loading data on to Spark RDDs
- The UCI machine learning repository
- Getting the data from the repository to Spark
- Getting data into Spark
- Parallelization with Spark RDDs
- What is parallelization?
- Basics of RDD operation
- Summary
- Big Data Cleaning and Wrangling with Spark Notebooks
- Using Spark Notebooks for quick iteration of ideas
- Sampling/filtering RDDs to pick out relevant data points
- Splitting datasets and creating some new combinations
- Summary
- Aggregating and Summarizing Data into Useful Reports
- Calculating averages with map and reduce
- Faster average computations with aggregate
- Pivot tabling with key-value paired data points
- Summary
- Powerful Exploratory Data Analysis with MLlib
- Computing summary statistics with MLlib
- Using Pearson and Spearman correlations to discover correlations
- The Pearson correlation
- The Spearman correlation
- Computing Pearson and Spearman correlations
- Testing our hypotheses on large datasets
- Summary
- Putting Structure on Your Big Data with SparkSQL
- Manipulating DataFrames with Spark SQL schemas
- Using Spark DSL to build queries
- Summary
- Transformations and Actions
- Using Spark transformations to defer computations to a later time
- Avoiding transformations
- Using the reduce and reduceByKey methods to calculate the results
- Performing actions that trigger computations
- Reusing the same rdd for different actions
- Summary
- Immutable Design
- Delving into the Spark RDD's parent/child chain
- Extending an RDD
- Chaining a new RDD with the parent
- Testing our custom RDD
- Using RDD in an immutable way
- Using DataFrame operations to transform
- Immutability in the highly concurrent environment
- Using the Dataset API in an immutable way
- Summary
- Avoiding Shuffle and Reducing Operational Expenses
- Detecting a shuffle in a process
- Testing operations that cause a shuffle in Apache Spark
- Changing the design of jobs with wide dependencies
- Using keyBy() operations to reduce shuffle
- Using a custom partitioner to reduce shuffle
- Summary
- Saving Data in the Correct Format
- Saving data in plain text format
- Leveraging JSON as a data format
- Tabular formats – CSV
- Using Avro with Spark
- Columnar formats – Parquet
- Summary
- Working with the Spark Key/Value API
- Available actions on key/value pairs
- Using aggregateByKey instead of groupBy()
- Actions on key/value pairs
- Available partitioners on key/value data
- Implementing a custom partitioner
- Summary
- Testing Apache Spark Jobs
- Separating logic from Spark engine-unit testing
- Integration testing using SparkSession
- Mocking data sources using partial functions
- Using ScalaCheck for property-based testing
- Testing in different versions of Spark
- Summary
- Leveraging the Spark GraphX API
- Creating a graph from a data source
- Creating the loader component
- Revisiting the graph format
- Loading Spark from file
- Using the Vertex API
- Constructing a graph using the vertex
- Creating couple relationships
- Using the Edge API
- Constructing the graph using edge
- Calculating the degree of the vertex
- The in-degree
- The out-degree
- Calculating PageRank
- Loading and reloading data about users and followers
- Summary
- Other Books You May Enjoy
- Leave a review - let other readers know what you think 更新時間:2021-06-24 15:52:53
推薦閱讀
- 數據分析實戰:基于EXCEL和SPSS系列工具的實踐
- Effective Amazon Machine Learning
- 云計算與大數據應用
- 數據庫應用基礎教程(Visual FoxPro 9.0)
- Lean Mobile App Development
- Hadoop 3.x大數據開發實戰
- 數據庫原理與應用(Oracle版)
- Python數據分析與挖掘實戰(第3版)
- Hadoop大數據開發案例教程與項目實戰(在線實驗+在線自測)
- Hadoop集群與安全
- Internet of Things with Python
- 大數據測試技術:數據采集、分析與測試實踐(在線實驗+在線自測)
- 數據挖掘與機器學習-WEKA應用技術與實踐(第二版)
- 標簽類目體系:面向業務的數據資產設計方法論
- MySQL性能調優與架構設計
- 數據產品經理寶典:大數據時代如何創造卓越產品
- ORACLE 11g權威指南
- MySQL 8.0從入門到實戰
- 數據質量管理:數據可靠性與數據質量問題解決之道
- 計算機視覺與深度學習實戰:以MATLAB、Python為工具
- 數據流上頻繁模式和高效用模式挖掘
- 微軟云計算:Microsoft Azure部署與管理指南
- Microsoft Power BI 智能大數據分析
- 阿里云數字新基建系列:云數據庫架構
- PostgreSQL 9X之巔(原書第2版)
- 涂抹Oracle:三思筆記之一步一步學Oracle
- Python Reinforcement Learning
- 大學計算機基礎(第2版)
- Oracle J.D.Edwards技術與應用:基礎篇
- 工業大數據分析指南