舉報

會員
Frank Kane's Taming Big Data with Apache Spark and Python
Frank Kane 著
更新時間:2021-07-02 21:12:56
開會員,本書免費讀 >
IfyouareadatascientistordataanalystwhowantstolearnBigDataprocessingusingApacheSparkandPython,thisbookisforyou.IfyouhavesomeprogrammingexperienceinPython,andwanttolearnhowtoprocesslargeamountsofdatausingApacheSpark,FrankKane’sTamingBigDatawithApacheSparkandPythonwillalsohelpyou.
最新章節(jié)
- Where to Go From Here? – Learning More About Spark and Data Science
- Summary
- GraphX
- What is Spark Streaming?
- Spark Streaming and GraphX
- Getting results
品牌:中圖公司
上架時間:2021-07-02 18:36:45
出版社:Packt Publishing
本書數(shù)字版權(quán)由中圖公司提供,并由其授權(quán)上海閱文信息技術(shù)有限公司制作發(fā)行
- Where to Go From Here? – Learning More About Spark and Data Science 更新時間:2021-07-02 21:12:56
- Summary
- GraphX
- What is Spark Streaming?
- Spark Streaming and GraphX
- Getting results
- Examining the spark-linear-regression.py script
- Using DataFrames with MLlib
- Why did we get bad results?
- Analyzing the ALS recommendations results
- Examining the movie-recommendations-als.py script
- Using MLlib to produce movie recommendations
- Making movie recommendations
- For more information on machine learning
- Special MLlib data types
- MLlib capabilities
- Introducing MLlib
- Other Spark Technologies and Libraries
- Summary
- Using DataFrames instead of RDDs
- Using SQL-style functions instead of queries
- Executing SQL commands and SQL-style functions on a DataFrame
- User-defined functions (UDFs)
- Shell access in SparkSQL
- Differences between DataFrames and DataSets
- More things you can do with DataFrames
- Using SparkSQL in Python
- Introducing SparkSQL
- SparkSQL DataFrames and DataSets
- Summary
- Managing dependencies
- Troubleshooting
- More troubleshooting and managing dependencies
- Troubleshooting Spark on a cluster
- Terminating the cluster
- Assessing the results
- Creating similar movies from one million ratings – part 3
- Running the code
- Connecting to the master node using SSH
- Creating a cluster
- Preparing the script
- Setting up to run the movie-similarities-1m.py script on a cluster
- Running on a cluster
- Specifying a cluster manager
- Specifying memory per executor
- Our strategy
- Creating similar movies from one million ratings - part 2
- Changes to the script
- Creating similar movies from one million ratings - part 1
- Choosing a partition size
- Using .partitionBy()
- Partitioning
- Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY
- Warning - Spark on EMR is not cheap
- Why use Elastic MapReduce?
- Introducing Elastic MapReduce
- Running Spark on a Cluster
- Summary
- Improving the quality of the similar movies example
- Getting results
- Examining the script
- Running the similar-movies script using Spark's cluster manager
- Caching RDDs
- It's getting real
- Making item-based collaborative filtering a Spark problem
- How does item-based collaborative filtering work?
- Item-based collaborative filtering in Spark cache() and persist()
- Getting results
- Calling reduceByKey
- Calling an action
- Calling flatMap()
- Setting up an accumulator and using the convert to BFS function
- Superhero degrees of separation - review the code and run it
- How do we know when we're done?
- Using a mapper and a reducer
- Iteratively process the RDD
- Writing code to convert Marvel-Graph.txt to BFS nodes
- Convert the input file into structured data
- Accumulators and implementing BFS in Spark
- Final pass through the graph
- Third pass through the graph
- Second pass through the graph
- First pass through the graph
- The initial condition of our social graph
- How the breadth-first search algorithm works?
- Degrees of separation
- Superhero degrees of separation - introducing the breadth-first search algorithm
- Getting results
- Using max() and looking up the name of the winner
- Flipping the (map) RDD to (number hero ID)
- Adding up co-occurrence by hero ID
- Mapping input data to (hero ID number of co-occurrences) per line
- Running the script - discover who the most popular superhero is
- Strategy
- Input data format
- Superhero social networks
- Finding the most popular superhero in a social graph
- Getting results
- Examining the popular-movies-nicer.py script
- Introducing broadcast variables
- Using broadcast variables to display movie names instead of ID numbers
- Getting results
- Examining the popular-movies script
- Finding the most popular movie
- Advanced Examples of Spark Programs
- Summary
- Check your sorted implementation and results against mine
- Check your results and sort them by the total amount spent
- Useful snippets of code
- Strategy for solving the problem
- Introducing the problem
- Find the total amount spent by customer
- Running the code
- Examining the script
- Step 2 - Sort the new RDD
- Step 1 - Implement countByValue() the hard way to create a new RDD
- Sorting the word count results
- Running the code
- Examining the use of regular expressions in the word-count script
- Text normalization
- Improving the word-count script with regular expressions
- Code sample - count the words in a book
- Flatmap ()
- Map ()
- Map versus flatmap
- Counting word occurrences using flatmap()
- Running the maximum temperature by location example
- Running the script
- Examining the min-temperatures script
- Running the minimum temperature example and modifying it for maximums
- Collect and print results
- Find minimum temperature by station ID
- Create (station ID temperature) key/value pairs
- Filter out all but the TMIN entries
- Parse (map) the input data
- The source data for the minimum temperature by location example
- What is filter()
- Filtering RDDs and the minimum temperature by location example
- Running the code
- Examining the script
- Running the average friends by age example
- Collect and display the results
- Compute averages
- Counting up the sum of friends and number of entries per age
- Parsing (mapping) the input data
- The friends by age example
- Mapping the values of a key/value RDD
- What Spark can do with key/value data?
- Creating a key/value RDD
- Key/value concepts - RDDs can hold key/value pairs
- Key/value RDDs and the average friends by age example
- Looking at the ratings-counter script in Canopy
- Sort and display the results
- Perform an action - count by value
- Extract (MAP) the data we care about
- Loading the data
- Setting up the SparkContext object
- Understanding the code
- Ratings histogram walk-through
- RDD actions
- Map example
- Transforming RDDs
- Creating RDDs
- The SparkContext object
- What is the RDD?
- The Resilient Distributed Dataset (RDD)
- Using Python with Spark
- Components of Spark
- Spark is not that hard
- Spark is hot
- Spark is fast
- Spark is scalable
- What is Spark?
- Spark Basics and Spark Examples
- Summary
- Running the ratings counter script
- Examining the ratings counter script
- Run your first Spark program - the ratings histogram example
- Installing the MovieLens movie rating dataset
- Running Spark code
- Installing Spark
- Installing the Java Development Kit
- Installing Enthought Canopy
- Getting set up - installing Python a JDK and Spark and its dependencies
- Getting Started with Spark
- Questions
- Piracy
- Errata
- Downloading the color images of this book
- Downloading the example code
- Customer support
- Reader feedback
- Conventions
- Who this book is for
- What you need for this book
- What this book covers
- Preface
- Customer Feedback
- Why subscribe?
- www.PacktPub.com
- About the Author
- Credits
- Title Page
- coverpage
- coverpage
- Title Page
- Credits
- About the Author
- www.PacktPub.com
- Why subscribe?
- Customer Feedback
- Preface
- What this book covers
- What you need for this book
- Who this book is for
- Conventions
- Reader feedback
- Customer support
- Downloading the example code
- Downloading the color images of this book
- Errata
- Piracy
- Questions
- Getting Started with Spark
- Getting set up - installing Python a JDK and Spark and its dependencies
- Installing Enthought Canopy
- Installing the Java Development Kit
- Installing Spark
- Running Spark code
- Installing the MovieLens movie rating dataset
- Run your first Spark program - the ratings histogram example
- Examining the ratings counter script
- Running the ratings counter script
- Summary
- Spark Basics and Spark Examples
- What is Spark?
- Spark is scalable
- Spark is fast
- Spark is hot
- Spark is not that hard
- Components of Spark
- Using Python with Spark
- The Resilient Distributed Dataset (RDD)
- What is the RDD?
- The SparkContext object
- Creating RDDs
- Transforming RDDs
- Map example
- RDD actions
- Ratings histogram walk-through
- Understanding the code
- Setting up the SparkContext object
- Loading the data
- Extract (MAP) the data we care about
- Perform an action - count by value
- Sort and display the results
- Looking at the ratings-counter script in Canopy
- Key/value RDDs and the average friends by age example
- Key/value concepts - RDDs can hold key/value pairs
- Creating a key/value RDD
- What Spark can do with key/value data?
- Mapping the values of a key/value RDD
- The friends by age example
- Parsing (mapping) the input data
- Counting up the sum of friends and number of entries per age
- Compute averages
- Collect and display the results
- Running the average friends by age example
- Examining the script
- Running the code
- Filtering RDDs and the minimum temperature by location example
- What is filter()
- The source data for the minimum temperature by location example
- Parse (map) the input data
- Filter out all but the TMIN entries
- Create (station ID temperature) key/value pairs
- Find minimum temperature by station ID
- Collect and print results
- Running the minimum temperature example and modifying it for maximums
- Examining the min-temperatures script
- Running the script
- Running the maximum temperature by location example
- Counting word occurrences using flatmap()
- Map versus flatmap
- Map ()
- Flatmap ()
- Code sample - count the words in a book
- Improving the word-count script with regular expressions
- Text normalization
- Examining the use of regular expressions in the word-count script
- Running the code
- Sorting the word count results
- Step 1 - Implement countByValue() the hard way to create a new RDD
- Step 2 - Sort the new RDD
- Examining the script
- Running the code
- Find the total amount spent by customer
- Introducing the problem
- Strategy for solving the problem
- Useful snippets of code
- Check your results and sort them by the total amount spent
- Check your sorted implementation and results against mine
- Summary
- Advanced Examples of Spark Programs
- Finding the most popular movie
- Examining the popular-movies script
- Getting results
- Using broadcast variables to display movie names instead of ID numbers
- Introducing broadcast variables
- Examining the popular-movies-nicer.py script
- Getting results
- Finding the most popular superhero in a social graph
- Superhero social networks
- Input data format
- Strategy
- Running the script - discover who the most popular superhero is
- Mapping input data to (hero ID number of co-occurrences) per line
- Adding up co-occurrence by hero ID
- Flipping the (map) RDD to (number hero ID)
- Using max() and looking up the name of the winner
- Getting results
- Superhero degrees of separation - introducing the breadth-first search algorithm
- Degrees of separation
- How the breadth-first search algorithm works?
- The initial condition of our social graph
- First pass through the graph
- Second pass through the graph
- Third pass through the graph
- Final pass through the graph
- Accumulators and implementing BFS in Spark
- Convert the input file into structured data
- Writing code to convert Marvel-Graph.txt to BFS nodes
- Iteratively process the RDD
- Using a mapper and a reducer
- How do we know when we're done?
- Superhero degrees of separation - review the code and run it
- Setting up an accumulator and using the convert to BFS function
- Calling flatMap()
- Calling an action
- Calling reduceByKey
- Getting results
- Item-based collaborative filtering in Spark cache() and persist()
- How does item-based collaborative filtering work?
- Making item-based collaborative filtering a Spark problem
- It's getting real
- Caching RDDs
- Running the similar-movies script using Spark's cluster manager
- Examining the script
- Getting results
- Improving the quality of the similar movies example
- Summary
- Running Spark on a Cluster
- Introducing Elastic MapReduce
- Why use Elastic MapReduce?
- Warning - Spark on EMR is not cheap
- Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY
- Partitioning
- Using .partitionBy()
- Choosing a partition size
- Creating similar movies from one million ratings - part 1
- Changes to the script
- Creating similar movies from one million ratings - part 2
- Our strategy
- Specifying memory per executor
- Specifying a cluster manager
- Running on a cluster
- Setting up to run the movie-similarities-1m.py script on a cluster
- Preparing the script
- Creating a cluster
- Connecting to the master node using SSH
- Running the code
- Creating similar movies from one million ratings – part 3
- Assessing the results
- Terminating the cluster
- Troubleshooting Spark on a cluster
- More troubleshooting and managing dependencies
- Troubleshooting
- Managing dependencies
- Summary
- SparkSQL DataFrames and DataSets
- Introducing SparkSQL
- Using SparkSQL in Python
- More things you can do with DataFrames
- Differences between DataFrames and DataSets
- Shell access in SparkSQL
- User-defined functions (UDFs)
- Executing SQL commands and SQL-style functions on a DataFrame
- Using SQL-style functions instead of queries
- Using DataFrames instead of RDDs
- Summary
- Other Spark Technologies and Libraries
- Introducing MLlib
- MLlib capabilities
- Special MLlib data types
- For more information on machine learning
- Making movie recommendations
- Using MLlib to produce movie recommendations
- Examining the movie-recommendations-als.py script
- Analyzing the ALS recommendations results
- Why did we get bad results?
- Using DataFrames with MLlib
- Examining the spark-linear-regression.py script
- Getting results
- Spark Streaming and GraphX
- What is Spark Streaming?
- GraphX
- Summary
- Where to Go From Here? – Learning More About Spark and Data Science 更新時間:2021-07-02 21:12:56