- Hands-On Exploratory Data Analysis with R
- Radhika Datar Harish Garg
- 234字
- 2021-06-24 14:10:42
Getting data into R by scraping the web using the rvest package
In this section, we will focus on web scraping and how to implement it using the rvest package.
Web scraping is the procedure of converting unstructured data into a structured format. Structured data can be easily accessed and used. We will use R for scraping the data of most popular feature films from the IMDb website.
The following steps are implemented to get data into R using the rvest package:
- Install the rvest package. It is mandatory to install it, as it does not come as a built-in library:
> install.packages('rvest')
package 'rvest' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Radhika\AppData\Local\Temp\RtmpMvNUA5\downloaded_packages
- Include the installed package in R's workspace:
> library(rvest)
- Let's start web scraping the IMDb website, which displays the most popular feature films in a given year:
> url <- 'https://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature'> #Reading html code from mentioned url> webpage <- read_html(url)> webpage{xml_document}<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script type="text/ ...[2] <body id="styleguide-v2" class="fixed">\n\n <img height="1" width="1" style="display: ...
- As you can see, there are various CSS selectors that can be used to scrape the required data:
> #Using CSS selectors to scrap the rankings section> rank_data_html <- html_nodes(webpage,'.text-primary')> rank_data_html{xml_nodeset (100)} [1] <span class="lister-item-index unbold text-primary">1.</span> [2] <span class="lister-item-index unbold text-primary">2.</span> [3] <span class="lister-item-index unbold text-primary">3.</span> [4] <span class="lister-item-index unbold text-primary">4.</span> [5] <span class="lister-item-index unbold text-primary">5.</span> [6] <span class="lister-item-index unbold text-primary">6.</span> [7] <span class="lister-item-index unbold text-primary">7.</span> [8] <span class="lister-item-index unbold text-primary">8.</span> [9] <span class="lister-item-index unbold text-primary">9.</span>[10] <span class="lister-item-index unbold text-primary">10.</span>[11] <span class="lister-item-index unbold text-primary">11.</span>[12] <span class="lister-item-index unbold text-primary">12.</span>[13] <span class="lister-item-index unbold text-primary">13.</span>[14] <span class="lister-item-index unbold text-primary">14.</span>[15] <span class="lister-item-index unbold text-primary">15.</span>[16] <span class="lister-item-index unbold text-primary">16.</span>[17] <span class="lister-item-index unbold text-primary">17.</span>[18] <span class="lister-item-index unbold text-primary">18.</span>[19] <span class="lister-item-index unbold text-primary">19.</span>[20] <span class="lister-item-index unbold text-primary">20.</span>...
- Use the following code to get the specific rank of each film:
> rank_data <- html_text(rank_data_html)> head(rank_data)[1] "1." "2." "3." "4." "5." "6."
In the next section, we will focus more on importing the data into R from databases using the required package.
推薦閱讀
- 構建高質量的C#代碼
- 計算機應用
- 計算機圖形學
- Effective DevOps with AWS
- Hands-On Cybersecurity with Blockchain
- CentOS 8 Essentials
- OpenStack Cloud Computing Cookbook
- 大數據技術基礎:基于Hadoop與Spark
- Hands-On Data Warehousing with Azure Data Factory
- Spark大數據商業實戰三部曲:內核解密|商業案例|性能調優
- 生成對抗網絡項目實戰
- 工業機器人操作
- 人工智能云平臺:原理、設計與應用
- Learning iOS 8 for Enterprise
- 系統安裝、維護與數據備份技巧