官术网_书友最值得收藏!

Getting data into R by scraping the web using the rvest package

In this section, we will focus on web scraping and how to implement it using the rvest package.

Web scraping is the procedure of converting unstructured data into a structured format. Structured data can be easily accessed and used. We will use R for scraping the data of most popular feature films from the IMDb website.

The following steps are implemented to get data into R using the rvest package:

  1. Install the rvest package. It is mandatory to install it, as it does not come as a built-in library:
> install.packages('rvest') 
package 'rvest' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Radhika\AppData\Local\Temp\RtmpMvNUA5\downloaded_packages
  1. Include the installed package in R's workspace:
> library(rvest)
  1. Let's start web scraping the IMDb website, which displays the most popular feature films in a given year:
> url <- 'https://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature'> #Reading html code from mentioned url> webpage <- read_html(url)> webpage{xml_document}<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script type="text/ ...[2] <body id="styleguide-v2" class="fixed">\n\n <img height="1" width="1" style="display: ... 
  1. As you can see, there are various CSS selectors that can be used to scrape the required data:
> #Using CSS selectors to scrap the rankings section> rank_data_html <- html_nodes(webpage,'.text-primary')> rank_data_html{xml_nodeset (100)} [1] <span class="lister-item-index unbold text-primary">1.</span> [2] <span class="lister-item-index unbold text-primary">2.</span> [3] <span class="lister-item-index unbold text-primary">3.</span> [4] <span class="lister-item-index unbold text-primary">4.</span> [5] <span class="lister-item-index unbold text-primary">5.</span> [6] <span class="lister-item-index unbold text-primary">6.</span> [7] <span class="lister-item-index unbold text-primary">7.</span> [8] <span class="lister-item-index unbold text-primary">8.</span> [9] <span class="lister-item-index unbold text-primary">9.</span>[10] <span class="lister-item-index unbold text-primary">10.</span>[11] <span class="lister-item-index unbold text-primary">11.</span>[12] <span class="lister-item-index unbold text-primary">12.</span>[13] <span class="lister-item-index unbold text-primary">13.</span>[14] <span class="lister-item-index unbold text-primary">14.</span>[15] <span class="lister-item-index unbold text-primary">15.</span>[16] <span class="lister-item-index unbold text-primary">16.</span>[17] <span class="lister-item-index unbold text-primary">17.</span>[18] <span class="lister-item-index unbold text-primary">18.</span>[19] <span class="lister-item-index unbold text-primary">19.</span>[20] <span class="lister-item-index unbold text-primary">20.</span>...
  1. Use the following code to get the specific rank of each film:
> rank_data <- html_text(rank_data_html)> head(rank_data)[1] "1." "2." "3." "4." "5." "6."

In the next section, we will focus more on importing the data into R from databases using the required package.

主站蜘蛛池模板: 马鞍山市| 哈尔滨市| 绥芬河市| 盐边县| 交口县| 贡觉县| 惠来县| 上蔡县| 太仓市| 贡觉县| 海口市| 南郑县| 德惠市| 基隆市| 北宁市| 原平市| 博客| 沅江市| 内黄县| 德州市| 黑河市| 和林格尔县| 浦东新区| 扎鲁特旗| 报价| 理塘县| 云南省| 西藏| 温州市| 巴马| 富宁县| 桂林市| 宜章县| 喀喇| 浪卡子县| 河北省| 宜城市| 平果县| 东方市| 三亚市| 临澧县|