官术网_书友最值得收藏!

Crawling your first website

In order to scrape a website, we first need to download its web pages containing the data of interest, a process known as crawling. There are a number of approaches that can be used to crawl a website, and the appropriate choice will depend on the structure of the target website. This chapter will explore how to download web pages safely, and then introduce the following three common approaches to crawling a website:

  • Crawling a sitemap
  • Iterating each page using database IDs 
  • Following web page links

We have so far used the terms scraping and crawling interchangeably, but let's take a moment to define the similarities and differences in these two approaches.

主站蜘蛛池模板: 武安市| 南宫市| 武胜县| 淮安市| 朝阳县| 瑞金市| 通化县| 乡宁县| 嘉黎县| 卢龙县| 怀柔区| 揭西县| 万年县| 临海市| 土默特右旗| 年辖:市辖区| 安宁市| 昆山市| 阳朔县| 清新县| 门源| 正宁县| 汉源县| 白银市| 巩留县| 攀枝花市| 定州市| 琼海市| 潍坊市| 黑河市| 张家港市| 江源县| 龙海市| 翁牛特旗| 左贡县| 西平县| 萝北县| 兴文县| 汨罗市| 万荣县| 广南县|