官术网_书友最值得收藏!

Scraping the Data

In the previous chapter, we built a crawler which follows links to download the web pages we want. This is interesting but not useful-the crawler downloads a web page, and then discards the result. Now, we need to make this crawler achieve something by extracting data from each web page, which is known as scraping.

We will first cover browser tools to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the chapter will conclude with a comparison of these three scraping alternatives.

In this chapter, we will cover the following topics:

  • Analyzing a web page
  • Approaches to scrape a web page
  • Using the console
  • xpath selectors
  • Scraping results
主站蜘蛛池模板: 洪雅县| 东至县| 苗栗县| 武夷山市| 邵阳市| 阆中市| 元江| 枝江市| 罗江县| 城固县| 德惠市| 黄梅县| 莎车县| 香格里拉县| 富顺县| 大姚县| 西平县| 肇东市| 陈巴尔虎旗| 磐安县| 阳春市| 遂溪县| 麻阳| 沙河市| 临武县| 南投县| 邳州市| 莱芜市| 丹凤县| 北流市| 老河口市| 黔江区| 高碑店市| 临清市| 达日县| 靖安县| 漯河市| 深水埗区| 郴州市| 大埔区| 蒲江县|