官术网_书友最值得收藏!

Scraping Python.org with Scrapy

Scrapy is a very popular open source Python scraping framework for extracting data. It was originally designed for only scraping, but it is has also evolved into a powerful web crawling solution.

In our previous recipes, we used Requests and urllib2 to fetch data and Beautiful Soup to extract data. Scrapy offers all of these functionalities with many other built-in modules and extensions. It is also our tool of choice when it comes to scraping with Python. 

Scrapy offers a number of powerful features that are worth mentioning:

  • Built-in extensions to make HTTP requests and handle compression, authentication, caching, manipulate user-agents, and HTTP headers
  • Built-in support for selecting and extracting data with selector languages such as CSS and XPath, as well as support for utilizing regular expressions for selection of content and links 
  • Encoding support to deal with languages and non-standard encoding declarations
  • Flexible APIs to reuse and write custom middleware and pipelines, which provide a clean and easy way to implement tasks such as automatically downloading assets (for example, images or media) and storing data in storage such as file systems, S3, databases, and others
主站蜘蛛池模板: 汕尾市| 开鲁县| 南雄市| 苏尼特右旗| 武强县| 张家川| 明溪县| 乐业县| 泽州县| 运城市| 宁都县| 耿马| 武功县| 同江市| 乌拉特前旗| 廊坊市| 读书| 新乐市| 灵寿县| 德兴市| 登封市| 凉城县| 虞城县| 怀来县| 全南县| 平遥县| 敖汉旗| 松原市| 竹山县| 噶尔县| 许昌县| 江北区| 阿瓦提县| 岗巴县| 太和县| 吉林市| 新源县| 周至县| 高平市| 蒲江县| 竹溪县|