- Python Web Scraping Cookbook
- Michael Heydt
- 399字
- 2021-06-30 18:43:58
How it works
We will get into some details about Scrapy in later chapters, but let's just go through this code quick to get a feel how it is accomplishing this scrape. Everything in Scrapy revolves around creating a spider. Spiders crawl through pages on the Internet based upon rules that we provide. This spider only processes one single page, so it's not really much of a spider. But it shows the pattern we will use through later Scrapy examples.
The spider is created with a class definition that derives from one of the Scrapy spider classes. Ours derives from the scrapy.Spider class.
class PythonEventsSpider(scrapy.Spider):
name = 'pythoneventsspider'
start_urls = ['https://www.python.org/events/python-events/',]
Every spider is given a name, and also one or more start_urls which tell it where to start the crawling.
This spider has a field to store all the events that we find:
found_events = []
The spider then has a method names parse which will be called for every page the spider collects.
def parse(self, response):
for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
event_details = dict()
event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
event_details['time'] = event.xpath('p/time/text()').extract_first()
self.found_events.append(event_details)
The implementation of this method uses and XPath selection to get the events from the page (XPath is the built in means of navigating HTML in Scrapy). It them builds the event_details dictionary object similarly to the other examples, and then adds it to the found_events list.
The remaining code does the programmatic execution of the Scrapy crawler.
process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()
It starts with the creation of a CrawlerProcess which does the actual crawling and a lot of other tasks. We pass it a LOG_LEVEL of ERROR to prevent the voluminous Scrapy output. Change this to DEBUG and re-run it to see the difference.
Next we tell the crawler process to use our Spider implementation. We get the actual spider object from that crawler so that we can get the items when the crawl is complete. And then we kick of the whole thing by calling process.start().
When the crawl is completed we can then iterate and print out the items that were found.
for event in spider.found_events: print(event)
- 物聯(lián)網(wǎng)之魂:物聯(lián)網(wǎng)協(xié)議與物聯(lián)網(wǎng)操作系統(tǒng)
- 萬物互聯(lián):蜂窩物聯(lián)網(wǎng)組網(wǎng)技術(shù)詳解
- 計算機(jī)網(wǎng)絡(luò)與數(shù)據(jù)通信
- 局域網(wǎng)組建、管理與維護(hù)項目教程(Windows Server 2003)
- SD-WAN架構(gòu)與技術(shù)(第2版)
- 面向云平臺的物聯(lián)網(wǎng)多源異構(gòu)信息融合方法
- SSL VPN : Understanding, evaluating and planning secure, web/based remote access
- 世界互聯(lián)網(wǎng)發(fā)展報告·2019
- Building Web Applications with ArcGIS
- Getting Started with nopCommerce
- Practical Web Penetration Testing
- TCP/IP基礎(chǔ)(第2版)
- 基于IPv6的家居物聯(lián)網(wǎng)開發(fā)與應(yīng)用技術(shù)
- Web用戶查詢?nèi)罩就诰蚺c應(yīng)用
- 精通SEO:100%網(wǎng)站流量提升密碼