官术网_书友最值得收藏!

How it works

We will get into some details about Scrapy in later chapters, but let's just go through this code quick to get a feel how it is accomplishing this scrape.  Everything in Scrapy revolves around creating a spider.  Spiders crawl through pages on the Internet based upon rules that we provide.  This spider only processes one single page, so it's not really much of a spider.  But it shows the pattern we will use through later Scrapy examples.

The spider is created with a class definition that derives from one of the Scrapy spider classes.  Ours derives from the scrapy.Spider class.

class PythonEventsSpider(scrapy.Spider):
name = 'pythoneventsspider'

start_urls = ['https://www.python.org/events/python-events/',]

Every spider is given a name, and also one or more start_urls which tell it where to start the crawling.

This spider has a field to store all the events that we find:

    found_events = []

The spider then has a method names parse which will be called for every page the spider collects.

def parse(self, response):
for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
event_details = dict()
event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
event_details['time'] = event.xpath('p/time/text()').extract_first()
self.found_events.append(event_details)

The implementation of this method uses and XPath selection to get the events from the page (XPath is the built in means of navigating HTML in Scrapy). It them builds the event_details dictionary object similarly to the other examples, and then adds it to the found_events list.

The remaining code does the programmatic execution of the Scrapy crawler.

    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
process.crawl(PythonEventsSpider)
spider = next(iter(process.crawlers)).spider
process.start()

It starts with the creation of a CrawlerProcess which does the actual  crawling and a lot of other tasks.  We pass it a LOG_LEVEL of ERROR to prevent the voluminous Scrapy output.  Change this to DEBUG and re-run it to see the difference.

Next we tell the crawler process to use our Spider implementation.  We get the actual spider object from that crawler so that we can get the items when the crawl is complete.  And then we kick of the whole thing by calling process.start().

When the crawl is completed we can then iterate and print out the items that were found.

    for event in spider.found_events: print(event)
This example really didn't touch any of the power of Scrapy.  We will look more into some of the more advanced features later in the book.
主站蜘蛛池模板: 湾仔区| 丽水市| 彩票| 乌拉特中旗| 麟游县| 睢宁县| 河津市| 航空| 宁南县| 江达县| 乐至县| 古交市| 宿松县| 池州市| 休宁县| 广宗县| 浙江省| 岢岚县| 韶关市| 轮台县| 泾阳县| 昭平县| 陆丰市| 乡城县| 邻水| 德兴市| 肥东县| 青川县| 安溪县| 庆城县| 大埔区| 天镇县| 昭觉县| 新化县| 藁城市| 德江县| 沙河市| 军事| 巢湖市| 独山县| 延吉市|