官术网_书友最值得收藏!

Checking robots.txt

Most websites define a robots.txt file to let crawlers know of any restrictions when crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt file is a valuable resource to check before crawling to minimize the chance of being blocked, and to discover clues about the website's structure. More information about the robots.txt protocol is available at http://www.robotstxt.org. The following code is the content of our example robots.txt, which is available at http://example.webscraping.com/robots.txt:

# section 1 
User-agent: BadCrawler
Disallow: /

# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap

# section 3
Sitemap: http://example.webscraping.com/sitemap.xml

In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to crawl their website, but this is unlikely to help because a malicious crawler would not respect robots.txt anyway. A later example in this chapter will show you how to make your crawler follow robots.txt automatically.

Section 2 specifies a crawl delay of 5 seconds between download requests for all user-agents, which should be respected to avoid overloading their server(s). There is also a /trap link to try to block malicious crawlers who follow disallowed links. If you visit this link, the server will block your IP for one minute! A real website would block your IP for much longer, perhaps permanently, but then we could not continue with this example.

Section 3 defines a Sitemap file, which will be examined in the next section.

主站蜘蛛池模板: 巧家县| 阳信县| 龙川县| 海原县| 华宁县| 银川市| 承德市| 汾西县| 通渭县| 汽车| 仁怀市| 将乐县| 凤庆县| 新疆| 苍南县| 汉寿县| 迭部县| 铁岭市| 萝北县| 安达市| 资兴市| 昭苏县| 衡南县| 大方县| 潜山县| 台州市| 衢州市| 河北区| 堆龙德庆县| 获嘉县| 开封县| 拉萨市| 铜鼓县| 南通市| 勃利县| 商洛市| 鄂温| 汽车| 顺平县| 美姑县| 东宁县|