官术网_书友最值得收藏!

Examining the Sitemap

Sitemap files are provided bywebsites to help crawlers locate their updated content without needing to crawl every web page. For further details, the sitemap standard is defined at http://www.sitemaps.org/protocol.html. Many web publishing platforms have the ability to generate a sitemap automatically. Here is the content of the  Sitemap file located in the listed robots.txt file:

<?xml version="1.0" encoding="UTF-8"?> 
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://example.webscraping.com/view/Afghanistan-1</loc></url>
<url><loc>http://example.webscraping.com/view/Aland-Islands-2</loc></url>
<url><loc>http://example.webscraping.com/view/Albania-3</loc></url>
...
</urlset>

This sitemap provides links to all the web pages, which will be used in the next section to build our first crawler. Sitemap files provide an efficient way to crawl a website, but need to be treated carefully because they can be missing, out-of-date, or incomplete.

主站蜘蛛池模板: 香河县| 河曲县| 鄂伦春自治旗| 福鼎市| 上犹县| 当雄县| 塘沽区| 融水| 仁化县| 嵊州市| 乐安县| 松桃| 舒兰市| 轮台县| 永福县| 汪清县| 阿图什市| 十堰市| 澳门| 大兴区| 安达市| 开封市| 河津市| 博爱县| 常宁市| 宣武区| 墨玉县| 温泉县| 榆中县| 德庆县| 宣化县| 湖州市| 浮梁县| 江油市| 吉林省| 旬邑县| 巍山| 桦南县| 壤塘县| 新营市| 汽车|