官术网_书友最值得收藏!

Finding the owner of a website

For some websites it may matter to us who the owner is. For example, if the owner is known to block web crawlers then it would be wise to be more conservative in our download rate. To find who owns a website we can use the WHOIS protocol to see who is the registered owner of the domain name. A Python wrapper to this protocol, documented at https://pypi.python.org/pypi/python-whois, can be installed via pip:

   pip install python-whois

Here is the most informative part of the WHOIS response when querying the appspot.com domain with this module:

   >>> import whois
>>> print(whois.whois('appspot.com'))
{
...
"name_servers": [
"NS1.GOOGLE.COM",
"NS2.GOOGLE.COM",
"NS3.GOOGLE.COM",
"NS4.GOOGLE.COM",
"ns4.google.com",
"ns2.google.com",
"ns1.google.com",
"ns3.google.com"
],
"org": "Google Inc.",
"emails": [
"abusecomplaints@markmonitor.com",
"dns-admin@google.com"
]
}

We can see here that this domain is owned by Google, which is correct; this domain is for the Google App Engine service. Google often blocks web crawlers despite being fundamentally a web crawling business themselves. We would need to be careful when crawling this domain because Google often blocks IPs that quickly scrape their services; and you, or someone you live or work with, might need to use Google services. I have experienced being asked to enter captchas to use Google services for short periods, even after running only simple search crawlers on Google domains.

主站蜘蛛池模板: 濮阳市| 中牟县| 团风县| 兴城市| 黄山市| 噶尔县| 潼南县| 修水县| 仪征市| 离岛区| 井冈山市| 砀山县| 台南市| 饶河县| 读书| 西峡县| 夏河县| 增城市| 酉阳| 襄樊市| 临泽县| 漳浦县| 阿瓦提县| 菏泽市| 镇远县| 竹溪县| 阿拉善右旗| 阳江市| 镇平县| 合江县| 崇州市| 交城县| 毕节市| 上犹县| 双桥区| 理塘县| 越西县| 沅陵县| 田东县| 绥宁县| 伊春市|