- Python Web Scraping Cookbook
- Michael Heydt
- 417字
- 2021-06-30 18:44:01
How to do it...
We will start with a fresh iPython session and start by loading the planets page:
In [1]: import requests
...: from bs4 import BeautifulSoup
...: html = requests.get("http://localhost:8080/planets.html").text
...: soup = BeautifulSoup(html, "lxml")
...:
In the previous recipe, to access all of the <tr> in the table, we used a chained property syntax to get the table, and then needed to get the children and iterator over them. This does have a problem as the children could be elements other than <tr>. A more preferred method of getting just the <tr> child elements is to use findAll.
Lets start by first finding the <table>:
In [4]: table = soup.find("table")
...: str(table)[:100]
...:
Out[4]: '<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n Nam'
This tells the soup object to find the first <table> element in the document. From this element we can find all of the <tr> elements that are descendants of the table with findAll:
In [8]: [str(tr)[:50] for tr in table.findAll("tr")]
Out[8]:
['<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n ',
'<tr class="planet" id="planet1" name="Mercury">\n<t',
'<tr class="planet" id="planet2" name="Venus">\n<td>',
'<tr class="planet" id="planet3" name="Earth">\n<td>',
'<tr class="planet" id="planet4" name="Mars">\n<td>\n',
'<tr class="planet" id="planet5" name="Jupiter">\n<t',
'<tr class="planet" id="planet6" name="Saturn">\n<td',
'<tr class="planet" id="planet7" name="Uranus">\n<td',
'<tr class="planet" id="planet8" name="Neptune">\n<t',
'<tr class="planet" id="planet9" name="Pluto">\n<td>']
There is a small issue here if we want only rows that contain data for planets. The table header is also included. We can fix this by utilizing the id attribute of the target rows. The following finds the row where the value of id is "planet3".
In [14]: table.find("tr", {"id": "planet3"})
...:
Out[14]:
<tr class="planet" id="planet3" name="Earth">
<td>
<img src="img/earth-150x150.png"/>
</td>
<td>
Earth
</td>
<td>
5.97
</td>
<td>
12756
</td>
<td>
The name Earth comes from the Indo-European base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,'
Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning
'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'
</td>
<td>
<a >Wikipedia</a>
</td>
</tr>
Awesome! We used the fact that this page uses this attribute to represent table rows with actual data.
Now let's go one step further and collect the masses for each planet and put the name and mass in a dictionary:
In [18]: items = dict()
...: planet_rows = table.findAll("tr", {"class": "planet"})
...: for i in planet_rows:
...: tds = i.findAll("td")
...: items[tds[1].text.strip()] = tds[2].text.strip()
...:
In [19]: items
Out[19]:
{'Earth': '5.97',
'Jupiter': '1898',
'Mars': '0.642',
'Mercury': '0.330',
'Neptune': '102',
'Pluto': '0.0146',
'Saturn': '568',
'Uranus': '86.8',
'Venus': '4.87'}
And just like that we have made a nice data structure from the content embedded within the page.
- 光網絡評估及案例分析
- 信息通信網絡建設安全管理概要2
- 局域網組建、管理與維護項目教程(Windows Server 2003)
- 中小型局域網組建、管理與維護實戰
- WordPress Web Application Development
- CCNP TSHOOT(642-832)認證考試指南
- The Kubernetes Workshop
- 搶占下一個智能風口:移動物聯網
- Implementing NetScaler VPX?
- 小型局域網組建
- 人際網絡
- 智能物聯網:區塊鏈與霧計算融合應用詳解
- Microsoft System Center 2012 Configuration Manager:Administration Cookbook
- CDN技術詳解
- XSS跨站腳本攻擊剖析與防御