- Python Web Scraping Cookbook
- Michael Heydt
- 267字
- 2021-06-30 18:44:01
Querying the DOM with XPath and lxml
XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:
- Can easily navigate through the DOM tree
- More sophisticated and powerful than other selectors like CSS selectors and regular expressions
- It has a great set (200+) of built-in functions and is extensible with custom functions
- It is widely supported by parsing libraries and scraping platforms
XPath contains seven data models (we have seen some of them previously):
- root node (top level parent node)
- element nodes (<a>..</a>)
- attribute nodes (href="example.html")
- text nodes ("this is a text")
- comment nodes (<!-- a comment -->)
- namespace nodes
- processing instruction nodes
XPath expressions can return different data types:
- strings
- booleans
- numbers
- node-sets (probably the most common case)
An (XPath) axis defines a node-set relative to the current node. A total of 13 axes are defined in XPath to enable easy searching for different node parts, from the current context node, or the root node.
lxml is a Python wrapper on top of the libxml2 XML parsing library, which is written in C. The implementation in C helps make it faster than Beautiful Soup, but also harder to install on some computers. The latest installation instructions are available at: http://lxml.de/installation.html.
lxml supports XPath, which makes it considerably easy to manage complex XML and HTML documents. We will examine several techniques of using lxml and XPath together, and how to use lxml and XPath to navigate the DOM and access data.
- 物聯網與北斗應用
- 黑客攻防實戰技術完全手冊:掃描、嗅探、入侵與防御
- FreeSWITCH 1.2
- 計算機網絡安全實訓教程(第二版)
- Learning Karaf Cellar
- Microservice Patterns and Best Practices
- Kong網關:入門、實戰與進階
- 物聯網與智能家居
- 深入理解OpenStack Neutron
- Hands-On Bitcoin Programming with Python
- 圖解物聯網
- RestKit for iOS
- Hands-On Reactive Programming in Spring 5
- Microservices Development Cookbook
- 趣話通信:6G的前世、今生和未來