官术网_书友最值得收藏!

Querying the DOM with XPath and lxml

XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:

  • Can easily navigate through the DOM tree
  • More sophisticated and powerful than other selectors like CSS selectors and regular expressions
  • It has a great set (200+) of built-in functions and is extensible with custom functions
  • It is widely supported by parsing libraries and scraping platforms 

XPath contains seven data models (we have seen some of them previously):

  • root node (top level parent node)
  • element nodes (<a>..</a>)
  • attribute nodes (href="example.html")
  • text nodes ("this is a text")
  • comment nodes (<!-- a comment -->)
  • namespace nodes 
  • processing instruction nodes

XPath expressions can return different data types:

  • strings
  • booleans
  • numbers
  • node-sets (probably the most common case)

An (XPath) axis defines a node-set relative to the current node. A total of 13 axes are defined in XPath to enable easy searching for different node parts, from the current context node, or the root node.

lxml is a Python wrapper on top of the libxml2 XML parsing library, which is written in C.  The implementation in C helps make it faster than Beautiful Soup, but also harder to install on some computers. The latest installation instructions are available at: http://lxml.de/installation.html.

lxml supports XPath, which makes it considerably easy to manage complex XML and HTML documents. We will examine several techniques of using lxml and XPath together, and how to use lxml and XPath to navigate the DOM and access data.

主站蜘蛛池模板: 三都| 子长县| 水城县| 梅州市| 张家川| 嘉定区| 望江县| 沙雅县| 广饶县| 泽州县| 方正县| 稻城县| 江口县| 昭苏县| 湖州市| 武夷山市| 沁阳市| 河池市| 益阳市| 林周县| 陈巴尔虎旗| 宁德市| 图木舒克市| 宜宾县| 延长县| 浠水县| 晋州市| 探索| 赤峰市| 江川县| 定南县| 连南| 绵竹市| 南川市| 万源市| 厦门市| 航空| 内江市| 工布江达县| 新巴尔虎左旗| 五大连池市|