官术网_书友最值得收藏!

3.4 CSS選擇器

CSS即層疊樣式表,其選擇器是一種用來確定HTML文檔中某部分位置的語言。

CSS選擇器的語法比XPath更簡單一些,但功能不如XPath強大。實際上,當我們調用Selector對象的CSS方法時,在其內部會使用Python庫cssselect將CSS選擇器表達式翻譯成XPath表達式,然后調用Selector對象的XPATH方法。

表3-2列出了CSS選擇器的一些基本語法。

表3-2 CSS選擇器

和學習XPath一樣,通過一些例子展示CSS選擇器的使用。

先創建一個HTML文檔并構造一個HtmlResponse對象:

    >>> from scrapy.selector import Selector
    >>> from scrapy.http import HtmlResponse
    >>> body = '''
    ... <html>
    ...    <head>
    ...        <base />
    ...        <title>Example website</title>
    ...    </head>
    ...    <body>
    ...        <div id='images-1'style="width: 1230px; ">
    ...           <a href='image1.html'>Name: Image 1<br/><img src='image1.jpg'/></a>
    ...           <a href='image2.html'>Name: Image 2<br/><img src='image2.jpg'/></a>
    ...           <a href='image3.html'>Name: Image 3<br/><img src='image3.jpg'/></a>
    ...        </div>
    ...
    ...        <div id='images-2'class='small'>
    ...           <a href='image4.html'>Name: Image 4<br/><img src='image4.jpg'/></a>
    ...           <a href='image5.html'>Name: Image 5<br/><img src='image5.jpg'/></a>
    ...        </div>
    ...    </body>
    ... </html>
    ... '''
    ...
    >>> response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf8')

● E:選中E元素。

        # 選中所有的img
        >>> response.css('img')
        [<Selector xpath='descendant-or-self::img' data='<img src="image1.jpg">'>,
         <Selector xpath='descendant-or-self::img' data='<img src="image2.jpg">'>,
         <Selector xpath='descendant-or-self::img' data='<img src="image3.jpg">'>,
         <Selector xpath='descendant-or-self::img' data='<img src="image4.jpg">'>,
         <Selector xpath='descendant-or-self::img' data='<img src="image5.jpg">'>]

● E1, E2:選中E1和E2元素。

        # 選中所有base和title
        >>> response.css('base, title')
        [<Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<base
    >,
         <Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<title>Example
    website</title>'>]

● E1 E2:選中E1后代元素中的E2元素。

        # div后代中的img
        >>> response.css('div img')
        [<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image1.jpg">'>,
         <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image2.jpg">'>,
         <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image3.jpg">'>,
         <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image4.jpg">'>,
         <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="image5.jpg">'>]

● E1>E2:選中E1子元素中的E2元素。

        # body子元素中的div
        >>> response.css('body>div')
        [<Selector xpath='descendant-or-self::body/div' data='<div id="images-1" style="width: 1230px; '>,
         <Selector xpath='descendant-or-self::body/div'data='<div id="images-2" class="small">\n     '>]

● [ATTR]:選中包含ATTR屬性的元素。

        # 選中包含style屬性的元素
        >>> response.css('[style]')
        [<Selector xpath='descendant-or-self::*[@style]' data='<div id="images-1" style="width: 1230px; '>]

● [ATTR=VALUE]:選中包含ATTR屬性且值為VALUE的元素。

        # 選中屬性id值為images-1的元素
        >>> response.css('[id=images-1]')
        [<Selector xpath="descendant-or-self::*[@id = 'images-1']" data='<div id="images-1" style="width:1230px; '>]

● E:nth-child(n):選中E元素,且該元素必須是其父元素的第n個子元素。

        # 選中每個div的第一個a
        >>> response.css('div>a:nth-child(1)')
        [<Selector xpath="descendant-or-self::div/*[name() = 'a' and (position() = 1)]" data='<a
    href="image1.html">Name: Image 1 <br>'>,
         <Selector xpath="descendant-or-self::div/*[name() = 'a' and (position() = 1)]" data='<a
    href="image4.html">Name: Image 4 <br>'>]

        # 選中第二個div的第一個a
        >>> response.css('div:nth-child(2)>a:nth-child(1)')
        [<Selector xpath="descendant-or-self::*/*[name() = 'div' and (position() = 2)]/*[name() = 'a' and
    (position() = 1)]" data='<a href="image4.html">Name: Image 4 <br>'>]

● E:first-child:選中E元素,該元素必須是其父元素的第一個子元素。

● E:last-child:選中E元素,該元素必須是其父元素的倒數第一個子元素。

        # 選中第一個div的最后一個a
        >>> response.css('div:first-child>a:last-child')
        [<Selector xpath="descendant-or-self::*/*[name() = 'div' and (position() = 1)]/*[name() = 'a' and
    (position() = last())]" data='<a href="image3.html">Name: Image 3 <br>'>]

● E::text:選中E元素的文本節點。

        # 選中所有a的文本
        >>> sel = response.css('a::text')
        >>> sel
        [<Selector xpath='descendant-or-self::a/text()' data='Name: Image 1 '>,
         <Selector xpath='descendant-or-self::a/text()' data='Name: Image 2 '>,
         <Selector xpath='descendant-or-self::a/text()' data='Name: Image 3 '>,
         <Selector xpath='descendant-or-self::a/text()' data='Name: Image 4 '>,
         <Selector xpath='descendant-or-self::a/text()' data='Name: Image 5 '>]
        >>> sel.extract()
        ['Name: Image 1 ',
         'Name: Image 2 ',
         'Name: Image 3 ',
         'Name: Image 4 ',
         'Name: Image 5 ']

關于CSS選擇器的使用先介紹到這里,更多詳細內容可以參看CSS選擇器文檔:https://www.w3.org/TR/css3-selectors/。

主站蜘蛛池模板: 永新县| 延津县| 加查县| 彰化县| 合江县| 普宁市| 新闻| 芮城县| 台东县| 新民市| 五寨县| 安新县| 昆山市| 宕昌县| 竹山县| 榕江县| 和硕县| 长阳| 清河县| 建平县| 丰都县| 嵊州市| 绍兴县| 多伦县| 永丰县| 萨迦县| 麻城市| 定日县| 嘉鱼县| 河北省| 福海县| 清涧县| 乌鲁木齐县| 涡阳县| 华池县| 白玉县| 崇左市| 丽水市| 团风县| 绵竹市| 阿瓦提县|