admin 管理员组

文章数量: 887021

lxml,xpath

XPath与lxml结合使用

一、lxml作用

将html字符串进行解析,供xpath语法进行数据提取

二、提取如下HTML页面的内容

text = \
"""
<ul class="ullist" padding="1" spacing="1"><li><div id="top"><span class="position" width="350">职位名称</span><span>职位类别</span><span>人数</span><span>地点</span><span>发布时间</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=33824&amp;keywords=python&amp;tid=87&amp;lid=2218">python开发工程师</a></span><span>技术类</span><span>2</span><span>上海</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=29938&amp;keywords=python&amp;tid=87&amp;lid=2218">python后端</a></span><span>技术类</span><span>2</span><span>上海</span><span>2018-10-23</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=31236&amp;keywords=python&amp;tid=87&amp;lid=2218">高级Python开发工程师</a></span><span>技术类</span><span>2</span><span>上海</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=31235&amp;keywords=python&amp;tid=87&amp;lid=2218">python架构师</a></span><span>技术类</span><span>1</span><span>上海</span><span>2018-10-23</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=34531&amp;keywords=python&amp;tid=87&amp;lid=2218">Python数据开发工程师</a></span><span>技术类</span><span>1</span><span>上海</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=34532&amp;keywords=python&amp;tid=87&amp;lid=2218">高级图像算法研发工程师</a></span><span>技术类</span><span>1</span><span>上海</span><span>2018-10-23</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=31648&amp;keywords=python&amp;tid=87&amp;lid=2218">高级AI开发工程师</a></span><span>技术类</span><span>4</span><span>上海</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=32218&amp;keywords=python&amp;tid=87&amp;lid=2218">后台开发工程师</a></span><span>技术类</span><span>1</span><span>上海</span><span>2018-10-23</span></div><div id="even"><span class="l square"><a target="_blank" href="position_detail.php?id=32217&amp;keywords=python&amp;tid=87&amp;lid=2218">Python开发(自动化运维方向)</a></span><span>技术类</span><span>1</span><span>上海</span><span>2018-10-23</span></div><div id="odd"><span class="l square"><a target="_blank" href="position_detail.php?id=34511&amp;keywords=python&amp;tid=87&amp;lid=2218">Python数据挖掘讲师 </a></span><span>技术类</span><span>1</span><span>上海</span><span>2018-10-23</span></div></li>
</ul>
"""

进行工作之前,我们先把html字符串解析为html文档

from lxml import etree
html = etree.HTML(text)

1.获取所有的div标签[结点选取]

divs = html.xpath('//div')
print(divs)

此时我们发现并没有得到div标签,为什么呢?

要想得到所有div标签,我们首先想到的是for循环;
其次如何获取才是重点;
当然,前面讲过,要想xpath提取数据,首先得让lxml将html字符串进行解析;
for div in divs:d = etree.tostring(div,encoding='utf8').decode('utf8')print(d)print("*"*10)

这样我们就获取了所有的div标签啦。

2.获取某个指定的div标签[谓语的使用]

div = html.xpath('//div[1]')
print(etree.tostring(div,encoding='utf8').decode('utf8'))

看代码符合逻辑,可是我们执行完后报错了?

这里我们就要注意了:xpath提取数据时必定是列表,所以正确代码为:

div = html.xpath('//div[1]')[0]
print(etree.tostring(div,encoding='utf8').decode('utf8'))


3.获取所有id="even"的div标签

divs = html.xpath('//div/[@id="even"]')
for div in divs:d = etree.tostring(div,encoding='utf8').decoding('utf8')print(d)print('%'*10)

4.获取标签的某个属性

(1)获取所有div的id属性的值

divs = html.xpath('//div/@id')
print(divs)

(2)获取所有a标签的href属性的值

hrefs = html.xpath('//a/@href')
print(hrefs)

5.获取div里面所有的职位信息

我们可以看到第一个div里面并不是我们所需要的信息,所以要注意div的取值

得到信息为了直观,我们选择存储在列表里面。

divs = html.xpath('//div[position()>1]')
works = []
for div in divs:work={}#获取a标签下的href属性url = divs.xpath('.//a/@href')[0]#获取a标签下的文本信息position = divs.xpath('.//a/text()')[0]#获取工作类型work_type = divs.xpath('.//span[2]/text')[0]#获取职位人数nums = divs.xpath('.//span[3]/text()')[0]#获取工作地点area = divs.xpath('.//span[4]/text()')[0]#获取发布时间time = divs.xpath('.//span[5]/text()')[0]work={"url":url,"position":position,"work_type":work_type,"nums":nums,"area":area,"time":time}works.append(work)

综上,只有掌握了xpath语法,想拿到自己想要的数据岂不是小菜一碟?

本文标签: lxml xpath