libxml2库的安装，xpath的使用

鲲鹏Web数据抓取 - 专业Web数据采集服务提供者

Python的libxml2库支持xpath。但默认没有包含该库，需要单独安装。

libxml2 Win32版可以在如下地址下载：

http://xmlsoft.org/sources/win32/python/

我的Python版本是2.5，这里我下载安装了libxml2-python-2.6.30.win32-py2.5.exe

安装程序会将libxml2安装到python2.5的默认目录下（我安装的是ActivePython-2.5.2.2-win32-x86.msi，默认安装路径是C:Python25）。

另外一种安装方法是利用easy_install工具，它有点类似linux下的yum工具。

详见： http://codespeak.net/lxml/installation.html

Get the easy_install tool and run the following as super-user (or administrator):

easy_install lxml

On MS Windows, the above will install the binary builds that we provide. If there is no binary build of the latest release yet, please search PyPI for the last release that has them and pass that version to easy_install like this:
```
easy_install lxml==2.2.2
```
On Linux (and most other well-behaved operating systems), easy_install will manage to build the source distribution as long as libxml2 and libxslt are properly installed, including development packages, i.e. header files, etc. Use your package management tool to look for packages like libxml2-dev or libxslt-devel if the build fails, and make sure they are installed.
On MacOS-X, use the following to build the source distribution, and make sure you have a working Internet connection, as this will download libxml2 and libxslt in order to build them:
```
STATIC_DEPS=true easy_install lxml
```

附：setuptools-0.6c11.win32-py2.5.exe 即easy_install，注意：本安装包适用于Python25。

setuptools-0.6c11.win32-py2.5.rar

解压后，直接安装即可。
然后，命令行切换至C:\Python25\Lib\site-packages，并运行 easy_install lxml==2.2.2 即可完成libxml2的安装。

安装后可以用下面的程序测试，让我们一起来见识一下强大的xpath！

File: Click to Download

#coding:utf-8

import codecs
import sys
#不加如下行，无法打印Unicode字符，产生UnicodeEncodeError错误。?
sys.stdout = codecs.lookup('iso8859-1')[-1](sys.stdout)

from lxml import etree

html = r'''<div>
    <div>redice</div>
    <div id="email">redice@163.com</div>
    <div name="address">中国</div>
    <div>http://www.redicecn.com</div>
</div>'''

tree = etree.HTML(html)

#获取email。email所在的div的id为email
nodes = tree.xpath("//div[@id='email']")
print nodes[0].text

#获取地址。地址所在的div的name为address
nodes = tree.xpath("//div[@name='address']")
print nodes[0].text

#获取博客地址。博客地址位于email之后兄弟节点的第二个
nodes = tree.xpath("//div[@id='email']/following-sibling::div[2]")
print nodes[0].text

MongoDB导出CSV - mongoexport工具	Python跨进程级锁的一种实现
MySQLdb取回大结果集的技巧	Python字符串IP转整型
使用PIL实现多张图片垂直合并	pyodbc如何获取刚插入记录的ID

redice's Blog

现专注于Web数据抓取

libxml2库的安装，xpath的使用

[日志分享]

[日志信息]

[相关日志]

关于我

日志分类

热门日志

最新日志

网友评论

标签云

友情链接

redice's Blog is powered by DedeCms | Theme by Monkeii.Lee | 网站地图 | 本服务器由西安鲲之鹏网络信息技术有限公司友情提供

redice's Blog

现专注于Web数据抓取

libxml2库的安装，xpath的使用

[日志分享]

[日志信息]

[相关日志]

关于我

搜索

日志分类

热门日志

最新日志

网友评论

标签云

友情链接

redice's Blog is powered by DedeCms | Theme by Monkeii.Lee | 网站地图 | 本服务器由西安鲲之鹏网络信息技术有限公司友情提供