how to deal with unicode problem for screen scrape

鲲鹏Web数据抓取 - 专业Web数据采集服务提供者

# how to deal with unicode problem for screen scrape
# by redice 2011.03.04


# following example shows how to solve this problem

from webscraping import common, download, xpath

D = download.Download()

url = 'http://www.infobel.com/fr/belgium/mediterranea/schaerbeek/022306274/businessdetails.aspx'
html = D.get(url)

#convert html from Windows-1252 into needed charset, assuming your Debug I/O Encoding is utf8
html = html.decode('Windows-1252').replace(u'\xa0', ' ').encode('utf8', 'replace')

s = xpath.get(html, '//div[@class="result-details"]')
print common.unescape(s)

following is output:

Chaussée de Louvain 446, 1030 Schaerbeek<ul class="result-data"><li>Téléphone:022306274</li><li>Fax:026095995</li><li>Email:mediterranea@skynet.be <a href="sendmail.aspx?qphone=022306274&pos=1&rc=1&sp=mediterraneaschaerbeek">Envoyer un message</a></li></ul>

summary:

Decode early
Unicode everywhere
Encode late

you may find following articles are helpful:

http://farmdev.com/talks/unicode/

http://docs.python.org/howto/unicode.html

http://effbot.org/zone/unicode-objects.htm

http://boodebr.org/main/python/all-about-python-and-unicode

xvfb启动PyQt4程序报Unable to load librar	如何从QNetworkAccessManager中读取Cookie
哪种代理适合用于Web数据采集	QtWebKit对username:password@host:port格
Ubuntu下Webscraping环境配置	如何用Python进行whois查询？

redice's Blog

现专注于Web数据抓取

how to deal with unicode problem for screen scrape

[日志分享]

[日志信息]

[相关日志]

关于我

日志分类

热门日志

最新日志

网友评论

标签云

友情链接

redice's Blog is powered by DedeCms | Theme by Monkeii.Lee | 网站地图 | 本服务器由西安鲲之鹏网络信息技术有限公司友情提供

redice's Blog

现专注于Web数据抓取

how to deal with unicode problem for screen scrape

[日志分享]

[日志信息]

[相关日志]

关于我

搜索

日志分类

热门日志

最新日志

网友评论

标签云

友情链接

redice's Blog is powered by DedeCms | Theme by Monkeii.Lee | 网站地图 | 本服务器由西安鲲之鹏网络信息技术有限公司友情提供