当前位置: 主页 > 日志 > Webscraping >

how to deal with unicode problem for screen scrape

# how to deal with unicode problem for screen scrape
# by redice 2011.03.04


# following example shows how to solve this problem

from webscraping import common, download, xpath

D = download.Download()

url = 'http://www.infobel.com/fr/belgium/mediterranea/schaerbeek/022306274/businessdetails.aspx'
html = D.get(url)

#convert html from Windows-1252 into needed charset, assuming your Debug I/O Encoding is utf8
html = html.decode('Windows-1252').replace(u'\xa0', ' ').encode('utf8', 'replace')

s = xpath.get(html, '//div[@class="result-details"]')
print common.unescape(s)

 

following is output:

<strong>Chaussée de Louvain 446, 1030 Schaerbeek</strong><ul class="result-data"><li><span>Téléphone:</span><em>022306274</em></li><li><span>Fax:</span><em>026095995</em></li><li><span>Email:</span><em>mediterranea@skynet.be<br /><a href="sendmail.aspx?qphone=022306274&pos=1&rc=1&sp=mediterraneaschaerbeek">Envoyer un message</a></em></li></ul> 

 

summary:

  • Decode early
  • Unicode everywhere
  • Encode late

 

you may find following articles  are helpful:

http://farmdev.com/talks/unicode/

http://docs.python.org/howto/unicode.html

http://effbot.org/zone/unicode-objects.htm

http://boodebr.org/main/python/all-about-python-and-unicode

[日志信息]

该日志于 2011-03-04 20:19 由 redice 发表在 redice's Blog ,你除了可以发表评论外,还可以转载 “how to deal with unicode problem for screen scrape” 日志到你的网站或博客,但是请保留源地址及作者信息,谢谢!!    (尊重他人劳动,你我共同努力)
   
验证(必填):   点击我更换验证码

redice's Blog  is powered by DedeCms |  Theme by Monkeii.Lee |  网站地图 |  本服务器由西安鲲之鹏网络信息技术有限公司友情提供

返回顶部