# how to deal with unicode problem for screen scrape # by redice 2011.03.04 # following example shows how to solve this problem from webscraping import common, download, xpath D = download.Download() url = 'http://www.infobel.com/fr/belgium/mediterranea/schaerbeek/022306274/businessdetails.aspx' html = D.get(url) #convert html from Windows-1252 into needed charset, assuming your Debug I/O Encoding is utf8 html = html.decode('Windows-1252').replace(u'\xa0', ' ').encode('utf8', 'replace') s = xpath.get(html, '//div[@class="result-details"]') print common.unescape(s)
following is output:
<strong>Chaussée de Louvain 446, 1030 Schaerbeek</strong><ul class="result-data"><li><span>Téléphone:</span><em>022306274</em></li><li><span>Fax:</span><em>026095995</em></li><li><span>Email:</span><em>mediterranea@skynet.be<br /><a href="sendmail.aspx?qphone=022306274&pos=1&rc=1&sp=mediterraneaschaerbeek">Envoyer un message</a></em></li></ul>
summary:
- Decode early
- Unicode everywhere
- Encode late
you may find following articles are helpful:
http://farmdev.com/talks/unicode/
http://docs.python.org/howto/unicode.html
呵呵,谢谢
VaTG790i.最好的<a href=http://www.kyfei.com>网站推广软件</a>,
非常好
....................
;ui;普i;uighur;ui;ui;个
在unix网络编程中看到了关于TCP/IP的一些内容,我感觉还是写的不够。正在下载中,一定
下载地址呢