Python抓取网页中文乱码

最近在学习Python，练习用Python抓取网页内容并解析，在解析gb2312字符集网页时出现中文乱码：

UnicodeEncodeError: 'gbk' codec can't encode character u'\xbb' in position 0: illegal multibyte sequence

原因及解决方案：网页中的字符编码方式重新编码一次即可：

# 使用requests库封装一个简单的通过get方式获取网页源码的函数
def getsource(url):
    html = requests.get(url)
    s = html.text.encode(html.encoding)
    s = s.decode('gb2312', 'ignore')    #转换为unicode
    # print s
    return s

当然# coding: utf-8也是要加的。

参考：Python编码unicode Gbk Utf8字符集转换的正确姿势 - 大星哥的博客 | BIGSINGER Blog

需指定Accept

# 有时不能返回正确的编码导致的乱码文本，可以指定下headers中的：'Accept': 'text/html',
def post(url, data, headers = None):
    h = headers
    if h is None:
        h = {
            'Accept': 'text/html',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            }
    r = requests.post(url, data = data, headers = h)
    return r.content.decode()

文档信息

本文作者：zhupite
本文链接：https://zhupite.com/python/python%E7%88%AC%E5%8F%96%E4%B8%AD%E6%96%87%E7%BD%91%E9%A1%B5%E5%86%85%E5%AE%B9%E4%B9%B1%E7%A0%81.html
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

朱皮特的烂笔头

Python抓取网页中文乱码

需指定Accept

文档信息

Search

Table of Contents