Python: Sanitize a string for unicode?(Python:清理 unicode 的字符串?)
问题描述
<块引用>可能的重复:
Python UnicodeDecodeError - 我误解了编码吗?
我有一个字符串,我正试图确保 unicode() 函数的安全:
我大部分时间都在这儿闲逛.我需要怎么做才能从字符串中删除不安全的字符?
与此问题有些相关,尽管我无法从中解决我的问题.
这也失败了:
<预><代码>>>>秒' foo x93bar bar x94 鼬鼠'>>>s.decode('utf-8')回溯(最近一次调用最后一次):文件<pyshell#13>",第 1 行,在 <module> 中s.decode('utf-8')文件C:Python25254libencodingsutf_8.py",第 16 行,解码返回 codecs.utf_8_decode(输入,错误,真)UnicodeDecodeError: 'utf8' 编解码器无法解码位置 5 的字节 0x93:意外的代码字节好问题.编码问题很棘手.让我们从 我有一个字符串"开始. Python 2 中的字符串并不是真正的字符串",它们是字节数组.所以你的字符串,它来自哪里以及它是什么编码?您的示例在文字中显示了卷曲引号,我什至不确定您是如何做到的.我尝试将其粘贴到 Python 解释器中,或在 OS X 上使用 Option-[ 键入它,但它没有通过.
虽然看你的第二个例子,你有一个十六进制 93 的字符.那不能是 UTF-8,因为在 UTF-8 中,任何高于 127 的字节都是多字节的一部分顺序.所以我猜它应该是Latin-1.问题是,x93 不是 Latin-1 字符集中的字符.在从 x7f 到 x9f 的 Latin-1 中有这个无效"范围被认为是非法的.但是,Microsoft 看到了未使用的范围,并决定在其中放置卷曲引号".在这样做的过程中,他们创建了一种名为windows-1252"的类似编码,它类似于拉丁文-1,其中包含该无效范围内的内容.
所以,让我们假设它是 windows-1252.现在怎么办?String.decode 将字节转换为 Unicode,这就是您想要的.你的第二个例子是在正确的轨道上,但它失败了,因为字符串不是 UTF-8.试试:
<预><代码>>>>uni = 'foo x93bar barx94 weasel'.decode("windows-1252")u'foo u201cbar baru201d weasel'>>>印刷大学foo酒吧酒吧"黄鼠狼>>>类型(单)<输入'unicode'>这是正确的,因为左引号是 Unicode U+201C.现在你有了 Unicode,你可以用你选择的任何编码将它序列化为字节(如果你需要通过网络传递它),或者如果它留在 Python 中,就将它保留为 Unicode.如果要转换为 UTF-8,请使用反对函数 string.encode.
<预><代码>>>>uni.encode("utf-8")'foo xe2x80x9cbar bar xe2x80x9d 鼬鼠'卷曲引号需要 3 个字节以 UTF-8 编码.您可以使用 UTF-16,它们只有两个字节.但是,您不能编码为 ASCII 或 Latin-1,因为它们没有大引号.
Possible Duplicate:
Python UnicodeDecodeError - Am I misunderstanding encode?
I have a string that I'm trying to make safe for the unicode()
function:
>>> s = " foo "bar bar " weasel"
>>> s.encode('utf-8', 'ignore')
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
I'm mostly flailing around here. What do I need to do to remove the unsafe characters from the string?
Somewhat related to this question, although I was unable to solve my problem from it.
This also fails:
>>> s
' foo x93bar bar x94 weasel'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
s.decode('utf-8')
File "C:Python25254libencodingsutf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte
Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.
Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.
So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:
>>> uni = 'foo x93bar barx94 weasel'.decode("windows-1252")
u'foo u201cbar baru201d weasel'
>>> print uni
foo "bar bar" weasel
>>> type(uni)
<type 'unicode'>
That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.
>>> uni.encode("utf-8")
'foo xe2x80x9cbar bar xe2x80x9d weasel'
Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.
这篇关于Python:清理 unicode 的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Python:清理 unicode 的字符串?
- ";find_element_by_name(';name';)";和&QOOT;FIND_ELEMENT(BY NAME,';NAME';)";之间有什么区别? 2022-01-01
- 检查具有纬度和经度的地理点是否在 shapefile 中 2022-01-01
- 如何使用PYSPARK从Spark获得批次行 2022-01-01
- YouTube API v3 返回截断的观看记录 2022-01-01
- 使用公司代理使Python3.x Slack(松弛客户端) 2022-01-01
- 使用 Cython 将 Python 链接到共享库 2022-01-01
- 我如何卸载 PyTorch? 2022-01-01
- CTR 中的 AES 如何用于 Python 和 PyCrypto? 2022-01-01
- 计算测试数量的Python单元测试 2022-01-01
- 我如何透明地重定向一个Python导入? 2022-01-01