Unbaking mojibake(解压 mojibake)
问题描述
当您解码错误的字符时,您如何识别原始字符串的可能候选者?
Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png我知道这个图像文件名应该是一些日文字符.但是由于对 urllib 引用/取消引用、编码和解码 iso8859-1、utf8 的各种猜测,我一直无法取消并获得原始文件名.
腐败是可逆的吗?
解决方案 您可以使用 chardet(使用 pip 安装):
导入chardetyour_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"Detected_encoding = chardet.detect(your_str)["encoding"]尝试:right_str = your_str.decode(detected_encoding)除了 UnicodeDecodeError:print("无法估计编码")
结果:时间试験観点(アニメパス)_10秒(不知道对不对)
对于 Python 3(源文件编码为 utf8):
导入chardet导入编解码器falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"尝试:encoding_str = falsely_decoded_str.encode("cp850")除了 UnicodeEncodeError:print("无法编码错误解码的字符串")编码_str = 无如果已编码_str:detected_encoding = chardet.detect(encoded_str)["encoding"]尝试:right_str = encoding_str.decode(detected_encoding)除了 UnicodeEncodeError:打印(无法将encoded_str解码为%s"%detected_encoding)使用 codecs.open("output.txt", "w", "utf-8-sig") 作为输出:out.write(correct_str)
总结:
<预><代码>>>>s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'>>>s.encode('cp850').decode('shift-jis')'时间试験観点(アニメパス)_10秒.png'
When you have incorrectly decoded characters, how can you identify likely candidates for the original string?
Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png
I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.
Is the corruption reversible?
You could use chardet (install with pip):
import chardet
your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]
try:
correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
print("Could not estimate encoding")
Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)
For Python 3 (source file encoded as utf8):
import chardet
import codecs
falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"
try:
encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
print("could not encode falsely decoded string")
encoded_str = None
if encoded_str:
detected_encoding = chardet.detect(encoded_str)["encoding"]
try:
correct_str = encoded_str.decode(detected_encoding)
except UnicodeEncodeError:
print("could not decode encoded_str as %s" % detected_encoding)
with codecs.open("output.txt", "w", "utf-8-sig") as out:
out.write(correct_str)
In summary:
>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'
这篇关于解压 mojibake的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:解压 mojibake
- 我如何卸载 PyTorch? 2022-01-01
- 使用 Cython 将 Python 链接到共享库 2022-01-01
- 使用公司代理使Python3.x Slack(松弛客户端) 2022-01-01
- YouTube API v3 返回截断的观看记录 2022-01-01
- 计算测试数量的Python单元测试 2022-01-01
- CTR 中的 AES 如何用于 Python 和 PyCrypto? 2022-01-01
- ";find_element_by_name(';name';)";和&QOOT;FIND_ELEMENT(BY NAME,';NAME';)";之间有什么区别? 2022-01-01
- 如何使用PYSPARK从Spark获得批次行 2022-01-01
- 检查具有纬度和经度的地理点是否在 shapefile 中 2022-01-01
- 我如何透明地重定向一个Python导入? 2022-01-01