Python returning the wrong length of string when using special characters(Python在使用特殊字符时返回错误长度的字符串)
问题描述
我有一个字符串 ë́aúlt,我想根据字符位置等获取操作的长度.问题是第一个 ë́ 被计算了两次,或者我猜 ë 在位置 0 并且 ´ 在位置 1.
在 Python 中是否有任何可能的方法可以将 ë́ 这样的字符表示为 1?
我将 UTF-8 编码用于输出到的实际代码和网页.
只是一些关于为什么我需要这样做的背景.我正在做一个将英语翻译成 Seneca(一种美洲原住民语言)的项目,而且 ë́ 出现了很多.某些单词的一些重写规则需要了解字母位置(本身和周围的字母)和其他特征,例如重音和其他变音符号.
UTF-8 是一种 Unicode 编码,它使用多个字节来表示特殊字符.如果您不想要编码字符串的长度,请对其进行简单解码并在 unicode 对象上使用
len()(而不是
str
> 对象!).
以下是一些示例:
<预><代码>>>># 创建一个 str 文字(使用 utf-8 编码,如果这是>>># 指定在文件的开头):>>>len('ë́aúlt')9>>># 创建一个 unicode 文字(你通常应该使用这个>>># 版本(如果您正在处理特殊字符):>>>len(u'ë́aúlt')6>>># 相同的 str 文字(以编码符号编写):>>>len('xc3xabxccx81axc3xbalt')9>>># 您可以通过decode() 将任何str 转换为unicode 对象:>>>len('xc3xabxccx81axc3xbalt'.decode('utf-8'))6当然,您也可以像在 str
对象中那样访问 unicode
对象中的单个字符(它们都继承自 basestring
,因此具有相同的方法):
如果您开发本地化应用程序,通常在内部仅使用 unicode
对象是一个好主意,通过解码您获得的所有输入.工作完成后,您可以再次将结果编码为UTF-8".如果你坚持这个原则,你永远不会看到你的服务器因为任何内部的 UnicodeDecodeError
而崩溃,否则你可能会得到 ;)
PS:请注意,str
和 unicode
数据类型在 Python 3 中发生了显着变化.在 Python 3 中,只有 unicode 字符串和纯字节字符串可以'不要再混了.这应该有助于避免 unicode 处理的常见陷阱...
问候,克里斯托夫
I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Is there any possible way in Python to have a character like ë́ be represented as 1?
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.
UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len()
on the unicode
object (and not the str
object!).
Here are some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('xc3xabxccx81axc3xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('xc3xabxccx81axc3xbalt'.decode('utf-8'))
6
Of course, you can also access single characters in an unicode
object like you would do in a str
object (they are both inheriting from basestring
and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If you develop localized applications, it's generally a good idea to use only unicode
-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeError
s you might get otherwise ;)
PS: Please note, that the str
and unicode
datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
Regards, Christoph
这篇关于Python在使用特殊字符时返回错误长度的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Python在使用特殊字符时返回错误长度的字符串
- 如何使用PYSPARK从Spark获得批次行 2022-01-01
- ";find_element_by_name(';name';)";和&QOOT;FIND_ELEMENT(BY NAME,';NAME';)";之间有什么区别? 2022-01-01
- CTR 中的 AES 如何用于 Python 和 PyCrypto? 2022-01-01
- YouTube API v3 返回截断的观看记录 2022-01-01
- 使用公司代理使Python3.x Slack(松弛客户端) 2022-01-01
- 我如何透明地重定向一个Python导入? 2022-01-01
- 使用 Cython 将 Python 链接到共享库 2022-01-01
- 我如何卸载 PyTorch? 2022-01-01
- 计算测试数量的Python单元测试 2022-01-01
- 检查具有纬度和经度的地理点是否在 shapefile 中 2022-01-01