Cross-platform iteration of Unicode string (counting Graphemes using ICU)(Unicode字符串的跨平台迭代(使用ICU计算Graphemes))
问题描述
我想迭代 Unicode 字符串的每个字符,处理每个代理对并将字符序列组合为一个单元(一个字素).
I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).
文本नमस्ते"由代码点组成:U+0928、U+092E、U+0938、U+094D、U+0924、U+0947
,其中,U+0938
和 U+0947
是组合标记.
The text "नमस्ते" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947
, of which, U+0938
and U+0947
are combining marks.
static void Main(string[] args)
{
const string s = "नमस्ते";
Console.WriteLine(s.Length); // Ouptuts "6"
var l = 0;
var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while(e.MoveNext()) l++;
Console.WriteLine(l); // Outputs "4"
}
所以我们在 .NET 中有它.我们还有 Win32 的 CharNextW()
So there we have it in .NET. We also have Win32's CharNextW()
#include <Windows.h>
#include <iostream>
#include <string>
int main()
{
const wchar_t * s = L"नमस्ते";
std::cout << std::wstring(s).length() << std::endl; // Gives "6"
int l = 0;
while(CharNextW(s) != s)
{
s = CharNextW(s);
++l;
}
std::cout << l << std::endl; // Gives "4"
return 0;
}
问题
我所知道的两种方式都特定于 Microsoft.有没有便携的方法来做到这一点?
- 我听说过 ICU,但我无法快速找到相关内容(
UnicodeString(s).length()
仍然给出 6).指向 ICU 中的相关功能/模块是可以接受的答案. - C++ 没有 Unicode 的概念,因此用于处理这些问题的轻量级跨平台库将是一个可以接受的答案.
- I heard about ICU but I couldn't find something related quickly (
UnicodeString(s).length()
still gives 6). Would be an acceptable answer to point to the related function/module in ICU. - C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.
@McDowell 给出了使用 ICU 的 BreakIterator
的提示,我认为这可以看作是处理 Unicode 的事实上的跨平台标准.下面是演示其用法的示例代码(因为示例出人意料很少见):
@McDowell gave the hint to use BreakIterator
from ICU, which I think can be regarded as the de-facto cross-platform standard to deal with Unicode. Here's an example code to demonstrate its use (since examples are surprisingly rare):
#include <unicode/schriter.h>
#include <unicode/brkiter.h>
#include <iostream>
#include <cassert>
#include <memory>
int main()
{
const UnicodeString str(L"नमस्ते");
{
// StringCharacterIterator doesn't seem to recognize graphemes
StringCharacterIterator iter(str);
int count = 0;
while(iter.hasNext())
{
++count;
iter.next();
}
std::cout << count << std::endl; // Gives "6"
}
{
// BreakIterator works!!
UErrorCode err = U_ZERO_ERROR;
std::unique_ptr<BreakIterator> iter(
BreakIterator::createCharacterInstance(Locale::getDefault(), err));
assert(U_SUCCESS(err));
iter->setText(str);
int count = 0;
while(iter->next() != BreakIterator::DONE) ++count;
std::cout << count << std::endl; // Gives "4"
}
return 0;
}
推荐答案
您应该能够使用 ICU BreakIterator 用于此(假设它的特性等效于 Java 版本的字符实例).
You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).
这篇关于Unicode字符串的跨平台迭代(使用ICU计算Graphemes)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Unicode字符串的跨平台迭代(使用ICU计算Graphemes)


- Stroustrup 的 Simple_window.h 2022-01-01
- 如何对自定义类的向量使用std::find()? 2022-11-07
- 一起使用 MPI 和 OpenCV 时出现分段错误 2022-01-01
- STL 中有 dereference_iterator 吗? 2022-01-01
- 近似搜索的工作原理 2021-01-01
- 静态初始化顺序失败 2022-01-01
- C++ 协变模板 2021-01-01
- 使用/clr 时出现 LNK2022 错误 2022-01-01
- 从python回调到c++的选项 2022-11-16
- 与 int by int 相比,为什么执行 float by float 矩阵乘法更快? 2021-01-01