Fast string splitting with multiple delimiters(使用多个分隔符快速拆分字符串)
问题描述
我在 StackOverflow 上调查了一段时间,以找到将具有多个分隔符的字符串拆分为 vector< 的好算法.字符串 >
.我也找到了一些方法:
I investigated some time here on StackOverflow to find good algorithms to split strings with multiple delimiters into a vector< string >
. I also found some methods:
提升方式:
boost::split(vector, string, boost::is_any_of(" "));
getline
方法:
std::stringstream ss(string);
std::string item;
while(std::getline(ss, item, ' ')) {
vector.push_back(item);
}
Boost 的标记化方式:
the tokenize way of Boost:
char_separator<char> sep(" ");
tokenizer<char_separator<char>> tokens(string, sep);
BOOST_FOREACH(string t, tokens)
{
vector.push_back(t);
}
和很酷的 STL 方式:
and the cool STL way:
istringstream iss(string);
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
back_inserter<vector<string> >(vector));
以及Shadow2531的方法(见链接主题).
and the method of Shadow2531 (see the linked topic).
其中大部分来自本主题.但不幸的是,它们并没有解决我的问题:
Most of them came from this topic. But they unfortunately don't solve my problem:
Boost 的拆分很容易使用,但是对于大数据(最好的情况下大约为 1.5*10^6 单个元素)和大约 10 个分隔符,我使用它的速度非常慢.
Boost's split is easy to use but with the big data (about 1.5*10^6 single elements in best cases) and about 10 delimiters I am using it's horrific slow.
getline
、STL 和Shadow2531 的方法有一个问题,我只能使用一个字符作为分隔符.我还需要一些.
The getline
, STL and Shadow2531's method have the problem that I can only use one single char as delimiter. I need a few more.
Boost 的 tokenize 在速度方面更可怕.使用 10 个分隔符将字符串拆分为 1.5*10^6 个元素需要 11 秒.
Boost's tokenize is even more horrific in the aspect of speed. It took 11 seconds with 10 delimiters to split a string into 1.5*10^6 elements.
所以我不知道该怎么办:我想要一个非常快速的带有多个分隔符的字符串拆分算法.
So I don't know what to do: I want to have a really fast string splitting algorithm with multiple delimiters.
Boost 的分割是最大的还是有办法更快?
Is Boost's split the maximum or is there a way to do it faster?
推荐答案
想到两件事:
- 使用字符串视图代替字符串作为拆分结果,节省了很多分配.
- 如果你知道你只是将使用字符(在[0,255] 范围),尝试使用bitset 来测试成员资格而不是
find
ing 进入分隔符字符.
- Use string views instead of strings as the split result, saves a lot of allocations.
- If you know you're only
going to be working with chars (in
the [0,255] range), try using a
bitset to test membership instead of
find
ing into the delimiter characters.
这是应用这些想法的快速尝试:
Here's a quick attempt at applying these ideas:
#include <vector>
#include <bitset>
#include <iostream>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <boost/timer.hpp>
using namespace std;
size_t const N = 10000000;
template<typename C>
void test_custom(string const& s, char const* d, C& ret)
{
C output;
bitset<255> delims;
while( *d )
{
unsigned char code = *d++;
delims[code] = true;
}
typedef string::const_iterator iter;
iter beg;
bool in_token = false;
for( string::const_iterator it = s.begin(), end = s.end();
it != end; ++it )
{
if( delims[*it] )
{
if( in_token )
{
output.push_back(typename C::value_type(beg, it));
in_token = false;
}
}
else if( !in_token )
{
beg = it;
in_token = true;
}
}
if( in_token )
output.push_back(typename C::value_type(beg, s.end()));
output.swap(ret);
}
template<typename C>
void test_strpbrk(string const& s, char const* delims, C& ret)
{
C output;
char const* p = s.c_str();
char const* q = strpbrk(p+1, delims);
for( ; q != NULL; q = strpbrk(p, delims) )
{
output.push_back(typename C::value_type(p, q));
p = q + 1;
}
output.swap(ret);
}
template<typename C>
void test_boost(string const& s, char const* delims)
{
C output;
boost::split(output, s, boost::is_any_of(delims));
}
int main()
{
// Generate random text
string text(N, ' ');
for( size_t i = 0; i != N; ++i )
text[i] = (i % 2 == 0)?('a'+(i/2)%26):((i/2)%2?' ':' ');
char const* delims = " [],-'/\!"§$%&=()<>?";
// Output strings
boost::timer timer;
test_boost<vector<string> >(text, delims);
cout << "Time: " << timer.elapsed() << endl;
// Output string views
typedef string::const_iterator iter;
typedef boost::iterator_range<iter> string_view;
timer.restart();
test_boost<vector<string_view> >(text, delims);
cout << "Time: " << timer.elapsed() << endl;
// Custom split
timer.restart();
vector<string> vs;
test_custom(text, delims, vs);
cout << "Time: " << timer.elapsed() << endl;
// Custom split
timer.restart();
vector<string_view> vsv;
test_custom(text, delims, vsv);
cout << "Time: " << timer.elapsed() << endl;
// Custom split
timer.restart();
vector<string> vsp;
test_strpbrk(text, delims, vsp);
cout << "Time: " << timer.elapsed() << endl;
// Custom split
timer.restart();
vector<string_view> vsvp;
test_strpbrk(text, delims, vsvp);
cout << "Time: " << timer.elapsed() << endl;
return 0;
}
使用启用了 -O4
标志的 GCC 4.5.1 用 Boost 1.46.1 编译我得到:
Compiling this with Boost 1.46.1 using GCC 4.5.1 with the -O4
flag enabled I get:
- 时间:5.951(Boost.Split + 向量)
- 时间:3.728(Boost.Split + 向量
- 时间:1.662(自定义分割 + 矢量)
- 时间:0.144(自定义分割 + 矢量)
- 时间:2.13(Strpbrk + 矢量)
- 时间:0.527(Strpbrk + 矢量)
注意:由于我的自定义函数删除了空令牌,因此输出略有不同.但是,如果您决定使用它,您可以根据自己的需要调整此代码.
NOTE: There's a slight difference in the output as empty tokens are dropped by my custom function. But you can adapt this code to your needs if you decide to use it.
这篇关于使用多个分隔符快速拆分字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:使用多个分隔符快速拆分字符串
- 如何对自定义类的向量使用std::find()? 2022-11-07
- Stroustrup 的 Simple_window.h 2022-01-01
- 静态初始化顺序失败 2022-01-01
- 与 int by int 相比,为什么执行 float by float 矩阵乘法更快? 2021-01-01
- 近似搜索的工作原理 2021-01-01
- C++ 协变模板 2021-01-01
- 从python回调到c++的选项 2022-11-16
- STL 中有 dereference_iterator 吗? 2022-01-01
- 使用/clr 时出现 LNK2022 错误 2022-01-01
- 一起使用 MPI 和 OpenCV 时出现分段错误 2022-01-01