How to generate random permutations with CUDA(如何使用 CUDA 生成随机排列)
问题描述
我可以使用哪些并行算法从给定集合中生成随机排列?特别是适合 CUDA 的提案或论文链接会很有帮助.
What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.
Fisher-Yates shuffle 的顺序版本.
A sequential version of this would be the Fisher-Yates shuffle.
例子:
令 S={1, 2, ..., 7} 为源索引集.目标是并行生成 n 个随机排列.n 个排列中的每一个都只包含每个源索引一次,例如{7, 6, ..., 1}.
Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}.
推荐答案
Fisher-Yates shuffle 可以并行化.例如,4 个并发工作人员只需要 3 次迭代即可对 8 个元素的向量进行混洗.在第一次迭代中,它们交换 0<->1, 2<->3, 4<->5, 6<->7;在第二次迭代 0<->2, 1<->3, 4<->5, 6<->7;并且在最后一次迭代 0<->4, 1<->5, 2<->6, 3<->7.
Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.
这可以很容易地实现为 CUDA __device__
代码(灵感来自标准 最小/最大减少):
This could be easily implemented as CUDA __device__
code (inspired by standard min/max reduction):
const int id = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id] = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();
unsigned int shift = 1;
unsigned int pos = id * 2;
while(shift <= BLOCK_SIZE)
{
if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
shift = shift << 1;
pos = (pos & ~shift) | ((pos & shift) >> 1);
__syncthreads();
}
这里省略了curand初始化代码,方法swap(int *p, int i, int j)
交换值p[i]
和p[j]
.
Here the curand initialization code is omitted, and method swap(int *p, int i, int j)
exchanges values p[i]
and p[j]
.
请注意,上面的代码有以下假设:
Note that the code above has the following assumptions:
- 排列的长度是 2 * BLOCK_SIZE,其中 BLOCK_SIZE 是 2 的幂.
- 2 * BLOCK_SIZE 整数适合
__shared__
CUDA 设备的内存 - BLOCK_SIZE 是 CUDA 块的有效大小(通常在 32 到 512 之间)
- The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
- 2 * BLOCK_SIZE integers fit into
__shared__
memory of CUDA device - BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)
要生成多个排列,我建议使用不同的 CUDA 块.如果目标是排列 7 个元素(正如在原始问题中提到的那样),那么我相信在单线程中完成它会更快.
To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.
这篇关于如何使用 CUDA 生成随机排列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何使用 CUDA 生成随机排列
- 一起使用 MPI 和 OpenCV 时出现分段错误 2022-01-01
- 如何对自定义类的向量使用std::find()? 2022-11-07
- C++ 协变模板 2021-01-01
- Stroustrup 的 Simple_window.h 2022-01-01
- STL 中有 dereference_iterator 吗? 2022-01-01
- 与 int by int 相比,为什么执行 float by float 矩阵乘法更快? 2021-01-01
- 从python回调到c++的选项 2022-11-16
- 近似搜索的工作原理 2021-01-01
- 静态初始化顺序失败 2022-01-01
- 使用/clr 时出现 LNK2022 错误 2022-01-01