Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size(理解 std::hardware_corruption_interference_size 和 std::hardware_constructive_interference_size)
问题描述
C++17 添加了 std::hardware_corruption_interference_size 和 <代码>std::hardware_constructive_interference_size
.首先,我认为这只是一种获取 L1 缓存行大小的可移植方式,但这过于简单化了.
问题:
- 这些常量与 L1 缓存行大小有什么关系?
- 是否有一个很好的例子来展示他们的用例?
- 两者都定义了
static constexpr
.如果您构建一个二进制文件并在具有不同缓存行大小的其他机器上执行它,这不是问题吗?当您不确定代码将在哪台机器上运行时,它如何防止这种情况下的错误共享?
这些常量的目的确实是为了获得缓存行的大小.阅读它们的基本原理的最佳位置是提案本身:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html
为了便于阅读,我将在此处引用部分理由:
<块引用>[...] 不干扰(一阶)的内存粒度[是] 通常称为缓存行大小.
缓存行大小的使用分为两大类:
- 避免具有来自不同线程的时间上不相交的运行时访问模式的对象之间的破坏性干扰(错误共享).
- 促进具有临时本地运行时访问模式的对象之间的建设性干扰(真正共享).
这个有用的实现数量最重要的问题是当前实践中用于确定其价值的方法的可移植性有问题,尽管它们作为一个群体普遍存在和流行.[...]
我们的目标是为此贡献一个适度的发明,这个数量的抽象可以通过实现为给定目的保守地定义:
- 破坏性干扰大小:适合作为两个对象之间的偏移量的数字,以避免由于来自不同线程的不同运行时访问模式而导致错误共享.
- 建设性干扰大小:一个适合作为限制两个对象的组合内存占用大小和基址对齐的数字,以促进它们之间的真正共享.
在这两种情况下,这些值都是在实现质量的基础上提供的,纯粹是作为可能提高性能的提示.这些是与 alignas()
关键字一起使用的理想可移植值,目前几乎没有标准支持的可移植用途.
<小时>
这些常量与 L1 缓存行大小有什么关系?"
理论上,很直接.
假设编译器确切地知道您将在什么架构上运行 - 那么这些几乎肯定会准确地为您提供 L1 缓存行大小.(正如后面提到的,这是一个很大的假设.)
就其价值而言,我几乎总是希望这些值相同.我相信单独声明它们的唯一原因是为了完整性.(也就是说,也许编译器想要估计 L2 缓存行大小而不是 L1 缓存行大小以进行建设性干扰;不过,我不知道这是否真的有用.)
<小时>有没有一个很好的例子来展示他们的用例?"
在这个答案的底部,我附上了一个很长的基准程序,它演示了错误共享和真实共享.
它通过分配一个 int 包装器数组来演示错误共享:在一种情况下,多个元素适合 L1 缓存行,而在另一种情况下,单个元素占用 L1 缓存行.在紧密循环中,从数组中选择一个固定的元素并重复更新.
它通过在包装器中分配一对整数来展示真正的共享:在一种情况下,这对整数中的两个整数不适合一起在 L1 缓存行大小中,而在另一种情况下.在一个紧密的循环中,对中的每个元素都会重复更新.
注意访问被测对象的代码不会改变;唯一的区别是对象本身的布局和对齐方式.
我没有 C++17 编译器(假设大多数人目前也没有),所以我用我自己的常量替换了有问题的常量.您需要更新这些值以使其在您的机器上准确无误.也就是说,64 字节可能是典型现代桌面硬件的正确值(在撰写本文时).
警告:测试将使用您机器上的所有内核,并分配约 256MB 的内存.不要忘记进行优化编译!
在我的机器上,输出是:
<前>硬件并发:16大小(naive_int):4alignof(naive_int): 4大小(cache_int):64alignof(cache_int): 64大小(坏对):72alignof(bad_pair): 4大小(good_pair):8alignof(good_pair): 4运行 naive_int 测试.平均时间:0.0873625 秒,无用结果:3291773运行 cache_int 测试.平均时间:0.024724 秒,无用结果:3286020运行 bad_pair 测试.平均时间:0.308667 秒,无用结果:6396272运行 good_pair 测试.平均时间:0.174936 秒,无用结果:6668457通过避免错误共享,我获得了大约 3.5 倍的加速,通过确保真实共享获得了大约 1.7 倍的加速.
<小时>两者都定义为静态 constexpr.如果您构建一个二进制文件并在具有不同缓存行大小的其他机器上执行它,这不是问题吗?当您不确定时,它如何防止这种情况下的错误共享您的代码将在哪台机器上运行?"
这确实会有问题.这些常量不能保证映射到特定目标机器上的任何缓存行大小,但旨在成为编译器可以收集的最佳近似值.
提案中对此进行了说明,并且在附录中他们给出了一些库如何在编译时根据各种环境提示和宏尝试检测缓存行大小的示例.你保证这个值至少是alignof(max_align_t)
,这是一个明显的下限.
换句话说,这个值应该用作你的后备案例;如果您知道,您可以自由定义一个精确的值,例如:
在编译期间,如果您想假设缓存行大小,只需定义 KNOWN_L1_CACHE_LINE_SIZE
.
希望这有帮助!
基准计划:
C++17 added std::hardware_destructive_interference_size
and std::hardware_constructive_interference_size
. First, I thought it is just a portable way to get the size of a L1 cache line but that is an oversimplification.
Questions:
- How are these constants related to the L1 cache line size?
- Is there a good example that demonstrates their use cases?
- Both are defined
static constexpr
. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?
The intent of these constants is indeed to get the cache-line size. The best place to read about the rationale for them is in the proposal itself:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html
I'll quote a snippet of the rationale here for ease-of-reading:
[...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.
Uses of cache-line size fall into two broad categories:
- Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.
- Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.
The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]
We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:
- Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.
- Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.
In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the
alignas()
keyword, for which there currently exists nearly no standard-supported portable uses.
"How are these constants related to the L1 cache line size?"
In theory, pretty directly.
Assume the compiler knows exactly what architecture you'll be running on - then these would almost certainly give you the L1 cache-line size precisely. (As noted later, this is a big assumption.)
For what it's worth, I would almost always expect these values to be the same. I believe the only reason they are declared separately is for completeness. (That said, maybe a compiler wants to estimate L2 cache-line size instead of L1 cache-line size for constructive interference; I don't know if this would actually be useful, though.)
"Is there a good example that demonstrates their use cases?"
At the bottom of this answer I've attached a long benchmark program that demonstrates false-sharing and true-sharing.
It demonstrates false-sharing by allocating an array of int wrappers: in one case multiple elements fit in the L1 cache-line, and in the other a single element takes up the L1 cache-line. In a tight loop a single, a fixed element is chosen from the array and updated repeatedly.
It demonstrates true-sharing by allocating a single pair of ints in a wrapper: in one case, the two ints within the pair do not fit in L1 cache-line size together, and in the other they do. In a tight loop, each element of the pair is updated repeatedly.
Note that the code for accessing the object under test does not change; the only difference is the layout and alignment of the objects themselves.
I don't have a C++17 compiler (and assume most people currently don't either), so I've replaced the constants in question with my own. You need to update these values to be accurate on your machine. That said, 64 bytes is probably the correct value on typical modern desktop hardware (at the time of writing).
Warning: the test will use all cores on your machines, and allocate ~256MB of memory. Don't forget to compile with optimizations!
On my machine, the output is:
Hardware concurrency: 16 sizeof(naive_int): 4 alignof(naive_int): 4 sizeof(cache_int): 64 alignof(cache_int): 64 sizeof(bad_pair): 72 alignof(bad_pair): 4 sizeof(good_pair): 8 alignof(good_pair): 4 Running naive_int test. Average time: 0.0873625 seconds, useless result: 3291773 Running cache_int test. Average time: 0.024724 seconds, useless result: 3286020 Running bad_pair test. Average time: 0.308667 seconds, useless result: 6396272 Running good_pair test. Average time: 0.174936 seconds, useless result: 6668457
I get ~3.5x speedup by avoiding false-sharing, and ~1.7x speedup by ensuring true-sharing.
"Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"
This will indeed be a problem. These constants are not guaranteed to map to any cache-line size on the target machine in particular, but are intended to be the best approximation the compiler can muster up.
This is noted in the proposal, and in the appendix they give an example of how some libraries try to detect cache-line size at compile time based on various environmental hints and macros. You are guaranteed that this value is at least alignof(max_align_t)
, which is an obvious lower bound.
In other words, this value should be used as your fallback case; you are free to define a precise value if you know it, e.g.:
During compilation, if you want to assume a cache-line size just define KNOWN_L1_CACHE_LINE_SIZE
.
Hope this helps!
Benchmark program:
这篇关于理解 std::hardware_corruption_interference_size 和 std::hardware_constructive_interference_size的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!