Why don#39;t compilers merge redundant std::atomic writes?(为什么编译器不合并冗余的 std::atomic 写入?)
问题描述
我想知道为什么没有编译器准备将相同值的连续写入合并到单个原子变量,例如:
I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:
#include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
我尝试过的每个编译器都会发出上述写入的 3 次.哪个合法的、无种族的观察者可以看到上述代码与经过一次写入的优化版本之间的差异(即as-if"规则不适用)?
Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?
如果变量是可变的,那么显然没有优化是适用的.在我的情况下是什么阻止了它?
If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?
这是编译器资源管理器中的代码.
推荐答案
C++11/C++14 标准编写确实允许将三个商店折叠/合并为一个商店的最终值.即使在这样的情况下:
The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:
y.store(1, order);
y.store(2, order);
y.store(3, order); // inlining + constant-folding could produce this in real code
该标准不保证在 y
上旋转的观察者(使用原子负载或 CAS)将永远看到 y == 2
.依赖于此的程序将具有数据竞争错误,但只有普通错误类型的竞争,而不是 C++ 未定义行为类型的数据竞争.(它只是带有非原子变量的 UB).一个希望有时看到它的程序甚至不一定有缺陷.(见下文:进度条.)
The standard does not guarantee that an observer spinning on y
(with an atomic load or CAS) will ever see y == 2
. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)
在 C++ 抽象机器上可能的任何排序都可以(在编译时)被选为 总是 发生的排序.这是实际中的 as-if 规则.在这种情况下,好像所有三个存储都以全局顺序背靠背发生,在 y=1
和y=3
.
Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1
and y=3
.
它不依赖于目标架构或硬件;就像编译时重新排序一样,即使在以强序 x86 为目标.编译器不必保留您在考虑要编译的硬件时可能期望的任何内容,因此您需要障碍.屏障可以编译成零汇编指令.
It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.
这是一个实施质量问题,可能会改变在真实硬件上观察到的性能/行为.
It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.
最明显的问题是进度条.将存储从循环(不包含其他原子操作)中取出并将它们全部折叠为一个将导致进度条保持在 0,然后在最后变为 100%.
The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.
没有 C++11 std::atomic
方法可以阻止他们在你不想要的情况下这样做,所以现在编译器只需选择永远不要将多个原子操作合并为一个.(将它们全部合并为一个操作不会改变它们相对于彼此的顺序.)
There's no C++11 std::atomic
way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)
编译器编写者已经正确地注意到,程序员期望每次源代码执行 y.store()
时,原子存储实际上会发生在内存中.(请参阅此问题的大多数其他答案,这些答案声称商店需要单独发生,因为可能的读者等待看到中间值.)即它违反了 最小惊喜原则.
Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store()
. (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.
但是,在某些情况下它会非常有用,例如避免在循环中使用无用的 shared_ptr
ref count inc/dec.
However, there are cases where it would be very helpful, for example avoiding useless shared_ptr
ref count inc/dec in a loop.
显然,任何重新排序或合并都不能违反任何其他排序规则.例如,num++;num--;
仍然必须完全阻止运行时和编译时重新排序,即使它不再触及 num
处的内存.
Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--;
would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num
.
正在讨论扩展 std::atomic
API 以让程序员控制此类优化,此时编译器将能够在有用时进行优化,从而即使在并非故意低效的精心编写的代码中也可能发生.以下工作组讨论/提案链接中提到了一些有用的优化案例示例:
Discussion is under way to extend the std::atomic
API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:
- http://wg21.link/n4455:N4455 没有健全的编译器会优化原子
- http://wg21.link/p0062:WG21/P0062R1:编译器应该何时优化原子?莉>
- http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
- http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?
另请参阅 Richard Hodges 对 int num"的 num++ 可以是原子的吗?(见评论).另请参阅同一问题的我的回答的最后一部分,我更详细地论证了允许这种优化.(在此简短,因为那些 C++ 工作组链接已经承认当前编写的标准确实允许这样做,而且当前的编译器只是没有故意优化.)
See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)
在当前标准中,volatile atomic
将是确保不允许对其进行优化的一种方法.(正如 Herb Sutter 在 SO 答案中指出的,volatile
和 atomic
已经共享了一些需求,但它们是不同的).另请参阅 std::memory_order
与 volatile
在 cppreference 上.
Within the current standard, volatile atomic<int> y
would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile
and atomic
already share some requirements, but they are different). See also std::memory_order
's relationship with volatile
on cppreference.
对 volatile
对象的访问不允许被优化掉(因为它们可能是内存映射的 IO 寄存器,例如).
Accesses to volatile
objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).
使用 volatile atomic
主要修复了进度条问题,但如果/当 C++ 决定使用不同的语法来控制优化以便编译器使用不同的语法时,它有点丑陋并且可能在几年后看起来很傻可以开始实践了.
Using volatile atomic<T>
mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.
我认为我们可以确信编译器不会开始进行这种优化,除非有一种方法可以控制它.希望它是某种选择加入(如 memory_order_release_coalesce
),在编译为 C++ 时不会改变现有代码 C++11/14 代码的行为.但它可能类似于 wg21/p0062 中的提议:使用 [[brittle_atomic]]
标记不优化案例.
I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce
) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]]
.
wg21/p0062 警告说,即使 volatile atomic
也不能解决所有问题,因此不鼓励将其用于此目的.它给出了这个例子:
wg21/p0062 warns that even volatile atomic
doesn't solve everything, and discourages its use for this purpose. It gives this example:
if(x) {
foo();
y.store(0);
} else {
bar();
y.store(0); // release a lock before a long-running loop
for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.
即使使用 volatile atomic
,允许编译器从 if/else
中提取 y.store()
并且只做一次,因为它仍然只做 1存储相同的值.(这将在 else 分支中的长循环之后).特别是如果商店只是 relaxed
或 release
而不是 seq_cst
.
Even with volatile atomic<int> y
, a compiler is allowed to sink the y.store()
out of the if/else
and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed
or release
instead of seq_cst
.
volatile
确实停止了问题中讨论的合并,但这指出 atomic<>
上的其他优化对于实际性能也可能存在问题.
volatile
does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<>
can also be problematic for real performance.
不优化的其他原因包括:没有人编写复杂的代码来允许编译器安全地进行这些优化(而不会出错).这还不够,因为 N4455 表示 LLVM 已经实现或可以轻松实现它提到的几个优化.
Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.
不过,让程序员感到困惑的原因当然是有道理的.无锁代码一开始就很难正确编写.
The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.
不要随意使用原子武器:它们并不便宜,也没有进行太多优化(目前根本没有).但是,使用 std::shared_ptr<T>
避免冗余原子操作并不总是那么容易,因为它没有非原子版本(尽管 这里的一个答案给出了一个简单的方法为 gcc 定义一个 shared_ptr_unsynchronized
).
Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>
, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T>
for gcc).
这篇关于为什么编译器不合并冗余的 std::atomic 写入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:为什么编译器不合并冗余的 std::atomic 写入?
- 使用 __stdcall & 调用 DLLVS2013 中的 GetProcAddress() 2021-01-01
- 哪个更快:if (bool) 或 if(int)? 2022-01-01
- 从父 CMakeLists.txt 覆盖 CMake 中的默认选项(...)值 2021-01-01
- DoEvents 等效于 C++? 2021-01-01
- 将函数的返回值分配给引用 C++? 2022-01-01
- GDB 不显示函数名 2022-01-01
- OpenGL 对象的 RAII 包装器 2021-01-01
- 如何提取 __VA_ARGS__? 2022-01-01
- 将 hdc 内容复制到位图 2022-09-04
- XML Schema 到 C++ 类 2022-01-01