-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix compiler optimize thread local variable access #2156
Conversation
能否自动根据当前编译器和平台来决定是否开启这个宏? |
done, 根据issue的反馈信息,目前仅针对 Clang/AppleClang 并且是 ARM/AArch 架构平台关闭优化。 |
CMakeLists.txt
Outdated
@@ -150,6 +150,11 @@ if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU") | |||
endif() | |||
endif() | |||
|
|||
if(CMAKE_CXX_COMPILER_ID MATCHES "(AppleClang)|(Clang)" AND CMAKE_SYSTEM_PROCESSOR MATCHES "(aarch64)|(arm64)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在这里判断的话只对cmake生效,我们还有make和bazel两种编译方式
是不是在头文件里进行判断,普适性更好一些
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
LGTM |
* fix compiler optimize thread local variable access * change __thread to BAIDU_THREAD_LOCAL * Add NOT_ALLOW_OPTIMIZE_THREAD_LOCAL_ACCESS according to compiler and arch * move thread local access optimization condition to thread_local.h
* fix compiler optimize thread local variable access * change __thread to BAIDU_THREAD_LOCAL * Add NOT_ALLOW_OPTIMIZE_THREAD_LOCAL_ACCESS according to compiler and arch * move thread local access optimization condition to thread_local.h
是在哪个平台上出现呢?arm 还是 x86 ? |
x86, core在了其他地方,找了下调用栈,有地方是这么用的 我之前没用过brpc,如何能够构造一个,问题里提到到示例呢? |
如果使用 thread_local 变量时,前后两次访问该变量的中间阶段发生了 bthread 的切换(由于
这个得分情况:如果想要在 bthread 中使用类似 "thread_local" 的行为,即无论 bthread 被调度到哪,都获得相同的 thread_local 值,建议使用 如果代码就是想要获得当前被调度运行的 pthread 线程的 |
感觉还是 thread_local 读写的问题, 请问下除了这个patch外,后面还有其他关于这个问题的修复吗? task_runner 循环外的 tls,也加上 BAIDU_GET_VOLATILE_THREAD_LOCAL,后似乎问题不再出现了。
|
后续还有3个改动:
怀疑可能更跟 #2262 这个修改有关。 |
另外还有一个疑问是,函数 task_runner 和 tls_task_group 不都是在一个编译单元内吗?为什么是开启lto后才会触发这个问题呢。 |
不开启 |
目前从clang11 切换 到clang16 遇到了一些新的问题。和 #1776 提到的问题应该是一样的,不过我看1776对应的代码并没有合入, 是有做了其他修复吗? 对比了汇编,这样差异在于326-331 行这段代码(version 0.96), 并且把这段代码拆到一个函数里,加上禁用优化的标注,就没问题了。 // Clean tls variables, must be done before changing version_butex
// otherwise another thread just joined this thread may not see side
// effects of destructing tls variables.
KeyTable* kt = tls_bls.keytable;
if (kt != NULL) {
return_keytable(m->attr.keytable_pool, kt);
// After deletion: tls may be set during deletion.
tls_bls.keytable = NULL;
m->local_storage.keytable = NULL; // optional
} clang11 对于 tls_bls 成员变量的读写,每次都会经过一次call。而clang16会在 ;third_party/brpc-0.9.6/src/bthread/task_group.cpp:325
# KeyTable* kt = tls_bls.keytable;
a9e: 66 48 8d 3d 00 00 00 data16 lea 0x0(%rip),%rdi # aa6 <bthread::TaskGroup::task_runner(long)+0x196>
aa5: 00
aa6: 66 66 48 e8 00 00 00 data16 data16 callq aae <bthread::TaskGroup::task_runner(long)+0x19e>
aad: 00
aae: 48 8b 30 mov (%rax),%rsi
third_party/brpc-0.9.6/src/bthread/task_group.cpp:326
ab1: 48 85 f6 test %rsi,%rsi
ab4: 74 28 je ade <bthread::TaskGroup::task_runner(long)+0x1ce>
...
third_party/brpc-0.9.6/src/bthread/task_group.cpp:329
# tls_bls.keytable = NULL;
abf: 66 48 8d 3d 00 00 00 data16 lea 0x0(%rip),%rdi # ac7 <bthread::TaskGroup::task_runner(long)+0x1b7>
ac6: 00
ac7: 66 66 48 e8 00 00 00 data16 data16 callq acf <bthread::TaskGroup::task_runner(long)+0x1bf>
ace: 00
acf: 48 c7 00 00 00 00 00 movq $0x0,(%rax) clang16 ;third_party/brpc-0.9.6/src/bthread/task_group.cpp:265
838: 48 8b 40 30 mov 0x30(%rax),%rax
83c: 48 85 c0 test %rax,%rax
83f: 75 df jne 820 <bthread::TaskGroup::task_runner(long)+0x30>
841: 4c 89 65 d0 mov %r12,-0x30(%rbp)
845: 66 48 8d 3d 00 00 00 data16 lea 0x0(%rip),%rdi # 84d <bthread::TaskGroup::task_runner(long)+0x5d>
84c: 00
84d: 66 66 48 e8 00 00 00 data16 data16 callq 855 <bthread::TaskGroup::task_runner(long)+0x65>
854: 00
855: 48 89 c3 mov %rax,%rbx # do 循环外保存 tls_bls 的地址到rbx
...
;third_party/brpc-0.9.6/src/bthread/task_group.cpp:325
# KeyTable* kt = tls_bls.keytable;
96e: 48 8b 33 mov (%rbx),%rsi
;third_party/brpc-0.9.6/src/bthread/task_group.cpp:326
971: 48 85 f6 test %rsi,%rsi
974: 74 18 je 98e <bthread::TaskGroup::task_runner(long)+0x19e>
...
;third_party/brpc-0.9.6/src/bthread/task_group.cpp:329
# tls_bls.keytable = NULL;
97f: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) |
这个差异找到了(定位不易,踩坑的场景比较特殊,不知道有没有人会遇到)。和clang对tls变量访问的选择有关。如果使用-fPIC编译 + 分布式 thinlto ,clang-16之前都会出现这种行为差异,具体原因。 |
What problem does this PR solve?
Issue Number:#1776, #1407, #845, #1860, apache/doris#13270
Problem Summary:
一些编译器会优化 thread_local 的变量的读写,在同一个函数中间如果再次获取相同变量的值时,会直接从缓存的地址里面进行加载。但是如果在函数过程中间,线程发生切换( bthread 切换到另一个线程执行),那么 thread_local 变量需要重新获当前运行线程的,而不应该使用之前的旧值。
经测试,Apple M1 Apple clang version 14.0.0, 编译依然存在该问题:
task_runner 函数的部分汇编代码:
1933 处首次读取了 tls_task_group,并放入了
[sp, #96]
位置,2043 行时,线程可能已经切换,但是依然从[sp, #96]
处中加载 tls_task_group ,导致读取到了上一个旧值。导致错误发生。gcc 通过开启 fno-gcse 选项禁止该优化,clang 未能找到对应控制选项。
参考:
What is changed and the side effects?
Changed:
NOT_ALLOW_OPTIMIZE_THREAD_LOCAL_ACCESS
用于控制是否允许编译器优化,如果为 true 是,对应的 thread local 变量必须通过 noinline 的函数调用方式来获取最新值,绕开编译器的优化。Side effects:
如果开启
NOT_ALLOW_OPTIMIZE_THREAD_LOCAL_ACCESS
, 会造成多一次的函数调用开销。无
Check List: