-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
snnn
commented
Dec 13, 2024
yuslepukhin
previously approved these changes
Dec 13, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goldsteinn
reviewed
Dec 13, 2024
yuslepukhin
approved these changes
Dec 13, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this unblock #21545? |
guschmue
pushed a commit
that referenced
this pull request
Dec 16, 2024
### Description This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
Sorry I don't know much about that one. |
guschmue
pushed a commit
that referenced
this pull request
Dec 20, 2024
### Description This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
This was referenced Jan 6, 2025
tarekziade
pushed a commit
to tarekziade/onnxruntime
that referenced
this pull request
Jan 10, 2025
### Description This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
In July,2024 an Intel engineer created a PR(#21545) that reduced the spin count of our Eigen thread pool. Then I found it triggered a deadlock bug in EigenNonBlockingThreadPool.h.
Here is how to reproduce the issue:
python3 tools/ci_build/build.py --skip_submodule_sync --parallel --config RelWithDebInfo --build_dir b1 --update --build
Then quickly a "onnx_test_runner" process will stick in the loop, and you will see one of the CPU's usage is 100%, while all the other CPUs are idle. Sorry I cannot make the model file public. But it has nothing special.
Furthermore, to confirm we saw the same bug, please use gdb to attach the hang process and examine each worker thread(the threads that are idle). To examine the Nth thread, you should run the following gdb commands:
The last command prints two integers. If they are different, it means the thread's worker queue is not empty, and the thread should not wait there and idling. That's the bug I mean.
I have a hypothesis about the root cause, unfortunately I couldn't fully prove it. I think Heisenberg's indeterminacy principle plays magic here that blocked me seeing what was actually happening. When two threads run simultaneously on multiple CPUs, I want to know the relative order of the actual executions, but I cannot get the information without adding additional synchronizations to the threads, which in turn may impact the real behavior. Anyway, when debugging the issue, my approach was adding a logical clock to each thread. The clock was just an atomic integer counter that can be read/write by multiple threads. Whoever reads it must also increase it by one at the same time. Let's assume there are two threads: a producer who produces tasks and a consumer who executes tasks. Then I believe I observed the following thing:
(I have low confidence in the above. )
It is very counterintuitive because at step 4 before the consumer thread went to sleep, the consumer thread should have seen the queue was not empty. My explanation is : it was because ARM has a weaker memory model than x86. We have got used to x86 too long.
My fix is to enable an assert. Though I don't believe the assert will ever hit, the updated code actually will insert a memory barrier there to ensure total ordering. std::atomic class's exchange function is a read-modify-write operation, while a store function is write-only. I tried to change the store function to use a stronger memory order, but it didn't fix the problem. Semantically, since we are going to read the queue, we need a read barrier here.
Still, I didn't get fully persuaded. It would be better if I can add some assert there, find a contradiction and abort the process. If you have ideal to prove it, please let me know.
Motivation and Context
5 weeks ago @goldsteinn suggested me to replace all std::memory_order_relaxed to std::memory_order_seq_cst. I didn't take his suggestion because any change to this file could make the bug not reproducible, however, it doesn't mean the bug is fixed. I have found a lot of different ways to make the bug disappear or harder to find. ("harder" means I need to run the test process 10k or 100k times to get a hang up instead of 1K. )
@yuslepukhin and @tlh20 also helped me a lot.