Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098

Merged
merged 3 commits into from
Dec 16, 2024
Merged

Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098

merged 3 commits into from
Dec 16, 2024

Conversation

snnn
Copy link
Member

@snnn snnn commented Dec 13, 2024

Description

In July,2024 an Intel engineer created a PR(#21545) that reduced the spin count of our Eigen thread pool. Then I found it triggered a deadlock bug in EigenNonBlockingThreadPool.h.

Here is how to reproduce the issue:

  1. Create an ARM64 VM in Azure with only 4 vCPUs. I used Standard_D4plds_v5. The more CPUs you have, the less likely you will see the bug.
  2. Create a local branch and apply the following changes to EigenNonBlockingThreadPool.h
-    constexpr int log2_spin = 20;
-    const int spin_count = allow_spinning_ ? (1ull << log2_spin) : 0;
-    const int steal_count = spin_count / 100;
+    //constexpr int log2_spin = 20;
+    const int spin_count = 10000;
+    const int steal_count = 100;
  1. Build the source code locally:
    python3 tools/ci_build/build.py --skip_submodule_sync --parallel --config RelWithDebInfo --build_dir b1 --update --build
  2. Run the following script
#!/bin/bash
for i in {1..1000}
do
    ./onnx_test_runner -c 1 -j 1 -x  model.onnx
done

Then quickly a "onnx_test_runner" process will stick in the loop, and you will see one of the CPU's usage is 100%, while all the other CPUs are idle. Sorry I cannot make the model file public. But it has nothing special.

Furthermore, to confirm we saw the same bug, please use gdb to attach the hang process and examine each worker thread(the threads that are idle). To examine the Nth thread, you should run the following gdb commands:

thr N
f 8
p {this->queue.back_._M_i & (1024-1), this->queue.front_._M_i & (1024-1)}

The last command prints two integers. If they are different, it means the thread's worker queue is not empty, and the thread should not wait there and idling. That's the bug I mean.

I have a hypothesis about the root cause, unfortunately I couldn't fully prove it. I think Heisenberg's indeterminacy principle plays magic here that blocked me seeing what was actually happening. When two threads run simultaneously on multiple CPUs, I want to know the relative order of the actual executions, but I cannot get the information without adding additional synchronizations to the threads, which in turn may impact the real behavior. Anyway, when debugging the issue, my approach was adding a logical clock to each thread. The clock was just an atomic integer counter that can be read/write by multiple threads. Whoever reads it must also increase it by one at the same time. Let's assume there are two threads: a producer who produces tasks and a consumer who executes tasks. Then I believe I observed the following thing:

  1. Producer Thread: called PushBack function that inserted a new task to the consumer thread's worker queue.
  2. Consumer Thread: called SetBlocked function and entered the mutex region
  3. Producer Thread: called EnsureAwake() function and load the status
  4. Consumer Thread: SetBlocked function went to sleep
  5. Producer Thread: In the EnsureAwake function, it skipped alert because it believed the consumer thread was spinning.
  6. Since nobody woke up the consumer thread, the producer thread idled there forever though its worker queue was not empty.
    (I have low confidence in the above. )

It is very counterintuitive because at step 4 before the consumer thread went to sleep, the consumer thread should have seen the queue was not empty. My explanation is : it was because ARM has a weaker memory model than x86. We have got used to x86 too long.

My fix is to enable an assert. Though I don't believe the assert will ever hit, the updated code actually will insert a memory barrier there to ensure total ordering. std::atomic class's exchange function is a read-modify-write operation, while a store function is write-only. I tried to change the store function to use a stronger memory order, but it didn't fix the problem. Semantically, since we are going to read the queue, we need a read barrier here.

Still, I didn't get fully persuaded. It would be better if I can add some assert there, find a contradiction and abort the process. If you have ideal to prove it, please let me know.

Motivation and Context

5 weeks ago @goldsteinn suggested me to replace all std::memory_order_relaxed to std::memory_order_seq_cst. I didn't take his suggestion because any change to this file could make the bug not reproducible, however, it doesn't mean the bug is fixed. I have found a lot of different ways to make the bug disappear or harder to find. ("harder" means I need to run the test process 10k or 100k times to get a hang up instead of 1K. )

@yuslepukhin and @tlh20 also helped me a lot.

@snnn snnn requested a review from yuslepukhin December 13, 2024 06:42
yuslepukhin
yuslepukhin previously approved these changes Dec 13, 2024
Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@snnn snnn requested a review from yuslepukhin December 13, 2024 23:48
Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@snnn snnn merged commit 2ff66b8 into main Dec 16, 2024
95 checks passed
@snnn snnn deleted the snnn-patch-8 branch December 16, 2024 17:05
@goldsteinn
Copy link
Contributor

Does this unblock #21545?

guschmue pushed a commit that referenced this pull request Dec 16, 2024
### Description
This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
@snnn
Copy link
Member Author

snnn commented Dec 16, 2024

Sorry I don't know much about that one.

guschmue pushed a commit that referenced this pull request Dec 20, 2024
### Description
This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
tarekziade pushed a commit to tarekziade/onnxruntime that referenced this pull request Jan 10, 2025
### Description
This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants