Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition for new GC #20820

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

schveiguy
Copy link
Member

The new GC depends on thread_preSuspend to tell it if druntime will provide scanning details (stack and TLS) for the current thread. If it returns true, then the new GC assumes it can rely on druntime to scan the details. If it returns false, then the new GC does its own scanning mechanism.

The cost of this is taking the slock a bit earlier than before, so it shouldn't affect anything.

There is a note about lazy TLS allocation, but whatever that is must be long gone, because the setThis call is a simple assignment for all platforms. This may have been related to how OSX TLS was managed years ago.

Note that this race caused a failure in a real world application. It's pretty specific to the new GC, and an internal detail to druntime, so there isn't really a test I can add, nor do I think we need to log about this.

threads. The `thread_preSuspend` hook should return true when druntime
has knowledge of a thread. But it's based on `sm_this` (the storage for
`Thread.getThis`) being set. Because the thread lock is not taken to set
`sm_this`, a race exists when a thread is suspended between this
setting, and the adding to the thread list for scanning. Therefore,
`thread_preSuspend` can return true, but `thread_scanAll` will not
include that thread in the list of scannables.
@dlang-bot
Copy link
Contributor

Thanks for your pull request, @schveiguy!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#20820"

Comment on lines +2126 to +2134
// Ensure setting `sm_this` and adding the thread to the list of
// known threads is protected by the global thread lock. Otherwise,
// GCs that use `thread_preSuspend` to determine if a thread is
// registered might be told it is registered for scanning, but find
// out it is not.
Thread.slock.lock_nothrow();
Thread.setThis(obj);
Thread.add(obj);
Thread.slock.unlock_nothrow();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the same pattern appears twice, and require a bit comment, I'd extract it in its own function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure it's actually needed for Windows in our case, because windows will not be pausing threads via signals. I just did it here for consistency.

I contemplated just combining setThis and add into one function that locks for everything. Maybe setThisAndAdd?

In reality, I don't know how much of this is based on a faulty assumption of what it means to have sm_this set. I know we added the boolean return, but that may be thwarted too. What if code calls setThis itself, but doesn't register the thread?

Druntime uses the test of whether the thread is in the list of managed threads for when it might suspend them. The new GC has a different list, which it uses. If we migrated to using the druntime list instead, then we could probably not even need this change.

@thewilsonator thewilsonator added the Druntime:GC Issues relating the Garbage Collector label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Druntime:GC Issues relating the Garbage Collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants