Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waive need for locking when calling FakeStream::{body,headers,trailers} #38167

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

krinkinmu
Copy link
Contributor

@krinkinmu krinkinmu commented Jan 23, 2025

Commit Message:

Those return references to internal protected members, so ideally all the callers should acquire a lock before calling those.

However, all the tests that use FakeStream or one of its derivatives cannot really acquire the right lock because it's a protected member of the class.

We got away with this so far for a few reasons:

  1. Clang thread safety annotations didn't detect this problematic pattern in the clang-14 that we are currently using (potentially because those methods actually acquired locks, even though those locks didn't actually protect much).
  2. The locks are really only needed to synchronize all the waitForX methods, accesors methods like body(), headers() and trailers() are called in tests after the appropriate waitForX method was called.

Disabling thread safety annotations for these methods does not actually make anything worse, because the existing implementation aren't thread safe anyways, however here are a few alternatives to disabling those that I considered and rejected at the moment:

  1. Return copies of body, headers and trailers instead of references, create those copies under a lock - that would be the easiest way to let compiler know that the code is fine, but all three methods return abstract classes and currently there is no easy way to copy them (that's not to say, that copying is impossible in principle);
  2. Expose the lock and require all the callers acquire it - this was my first idea of how to fix the issue, but FakeStream (and it's derivatives) is used quite a lot in tests, so this change will get quite invasive.

Because it does not seem like we really need to lock those methods in practice and given that alternatives to disabling thread safety analysis on those are quite invasive, I figured I can just silence the compiler in this case.

Additional Description: Related to #37911 and fixes one of the issues in #38093
Risk Level: Low
Testing: bazel test //test/server/config_validation:config_fuzz_test --config=clang-libc++ (that's how I found the issue in the first place) + all the regular release gating tests.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

+cc @phlax

Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #38167 was opened by krinkinmu.

see: more, trace.

@krinkinmu krinkinmu force-pushed the fix-thread-safety-checks branch from 27e05ca to d43f17a Compare January 23, 2025 18:13
@krinkinmu krinkinmu changed the title Correct locking in FakeStream Waive need for locking when calling FakeStream::{body,headers,trailers} Jan 23, 2025
Those return references to internal protected members, so ideally all
the callers should acquire a lock before calling those.

However, all the tests that use FakeStream or one of its derivatives
cannot really acquire the right lock because it's a protected member
of the class.

We got away with this so far for a few reasons:

1. Clang thread safety annotations didn't detect this problematic
   pattern in the clang-14 that we are currently using (potentially
   because those methods actually acquired locks, even though those
   locks didn't actually protect much).
2. The locks are really only needed to synchronize all the waitForX
   methods, accesors methods like body(), headers() and trailers() are
   called in tests after the appropriate waitForX method was called.

Disabling thread safety annotations for these methods does not actually
make anything worse, because the existing implementation aren't thread
safe anyways, however here are a few alternatives to disabling those
that I considered and rejected at the moment:

1. Return copies of body, headers and trailers instead of references,
   create those copies under a lock - that would be the easiest way to
   let compiler know that the code is fine, but all three methods return
   abstract classes and currently there is no easy way to copy them
   (that's not to say, that copying is impossible in principle);
2. Expose the lock and require all the callers acquire it - this was my
   first idea of how to fix the issue, but FakeStream (and it's
   derivatives) is used quite a lot in tests, so this change will get
   quite invasive.

Because it does not seem like we really need to lock those methods in
practice and given that alternatives to disabling thread safety analysis
on those are quite invasive, I figured I can just silence the compiler
in this case.

Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com>
@krinkinmu krinkinmu force-pushed the fix-thread-safety-checks branch from d43f17a to 55c6947 Compare January 23, 2025 18:26
@krinkinmu
Copy link
Contributor Author

/retest flaky test

@krinkinmu krinkinmu marked this pull request as ready for review January 23, 2025 23:27
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Thanks for pursuing that and paving the road towards a newer clang!
I agree that returning a reference to a data-member that is then not protected somewhat defeats the purpose of having a lock in the first place.
However, I think it may still be useful when the data-member is a pointer (so the pointer to the object isn't change concurrently).

Could you clarify a bit this:

The locks are really only needed to synchronize all the waitForX methods, accesors methods like body(), headers() and trailers() are called in tests after the appropriate waitForX method was called.

I was under the impression that the locks sync between the test's thread and the fake-upstream thread, so they are needed, regardless of the waitForX methods.

I agree that there may be some race-conditions here (as the main thread may access the internals of these objects, while the fake-upstream updates them), and honestly I haven't looked deep enough to understand what's going on, so feel free to shed more light here.

RE

Return copies of body, headers and trailers instead of references, create those copies under a lock - that would be the easiest way to let compiler know that the code is fine, but all three methods return abstract classes and currently there is no easy way to copy them (that's not to say, that copying is impossible in principle);

FWIW, I think that the right way to solve it is somewhat similar to what you proposed here. Specifically, have the data-members be pointers (just like the headers_ for example), move the pointer to a temp var, reset the data-member, and return the temp-var. The idea is to avoid copying the data, but there's still an overhead when getting the data, due to the creation of the new object.

@adisuissa adisuissa self-assigned this Jan 24, 2025
@krinkinmu
Copy link
Contributor Author

I was under the impression that the locks sync between the test's thread and the fake-upstream thread, so they are needed, regardless of the waitForX methods.

Let me clarify what I meant here, we still do need to synchronize between threads, but by the time we call body(), trailers() or headers() the syncrhonization already "happened". To clarify here is a hypothetical example of how a tests look:

  1. do the test setup
  2. trigger some action that you want to test
  3. wait for a syncrhonization event with say waitForHeadersComplete
  4. call headers() to check the result

In this scenario, by the time we headers() are called, we already synchronized. Unless headers get changed again somehow we don't really need a lock (and if they do, the current setup still does not protect us).

FWIW, I think that the right way to solve it is somewhat similar to what you proposed here. Specifically, have the data-members be pointers (just like the headers_ for example), move the pointer to a temp var, reset the data-member, and return the temp-var. The idea is to avoid copying the data, but there's still an overhead when getting the data, due to the creation of the new object.

I think it might work for some cases, but it would not protect against situations when body gets modified in place. In this case if we don't copy data we still will have a race condition - that seems like it defeats the purpose. Am I missing something?

Here

void FakeStream::decodeData(Buffer::Instance& data, bool end_stream) {
received_data_ = true;
absl::MutexLock lock(&lock_);
body_.add(data);
setEndStream(end_stream);
. decodeData can be called multiple times appending to the existing buffer. I don't think we can replace it every time we call decodeData without changing semantics and if don't replace it every time we modify it, then even if we return a pointer instead of reference, accessing the body through the pointer still isn't thread safe.

I will double check, of course, but it seems to me that some copying is still needed here to make it thread safe.

@krinkinmu
Copy link
Contributor Author

FWIW, I'm happy to spend more time on this and fix it properly rather than wave locks. I've only commented above because I didn't exactly understood how we can avoid copying.

@adisuissa
Copy link
Contributor

FWIW, I think that the right way to solve it is somewhat similar to what you proposed here. Specifically, have the data-members be pointers (just like the headers_ for example), move the pointer to a temp var, reset the data-member, and return the temp-var. The idea is to avoid copying the data, but there's still an overhead when getting the data, due to the creation of the new object.

I think it might work for some cases, but it would not protect against situations when body gets modified in place. In this case if we don't copy data we still will have a race condition - that seems like it defeats the purpose. Am I missing something?

Here

void FakeStream::decodeData(Buffer::Instance& data, bool end_stream) {
received_data_ = true;
absl::MutexLock lock(&lock_);
body_.add(data);
setEndStream(end_stream);

. decodeData can be called multiple times appending to the existing buffer. I don't think we can replace it every time we call decodeData without changing semantics and if don't replace it every time we modify it, then even if we return a pointer instead of reference, accessing the body through the pointer still isn't thread safe.
I will double check, of course, but it seems to me that some copying is still needed here to make it thread safe.

This is because the body isn't a pointer. If it were changed to a pointer to an OwnedImpl, then whenever it is fetched, that pointer is replaced with a new OwnedImpl, and the old one is returned to the caller.

@krinkinmu
Copy link
Contributor Author

krinkinmu commented Jan 24, 2025

FWIW, I think that the right way to solve it is somewhat similar to what you proposed here. Specifically, have the data-members be pointers (just like the headers_ for example), move the pointer to a temp var, reset the data-member, and return the temp-var. The idea is to avoid copying the data, but there's still an overhead when getting the data, due to the creation of the new object.

I think it might work for some cases, but it would not protect against situations when body gets modified in place. In this case if we don't copy data we still will have a race condition - that seems like it defeats the purpose. Am I missing something?
Here

void FakeStream::decodeData(Buffer::Instance& data, bool end_stream) {
received_data_ = true;
absl::MutexLock lock(&lock_);
body_.add(data);
setEndStream(end_stream);

. decodeData can be called multiple times appending to the existing buffer. I don't think we can replace it every time we call decodeData without changing semantics and if don't replace it every time we modify it, then even if we return a pointer instead of reference, accessing the body through the pointer still isn't thread safe.
I will double check, of course, but it seems to me that some copying is still needed here to make it thread safe.

This is because the body isn't a pointer. If it were changed to a pointer to an OwnedImpl, then whenever it is fetched, that pointer is replaced with a new OwnedImpl, and the old one is returned to the caller.

Imagine the following scenario:

  1. Thread 1, adds some data to the body
  2. Thread 2, thread-safely get a shared pointer to the body
  3. Thread 1, adds more data to the body
  4. Thread 2, access the body through the pointer

In this case if we allow thread 1 to append data to the body bit-by-bit (and not create a new body every time it calls decodeData), we will have a race condition between steps 3 and 4. On the other hand, if we don't allow thread 1 to append data to body bit-by-bit, it avoids a race condition, but it does seem like a change in semantics (currently it is possible to append to the body by calling decodeData multiple times).

@adisuissa
Copy link
Contributor

I was under the impression that the locks sync between the test's thread and the fake-upstream thread, so they are needed, regardless of the waitForX methods.

Let me clarify what I meant here, we still do need to synchronize between threads, but by the time we call body(), trailers() or headers() the syncrhonization already "happened". To clarify here is a hypothetical example of how a tests look:

  1. do the test setup
  2. trigger some action that you want to test
  3. wait for a syncrhonization event with say waitForHeadersComplete
  4. call headers() to check the result

In this scenario, by the time we headers() are called, we already synchronized. Unless headers get changed again somehow we don't really need a lock (and if they do, the current setup still does not protect us).

2 points to consider:

  1. wait has a timeout, so the call can exit before the headers are complete. In other words, the headers_ may be changed after the wait is called.
  2. technically (although probably not in practice), the call to decodeHeaders(), if called twice for example, may move the pointer, and make it invalid while the main test thread is invoking 4. I suggest not relying on std::move to be considered an atomic function.

Reading the decodeHeaders() code I now understand why having the lock seems to be sufficient (taking into account only the base class). The idea is that the headers are changed (under a lock), and a new one is created (by the move there).
I haven't looked at the derived classes, so not sure if it all works well there.

@adisuissa
Copy link
Contributor

This is because the body isn't a pointer. If it were changed to a pointer to an OwnedImpl, then whenever it is fetched, that pointer is replaced with a new OwnedImpl, and the old one is returned to the caller.

Imagine the following scenario:

  1. Thread 1, adds some data to the body
  2. Thread 2, thread-safely get a shared pointer to the body
  3. Thread 1, adds more data to the body
  4. Thread 2, access the body through the pointer

In this case if we allow thread 1 to append data to the body bit-by-bit (and not create a new body every time it calls decodeData), we will have a race condition between steps 3 and 4. On the other hand, if we don't allow thread 1 to append data to body bit-by-bit, it avoids a race condition, but it does seem like a change in semantics (currently it is possible to append to the body by calling decodeData multiple times).

Note that what I'm suggesting is to replace the owned impl - so instead of adding more data, it will return the current owned impl data, and create a new owned impl to replace the data member.

@krinkinmu
Copy link
Contributor Author

Note that what I'm suggesting is to replace the owned impl - so instead of adding more data, it will return the current owned impl data, and create a new owned impl to replace the data member.

Yes, I understand that you're suggesting to replace the body and I also understand that it avoids a race condition.

What I'm trying to point out though, is that it also changes the semantics of the method. Basically, with this change we cannot append to the body anymore. Either call to the "body()" will reset the current value, or call to the decodeData will do it, but either way the old data will be lost if we don't copy it and that's not what the current implementation of these methods do.

@krinkinmu
Copy link
Contributor Author

Maybe let me try to demonstrate my concern a bit more specifically and using an example.

Implementation for your suggestion, the way I understand it, might look something like this:

std::shared_ptr<Buffer::Instance> body() const {
    std::shared_ptr<Buffer::Instance> result;
    {
      absl::MutexLock lock(&lock_);
      result = body_;
    }
    return result;
}

 void decodeData(Buffer::Instance& data, bool end_stream) { 
   std::shared_ptr<Buffer::Instance> new_body = new Buffer::OwnedImpl(data);
   absl::MutexLock lock(&lock_); 
   received_data_ = true;
   body_ = new_body;
   setEndStream(end_stream);
}

Now, here is the alternative implementation that does additional copy:

std::shared_ptr<Buffer::Instance> body() const {
    std::shared_ptr<Buffer::Instance> result = new Buffer::OwnedImpl();
    {
      absl::MutexLock lock(&lock_);
      // instead of copying the pointer, we copy the data under a mutex
      result->add(body_);
    }
    return result;
}

 void decodeData(Buffer::Instance& data, bool end_stream) { 
   absl::MutexLock lock(&lock_); 
   received_data_ = true;
   body_.add(data);
   setEndStream(end_stream);
}

I think that both of these implementation do avoid a race condition, but they offer different behaviors. Let's look at this example to illustrate the difference:

decodeData("a", false);
decodeData("b", true);
assert(body().toString(), "ab");

For the first implementation that does not do a copy the assert check will fail, while for the second one it will not. I think that, leaving a race condition aside, the current implementation of body and decodeData behaves like the second alternative, not the first.

So it seems to me that if we want to preserve the current behavior we do need to copy data at some point. Am I misunderstanding your suggestion somehow?

@adisuissa
Copy link
Contributor

Yeah, the body accessor semantics will need to change (but I'm not sure that this is a bad thing).
The alternative, if one wants to keep the semantics, is to hold 2 OwnedImpl objects, where decodeData() adds to object1, the accessor moves (not copies) the data from object1 and adds it to object2 (this is done under a lock), and object2 is returned.

@krinkinmu
Copy link
Contributor Author

Yeah, the body accessor semantics will need to change (but I'm not sure that this is a bad thing). The alternative, if one wants to keep the semantics, is to hold 2 OwnedImpl objects, where decodeData() adds to object1, the accessor moves (not copies) the data from object1 and adds it to object2 (this is done under a lock), and object2 is returned.

I see, I think understand now what you mean.

Let me try first implement a version with just the pointers then and see how many tests (if any) actually depend on the current semantics. If none of them depends on the current semantics then it works. And if some do depend on the current semantics and cannot be easily fixed, then I can fallback to the move the approach with moving data blocks between two OwnedImpls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants