Mir

Merge lp://qastaging/~afrantzis/mir/fix-1317801 into lp://qastaging/mir

Proposed by Alexandros Frantzis
Status: Rejected
Rejected by: Alexandros Frantzis
Proposed branch: lp://qastaging/~afrantzis/mir/fix-1317801
Merge into: lp://qastaging/mir
Diff against target: 162 lines (+76/-7)
2 files modified
src/server/compositor/buffer_queue.cpp (+17/-0)
tests/unit-tests/compositor/test_buffer_queue.cpp (+59/-7)
To merge this branch: bzr merge lp://qastaging/~afrantzis/mir/fix-1317801
Reviewer Review Type Date Requested Status
Alberto Aguirre (community) Needs Fixing
Daniel van Vugt Needs Fixing
Alan Griffiths Abstain
PS Jenkins bot (community) continuous-integration Approve
Review via email: mp+218996@code.qastaging.launchpad.net

Commit message

compositor: Fix race in BufferQueue (LP: #1317801)

Reserve the maximum size for the 'buffers' std::vector so that all references to
existing elements of 'buffers' remain valid when we add new elements to it
(i.e. in client_acquire()).

Also add BufferQueueTest.with_single_buffer_compositor_acquires_resized_frames
to guard against a seemingly correct, but ultimately wrong, alternative approach to
resolving this race, namely copying the value from the vector instead of using
a reference to it.

Description of the change

compositor: Fix race in BufferQueue

Reserve the maximum size for the 'buffers' std::vector so that all references to
existing elements of 'buffers' remain valid when we add new elements to it
(i.e. in client_acquire()).

Also add BufferQueueTest.with_single_buffer_compositor_acquires_resized_frames to guard against a seemingly correct, but ultimately wrong, alternative approach to resolving this race, namely copying the value from the vector instead of using a reference to it.

To post a comment you must log in.
Revision history for this message
Alberto Aguirre (albaguirre) wrote :

~~~
108 + std::thread t{
109 + [&]
110 + {
111 + auto handle = client_acquire_async(q);
112 + client_acquire_requested.wake_up_everyone();
113 + handle->wait();
114 + }};
115 +
116 + client_acquire_requested.wait_for_at_most_seconds(3);
121 + t.join();
~~~
Can be replaced with:

auto handle = client_acquire_async(q);
handle->wait_for(std::chrono::seconds(3));
ASSERT_THAT(handle->has_acquired_buffer(), Eq(true));

~~~
153 + std::thread t{
154 + [&]
155 + {
156 + /* Use in conjuction with a 20ms delay in compositor_acquire() */
157 + std::this_thread::sleep_for(std::chrono::milliseconds{10});
158 +
159 + q.client_release(client_acquire_sync(q));
160 + q.client_release(client_acquire_sync(q));
161 + }};
~~~

Use AutoJoinThread so that thread is still joined in case compositor_acquire or compositor_release throws.

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Can be replaced with:
>
> auto handle = client_acquire_async(q);
> handle->wait_for(std::chrono::seconds(3));
> ASSERT_THAT(handle->has_acquired_buffer(), Eq(true));

For this test we need the following sequence of events to occur:

1. client_acquire_async() request (which blocks because we have already client_acquired/released the only buffer previously)
2. compositor_acquire() which gets the single buffer and also releases the only buffer back to the client and unblocks it

So a single thread wouldn't work because it would block in handle->wait_for(std::chrono::seconds(3)); (and fail).

> Use AutoJoinThread so that thread is still joined in case compositor_acquire
> or compositor_release throws.

Will do.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Use AutoJoinThread so that thread is still joined in case compositor_acquire
> or compositor_release throws.

Done.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> ~~~
> 108 + std::thread t{
> 109 + [&]
> 110 + {
> 111 + auto handle = client_acquire_async(q);
> 112 + client_acquire_requested.wake_up_everyone();
> 113 + handle->wait();
> 114 + }};
> 115 +
> 116 + client_acquire_requested.wait_for_at_most_seconds(3);
> 121 + t.join();
> ~~~

Removed thread.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

* Needs Discussion *

130 + * This is a regression test for bug lp:1317801. This bug is a race and
131 + * very difficult to reproduce with pristine code. By carefully placing
132 + * a delay in the code, we can greatly increase the chances (100% for me)
133 + * that this test catches a regression. However these delays are not
134 + * acceptable for production use, so since the test and code in their
135 + * pristine state are highly unlikely to catch the issue, I have decided
136 + * to DISABLE the test to avoid giving a false sense of security.

I can understand the desire to preserve a recipe for reproducing the problem but I'm not sure that having a disabled test that also requires a patch to the production code is effective.

Would it be less questionable to put a call to a virtual function call into BufferQueue::compositor_acquire() "just before returning the acquired_buffer" with a default null implementation, but override this in a test class with the needed delay?

OTOH any tests that rely on sleep() are unreliable anyway.

review: Needs Information
Revision history for this message
Alberto Aguirre (albaguirre) wrote :

LGTM

review: Approve
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Yay, a fix before Monday. Nice!

Although using std::vector::reserve() is technically a solution that should work (AFAIK the docs say it won't resize if enough is reserved), I think it's a little bit shaky relying on a technicality that is non-obvious and easy to regress by accident (deleting the single reserve call). I would prefer to see a slightly more robust solution of a fixed-length array.

This is the kind of bug the old array/index approach was designed to avoid. That and using a ring made it impossible to ever lose track of any sub-list.

In the past, I've found a useful way to catch bugs like this is to put macros in my code to temporarily disable any realloc() avoidance optimizations. Then running your tests would catch bugs like this one, but requires you write your own containers too :)

review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Could it be less questionable to put a call to a virtual function call into BufferQueue::compositor_acquire() "just > before returning the acquired_buffer" with a default null implementation, but override this in a test class with the > needed delay?

I implemented this approach and benchmarked compositor_acquire() with callgrind. I got a 2.5% increase in estimated cycles when using the null virtual function vs not having it. It's not a huge deal, but I am not sure we want to pay that price for a regression test.

Perhaps we could guard the call with a conditional and/or use templates to pass the function to use (e.g. template <void (*before_compositor_acquire_return)()> BufferQueue {...};), in order to cut down the cost, but I am not convinced it's worth the trouble either.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

> > Could it be less questionable to put a call to a virtual function call into
> BufferQueue::compositor_acquire() "just > before returning the
> acquired_buffer" with a default null implementation, but override this in a
> test class with the > needed delay?
>
> I implemented this approach and benchmarked compositor_acquire() with
> callgrind. I got a 2.5% increase in estimated cycles when using the null
> virtual function vs not having it. It's not a huge deal, but I am not sure we
> want to pay that price for a regression test.

I'd prefer that to having a "regression test" that is DISABLED *and* requires an edit to the sourcecode. But I won't prolong the discussion looking for a better way to test this.

review: Abstain
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Oops, I just noticed there's still a potentially fatal race --> bug 1318632

We should fix that here because it's the same section of code. And then once you have removed the reference it might be safe to drop the reserve() call.

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Oops, I just noticed there's still a potentially fatal race --> bug 1318632
>
> We should fix that here because it's the same section of code. And then once
> you have removed the reference it might be safe to drop the reserve() call.

Concerning that bug, my understanding is that std::shared_ptr can handle the scenario described in a thread-safe manner without any external synchronization (the control block of a shared_ptr which is what is shared between instances is thread safe).

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

Alexandros,

I think it's fine if the compositor_acquire lags the client during resizing - indeed this is the case for multiple-buffer queues as well.

Before getting to l32, the decision of which buffer to use has been made, so I think a shared_ptr copy is fine. Yeah there will be a transitive period where there are two buffers in play for a one-buffer queue - but that's ok).

So given that using a reference still has a potential race against surface resizing - a copy just makes more sense.

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Alexandros,
>
> I think it's fine if the compositor_acquire lags the client during resizing -
> indeed this is the case for multiple-buffer queues as well.
>
> Before getting to l32, the decision of which buffer to use has been made, so I
> think a shared_ptr copy is fine. Yeah there will be a transitive period where
> there are two buffers in play for a one-buffer queue - but that's ok).
>
> So given that using a reference still has a potential race against surface
> resizing - a copy just makes more sense.

Unfortunately, changing to copy the shared_ptr is not straighforward either. It introduces other problems, like for example:

client_acquire => A
client_release(A)
queue.resize(B_size)
client_acquire => no free buffer yet
compositor_acquire => returns A, current_compositor_buffer is B (resized) and client gets B
compositor_release(A) => now free{A} client{B}
client_acquire => gets A, which is wrong anyway, but furthermore A may have been deleted since the only shared_ptr to it was held by the compositor in the previous step... chaos ensues ☠

More investigation is needed. As first step I will look into potential races and solutions while keeping the reference, since it seems more straightforward than the alternative.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> More investigation is needed. As first step I will look into potential races and solutions while keeping the
> reference, since it seems more straightforward than the alternative.

Investigation didn't lead anywhere, so I went for the copy approach + some fixes needed for it to work:

https://code.launchpad.net/~afrantzis/mir/fix-1317801-alt/+merge/219356

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

Unmerged revisions

1618. By Alexandros Frantzis

tests: Simplify test

1617. By Alexandros Frantzis

tests: Use AutoJoinThread to ensure proper cleanup

1616. By Alexandros Frantzis

compositor: Fix race in BufferQueue

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches