Mir

Merge lp://qastaging/~afrantzis/mir/fix-1317801 into lp://qastaging/mir

fix-1317801
Merge into development-branch

Proposed by Alexandros Frantzis on 2014-05-09

Status:

Rejected

Rejected by:

Alexandros Frantzis on 2014-05-14

Proposed branch:

lp://qastaging/~afrantzis/mir/fix-1317801

Merge into:

lp://qastaging/mir

Diff against target:

162 lines (+76/-7)

2 files modified

src/server/compositor/buffer_queue.cpp (+17/-0)
tests/unit-tests/compositor/test_buffer_queue.cpp (+59/-7)

To merge this branch:

bzr merge lp://qastaging/~afrantzis/mir/fix-1317801

Related bugs:

Bug #1317801: [regression] [BufferQueue] BufferQueueTest.compositor_never_owns_client_buffers occasionally crashes with: what(): unexpected release: buffer was not given to compositor	Critical	Fix Released
Bug #1318632: [regression] [BufferQueue] Race condition in BufferQueue::compositor_acquire could underflow shared_ptr refcount and delete prematurely, crash.	High	Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Alberto Aguirre (community)		2014-05-09	Needs Fixing on 2014-05-12
Daniel van Vugt			Needs Fixing on 2014-05-12
Alan Griffiths			Abstain on 2014-05-12
PS Jenkins bot (community)	continuous-integration		Approve on 2014-05-09
Review via email: mp+218996@code.qastaging.launchpad.net

Commit message

compositor: Fix race in BufferQueue (LP: #1317801)

Reserve the maximum size for the 'buffers' std::vector so that all references to
existing elements of 'buffers' remain valid when we add new elements to it
(i.e. in client_acquire()).

Also add BufferQueueTest.with_single_buffer_compositor_acquires_resized_frames
to guard against a seemingly correct, but ultimately wrong, alternative approach to
resolving this race, namely copying the value from the vector instead of using
a reference to it.

Description of the change

compositor: Fix race in BufferQueue

Reserve the maximum size for the 'buffers' std::vector so that all references to
existing elements of 'buffers' remain valid when we add new elements to it
(i.e. in client_acquire()).

Also add BufferQueueTest.with_single_buffer_compositor_acquires_resized_frames to guard against a seemingly correct, but ultimately wrong, alternative approach to resolving this race, namely copying the value from the vector instead of using a reference to it.

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2014-05-09:

~~~
108 + std::thread t{
109 + [&]
110 + {
111 + auto handle = client_acquire_async(q);
112 + client_acquire_requested.wake_up_everyone();
113 + handle->wait();
114 + }};
115 +
116 + client_acquire_requested.wait_for_at_most_seconds(3);
121 + t.join();
~~~
Can be replaced with:

auto handle = client_acquire_async(q);
handle->wait_for(std::chrono::seconds(3));
ASSERT_THAT(handle->has_acquired_buffer(), Eq(true));

~~~
153 + std::thread t{
154 + [&]
155 + {
156 + /* Use in conjuction with a 20ms delay in compositor_acquire() */
157 + std::this_thread::sleep_for(std::chrono::milliseconds{10});
158 +
159 + q.client_release(client_acquire_sync(q));
160 + q.client_release(client_acquire_sync(q));
161 + }};
~~~

Use AutoJoinThread so that thread is still joined in case compositor_acquire or compositor_release throws.

review: Needs Fixing

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-05-09:

> Can be replaced with:
>
> auto handle = client_acquire_async(q);
> handle->wait_for(std::chrono::seconds(3));
> ASSERT_THAT(handle->has_acquired_buffer(), Eq(true));

For this test we need the following sequence of events to occur:

1. client_acquire_async() request (which blocks because we have already client_acquired/released the only buffer previously)
2. compositor_acquire() which gets the single buffer and also releases the only buffer back to the client and unblocks it

So a single thread wouldn't work because it would block in handle->wait_for(std::chrono::seconds(3)); (and fail).

> Use AutoJoinThread so that thread is still joined in case compositor_acquire
> or compositor_release throws.

Will do.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-05-09:

> Use AutoJoinThread so that thread is still joined in case compositor_acquire
> or compositor_release throws.

Done.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-05-09:

> ~~~
> 108 + std::thread t{
> 109 + [&]
> 110 + {
> 111 + auto handle = client_acquire_async(q);
> 112 + client_acquire_requested.wake_up_everyone();
> 113 + handle->wait();
> 114 + }};
> 115 +
> 116 + client_acquire_requested.wait_for_at_most_seconds(3);
> 121 + t.join();
> ~~~

Removed thread.

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-05-09:

* Needs Discussion *

130 + * This is a regression test for bug lp:1317801. This bug is a race and
131 + * very difficult to reproduce with pristine code. By carefully placing
132 + * a delay in the code, we can greatly increase the chances (100% for me)
133 + * that this test catches a regression. However these delays are not
134 + * acceptable for production use, so since the test and code in their
135 + * pristine state are highly unlikely to catch the issue, I have decided
136 + * to DISABLE the test to avoid giving a false sense of security.

I can understand the desire to preserve a recipe for reproducing the problem but I'm not sure that having a disabled test that also requires a patch to the production code is effective.

Would it be less questionable to put a call to a virtual function call into BufferQueue::compositor_acquire() "just before returning the acquired_buffer" with a default null implementation, but override this in a test class with the needed delay?

OTOH any tests that rely on sleep() are unreliable anyway.

review: Needs Information

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2014-05-09:

LGTM

review: Approve

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-05-09:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/1582/rebuild

review: Approve (continuous-integration)

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-05-09:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/1584/rebuild

review: Approve (continuous-integration)

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-05-12:

Yay, a fix before Monday. Nice!

Although using std::vector::reserve() is technically a solution that should work (AFAIK the docs say it won't resize if enough is reserved), I think it's a little bit shaky relying on a technicality that is non-obvious and easy to regress by accident (deleting the single reserve call). I would prefer to see a slightly more robust solution of a fixed-length array.

This is the kind of bug the old array/index approach was designed to avoid. That and using a ring made it impossible to ever lose track of any sub-list.

In the past, I've found a useful way to catch bugs like this is to put macros in my code to temporarily disable any realloc() avoidance optimizations. Then running your tests would catch bugs like this one, but requires you write your own containers too :)

review: Approve

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-05-12:

> Could it be less questionable to put a call to a virtual function call into BufferQueue::compositor_acquire() "just > before returning the acquired_buffer" with a default null implementation, but override this in a test class with the > needed delay?

I implemented this approach and benchmarked compositor_acquire() with callgrind. I got a 2.5% increase in estimated cycles when using the null virtual function vs not having it. It's not a huge deal, but I am not sure we want to pay that price for a regression test.

Perhaps we could guard the call with a conditional and/or use templates to pass the function to use (e.g. template <void (*before_compositor_acquire_return)()> BufferQueue {...};), in order to cut down the cost, but I am not convinced it's worth the trouble either.

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2014-05-12:

> > Could it be less questionable to put a call to a virtual function call into
> BufferQueue::compositor_acquire() "just > before returning the
> acquired_buffer" with a default null implementation, but override this in a
> test class with the > needed delay?
>
> I implemented this approach and benchmarked compositor_acquire() with
> callgrind. I got a 2.5% increase in estimated cycles when using the null
> virtual function vs not having it. It's not a huge deal, but I am not sure we
> want to pay that price for a regression test.

I'd prefer that to having a "regression test" that is DISABLED *and* requires an edit to the sourcecode. But I won't prolong the discussion looking for a better way to test this.

review: Abstain

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-05-12:

Oops, I just noticed there's still a potentially fatal race --> bug 1318632

We should fix that here because it's the same section of code. And then once you have removed the reference it might be safe to drop the reserve() call.

review: Needs Fixing

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-05-12:

> Oops, I just noticed there's still a potentially fatal race --> bug 1318632
>
> We should fix that here because it's the same section of code. And then once
> you have removed the reference it might be safe to drop the reserve() call.

Concerning that bug, my understanding is that std::shared_ptr can handle the scenario described in a thread-safe manner without any external synchronization (the control block of a shared_ptr which is what is shared between instances is thread safe).

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2014-05-12:

Alexandros,

I think it's fine if the compositor_acquire lags the client during resizing - indeed this is the case for multiple-buffer queues as well.

Before getting to l32, the decision of which buffer to use has been made, so I think a shared_ptr copy is fine. Yeah there will be a transitive period where there are two buffers in play for a one-buffer queue - but that's ok).

So given that using a reference still has a potential race against surface resizing - a copy just makes more sense.

review: Needs Fixing

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-05-13:

> Alexandros,
>
> I think it's fine if the compositor_acquire lags the client during resizing -
> indeed this is the case for multiple-buffer queues as well.
>
> Before getting to l32, the decision of which buffer to use has been made, so I
> think a shared_ptr copy is fine. Yeah there will be a transitive period where
> there are two buffers in play for a one-buffer queue - but that's ok).
>
> So given that using a reference still has a potential race against surface
> resizing - a copy just makes more sense.

Unfortunately, changing to copy the shared_ptr is not straighforward either. It introduces other problems, like for example:

client_acquire => A
client_release(A)
queue.resize(B_size)
client_acquire => no free buffer yet
compositor_acquire => returns A, current_compositor_buffer is B (resized) and client gets B
compositor_release(A) => now free{A} client{B}
client_acquire => gets A, which is wrong anyway, but furthermore A may have been deleted since the only shared_ptr to it was held by the compositor in the previous step... chaos ensues ☠

More investigation is needed. As first step I will look into potential races and solutions while keeping the reference, since it seems more straightforward than the alternative.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-05-13:

> More investigation is needed. As first step I will look into potential races and solutions while keeping the
> reference, since it seems more straightforward than the alternative.

Investigation didn't lead anywhere, so I went for the copy approach + some fixes needed for it to work:

https://code.launchpad.net/~afrantzis/mir/fix-1317801-alt/+merge/219356

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-05-14:

Rejecting in favor of https://code.launchpad.net/~afrantzis/mir/fix-1317801-alt/+merge/219356

Unmerged revisions

1618. By Alexandros Frantzis on 2014-05-09: tests: Simplify test
1617. By Alexandros Frantzis on 2014-05-09: tests: Use AutoJoinThread to ensure proper cleanup
1616. By Alexandros Frantzis on 2014-05-09: compositor: Fix race in BufferQueue