Mir

Merge lp://qastaging/~afrantzis/mir/fix-1350207-unresponsive-clients into lp://qastaging/mir

fix-1350207-unresponsive-clients
Merge into development-branch

Proposed by Alexandros Frantzis on 2014-09-09

Status:

Merged

Approved by:

Daniel van Vugt on 2014-09-16

Approved revision:

no longer in the source branch.

Merged at revision:

1920

Proposed branch:

lp://qastaging/~afrantzis/mir/fix-1350207-unresponsive-clients

Merge into:

lp://qastaging/mir

Diff against target:

293 lines (+158/-15)

9 files modified

src/server/frontend/socket_messenger.cpp (+21/-5)
tests/acceptance-tests/CMakeLists.txt (+1/-0)
tests/acceptance-tests/test_unresponsive_client.cpp (+122/-0)
tests/include/mir_test/cross_process_action.h (+2/-1)
tests/include/mir_test_framework/display_server_test_fixture.h (+1/-1)
tests/include/mir_test_framework/testing_process_manager.h (+1/-1)
tests/mir_test/cross_process_action.cpp (+2/-2)
tests/mir_test_framework/display_server_test_fixture.cpp (+2/-2)
tests/mir_test_framework/testing_process_manager.cpp (+6/-3)

To merge this branch:

bzr merge lp://qastaging/~afrantzis/mir/fix-1350207-unresponsive-clients

Critical

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Daniel van Vugt			Approve on 2014-09-16
Alberto Aguirre (community)			Approve on 2014-09-15
Kevin DuBois (community)			Approve on 2014-09-15
Chris Halse Rogers		2014-09-09	Approve on 2014-09-11
PS Jenkins bot (community)	continuous-integration		Approve on 2014-09-09
Review via email: mp+233934@code.qastaging.launchpad.net

Commit message

server: Work around unresponsive clients causing the server to hang (LP: #1350207)

Description of the change

server: Work around unresponsive clients causing the server to hang (LP: #1350207)

This MP makes the sockets used for communicating with clients non-blocking. When the server tries to write to a socket with a full send buffer it will fail instead of blocking, resulting in either dropping the event (when sending events), or disconnecting the client (when responding to a client rpc request).

This MP also increases the socket send buffer to 64KiB (default is 16KiB) to give clients a little more breathing room for transient freezes. We can increase it even more if needed (max is 4MiB).

Note that this is only a workaround, not a final fix, which would involve making the data writes asynchronous, essentially providing some more buffering on the server side (instead of the kernel). Unfortunately, this is not easily achievable with asio in this particular scenario because:

1. Asio doesn't support sending ancillary data (e.g. file descriptors) asynchronously and in coordination with normal data.

2. There is no guaranteed ordering and atomicity of asio asynchronous writes, and messages may end up being interleaved without additional care.

More discussion is needed in this area:
* what are the criteria for dropping a client (should we decide or the shell or both?)
* application-not-responding notification to shell
* if we drop messages to clients does it make sense for the clients to continue in a possibly inconsistent state (it depends on the kind of dropped messages)

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-09-09:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/2662/rebuild

review: Needs Fixing (continuous-integration)

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-09:

CI failure is https://bugs.launchpad.net/mir/+bug/1367353

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-09-09:

PASSED: Continuous integration, rev:1903
http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-ci/2663/
Executed test runs:
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-android-utopic-i386-build/1678
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-clang-utopic-amd64-build/1684
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-utopic-touch/1659
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-utopic-amd64-ci/1185
        deb: http://jenkins.qa.ubuntu.com/job/mir-team-mir-development-branch-utopic-amd64-ci/1185/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-utopic-armhf/596
        deb: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-builder-utopic-armhf/596/artifact/work/output/*zip*/output.zip
    SUCCESS: http://jenkins.qa.ubuntu.com/job/mir-mediumtests-runner-mako/2714
    SUCCESS: http://s-jenkins.ubuntu-ci:8080/job/touch-flash-device/12969

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-team-mir-development-branch-ci/2663/rebuild

review: Approve (continuous-integration)

Revision history for this message

Cemil Azizoglu (cemil-azizoglu) wrote on 2014-09-09:

I haven't looked at the code in detail, but it sounds like to actually fix the problem, we need to spawn a comms thread and not block the main thread.

Revision history for this message

Chris Halse Rogers (raof) wrote on 2014-09-10:

No, I don't think a comms thread helps particularly; that just changes the problem to “blocked clients cause unbounded memory consumption”.

I think that properly handling this would be:
a) Mark the client as unresponsive in a way that's visible to shells
b) Register a callback for when the socket becomes writable
c) Drop any additional writes to the socket
d) When the socket becomes writable, send a full state-update message so that we know the client and server agree on what the current state is.

For now bumping the buffer size and hoping that the client didn't really need the messages we dropped seems like a non-terrible improvement to the status quo. Except...

Doesn't your new while (nread != ba::buffer_size) loop replace a wait in the kernel with a busy-wait in the server, or does transfer_exactly() not work on a non-blocking socket?

You'll certainly busy-wait until there's something to read, which seems bad.

Does anyone else find ‘ba::mutable_buffers_1 const&’ hilarious? :)

review: Needs Fixing

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-10:

> Doesn't your new while (nread != ba::buffer_size) loop replace a wait in the kernel
> with a busy-wait in the server, or does transfer_exactly() not work on a
> non-blocking socket?
>
> You'll certainly busy-wait until there's something to read, which seems bad.

This is indeed a busy-wait in the server, but one that doesn't currently affect us because
the receive_msg() method is only used in this context:

if (message_receiver->available_bytes() >= body_size)
{
on_new_message(message_receiver->receive_msg(ba::buffer(body)));
}

essentially guaranteeing that we never busy-loop. I introduced the loop to ensure we retain the blocking semantics this method had previously. It's too bad that we can't set (non-)blocking mode for reads and writes separately.

Revision history for this message

Chris Halse Rogers (raof) wrote on 2014-09-11:

Yeah, that seems reasonable. I'm a bit surprised that boost::transfer_exactly() doesn't do the right thing, but eh.

review: Approve

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-09-11:

Tested. Bug fixed!

review: Approve

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-09-12:

Whoops. Now that I look more closely:

(1) This looks too much like an infinite loop waiting to happen. I think "!=" needs to change to "<" for safety:
while (nread != ba::buffer_size(buffer))
Also you could incorporate the if statement by changing to a "do .. while".

(2) This worries me too:
"* if we drop messages to clients does it make sense for the clients to continue in a possibly inconsistent state (it depends on the kind of dropped messages)"
If a busy client gets blocked doing other things just a bit too long, there's the real risk it won't be able to recover any more (rendering will freeze and never resume).

review: Needs Fixing

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-12:

> (1) This looks too much like an infinite loop waiting to happen.
> I think "!=" needs to change to "<" for safety: while (nread != ba::buffer_size(buffer))

nread can't ever become larger than ba::buffer_size(buffer), since we are descreasing the buffer size we are passing to ba::read() (ba::mutable_buffers_1{buffer + nread}).

> (2) This worries me too:
> "* if we drop messages to clients does it make sense for the clients to continue in a possibly
> inconsistent state (it depends on the kind of dropped messages)"
> If a busy client gets blocked doing other things just a bit too long, there's the real risk
> it won't be able to recover any more (rendering will freeze and never resume).

As things stand, if a client is blocked and the socket becomes full, then:

1. if the server tries to send an event and it can't the event will be dropped
2. if the server tries to reply to an RPC call (e.g. next_buffer) and it can't, it will disconnect the client

So, at least we won't get a frozen client. Of course this is far from ideal, but it's a trade-off we need to decide if we want to make: we either run the risk of dropping events/disconnecting clients if they become busy for too long, or we run the risk of a blocked client hanging the whole server. Note that we can further increase the socket send buffer to allow more breathing room for transiently frozen clients if we find it necessary.

The idea behind this MP is to provide a reasonable workaround for the issue without changing too much code and risking destabilizing the frontend for rtm. I am open to other ideas.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2014-09-15:

OK, weakly "Approved". I recommend fixing still...

(1) Your logic may well be perfect. I still recommend using a less-than for safety of maintenance and readability.

(2) Yeah I agree this is the lesser evil right now. I guess land it but open a new bug for the dropped response issue immediately?

Revision history for this message

Kevin DuBois (kdub) wrote on 2014-09-15:

Well, this looks okay from a code perspective. It seems that we're trading the 'blocked clients can hang server' bug for 'legitimately stopped clients will get forcibly disconnected' bug. I suppose its the lesser of two evils though.

review: Approve

Revision history for this message

Kevin DuBois (kdub) wrote on 2014-09-15:

> (2) Yeah I agree this is the lesser evil right now. I guess land it but open a
> new bug for the dropped response issue immediately?
Sounds appropriate to do

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2014-09-15:

30 + while (nread != ba::buffer_size(buffer))

I would prefer a < as well...but not blocking.

+1 on submitting a bug for (2) and also for the TODO.

review: Approve

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-16:

> (1) Your logic may well be perfect. I still recommend using a less-than for safety of maintenance and readability.
> I would prefer a < as well...but not blocking.

Fixed.

> (2) Yeah I agree this is the lesser evil right now. I guess land it but open a new bug for the dropped response issue immediately?

I will open a new bug as soon as this gets merged.

Revision history for this message

Daniel van Vugt (vanvugt) on 2014-09-16:

review: Approve

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2014-09-16:

> I will open a new bug as soon as this gets merged.

https://bugs.launchpad.net/mir/+bug/1370117