Mir

Merge lp://qastaging/~afrantzis/mir/fix-1350207-unresponsive-clients into lp://qastaging/mir

Proposed by Alexandros Frantzis
Status: Merged
Approved by: Daniel van Vugt
Approved revision: no longer in the source branch.
Merged at revision: 1920
Proposed branch: lp://qastaging/~afrantzis/mir/fix-1350207-unresponsive-clients
Merge into: lp://qastaging/mir
Diff against target: 293 lines (+158/-15)
9 files modified
src/server/frontend/socket_messenger.cpp (+21/-5)
tests/acceptance-tests/CMakeLists.txt (+1/-0)
tests/acceptance-tests/test_unresponsive_client.cpp (+122/-0)
tests/include/mir_test/cross_process_action.h (+2/-1)
tests/include/mir_test_framework/display_server_test_fixture.h (+1/-1)
tests/include/mir_test_framework/testing_process_manager.h (+1/-1)
tests/mir_test/cross_process_action.cpp (+2/-2)
tests/mir_test_framework/display_server_test_fixture.cpp (+2/-2)
tests/mir_test_framework/testing_process_manager.cpp (+6/-3)
To merge this branch: bzr merge lp://qastaging/~afrantzis/mir/fix-1350207-unresponsive-clients
Reviewer Review Type Date Requested Status
Daniel van Vugt Approve
Alberto Aguirre (community) Approve
Kevin DuBois (community) Approve
Chris Halse Rogers Approve
PS Jenkins bot (community) continuous-integration Approve
Review via email: mp+233934@code.qastaging.launchpad.net

Commit message

server: Work around unresponsive clients causing the server to hang (LP: #1350207)

Description of the change

server: Work around unresponsive clients causing the server to hang (LP: #1350207)

This MP makes the sockets used for communicating with clients non-blocking. When the server tries to write to a socket with a full send buffer it will fail instead of blocking, resulting in either dropping the event (when sending events), or disconnecting the client (when responding to a client rpc request).

This MP also increases the socket send buffer to 64KiB (default is 16KiB) to give clients a little more breathing room for transient freezes. We can increase it even more if needed (max is 4MiB).

Note that this is only a workaround, not a final fix, which would involve making the data writes asynchronous, essentially providing some more buffering on the server side (instead of the kernel). Unfortunately, this is not easily achievable with asio in this particular scenario because:

1. Asio doesn't support sending ancillary data (e.g. file descriptors) asynchronously and in coordination with normal data.

2. There is no guaranteed ordering and atomicity of asio asynchronous writes, and messages may end up being interleaved without additional care.

More discussion is needed in this area:
* what are the criteria for dropping a client (should we decide or the shell or both?)
* application-not-responding notification to shell
* if we drop messages to clients does it make sense for the clients to continue in a possibly inconsistent state (it depends on the kind of dropped messages)

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Needs Fixing (continuous-integration)
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :
review: Approve (continuous-integration)
Revision history for this message
Cemil Azizoglu (cemil-azizoglu) wrote :

I haven't looked at the code in detail, but it sounds like to actually fix the problem, we need to spawn a comms thread and not block the main thread.

Revision history for this message
Chris Halse Rogers (raof) wrote :

No, I don't think a comms thread helps particularly; that just changes the problem to “blocked clients cause unbounded memory consumption”.

I think that properly handling this would be:
a) Mark the client as unresponsive in a way that's visible to shells
b) Register a callback for when the socket becomes writable
c) Drop any additional writes to the socket
d) When the socket becomes writable, send a full state-update message so that we know the client and server agree on what the current state is.

For now bumping the buffer size and hoping that the client didn't really need the messages we dropped seems like a non-terrible improvement to the status quo. Except...

Doesn't your new while (nread != ba::buffer_size) loop replace a wait in the kernel with a busy-wait in the server, or does transfer_exactly() not work on a non-blocking socket?

You'll certainly busy-wait until there's something to read, which seems bad.

Does anyone else find ‘ba::mutable_buffers_1 const&’ hilarious? :)

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Doesn't your new while (nread != ba::buffer_size) loop replace a wait in the kernel
> with a busy-wait in the server, or does transfer_exactly() not work on a
> non-blocking socket?
>
> You'll certainly busy-wait until there's something to read, which seems bad.

This is indeed a busy-wait in the server, but one that doesn't currently affect us because
the receive_msg() method is only used in this context:

if (message_receiver->available_bytes() >= body_size)
{
    on_new_message(message_receiver->receive_msg(ba::buffer(body)));
}

essentially guaranteeing that we never busy-loop. I introduced the loop to ensure we retain the blocking semantics this method had previously. It's too bad that we can't set (non-)blocking mode for reads and writes separately.

Revision history for this message
Chris Halse Rogers (raof) wrote :

Yeah, that seems reasonable. I'm a bit surprised that boost::transfer_exactly() doesn't do the right thing, but eh.

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Tested. Bug fixed!

review: Approve
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Whoops. Now that I look more closely:

(1) This looks too much like an infinite loop waiting to happen. I think "!=" needs to change to "<" for safety:
  while (nread != ba::buffer_size(buffer))
Also you could incorporate the if statement by changing to a "do .. while".

(2) This worries me too:
  "* if we drop messages to clients does it make sense for the clients to continue in a possibly inconsistent state (it depends on the kind of dropped messages)"
If a busy client gets blocked doing other things just a bit too long, there's the real risk it won't be able to recover any more (rendering will freeze and never resume).

review: Needs Fixing
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> (1) This looks too much like an infinite loop waiting to happen.
> I think "!=" needs to change to "<" for safety: while (nread != ba::buffer_size(buffer))

nread can't ever become larger than ba::buffer_size(buffer), since we are descreasing the buffer size we are passing to ba::read() (ba::mutable_buffers_1{buffer + nread}).

> (2) This worries me too:
> "* if we drop messages to clients does it make sense for the clients to continue in a possibly
> inconsistent state (it depends on the kind of dropped messages)"
> If a busy client gets blocked doing other things just a bit too long, there's the real risk
> it won't be able to recover any more (rendering will freeze and never resume).

As things stand, if a client is blocked and the socket becomes full, then:

1. if the server tries to send an event and it can't the event will be dropped
2. if the server tries to reply to an RPC call (e.g. next_buffer) and it can't, it will disconnect the client

So, at least we won't get a frozen client. Of course this is far from ideal, but it's a trade-off we need to decide if we want to make: we either run the risk of dropping events/disconnecting clients if they become busy for too long, or we run the risk of a blocked client hanging the whole server. Note that we can further increase the socket send buffer to allow more breathing room for transiently frozen clients if we find it necessary.

The idea behind this MP is to provide a reasonable workaround for the issue without changing too much code and risking destabilizing the frontend for rtm. I am open to other ideas.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, weakly "Approved". I recommend fixing still...

(1) Your logic may well be perfect. I still recommend using a less-than for safety of maintenance and readability.

(2) Yeah I agree this is the lesser evil right now. I guess land it but open a new bug for the dropped response issue immediately?

Revision history for this message
Kevin DuBois (kdub) wrote :

Well, this looks okay from a code perspective. It seems that we're trading the 'blocked clients can hang server' bug for 'legitimately stopped clients will get forcibly disconnected' bug. I suppose its the lesser of two evils though.

review: Approve
Revision history for this message
Kevin DuBois (kdub) wrote :

> (2) Yeah I agree this is the lesser evil right now. I guess land it but open a
> new bug for the dropped response issue immediately?
Sounds appropriate to do

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

30 + while (nread != ba::buffer_size(buffer))

I would prefer a < as well...but not blocking.

+1 on submitting a bug for (2) and also for the TODO.

review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> (1) Your logic may well be perfect. I still recommend using a less-than for safety of maintenance and readability.
> I would prefer a < as well...but not blocking.

Fixed.

> (2) Yeah I agree this is the lesser evil right now. I guess land it but open a new bug for the dropped response issue immediately?

I will open a new bug as soon as this gets merged.

Revision history for this message
Daniel van Vugt (vanvugt) :
review: Approve
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> I will open a new bug as soon as this gets merged.

https://bugs.launchpad.net/mir/+bug/1370117

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.

Subscribers

People subscribed via source and target branches