Mir

Merge lp://qastaging/~vanvugt/mir/ddouble into lp://qastaging/mir

ddouble
Merge into development-branch

Proposed by Daniel van Vugt on 2015-05-06

Status:

Merged

Approved by:

Daniel van Vugt on 2015-06-17

Approved revision:

no longer in the source branch.

Merged at revision:

2673

Proposed branch:

lp://qastaging/~vanvugt/mir/ddouble

Merge into:

lp://qastaging/mir

Diff against target:

486 lines (+277/-45)

3 files modified

src/server/compositor/buffer_queue.cpp (+96/-19)
src/server/compositor/buffer_queue.h (+14/-0)
tests/unit-tests/compositor/test_buffer_queue.cpp (+167/-26)

To merge this branch:

bzr merge lp://qastaging/~vanvugt/mir/ddouble

Related bugs:

Bug #1240909: [performance] Restore support for better-than-triple buffering by default.	Medium	Fix Released
Bug #1274408: Hardware cursor is always slightly ahead of the composited image	Low	Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
Cemil Azizoglu (community)			Approve on 2015-06-16
Alexandros Frantzis (community)			Approve on 2015-06-16
Alberto Aguirre (community)		2015-05-06	Abstain on 2015-05-18
Kevin DuBois (community)			Abstain on 2015-05-14
PS Jenkins bot (community)	continuous-integration		Approve on 2015-05-14
Alan Griffiths			Abstain on 2015-05-13
Review via email: mp+258351@code.qastaging.launchpad.net

Commit message

Introducing dynamic buffer queue scaling to minimize lag.
(LP: #1240909)

Actually the size of the queue doesn't really change with this branch.
What it does do is reliably detect when a client is staying ahead of
the compositor and throttle it so that no more than one future frame
is ever pre-rendered. Hence you get the same single-frame latency as
double buffering. But the moment the client fails to keep up, it will
get to use all three buffers again.

Description of the change

Background reading and imagery to help you understand the improvement:
https://docs.google.com/document/d/116i4TC0rls4wKFmbaRrHL_UT_Jg22XI8YqpXGLLTVCc/edit#

Fun facts:

* For manual testing you can see the switching in action if you set MIR_CLIENT_PERF_REPORT=log and then resize mir_demo_client_flicker to large enough that it takes too long to render (>16ms). It will get triple buffers then. If you shrink it, the client can render quickly again and reports double buffers.

* Strange and wonderful things happen with input resampling. Event-driven clients that use the default 59Hz resampling rate (e.g. mir_demo_client_eglsquare) will generally not be detected as "keeping up" and so will always be on triple buffers. But surprisingly this is a good thing -- the low input rate means the compositor is always consuming buffers faster than we're producing and you get lag around one frame as the consumer eats through the queue up to where the producer is (so the queue length is almost irrelevant). Other clients that reliably produce at full frame rate or higher however will get switched to double buffers (one frame lag).

* Additional manual testing has been done on a couple of phones (I backported this branch). Everything remains smooth. Unity8 is fast enough on some devices to qualify for double buffering. Unity8-dash is generally not fast enough on any device to qualify. You can see the stats on a live phone with:
restart unity8 MIR_CLIENT_PERF_REPORT=log
tail -f ~/.cache/upstart/unity8.log

* In raw numbers this branch at least reduces end-to-end lag on the phone from 5 frames to 4 (when unity8 qualifies for double buffering). And if your app (not unity8-dash) is fast enough to keep up too, it's then only 3 frames lag. So this branch reduces Unity8/USC total graphics latency by 20-40%, but more consistent results closer to 40% will depend on Unity8/dash performance improving in future. Also keep in mind on common mobile devices the touchscreen hardware itself may represent most of the latency. So even a 100% reduction in graphics latency could likely still feel slightly laggy to the user. And thus a 40% improvement is actually less than 40% of the total lag you perceive.

* A "slow" client that gets triple buffers is actually one that's achieving between 50% and 100% of the compositor frame rate. 100% or more is obviously fast and gets double buffers. Less obvious is that really slow clients doing less than 50% frame rate (e.g. a clock) actually keep the system idle enough that the third buffer doesn't get touched so you might see really slow clients only report 1 or 2 buffers in action even though we allow them 3.

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2015-05-06:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/3731/rebuild

review: Approve (continuous-integration)

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2015-05-07:

I don't see a way of disabling the prediction.

review: Needs Fixing

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2015-05-08:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/3768/rebuild

review: Approve (continuous-integration)

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2015-05-08:

While descriptions of manual testing scenarios are interesting we all know that manual testing doesn't scale and we'll miss regressions.

Can we think of a way to automate any of this? Even if we need to define a new test category "test_needing_graphics_hardware"? (I know this problem existed before this MP but here's as good a place as any to discuss.)

It would be really nice to have something more than unit tests to prove the /system/ behavior is as intended, not just the buffer queue.

review: Needs Information

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-11:

The new tests cover the new functionality completely. And what you don't see in the diff is existing test `slow_client_framerate_matches_compositor' covering off any potential for performance regressions really well (it took quite some effort to make that test pass).

Of course more tests always help. But AFAIK we're at 100% functionality coverage here already so it shouldn't be a blocking issue. All pre-existing tests still pass too.

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2015-05-11:

I agree that the new unit tests cover the new behavior of BufferQueue.

But I disagree that the proposed tests cover the new functionality of the system: there are no new tests of /the system as a whole/. The potential value of such tests is clear from the "Description of the Change" detailing how to /manually/ test the new functionality.

We should always be looking for ways to avoid relying on manual testing and this seems like an opportunity.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-05-11:

Seems like the idea can get a better latency/throughput balance, some questions...

Questions:
I can see some sort of policy (mc::BufferingPolicy?) being added to the constructor to try to keep the complexity of BufferQueue down, and make the policies interchangeable (also could providing a null/disabled policy to turn off dynamism)

190+ void set_scaling_delay(int nframes);
191+ int scaling_delay() const;
these are just here for for the benefit of the tests... is there a plan in the works so that something else in the system can change this?

249+ std::this_thread::sleep_for(throttled_compositor_rate);
Is this timing sensitive?

nits:
47+ else if (int(buffers.size()) < nbuffers)
c-cast

review: Needs Information

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-12:

set_scaling_delay was only just added to satisfy Alberto. I preferred not having that at all because in the absence of bugs (and none are known) it makes no sense to ever turn it off. That just allows people to shoot themselves in the proverbial foot, performance wise. Throughput is never compromised so it should always be turned on. Although it does help with a couple of tests that are invalid with queue scaling so would need to be deleted/disabled in the absence of set_scaling_delay.

The sleep_for was not my first preference. I toyed a lot with timerless tests but what I found was that they were nowhere near as effective at finding subtle bugs (when I intentionally introduced them) as the tests using real time. So I switched back to using real time because they're stronger tests more sensitive to subtle mistakes. Just as the existing `slow_client_framerate_matches_compositor' is so effective it continues to surprise me.

Yeah that cast is there for readability. If I said:
int x = buffers.size();
if (x < nbuffers)
then no one would complain. It's all the same and not something we should bother blocking on. The current form is more readable.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-12:

Alan:
I only mentioned manual testing in the description because people should have some healthy suspicion of my claims and test for themselves. The suggestion to use manual testing is for healthy quality assurance and does not indicate any gap in the automated test coverage.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-05-12:

> Yeah that cast is there for readability. If I said:
> int x = buffers.size();
> if (x < nbuffers)
> then no one would complain. It's all the same and not something we should
> bother blocking on. The current form is more readable.

Just trying to stick to the style guide :)
http://unity.ubuntu.com/mir/cppguide/index.html?showone=Casting#Casting
We've been chatting about some of the style guide stuff in the EU/US standup (and probably at the sprint next week too), maybe just another point of the style guide to bring up

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-05-12:

> The sleep_for was not my first preference. I toyed a lot with timerless tests
> but what I found was that they were nowhere near as effective at finding
> subtle bugs (when I intentionally introduced them) as the tests using real
> time. So I switched back to using real time because they're stronger tests
> more sensitive to subtle mistakes. Just as the existing
> `slow_client_framerate_matches_compositor' is so effective it continues to
> surprise me.

So, obviously the worst thing possible is to have a bug or regression, so a strong test is good. Another bad thing though is to have tests break under loaded CI, or to have the tests take a long time to run.

The test seems robust enough, but just leery of timing tests from past bugs.
The unit test time increases significantly with these tests. At least on my system, the old BufferQueueTests take 3-4s to run, but the new tests increase that time to about 12-13s. (log of run: http://pastebin.ubuntu.com/11100093/). Is there a way to decrease the test time while still having bugs guarded against? Its difficult to know where to draw the line as to how long a test can take to run, but this seems like an increase in test run time that should be avoided if at all possible.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-05-12:

> set_scaling_delay was only just added to satisfy Alberto. I preferred not
> having that at all because in the absence of bugs (and none are known) it
> makes no sense to ever turn it off. That just allows people to shoot
> themselves in the proverbial foot, performance wise. Throughput is never
> compromised so it should always be turned on. Although it does help with a
> couple of tests that are invalid with queue scaling so would need to be
> deleted/disabled in the absence of set_scaling_delay.

Side-stepping the argument of if it should be disabled or configurable a bit [1], the current code does have disable/configure abilities. So, it makes more sense to have this be a policy interface (or even an int parameter) passed in via the BufferQueue constructor than it does to have dangling functions that we don't intend to use in production code.

[1] I buy that its more of a nice-to-have (and easy-to-have) than something we're planning on making use of often.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-13:

Indeed I think I can shorten the new tests more and maintain robustness... They're not "time sensitive" like other tests though. The only timing I introduce is the sleep in throttled_compositor_thread() to ensure that the client loop (which is not throttled at all) runs faster than the compositor. That holds true even under load on a slow system so no false positives are expected (but the probability is never quite zero).

As for a "policy" interface. It's possible but would be messy, requiring more code (more classes) and end up less readable than "q.scaling_delay()" is right now. So I'd rather not.

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2015-05-13:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/3820/rebuild

review: Approve (continuous-integration)

Revision history for this message

Alan Griffiths (alan-griffiths) wrote on 2015-05-13:

I'll leave it to those with stronger opinions

review: Abstain

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2015-05-13:

"set_scaling_delay was only just added to satisfy Alberto. "

It shouldn't just be me. ANY predictive algorithm will get wrong predictions. And I would prefer something less cryptic than setting -1 on set_scaling_delay. Just a enable_dynamic_depth or something like that should suffice.

47 + else if (int(buffers.size()) < nbuffers)

C-cast... No c-casts per mir style guide.

> As for a "policy" interface. It's possible but would be messy, requiring more
> code (more classes) and end up less readable than "q.scaling_delay()" is right
> now. So I'd rather not.

As oppossed to the complex web of logic introduced in BufferQueue? The algorithm is not clear (which is one benefit of an interface with clear boundaries).

Also it's not obvious what the trade-offs are in this scheme.

Needs fixing, needs info.

review: Needs Fixing

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-05-13:

@test shortening, thanks!

> As for a "policy" interface. It's possible but would be messy, requiring more
> code (more classes) and end up less readable than "q.scaling_delay()" is right
> now. So I'd rather not.

Sure, it requires some interfaces, but the expressiveness of the code improves by not having to figure out what a 'scaling delay' is. As the code stands now, we have to bake the policies and the swapping all into on BufferQueue object, which has always been pretty complex with just the swapping requirements. It also diffuses the BufferQueue logic a bit, because we have a dividing line that 'this is the size/scaling/queing policy' and 'this is the swapping queue'.

Like Alberto points out, this is something that can be optimized for in different ways, and I can see that using differing policies could be helpful. Having to define the interfaces and relationships between the BufferQueue and the policy-interface would stop the BufferQueue from mixing up the policy portion with the swapping portion, and hopefully lead to a a more flexible system.

So, I guess 'needs-fixing' on the c-cast, and 'appreciates-further-discussion' on the interface (its seeming like a good vs better situation)

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-14:

I glad people are suspicious. You should be since we've been trying to achieve this for over a year.

Although please be wary of "rearranging deck chairs on the titanic". Requests to change designs that contribute nothing functional and result in more complex code will be accommodated to some extent to avoid argument, but quickly become a waste of time. Try not to be Vogons...

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-14:

C cast:
Now removed. Not my preferred syntax but the hit to readability is minor.

What's the trade-off?
Although none has been observed this this current implementation (I never put it up for review until I believed all the corner cases had been solved), in theory there is a trade-off. That is for clients who are on the bleeding edge of keeping up most-of-but-not-quite-all the time. A client keeping up for 80% of frames and only missing every 5th for example would get scaled back to two buffers for 60% of the time, and would get three buffers for the 40% of the time. Depending on the specific client's behaviour, this could result in it keeping up only 60% of the time with this algorithm, instead of 80% of the time with always-triple-buffers.
But this whole branch is an exercise in working around sub-optimal clients (most notably unity8-dash). I have worked for months on and off to ensure those edge cases we actually see in Ubuntu where our clients are too slow are correctly handled by this algorithm. But in the long run we should of course aim to make all our graphics code fast (so double buffering can be used). It's not hard; just requires some commitment.
The current algorithm presented here has no visible stuttering on our phones, and provides a slightly visible reduction in lag.

Redesign with a policy object:
I'll make a proposal or two separately into this branch so you can see the changes more clearly...

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2015-05-14:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/mir-ci/3833/rebuild

review: Approve (continuous-integration)

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-14:

Correction:
A client keeping up for 80% of frames and only missing every 5th for example would get scaled back to two buffers for 20% of the time (on the fourth frame), and would get three buffers for the other 80% of the time. Depending on the specific client's behaviour, this could result in it keeping up only 60% of the time with this algorithm, instead of 80% of the time with always-triple-buffers.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-05-14:

@the policy discussion,
Its more of a request to define the relationships between the objects and allow for object-oriented substitution of the algorithm. I suppose this can be done later, but (imo) would help us review better if we teased out what policy decisions are being made, and what is being optimized for in the tradeoffs. This seems like something to sort out next week, abstain for now.

review: Abstain

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-15:

I will do it, but might not be ready till next week.

Revision history for this message

Alberto Aguirre (albaguirre) wrote on 2015-05-18:

@Daniel, thanks for the info.

Abstain for now until we have the buffer semantics discussion.

review: Abstain

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-18:

Talking about redesigning the buffer allocation architecture today, I won't bother redesigning this proposal for now. Sounds like the effort might go to waste within some months.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-05-28:

Ignore my most recent comments above.

Last week people were saying they'd be happy to land this as-is. Still true?

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-11:

This branch is important and we need to not drop the ball. Unity8 needs every reduction in latency we can get, released as soon as possible.

If anyone truly wants it redesigned, please tell me ASAP.

Revision history for this message

Kevin DuBois (kdub) wrote on 2015-06-15:

I guess still (if I can introduce a new branch of arithmetic to expand on the nuances):

+1 that using some sort of heuristic to dynamically trade off between "double buffering" and "triple buffering" is beneficial.

-1 that the heuristic is not easily extractable from BufferQueue.

+? that it improves some situtations, as described in the description, but its difficult to see that we're making a good universal tradeoff.

+? On how to prove that this new feature has transferred to 'the new buffer semantics'. If its something in trunk, we don't want to remove the feature with 'the new buffer semantics'; but it will look markedly different once we have server-initiated buffer transfers. Its difficult to see how to transfer the feature because we don't have an established way to test a variety of loads and pop a hard number out at the end to compare.

so I guess all that adds up to "0??", which (I suppose) translates to abstain.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2015-06-16:

This branch is visibly beneficial. When used on the phone you can see shell elements are least are a bit more responsive.

And the heuristic is mostly not a heuristic. It's based on careful logic and reasoning which is why it took me over 6 months to bring this version to reality, now 12 months after we first started trying to get back to double buffering.

I don't care to think about new buffer semantics right now. Future work may replace it, but this branch works today and solves 40% of the lag. So let's take advantage of it.

Revision history for this message

Alexandros Frantzis (afrantzis) wrote on 2015-06-16:

As with most additions to BufferQueue, it's not easy to convince oneself that this is bug free. However, it seems to work well in practice, and if we are going to have it, let's introduce it early in the cycle.

My main concern is that this branch would be largely irrelevant after the planned move to client-owned buffers (although we could provide a helper library that implements similar logic on the client side). However, we are not there yet, and since changing our buffer semantics is a long-term effort, I am OK "living for today" in this case.

There are a few things to improve (in the future):

As mentioned in previous comments, I would very much like to see tests that simulate our manual tests, i.e., they check the effects that buffer queue changes have end-to-end.

This branch optimizes latency, but leaves memory consumption intact (i.e. we mantain 3 buffers internally even if we don't need them). It would be nice to see memory consumption addressed too.

review: Approve

Revision history for this message

Cemil Azizoglu (cemil-azizoglu) wrote on 2015-06-16:

There is some healthy skepticism but no strong objections. I'm willing to try this out.

review: Approve