OpenStack Compute (nova)

Merge lp://qastaging/~justin-fathomdb/nova/constraint-scheduler into lp://qastaging/~hudson-openstack/nova/trunk

constraint-scheduler
Merge into trunk

Proposed by justinsb on 2011-03-02

Status:	Work in progress
Proposed branch:	lp://qastaging/~justin-fathomdb/nova/constraint-scheduler
Merge into:	lp://qastaging/~hudson-openstack/nova/trunk
Diff against target:	1136 lines (+898/-85) 8 files modified nova/scheduler/constraint.py (+226/-0) nova/scheduler/constraint_lib.py (+254/-0) nova/scheduler/datastore.py (+148/-0) nova/scheduler/driver.py (+13/-3) nova/scheduler/simple.py (+24/-35) nova/scheduler/zone.py (+1/-2) nova/tests/test_constraintlib.py (+170/-0) nova/tests/test_scheduler.py (+62/-45)
To merge this branch:	bzr merge lp://qastaging/~justin-fathomdb/nova/constraint-scheduler
Related bugs:	Link a bug report
Related blueprints:	Resource partitioning (Undefined)

Reviewer	Date Requested	Status
Nachi Ueno (community)		Needs Fixing on 2011-03-10
Nova Core security contacts	2011-03-02	Pending
Review via email: mp+51857@code.qastaging.launchpad.net

Description of the change

Implementation of constraint-based scheduler.
Created a simple contraint solver and a scheduler that uses it; created constraints that mirrored the existing selection criteria; refactored DB access code to avoid duplication and support future constraints (Working towards 'proximity' allocation or 'specific zone' allocation)

Revision history for this message

Soren Hansen (soren) wrote on 2011-03-07:

I question the usefulness of best-match algorithms. I don't believe that the best match is really interesting at all. Finding best matches typically involves (and indeed that seems to be what happens in this implementation) looking at the entire set of candidates and ordering them according to some criteria. I don't believe this approach scales, and I don't believe it's necessary. It's not unlikely that the top XX% are all just fine candidates, so finding the very best offers no real advantage. Furthermore, the host that is the best match right now might be significantly worse two minutes from now.

Revision history for this message

Sandy Walsh (sandy-walsh) wrote on 2011-03-07:

Justin, nice work on a very interesting branch.

We're currently working on a similar approach in our Zones and Distributed Scheduler blueprints.

https://blueprints.launchpad.net/nova/+spec/multi-cluster-in-a-region
https://blueprints.launchpad.net/nova/+spec/bexar-distributed-scheduler

Ed Leafe (dabo) is working on a similar body of work to this, based on the current Rackspace/Slicehost implementation of Server Best Match.

Soren has a point that the search doesn't have to be exhaustive. We recently changed our approach from server-side to nearly fully db-side and saw massive performance improvements. Also, limiting the result set to XX% as Soren mentioned was possible once the problem was pushed off.

I think we need to review your branch in some depth before giving a pass/fail. I've sure there are things we can leverage to bring it in line with Distributed Scheduler.

Cheers! ... and stay tuned.

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-03-07:

Soren: It's true that an exhaustive search can be expensive, which is why I
didn't code an exhaustive search (other than in the unit tests) :-) There's
a framework for 'Criteria', which the 'Solver' tries to solve as best it
can. You can plug in whatever Solver technique you think best (which might
actually be exhaustive for small sets), but I believe the Solver I've got
here is likely to be reasonable in practice. The approach is that it
identifies the 'most selective criteria' and then steps through those
results in order to find one where all the other criteria are no-more
unhappy (mini-max). I'll document the magic better! I haven't coded
heuristic algorithms yet because - frankly - we're nowhere near the scales
where it becomes necessary and we have no way to take advantage of
heuristics when we're sourcing data from a relational DB, and because there
are difficult questions around behavior when the heuristics fail to find a
good solution or any solution at all.

Sandy: I think the work is non-overlapping with the distributed & multi
schedulers, but I'll check them out in more detail. My goal is to support
more constraints in the scheduler (in particular, co-placement of volumes
and servers), and I'm going to work on this constraint to help motivate this
patch.

I'm going to check out the other branches, and code up a co-placement
Criteria so this isn't just work in the abstract!

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-03-07:

I've got a branch up which makes use of the constraint scheduler:
lp:~justin-fathomdb/nova/schedule-compute-near-volume

It's now super-easy to add constraints, even conflicting constraints, and (hopefully) this will still yield reasonable decisions.

As the number of constraints grows, I don't believe that trying to satisfy the constraints manually will scale.

The dependent branch isn't yet ready for merge (I doubt it actually works); I need to write tests for it and that is at the end of a really long chain of merge requests!!

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2011-03-10:

As you pointed, I suppose your code needs more test code before merged.
In addtion unit tests for ConstraintLib fail.

======================================================================
ERROR: test_big (nova.tests.test_constraintlib.ConstraintLibTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/nati/workspace/constraint-scheduler/nova/tests/test_constraintlib.py", line 106, in test_big
    self._do_random_test(rnd, item_count, constraints_count)
  File "/home/nati/workspace/constraint-scheduler/nova/tests/test_constraintlib.py", line 122, in _do_random_test
    solver = constraint_lib.Solver(None, None)
TypeError: __init__() takes exactly 2 arguments (3 given)

======================================================================
ERROR: test_five_five__five (nova.tests.test_constraintlib.ConstraintLibTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/nati/workspace/constraint-scheduler/nova/tests/test_constraintlib.py", line 97, in test_five_five__five
    self._do_random_test(rnd, 5, 5)
  File "/home/nati/workspace/constraint-scheduler/nova/tests/test_constraintlib.py", line 122, in _do_random_test
    solver = constraint_lib.Solver(None, None)
TypeError: __init__() takes exactly 2 arguments (3 given)

review: Needs Fixing

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-03-10:

Fixed up the unit tests (I was working in a derived branch and tried to keep this one up to date - probably more trouble than it's worth with bazaar)

This branch has good test coverage; and the dependent one now does also.

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2011-03-11:

Thank you for your fix.
Would you add more test case for ConstraintDriverTestCase?

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-03-11:

It's easy to miss, but the ConstraintDriverTestCase derives from
_SchedulerBaseTestCase, so inherits basically the same code coverage as the
SimpleScheduler (although the SimpleScheduler supports some 'directed
placement' in a bit of a hacky way, which I don't support in the
ConstraintScheduler) But the ConstraintScheduler (which is never used
unless someone specifically requests it) behaves the same way as the
SimpleScheduler under the basic tests.

In addition, there are unit tests for the constraint solving library.

I believe that's reasonable test coverage. There can always be more tests,
but I think this is a reasonable start for something that is only
user-selectable. As we add use-cases, I'm sure we'll find issues and add
tests, both for bugs, and places where extra behaviour is needed. I did
find a bug when developing the derived directed-location branch, and I
back-ported that fix (which is how I broke the unit tests in the first
place!) That bug was when the min-scores were tied, it didn't fall-back to
consider the secondary criteria. The constraint solving library tests
didn't hit this because they were using randomized data, so were very
unlikely to get ties.

Is that OK?

Revision history for this message

Nachi Ueno (nati-ueno) wrote on 2011-03-14:

I'm sorry for late reply, because of earthquake in Japan.
As you said, it is hard to write sufficient test code.
However, is it possible to add Schedure test code which corresponds to constraint solving library tests? Test code can be used a document, so it shows how to use the Constraint Schedure.

> It's easy to miss, but the ConstraintDriverTestCase derives from
> _SchedulerBaseTestCase, so inherits basically the same code coverage as the
> SimpleScheduler (although the SimpleScheduler supports some 'directed
> placement' in a bit of a hacky way, which I don't support in the
> ConstraintScheduler) But the ConstraintScheduler (which is never used
> unless someone specifically requests it) behaves the same way as the
> SimpleScheduler under the basic tests.
>
> In addition, there are unit tests for the constraint solving library.
>
> I believe that's reasonable test coverage. There can always be more tests,
> but I think this is a reasonable start for something that is only
> user-selectable. As we add use-cases, I'm sure we'll find issues and add
> tests, both for bugs, and places where extra behaviour is needed. I did
> find a bug when developing the derived directed-location branch, and I
> back-ported that fix (which is how I broke the unit tests in the first
> place!) That bug was when the min-scores were tied, it didn't fall-back to
> consider the secondary criteria. The constraint solving library tests
> didn't hit this because they were using randomized data, so were very
> unlikely to get ties.
>
> Is that OK?

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-03-15:

Nachi: There are tests already for the constraint scheduler, but the only constraints implemented in this branch are ones that assign to the least-loaded machine. If you look in the derived branch, there are examples of more advanced constraints and better tests

In the derived branch, I add a constraint that favors putting a compute node as close as possible to a specified location, based on a topology:
https://code.launchpad.net/~justin-fathomdb/nova/schedule-compute-near-volume/+merge/52520

Revision history for this message

justinsb (justin-fathomdb) wrote on 2011-03-24:

Moving to WIP - we're going to discuss at Design Summit

Unmerged revisions

725. By justinsb on 2011-03-10: Back-ported fixes from derived branch
724. By justinsb on 2011-03-02: Added (derived) unit test for constraint scheduler
723. By justinsb on 2011-03-02: Fixed pep8, missing copyrights
722. By justinsb on 2011-03-02: Use datastore in simple scheduler for retrieval in non-forced case
721. By justinsb on 2011-03-02: Remove DB update code from schedulers; they deal with the datastore now
720. By justinsb on 2011-03-02: Refactoring data store so that we're not using DB objects
(Different constraints will likely need different queries, and the DB binding will be problematic)
719. By justinsb on 2011-03-02: Optimization for when we know we're not the worst constraint
718. By justinsb on 2011-03-02: Small cleanup of tests & pep8
717. By justinsb on 2011-03-02: Fix logical error, use a priority queue in the constraint solver
716. By justinsb on 2011-03-01: A few fixes (passes most simple scheduler tests)