Merge lp://qastaging/~tpatil/nova/bug708822 into lp://qastaging/~hudson-openstack/nova/trunk

Proposed by Tushar Patil
Status: Work in progress
Proposed branch: lp://qastaging/~tpatil/nova/bug708822
Merge into: lp://qastaging/~hudson-openstack/nova/trunk
Diff against target: 20 lines (+8/-2)
1 file modified
nova/api/ec2/cloud.py (+8/-2)
To merge this branch: bzr merge lp://qastaging/~tpatil/nova/bug708822
Reviewer Review Type Date Requested Status
Soren Hansen (community) Needs Information
Rick Harris Pending
Vish Ishaya Pending
Review via email: mp+47729@code.qastaging.launchpad.net

Commit message

Fix for LP Bug #708822

To post a comment you must log in.
Revision history for this message
Vish Ishaya (vishvananda) wrote :

Can you explain why this makes a difference? I'm not sure why this fix would work?

Vish

On Jan 27, 2011, at 1:09 PM, Tushar Patil wrote:

> Tushar Patil has proposed merging lp:~tpatil/nova/bug708822 into lp:nova.
>
> Requested reviews:
> Nova Core (nova-core)
> Related bugs:
> #708822 DescribeAvailabilityZones: services alive status is not consistent
> https://bugs.launchpad.net/bugs/708822
>
> For more details, see:
> https://code.launchpad.net/~tpatil/nova/bug708822/+merge/47729
> --
> https://code.launchpad.net/~tpatil/nova/bug708822/+merge/47729
> You are subscribed to branch lp:nova.
> === modified file 'nova/api/ec2/cloud.py'
> --- nova/api/ec2/cloud.py 2011-01-27 08:30:24 +0000
> +++ nova/api/ec2/cloud.py 2011-01-27 21:08:54 +0000
> @@ -236,7 +236,9 @@
> if service['host'] == host]
> for svc in hsvcs:
> delta = now - (svc['updated_at'] or svc['created_at'])
> - alive = (delta.seconds <= FLAGS.service_down_time)
> + alive = delta < datetime.timedelta(
> + seconds=FLAGS.service_down_time)
> +
> art = (alive and ":-)") or "XXX"
> active = 'enabled'
> if svc['disabled']:
>

Revision history for this message
Tushar Patil (tpatil) wrote :

Please check the following example

#test case: api server time is lacking by 5 seconds
    print "api server time is lacking by 5 seconds"
    now = datetime.datetime.utcnow() - datetime.timedelta(seconds=5);

    delta = now - datetime.datetime.utcnow();
    print delta

    #Modified code
    alive = delta < datetime.timedelta(seconds=60)
    #-5 < 60
    print alive

    #Current code"
    print delta.seconds
    alive = delta.seconds < 60
    #86395 < 60
    print alive

    print "api server time is ahead by 5 seconds"
    #test case: api server time is ahead by 5 seconds
    #API server time
    now = datetime.datetime.utcnow() + datetime.timedelta(seconds=5);
    delta = now - datetime.datetime.utcnow();
    print delta

    #Modified code
    alive = delta < datetime.timedelta(seconds=60)
    #5 < 60
    print alive

    print delta.seconds
    #5 < 60
    alive = delta.seconds < 60
    print alive

Output of the program
api server time is lacking by 5 seconds
-1 day, 23:59:55
True
86395
False
api server time is ahead by 5 seconds
0:00:05
True
5
True

Revision history for this message
Tushar Patil (tpatil) wrote :

If the time are sync on all the nodes including api server, then I don't think we need to find this issue.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

Yes, the real problem is that the nodes are out of sync. The negative value is especially scary because it will always succeed if the api node is significantly behind, even if the compute hasn't reported in in a long time. Perhaps a better fix would be to do a LOG.warn if the api is behind compute host.

Vish

On Jan 27, 2011, at 2:35 PM, Tushar Patil wrote:

> If the time are sync on all the nodes including api server, then I don't think we need to find this issue.
> --
> https://code.launchpad.net/~tpatil/nova/bug708822/+merge/47729
> You are subscribed to branch lp:nova.

Revision history for this message
Devin Carlen (devcamcar) wrote :

I agree that this is an error case but I'm not sure that a warning in the log is sufficient to deal with this. We should go ahead and add it but I think that if the API node is significantly behind that it should fail outright and that the error should propagate somehow.

Revision history for this message
Thierry Carrez (ttx) wrote :

Hmm, I tend to agree with Vish here. Cloud nodes should be run with synchronized clocks, and if we detect a delta we should WARN about it... We could even make ntpd a recommended dep of nova-common.

Revision history for this message
Devin Carlen (devcamcar) wrote :

You have a point. I suppose it's a question of use case. For a smaller or private cloud, this would not be a critical issue. I think this could be a critical issue for an enterprise scale installation. Perhaps we can add a flag to determine the behavior in case the time drifts too much?

Revision history for this message
Soren Hansen (soren) wrote :

Bexar GammaFreeze has kicked in, deferred to Cactus. Setting to "Work in Progress" to clear the review queue. Feel free to set back to "Needs Review" once Cactus opens up (of if this issue gets promoted to a release critical one).

review: Needs Resubmitting
Revision history for this message
Jesse Andrews (anotherjesse) wrote :

should we re-example this issue now?

Revision history for this message
Thierry Carrez (ttx) wrote :

Definitely, but Tushar needs to remerge with trunk now.

Revision history for this message
Tushar Patil (tpatil) wrote :

I will add warning message in DescribeAvailabilityZones API if the time difference between api server node and other nodes (last reported status) is not in the range of time defined in the flag "service_down_time". so this warning message will be logged in the nova-api.log even if the api node is ahead or it is lagging behind from other nodes by the value defined in the service_down_time flag.

If you all agree, I will make these changes and submit my branch for review.

Revision history for this message
Soren Hansen (soren) wrote :

as for the code itself, I find the logic confusing to read.

Why not just:

if now < (svc['updated_at'] or svc['created_at']):
    LOG.warn(_("Time skew detected"))

if abs(delta) > datetime.timedelta(seconds=FLAGS.service_down_time):
    art = "XXX"
else:
    art = ":-)"

That said, I'm not generally in favour of working around issues this way. Time skew is destined to cause problems eventually, and ntp is not an unreasonable requirement. As such, I'm not convinced "time in the future" is something we want to complicate the code base to support. Can you provide a use case where this would be a requirement?

review: Needs Information
Revision history for this message
Tushar Patil (tpatil) wrote :

your logic is better and not confusing.

I have added log warning message as per comments posted by Devin and Vishy.

My initial fix to this problem was just to change the line from

alive = (delta.seconds <= FLAGS.service_down_time)
to
alive = delta < datetime.timedelta(seconds=FLAGS.service_down_time)

The above change caters to fix following test case

#test case: api server time is lagging behind by 5 seconds from any of other host(for e.g. compute)
    now = datetime.datetime.utcnow() - datetime.timedelta(seconds=5);

    delta = now - datetime.datetime.utcnow();
    print delta

    #fixed code
    alive = delta < datetime.timedelta(seconds=60)
    #-5 < 60
    print alive

    #Current code in the trunk"
    print delta.seconds
    alive = delta.seconds < 60
    #86395 < 60
    print alive

So finally do we need log.warn?

Revision history for this message
Brian Lamar (blamar) wrote :

Just as a suggestion, I find long if statements difficult to read. Would it be possible to give these two values relevant names, something like:

min_delta = datetime.timedelta(seconds=-FLAGS.service_down_time)
max_delta = datetime.timedelta(seconds=FLAGS.service_down_time)

...which would turn the if into something like:

if min_delta < delta < max_delta:

Revision history for this message
Rick Harris (rconradharris) wrote :

> Time skew is destined to cause problems eventually, and ntp is not an unreasonable requirement. As such, I'm not convinced "time in the future" is something we want to complicate the code base to support.

++ on this, any reason why ntp can't be made a hard requirement?

That said, if we want to include this patch, I'd prefer Soren's structuring which seems a bit easier to follow.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

I do think ntp is important. That said, perhaps we should allow for a small amount of time skew. Allowing a full 60 seconds in the future seems like a bad idea, IMO

Revision history for this message
Devin Carlen (devcamcar) wrote :

I suggest we take this discussion to the list. I agree with Vish that NTP is important, but there seem to be other issues to discuss here.

Revision history for this message
Tushar Patil (tpatil) wrote :

I also agree with Vish that ntp solution is ideal and there are absolutely so issues if ntp is installed on all of the hosts.

Perhaps I don't like the idea of nova-api taking decision to declare host as live or dead based on time. nova-api should simply read the data from the db and send it as a response. There should be some health checker service which should monitor health of the services just like heartbeat and update the db.

Unmerged revisions

638. By Tushar Patil

Removed pep8 errors

637. By Tushar Patil

logic restructured as per review comments

636. By Tushar Patil

Merged with Trunk

635. By Tushar Patil

Fix for LP Bug #708822

634. By Tushar Patil

Merged with trunk

633. By Tushar Patil

Merged with Trunk

632. By Tushar Patil

Fix for LP Bug #708822

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
The diff is not available at this time. You can reload the page or download it.