OpenStack Compute (nova)

Merge lp://qastaging/~johannes.erdfelt/nova/deferred-delete-instance into lp://qastaging/~hudson-openstack/nova/trunk

deferred-delete-instance
Merge into trunk

Proposed by Johannes Erdfelt on 2011-09-07

Status:	Work in progress
Proposed branch:	lp://qastaging/~johannes.erdfelt/nova/deferred-delete-instance
Merge into:	lp://qastaging/~hudson-openstack/nova/trunk
Diff against target:	103 lines (+30/-24) 2 files modified nova/compute/api.py (+17/-23) nova/db/sqlalchemy/api.py (+13/-1)
To merge this branch:	bzr merge lp://qastaging/~johannes.erdfelt/nova/deferred-delete-instance
Related bugs:	Link a bug report
Related blueprints:	Deferred Delete for Instances (Low)

Reviewer	Date Requested	Status
Chris Behrens (community)		Needs Fixing on 2011-09-22
Paul Voccio (community)		Approve on 2011-09-21
Matt Dietz (community)		Approve on 2011-09-20
Brian Waldon (community)		Needs Information on 2011-09-19
Brian Lamar (community)	2011-09-07	Needs Information on 2011-09-19
Review via email: mp+74530@code.qastaging.launchpad.net

Description of the change

Instance deletions in Openstack are immediate. This can cause data to be lost accidentally.

This branch adds a new configuration flag, reclaim_instance_interval. The default of 0 results in the same behavior before this patch, immediate deletion of the instance. Any value greater than 0 will result in the instance being powered off immediately and then later the instance will be reclaimed.

New actions, restore and forceDelete allow a previously deleted instance to be restored, or reclaimed immediately.

Revision history for this message

Brian Lamar (blamar) wrote on 2011-09-09:

55 + "Restore a previously deleted instance."
56 +

I think you need more quotes to make this a docstring, and line 56 should be removed. Same goes for:

62 + "Force delete of instance before deferred cleanup."
63 +

Maybe this is pedantic, but I think these are too far indented, 4 spaces should suffice but it doesn't mess with pep8 so feel free to ignore me:

85 + actions = [
86 + extensions.ActionExtension("servers", "restore",
87 + self._restore),
88 + extensions.ActionExtension("servers", "forceDelete",
89 + self._force_delete),
90 + ]

I'd argue that instead of having:

vm_state = vm_states.DELETED
task_state = task_states.QUEUED_DELETE

...that we have something like this:

vm_state = vm_states.SOFT_DELETE

vm_state = vm_states.SCHEDULED_FOR_DELETION

I really go back and forth on this because I feel like task_states should be short-lived tasks and that vm_states are longer and potentially 'more permanent' if that makes sense. Having a vm_state of DELETED when it's not actually deleted seems confusing to me (although I understand for all intents and purposes that it's deleted).

My other question/comment is slightly larger, but wouldn't require a lot more work to implement. Would it be better to have a Compute API schedule_delete() method which can be called separately of delete()? The reason I ask is that I like methods to do one thing and do it well. If someone calls Compute API delete() they might expect the instance to be deleted. There is no indication given in the function call that the instance will simply be shut down.

nova.compute.api:API.schedule_delete(self, context, instance_id, datetime/seconds_in_the_future)

might be more explicit.

Overall I love the tests and code, just want to make sure the interface is correct.

review: Needs Fixing

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-09:

Docstrings are just string literals at the top of a module, class or function.

>>> def foo():
... "This is a docstring"
... pass
...
>>> foo.__doc__
'This is a docstring'

So as long as it's a string literal, it'll be interpreted as a docstring. But the rest of the code in nova seems to use triple quotes consistently, so I've cleaned that up.

The recent state cleanups in nova were a *huge* benefit. It simplified this branch quite a bit when I merged trunk. But there's still some inconsistency internally and with the API. The blueprint didn't specify it, but it seemed like returning DELETED via the API was the most user-friendly.

I was aiming for that in the code changes, but there are already translations in place, we can have the internal state be more consistent and still expose DELETED via the API.

I think your reasoning behind implementing a schedule_delete() method is sound. However, I'd like to avoid specifying the date/time. It requires another column in the database and I can't think of a reason where we wouldn't want to use just one interval. If it's calculated, then a changed interval will be used immediately on restart. Probably minor, but I think it's simpler.

I'll make these changes and push them up soon.

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-15:

I think I've got all of the feedback incorporated into the branch now.

Revision history for this message

Brian Lamar (blamar) wrote on 2011-09-19:

Awesomeness. Only a single question remains for me about the Compute API:

soft_delete
delete
restore
force_delete

These 4 methods seem to have a little overlap and I think delete/force_delete end up doing the exact same thing? Would you be up for just having:

delete
soft_delete
restore

Might be able to just check for _is_queued_delete or _is_able_to_shutdown in delete and it would remove the need for force_delete?

Great stuff.

review: Needs Information

Revision history for this message

Brian Waldon (bcwaldon) wrote on 2011-09-19:

Functionally tested and working. Code looks good. I only have two comments:

1) I would prefer not to see DELETED instances in a call to /servers, even if they were just soft deleted. Any reason to keep them in the output?

2) We should use a different flag than reclaim_instance_interval in the OSAPI. Maybe something like enable_soft_delete_instances? If there is a solution where I don't actually need a flag on the api node, that would be even better.

review: Needs Information

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-19:

blamar: the rest of delete and force_delete are identical, but the preconditions are different. Both require different states to perform the action requested. This is to ensure that if you delete an instance twice, it doesn't get hard deleted. And that if the forceDelete action is called, it doesn't do a hard delete unless it's already in a soft delete state.

The logic can be pushed up into the OS API level, but that leaves the compute API level open to being called incorrectly.

An alternative would be to pass a flag to delete() to differentiate between the two. Or the common logic could be split out to a _delete() method leaving delete() and force_delete() to just check with the correct state.

After thinking about it, I'm leaning towards a flag to delete().

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-19:

bcwaldon: I left the soft deleted instances in the server list to allow the user to identify the instance id again to restore or force delete it. It just seemed like the more user friendly thing to do. The API is conflicting when it comes to deleted instances. It says deleted servers are not included, but then it says that DELETED is a possible status.

I'm not sure I understand the problem with using the existing flag on the API server.

Revision history for this message

Brian Waldon (bcwaldon) wrote on 2011-09-19:

> bcwaldon: I left the soft deleted instances in the server list to allow the
> user to identify the instance id again to restore or force delete it. It just
> seemed like the more user friendly thing to do. The API is conflicting when it
> comes to deleted instances. It says deleted servers are not included, but then
> it says that DELETED is a possible status.

AFAIK, DELETED should only show up in output from a call with a changes-since param. I'm thinking we shouldn't show them to the user as they are effectively deleted. I'm not really arguing for which is better, I'm more just trying to follow the spec. Please correct me if it says otherwise.

> I'm not sure I understand the problem with using the existing flag on the API
> server.

In my setup, I have two nodes: api1 and compute1. I've got nova-api running on api1 and nova-compute on compute1. I configured nova.conf on compute1 and expected it to work, which it didn't. I had to look at the code to figure out I had to define the flag on api1 as well. What's weird, is that the value doesn't really matter to nova-api as long as it is not 0. How about we go to a two-flag setup where one enables it (enable_instance_soft_delete) and another defines how long to wait before reclamation (reclaim_instance_interval, possibly defaulting to 300).

Revision history for this message

Brian Lamar (blamar) wrote on 2011-09-19:

> blamar: the rest of delete and force_delete are identical, but the
> preconditions are different. Both require different states to perform the
> action requested. This is to ensure that if you delete an instance twice, it
> doesn't get hard deleted. And that if the forceDelete action is called, it
> doesn't do a hard delete unless it's already in a soft delete state.
>
> The logic can be pushed up into the OS API level, but that leaves the compute
> API level open to being called incorrectly.
>
> An alternative would be to pass a flag to delete() to differentiate between
> the two. Or the common logic could be split out to a _delete() method leaving
> delete() and force_delete() to just check with the correct state.
>
> After thinking about it, I'm leaning towards a flag to delete().

I'm not a huge fan of binary flags so I'd lean towards the _delete() method, but either way the code cleanup would be welcome so it's your call.

I do agree with Waldon's suggestion of FLAGS.enable_instance_soft_delete & FLAGS.reclaim_instance_interval just because it is a little awkward to have reclaim_instance_interval = 0 disable soft deletes in my opinion.

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-19:

I really think deploying a different configuration for API and compute is a bad idea. If you make a mistake, you can easily end up in a situation where instances are never deleted (soft delete enabled on API, but interval not set on compute).

Are there any other examples where a feature is enabled on API, but configured on compute?

Revision history for this message

Brian Waldon (bcwaldon) wrote on 2011-09-19:

> I really think deploying a different configuration for API and compute is a
> bad idea. If you make a mistake, you can easily end up in a situation where
> instances are never deleted (soft delete enabled on API, but interval not set
> on compute).

I understand the concern, but these are different services with very different jobs. I would *expect* people to use different configuration files. We can fix the case outlined above by defaulting the interval to 120 or something, requiring it is above 0 if soft delete is enabled.

> Are there any other examples where a feature is enabled on API, but configured
> on compute?

Nothing jumps out at me. I don't think it is necessarily a bad thing... Let's get an outside opinion on this.

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-19:

Why would you expect people to use different configuration files? I'm trying to understand why you think they *should* be different.

Nova works fine with a unified configuration file. I think people should do that to ensure that configuration options are used consistently.

Revision history for this message

Brian Waldon (bcwaldon) wrote on 2011-09-19:

> Nova works fine with a unified configuration file. I think people should do
> that to ensure that configuration options are used consistently.

Very true. I don't think I really have an argument against this. I really don't have any hard evidence here, I'm more just thinking out loud. What if you just need to update a compute-specific option? Would you risk restarting all the api services, too? What happens when an api node is compromised? Do you want to leak out your mysql connection string? Again, this is more of a brainstorm than argument, so I don't want to hold you up. Let's go with what you have unless somebody else has a real reason not to :)

Just so my comment doesn't get lost, I still think DELETED instances should not show up in GET /servers

Revision history for this message

Brian Waldon (bcwaldon) wrote on 2011-09-19:

> What happens when an api node is compromised? Do
> you want to leak out your mysql connection string?

Just realized this isn't a valid example...you get the point, though :)

Revision history for this message

Matt Dietz (cerberus) wrote on 2011-09-20:

Not exactly an issue at the moment, but the power_on calls might conflict weirdly with my https://code.launchpad.net/~cerberus/nova/lp853573/+merge/76272 branch (despite being nearly identical)

I like the branch, otherwise.

review: Approve

Revision history for this message

Paul Voccio (pvo) wrote on 2011-09-21:

lgtm.

Dietz, I'll take a peek at your branch today as well.

review: Approve

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2011-09-21:

There are additional revisions which have not been approved in review. Please seek review and approval of these new revisions.

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-21:

I should have been more aggressive setting this MP to WIP to prevent it from accidentally being merged.

I've made a couple of the discussed changes. delete() and force_delete() share common code now. Soft deleted instances are not included in /servers output anymore. They are treated like hard deleted instances.

Revision history for this message

Chris Behrens (cbehrens) wrote on 2011-09-22:

Looks good. I like the delete filtering implementation to solve the deleted instances showing up problem. You could actually just do:

if filters.get('deleted', False):
.. code for True
else:
.. code for False

Removing 'deleted' from the list of direct match names causes it to be ignored below this code IIRC, so you don't really need to pop it out of the dict. Even if you wanted to pop it out, you could s/get/pop/ above.

But, I'm good with it how it is.

review: Approve

Revision history for this message

Chris Behrens (cbehrens) wrote on 2011-09-22:

Well, I just ran into an issue in trunk with the part of this merge prop that merged already.

openstack/api/servers.py references that reclaim_instance_interval flag that's defined in compute/manager.py.. but it doesn't import compute.manager

I'm guessing at runtime, it's probably imported elsewhere... but I have a test in a branch I'm working on that is failing because of this. It feels like this should be fixed in some way correctly in this branch. For now, in my branch, I've just added the import to servers.py even though we don't use manager at all there.

review: Needs Fixing

lp://qastaging/~johannes.erdfelt/nova/deferred-delete-instance updated on 2011-09-22

1557. By Johannes Erdfelt on 2011-09-22: Make sure context gets passed to new _delete() method

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-22:

> Looks good. I like the delete filtering implementation to solve the deleted
> instances showing up problem. You could actually just do:
>
> if filters.get('deleted', False):
> .. code for True
> else:
> .. code for False

The check to see if it's been defined is necessary to match the current behavior.

> Removing 'deleted' from the list of direct match names causes it to be ignored
> below this code IIRC, so you don't really need to pop it out of the dict.
> Even if you wanted to pop it out, you could s/get/pop/ above.

The code further down was resulting in an extra check for deleted if I didn't remove it from the filters dict. Other code (like changes-since) didn't need to remove it because the name doesn't match a column. _regexp_filter_by_column will return True in that case.

The code is a little messy there as it is. I'm not sure why regexp_filter_funcs is still there for instance. It's checked, but no keys are ever set, so it's effectively a noop. Plus the aforementioned return True seems like a hack.

Revision history for this message

Johannes Erdfelt (johannes.erdfelt) wrote on 2011-09-22:

> Well, I just ran into an issue in trunk with the part of this merge prop that
> merged already.
>
> openstack/api/servers.py references that reclaim_instance_interval flag that's
> defined in compute/manager.py.. but it doesn't import compute.manager

Chris and I worked through this offline. The test suite runs fine if you run the entire test suite. Other code will import compute.manager. If just test_servers is run, then this problem will popup.

Chris and I have a different idea I'm going to test out to fix this and resolve some other configuration option concerns.

Revision history for this message

Chris Behrens (cbehrens) wrote on 2011-09-22:

> > Looks good. I like the delete filtering implementation to solve the deleted
> > instances showing up problem. You could actually just do:
> >
> > if filters.get('deleted', False):
> > .. code for True
> > else:
> > .. code for False
>
> The check to see if it's been defined is necessary to match the current
> behavior.

Ah. I see why. Cool.

> The code is a little messy there as it is. I'm not sure why
> regexp_filter_funcs is still there for instance. It's checked, but no keys are
> ever set, so it's effectively a noop. Plus the aforementioned return True
> seems like a hack.

The idea there was to ignore unknown filtering options, but I know what you mean.

Revision history for this message

Chris Behrens (cbehrens) wrote on 2011-09-22:

> > The code is a little messy there as it is. I'm not sure why
> > regexp_filter_funcs is still there for instance. It's checked, but no keys
> are
> > ever set, so it's effectively a noop. Plus the aforementioned return True
> > seems like a hack.

I was thinking the same thing earlier, but after looking at it more closely, it loops through the 'filters', not the empty dictionary. It'll actually do regexp matching on 'metadata' and anything else that's left in 'filters' that matches a column name, still...

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2011-09-22:

On Sep 21, 2011, at 9:56 PM, Chris Behrens wrote:

> openstack/api/servers.py references that reclaim_instance_interval flag that's defined in compute/manager.py.. but it doesn't import compute.manager
>

The correct fix for this is flags.DECLARE('reclaim_instance_interval', 'nova.compute.manager')

Unmerged revisions

1557. By Johannes Erdfelt on 2011-09-22: Make sure context gets passed to new _delete() method
1556. By Johannes Erdfelt on 2011-09-21: Merge with trunk
1555. By Johannes Erdfelt on 2011-09-21: Consolidate similar code
1554. By Johannes Erdfelt on 2011-09-21: Handle soft deleted instances identically to hard deleted instances
1553. By Johannes Erdfelt on 2011-09-21: Merge with trunk