LAVA Dispatcher

Merge lp://qastaging/lava-dispatcher/multinode into lp://qastaging/lava-dispatcher

multinode
Merge into trunk

Proposed by Neil Williams on 2013-08-21

Status:	Merged
Approved by:	Neil Williams on 2013-08-28
Approved revision:	693
Merged at revision:	659
Proposed branch:	lp://qastaging/lava-dispatcher/multinode
Merge into:	lp://qastaging/lava-dispatcher
Diff against target:	4419 lines (+2791/-513) (has conflicts) 42 files modified doc/conf.py (+2/-2) doc/debugging.rst (+119/-0) doc/index.rst (+1/-0) doc/multinode-usecases.rst (+8/-0) doc/multinode.rst (+291/-0) doc/multinodeapi.rst (+302/-0) doc/usecaseone.rst (+521/-0) doc/usecasetwo.rst (+224/-0) lava/dispatcher/commands.py (+8/-1) lava/dispatcher/node.py (+411/-0) lava_dispatcher/__init__.py (+1/-1) lava_dispatcher/actions/deploy.py (+1/-0) lava_dispatcher/actions/launch_control.py (+60/-3) lava_dispatcher/actions/lava_test_shell.py (+69/-1) lava_dispatcher/config.py (+0/-1) lava_dispatcher/context.py (+20/-4) lava_dispatcher/default-config/lava-dispatcher/device-types/aa9.conf (+16/-0) lava_dispatcher/default-config/lava-dispatcher/device-types/capri.conf (+0/-46) lava_dispatcher/default-config/lava-dispatcher/device-types/mx53loco.conf (+16/-6) lava_dispatcher/default-config/lava-dispatcher/device-types/nexus10.conf (+0/-46) lava_dispatcher/default-config/lava-dispatcher/device-types/rtsm_foundation-armv8.conf (+0/-20) lava_dispatcher/default-config/lava-dispatcher/device-types/rtsm_ve-a15x1-a7x1.conf (+0/-117) lava_dispatcher/default-config/lava-dispatcher/device-types/rtsm_ve-a15x4-a7x4.conf (+0/-117) lava_dispatcher/default-config/lava-dispatcher/device-types/rtsm_ve-armv8.conf (+0/-128) lava_dispatcher/device/master.py (+15/-10) lava_dispatcher/downloader.py (+1/-1) lava_dispatcher/job.py (+148/-4) lava_dispatcher/signals/__init__.py (+88/-1) lava_dispatcher/tests/test-config/bin/fake-qemu (+3/-0) lava_dispatcher/tests/test_device_version.py (+24/-0) lava_dispatcher/utils.py (+2/-2) lava_test_shell/multi_node/lava-group (+19/-0) lava_test_shell/multi_node/lava-multi-node.lib (+210/-0) lava_test_shell/multi_node/lava-network (+104/-0) lava_test_shell/multi_node/lava-role (+14/-0) lava_test_shell/multi_node/lava-self (+9/-0) lava_test_shell/multi_node/lava-send (+17/-0) lava_test_shell/multi_node/lava-sync (+20/-0) lava_test_shell/multi_node/lava-wait (+21/-0) lava_test_shell/multi_node/lava-wait-all (+23/-0) requirements.txt (+2/-1) setup.py (+1/-1) Text conflict in lava_dispatcher/actions/lava_test_shell.py Text conflict in lava_dispatcher/context.py Conflict adding file lava_dispatcher/default-config/lava-dispatcher/device-types/aa9.conf. Moved existing file to lava_dispatcher/default-config/lava-dispatcher/device-types/aa9.conf.moved. Text conflict in lava_dispatcher/default-config/lava-dispatcher/device-types/mx53loco.conf Text conflict in lava_dispatcher/device/master.py Text conflict in lava_dispatcher/job.py Conflict adding file lava_dispatcher/tests/test-config/bin. Moved existing file to lava_dispatcher/tests/test-config/bin.moved. Text conflict in lava_dispatcher/tests/test_device_version.py
To merge this branch:	bzr merge lp://qastaging/lava-dispatcher/multinode
Related bugs:	Link a bug report

Reviewer	Date Requested	Status
Neil Williams		Approve on 2013-08-28
Antonio Terceiro	2013-08-21	Needs Fixing on 2013-08-23
Review via email: mp+181233@code.qastaging.launchpad.net

This proposal supersedes a proposal from 2013-08-20.

Description of the change

Landing MultiNode.

Handles the communication between jobs in a MultiNode group to deliver the LAVA MultiNode API with synchronisation primitives.

This branch applies with conflicts. The conflicts are proposed to be resolved as per this temporary branch: lp:~codehelp/lava-dispatcher/multinode-merge

lava-dispatcher will be the merged after dashboard but before scheduler, so that MultiNode jobs can start as soon as the scheduler is ready.

Updated: Include missing changes from tip.

Revision history for this message

Antonio Terceiro (terceiro) wrote on 2013-08-23:

Download full text (68.2 KiB)

Hi guys,

This is great work! I like how we managed to minimize the impact on existing
code, so we are not risking breaking what already works! :-)

I have a fair number of comments below. I hope it helps us to get to a even
better code.

On Wed, Aug 21, 2013 at 09:45:46AM -0000, Neil Williams wrote:
> Neil Williams has proposed merging lp:lava-dispatcher/multinode into lp:lava-dispatcher.
>
> Requested reviews:
> Linaro Validation Team (linaro-validation)
>
> For more details, see:
> https://code.launchpad.net/~linaro-automation/lava-dispatcher/multinode/+merge/181233
>
> Landing MultiNode.
>
> Handles the communication between jobs in a MultiNode group to deliver
> the LAVA MultiNode API with synchronisation primitives.
>
> This branch applies with conflicts. The conflicts are proposed to be
> resolved as per this temporary branch:
> lp:~codehelp/lava-dispatcher/multinode-merge

I don't undertand why you didn't resolve the conflict already before making the
merge proposal ... specially because you already have that done somewhere else
:-)

> lava-dispatcher will be the merged after dashboard but before
> scheduler, so that MultiNode jobs can start as soon as the scheduler
> is ready.
>
> Updated: Include missing changes from tip.
[...]

There were a bunch of file renamings/removals in lava_dispatcher/default-config
in the diff, I don't think they should be here. I'm ommitting those from the
review. Maybe the merge from trunk was not complete, please make sure you
review the diff wrt trunk before going further with this.

> === modified file 'lava/dispatcher/commands.py'
> --- lava/dispatcher/commands.py 2013-07-16 15:58:16 +0000
> +++ lava/dispatcher/commands.py 2013-08-21 09:44:41 +0000
> @@ -7,7 +7,7 @@
> from json_schema_validator.errors import ValidationError
> from lava.tool.command import Command
> from lava.tool.errors import CommandError
> -
> +from lava.dispatcher.node import NodeDispatcher
> import lava_dispatcher.config
> from lava_dispatcher.config import get_config, get_device_config, get_devices
> from lava_dispatcher.job import LavaTestJob, validate_job_data
> @@ -93,6 +93,7 @@
> # Set process id if job-id was passed to dispatcher
> if self.args.job_id:
> try:
> + # noinspection PyUnresolvedReferences

what is this?

> from setproctitle import getproctitle, setproctitle
> except ImportError:
> logging.warning(
> @@ -107,6 +108,14 @@
> jobdata = stream.read()
> json_jobdata = json.loads(jobdata)
>
> + # detect multinode and start a NodeDispatcher to work with the LAVA Coordinator.
> + if not self.args.validate:
> + if 'target_group' in json_jobdata:
> + node = NodeDispatcher(json_jobdata, oob_file, self.args.output_dir)
> + node.run()
> + # the NodeDispatcher has started and closed.
> + # FIXME: get any error state from nodeDispatcher!

is it OK to land with this issue here unsolved?

> + exit(0)
> if self.args.target is None:
> if 'target' not in json_jobdata:
> ...

Hi guys,

This is great work! I like how we managed to minimize the impact on existing
code, so we are not risking breaking what already works! :-)

I have a fair number of comments below. I hope it helps us to get to a even
better code.

On Wed, Aug 21, 2013 at 09:45:46AM -0000, Neil Williams wrote:
> Neil Williams has proposed merging lp:lava-dispatcher/multinode into lp:lava-dispatcher.
> 
> Requested reviews:
>   Linaro Validation Team (linaro-validation)
> 
> For more details, see:
> https://code.launchpad.net/~linaro-automation/lava-dispatcher/multinode/+merge/181233
> 
> Landing MultiNode.
> 
> Handles the communication between jobs in a MultiNode group to deliver
> the LAVA MultiNode API with synchronisation primitives.
> 
> This branch applies with conflicts. The conflicts are proposed to be
> resolved as per this temporary branch:
> lp:~codehelp/lava-dispatcher/multinode-merge

I don't undertand why you didn't resolve the conflict already before making the
merge proposal ... specially because you already have that done somewhere else
:-)

> lava-dispatcher will be the merged after dashboard but before
> scheduler, so that MultiNode jobs can start as soon as the scheduler
> is ready.
> 
> Updated: Include missing changes from tip.
[...]

> === modified file 'lava/dispatcher/commands.py'
> --- lava/dispatcher/commands.py	2013-07-16 15:58:16 +0000
> +++ lava/dispatcher/commands.py	2013-08-21 09:44:41 +0000
> @@ -7,7 +7,7 @@
>  from json_schema_validator.errors import ValidationError
>  from lava.tool.command import Command
>  from lava.tool.errors import CommandError
> -
> +from lava.dispatcher.node import NodeDispatcher
>  import lava_dispatcher.config
>  from lava_dispatcher.config import get_config, get_device_config, get_devices
>  from lava_dispatcher.job import LavaTestJob, validate_job_data
> @@ -93,6 +93,7 @@
>          # Set process id if job-id was passed to dispatcher
>          if self.args.job_id:
>              try:
> +                # noinspection PyUnresolvedReferences

what is this?

>                  from setproctitle import getproctitle, setproctitle
>              except ImportError:
>                  logging.warning(
> @@ -107,6 +108,14 @@
>              jobdata = stream.read()
>              json_jobdata = json.loads(jobdata)
>  
> +        # detect multinode and start a NodeDispatcher to work with the LAVA Coordinator.
> +        if not self.args.validate:
> +            if 'target_group' in json_jobdata:
> +                node = NodeDispatcher(json_jobdata, oob_file, self.args.output_dir)
> +                node.run()
> +                # the NodeDispatcher has started and closed.
> +                # FIXME: get any error state from nodeDispatcher!

is it OK to land with this issue here unsolved?

> +                exit(0)
>          if self.args.target is None:
>              if 'target' not in json_jobdata:
>                  logging.error("The job file does not specify a target device. "
> 
> === added file 'lava/dispatcher/node.py'
> --- lava/dispatcher/node.py	1970-01-01 00:00:00 +0000
> +++ lava/dispatcher/node.py	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,412 @@
> +#!/usr/bin/env python
> +# -*- coding: utf-8 -*-
> +#
> +#  node.py
> +#
> +#  Copyright 2013 Linaro Limited
> +#  Author Neil Williams <neil.williams@linaro.org>
> +#
> +#  This program is free software; you can redistribute it and/or modify
> +#  it under the terms of the GNU General Public License as published by
> +#  the Free Software Foundation; either version 2 of the License, or
> +#  (at your option) any later version.
> +#
> +#  This program is distributed in the hope that it will be useful,
> +#  but WITHOUT ANY WARRANTY; without even the implied warranty of
> +#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +#  GNU General Public License for more details.
> +#
> +#  You should have received a copy of the GNU General Public License
> +#  along with this program; if not, write to the Free Software
> +#  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
> +#  MA 02110-1301, USA.
> +#
> +#
> +
> +import socket
> +from socket import gethostname
> +import json
> +import logging
> +import os
> +import copy
> +import sys
> +import time
> +from lava_dispatcher.config import get_config
> +from lava_dispatcher.job import LavaTestJob
> +
> +
> +class Poller(object):
> +    """
> +    Blocking, synchronous socket poller which repeatedly tries to connect
> +    to the Coordinator, get a very fast response and then implement the
> +    wait.
> +    If the node needs to wait, it will get a {"response": "wait"}
> +    If the node should stop polling and send data back to the board, it will
> +    get a {"response": "ack", "message": "blah blah"}
> +    """
> +
> +    json_data = None
> +    polling = False

polling is only used inside poll(), and its value does not depend on anything
that is outside of that method.  make it a local variable instead.

> +    # starting value for the delay between polls
> +    delay = 1

ditto for delay

> +    blocks = 4 * 1024
> +    # how long between polls (in seconds)
> +    step = 1

you might want to call this step_delay or delay_step?

> +    timeout = 0
> +
> +    def __init__(self, data_str):
> +        try:
> +            self.json_data = json.loads(data_str)

you are parsing json here, then dumping when creating an instance of this
class. We not just passing the data (i.e. a regular dictorionary) directly?

> +        except ValueError:
> +            logging.error("bad JSON")
> +            exit(1)
> +        if 'port' not in self.json_data:
> +            logging.error("Misconfigured NodeDispatcher - port not specified")
> +        if 'blocksize' not in self.json_data:
> +            logging.error("Misconfigured NodeDispatcher - blocksize not specified")
> +        self.blocks = int(self.json_data['blocksize'])
> +        if "poll_delay" in self.json_data:
> +            self.step = int(self.json_data["poll_delay"])
> +        if 'timeout' in self.json_data:
> +            self.timeout = self.json_data['timeout']
> +
> +    def poll(self, msg_str):

it seems that all the calls to this method are done like
poller.poll(json.dumps(msg)). Perhaps you should take the message directly,
then do the dumping inside this method

> +        """
> +        Blocking, synchronous polling of the Coordinator on the configured port.
> +        Single send operations greater than 0xFFFF are rejected to prevent truncation.
> +        :param msg_str: The message to send to the Coordinator, as a JSON string.
> +        :return: a JSON string of the response to the poll
> +        """
> +        msg_len = len(msg_str)
> +        if msg_len > 0xFFFE:
> +            logging.error("Message was too long to send!")
> +            return

is just logging and error message enough here? I think the probability that the
rest of the test logic would depend on this message being sent is pretty high,
so if it's rejected, shouldn't we abort the job at this point?

> +        self.polling = True
> +        c = 0
> +        response = None
> +        while self.polling:
> +            c += self.step
> +            s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +            s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
> +            try:
> +                s.connect((self.json_data['host'], self.json_data['port']))
> +                logging.debug("Connecting to LAVA Coordinator on %s:%s" % (self.json_data['host'], self.json_data['port']))
> +                self.delay = self.step
> +            except socket.error as e:
> +                logging.warn("socket error on connect: %d %s %s" %
> +                             (e.errno, self.json_data['host'], self.json_data['port']))
> +                time.sleep(self.delay)
> +                self.delay += 2
> +                s.close()
> +                continue

if there is a problem with the coordinator (e.g. it crashed) and/or with the
configuration (wrong host, wrong port etc) and connection cannot be made, we
will loop here until a timeout. Can we be more specific in the exception
handling there, or it just does not matter?

> +            logging.debug("sending message: %s" % msg_str[:42])
> +            # blocking synchronous call
> +            try:
> +                # send the length as 32bit hexadecimal
> +                ret_bytes = s.send("%08X" % msg_len)
> +                if ret_bytes == 0:
> +                    logging.debug("zero bytes sent for length - connection closed?")
> +                    continue
> +                ret_bytes = s.send(msg_str)
> +                if ret_bytes == 0:
> +                    logging.debug("zero bytes sent for message - connection closed?")
> +                    continue
> +            except socket.error as e:
> +                logging.warn("socket error '%d' on send" % e.message)
> +                s.close()
> +                continue
> +            s.shutdown(socket.SHUT_WR)
> +            try:
> +                header = s.recv(8)  # 32bit limit as a hexadecimal
> +                if not header or header == '':
> +                    logging.debug("empty header received?")
> +                    continue
> +                msg_count = int(header, 16)
> +                recv_count = 0
> +                response = ''
> +                while recv_count < msg_count:
> +                    response += s.recv(self.blocks)
> +                    recv_count += self.blocks
> +            except socket.error as e:
> +                logging.warn("socket error '%d' on response" % e.errno)
> +                s.close()
> +                continue
> +            s.close()
> +            if not response:
> +                time.sleep(self.delay)
> +                # if no response, wait and try again
> +                logging.debug("failed to get a response, setting a wait")
> +                response = json.dumps({"response": "wait"})
> +            try:
> +                json_data = json.loads(response)
> +            except ValueError:
> +                logging.error("response starting '%s' was not JSON" % response[:42])
> +                break
> +            if json_data['response'] != 'wait':
> +                self.polling = False
> +                break
> +            else:
> +                if not (c % int(10 * self.step)):
> +                    logging.info("Waiting ... %d of %d secs" % (c, self.timeout))
> +                time.sleep(self.delay)
> +            # apply the default timeout to each poll operation.
> +            if c > self.timeout:
> +                response = json.dumps({"response": "nack"})
> +                self.polling = False
> +                break
> +        return response

We probably already discussed this, but I feel we could be using something like
zeromq for this. (we do not have to do anything about this right now, just
saying :-))

> +def readSettings(filename):
> +    """
> +    NodeDispatchers need to use the same port and blocksize as the Coordinator,
> +    so read the same conffile.
> +    The protocol header is hard-coded into the server & here.
> +    """
> +    settings = {
> +        "port": 3079,
> +        "blocksize": 4 * 1024,
> +        "poll_delay": 1,
> +        "coordinator_hostname": "localhost"
> +    }
> +    with open(filename) as stream:
> +        jobdata = stream.read()
> +        json_default = json.loads(jobdata)
> +    if "port" in json_default:
> +        settings['port'] = json_default['port']
> +    if "blocksize" in json_default:
> +        settings['blocksize'] = json_default["blocksize"]
> +    if "poll_delay" in json_default:
> +        settings['poll_delay'] = json_default['poll_delay']
> +    if "coordinator_hostname" in json_default:
> +        settings['coordinator_hostname'] = json_default['coordinator_hostname']
> +    return settings
> +
> +
> +class NodeDispatcher(object):
> +
> +    group_name = ''
> +    client_name = ''
> +    group_size = 0
> +    target = ''
> +    role = ''
> +    poller = None
> +    oob_file = sys.stderr
> +    output_dir = None
> +    base_msg = None
> +    json_data = None
> +
> +    def __init__(self, json_data, oob_file=sys.stderr, output_dir=None):
> +        """
> +        Parse the modified JSON to identify the group name,
> +        requested port for the group - node comms
> +        and get the designation for this node in the group.
> +        """
> +        settings = readSettings("/etc/lava-coordinator/lava-coordinator.conf")
> +        self.json_data = json_data
> +        # FIXME: do this with a schema once the API settles
> +        if 'target_group' not in json_data:
> +            raise ValueError("Invalid JSON to work with the MultiNode Coordinator: no target_group.")
> +        self.group_name = json_data['target_group']
> +        if 'group_size' not in json_data:
> +            raise ValueError("Invalid JSON to work with the Coordinator: no group_size")
> +        self.group_size = json_data["group_size"]
> +        if 'target' not in json_data:
> +            raise ValueError("Invalid JSON for a child node: no target designation.")
> +        self.target = json_data['target']
> +        if 'timeout' not in json_data:
> +            raise ValueError("Invalid JSON - no default timeout specified.")
> +        if "sub_id" not in json_data:
> +            logging.info("Error in JSON - no sub_id specified. Results cannot be aggregated.")
> +            json_data['sub_id'] = None
> +        if 'port' in json_data:
> +            # lava-coordinator provides a conffile for the port and blocksize.
> +            logging.debug("Port is no longer supported in the incoming JSON. Using %d" % settings["port"])
> +        if 'role' in json_data:
> +            self.role = json_data['role']
> +        # hostname of the server for the connection.
> +        if 'hostname' in json_data:
> +            # lava-coordinator provides a conffile for the group_hostname
> +            logging.debug("Coordinator hostname is no longer supported in the incoming JSON. Using %s"
> +                          % settings['coordinator_hostname'])
> +        self.base_msg = {"port": settings['port'],
> +                         "blocksize": settings['blocksize'],
> +                         "step": settings["poll_delay"],
> +                         "timeout": json_data['timeout'],
> +                         "host": settings['coordinator_hostname'],
> +                         "client_name": json_data['target'],
> +                         "group_name": json_data['target_group'],
> +                         # hostname here is the node hostname, not the server.
> +                         "hostname": gethostname(),
> +                         "role": self.role,
> +                         }
> +        self.client_name = json_data['target']
> +        self.poller = Poller(json.dumps(self.base_msg))

see my comment above in Poller.__init__ about dumping json and then parsing it
just after - we should probably drop the redundant dumping/parsing

> +        self.oob_file = oob_file
> +        self.output_dir = output_dir
> +
> +    def run(self):
> +        """
> +        Initialises the node into the group, registering the group if necessary
> +        (via group_size) and *waiting* until the rest of the group nodes also
> +        register before starting the actual job,
> +        """

> +        init_msg = {"request": "group_data", "group_size": self.group_size}
> +        init_msg.update(self.base_msg)
> +        logging.info("Starting Multi-Node communications for group '%s'" % self.group_name)
> +        logging.debug("init_msg %s" % json.dumps(init_msg))
> +        response = json.loads(self.poller.poll(json.dumps(init_msg)))
> +        logging.info("Starting the test run for %s in group %s" % (self.client_name, self.group_name))

> +        self.run_tests(self.json_data, response)

> +        # send a message to the GroupDispatcher to close the group (when all nodes have sent fin_msg)
> +        fin_msg = {"request": "clear_group", "group_size": self.group_size}
> +        fin_msg.update(self.base_msg)
> +        logging.debug("fin_msg %s" % json.dumps(fin_msg))
> +        self.poller.poll(json.dumps(fin_msg))

I would break the body of this method as above (or even extract each block into
to its own method) to make it more readable.

> +
> +    def __call__(self, args):
> +        """ Makes the NodeDispatcher callable so that the test shell can send messages just using the
> +        NodeDispatcher object.
> +        This function blocks until the specified API call returns. Some API calls may involve a
> +        substantial period of polling.
> +        :param args: JSON string of the arguments of the API call to make
> +        :return: A Python object containing the reply dict from the API call
> +        """
> +        try:
> +            return self._select(json.loads(args))

same here, maybe we could receive the args as a proper dictionary; it seems all
calls of this method first dump their message to JSON?

> +        except KeyError:
> +            logging.warn("Unable to handle request for: %s" % args)
> +
> +    def _select(self, json_data):

this method should probably be called _api_call instead?

> +        """ Determines which API call has been requested, makes the call, blocks and returns the reply.
> +        :param json_data: Python object of the API call
> +        :return: Python object containing the reply dict.
> +        """
> +        reply_str = ''
> +        if not json_data:
> +            logging.debug("Empty args")
> +            return
> +        if 'request' not in json_data:
> +            logging.debug("Bad call")
> +            return
> +        if json_data["request"] == "aggregate":
> +            # no message processing here, just the bundles.
> +            return self._aggregation(json_data)
> +        messageID = json_data['messageID']
> +        if json_data['request'] == "lava_sync":
> +            logging.info("requesting lava_sync '%s'" % messageID)
> +            reply_str = self.request_sync(messageID)
> +        elif json_data['request'] == 'lava_wait':
> +            logging.info("requesting lava_wait '%s'" % messageID)
> +            reply_str = self.request_wait(messageID)
> +        elif json_data['request'] == 'lava_wait_all':
> +            if 'role' in json_data and json_data['role'] is not None:
> +                reply_str = self.request_wait_all(messageID, json_data['role'])
> +                logging.info("requesting lava_wait_all '%s' '%s'" % (messageID, json_data['role']))
> +            else:
> +                logging.info("requesting lava_wait_all '%s'" % messageID)
> +                reply_str = self.request_wait_all(messageID)
> +        elif json_data['request'] == "lava_send":
> +            logging.info("requesting lava_send %s" % messageID)
> +            reply_str = self.request_send(messageID, json_data['message'])
> +        reply = json.loads(str(reply_str))
> +        if 'message' in reply:
> +            return reply['message']
> +        else:
> +            return reply['response']
> +
> +    def _aggregation(self, json_data):
> +        """ Internal call to send the bundle message to the coordinator so that the node
> +        with sub_id zero will get the complete bundle and everyone else a blank bundle.
> +        :param json_data: Arbitrary data from the job which will form the result bundle
> +        """
> +        if json_data["bundle"] is None:
> +            logging.info("Notifyng LAVA Controller of job completion")
> +        else:
> +            logging.info("Passing results bundle to LAVA Coordinator.")
> +        reply_str = self._send(json_data)
> +        reply = json.loads(str(reply_str))
> +        if 'message' in reply:
> +            return reply['message']
> +        else:
> +            return reply['response']
> +
> +    def _send(self, msg):
> +        """ Internal call to perform the API call via the Poller.
> +        :param msg: The call-specific message to be wrapped in the base_msg primitive.
> +        :return: Python object of the reply dict.
> +        """
> +        new_msg = copy.deepcopy(self.base_msg)
> +        new_msg.update(msg)
> +        if 'bundle' in new_msg:
> +            logging.debug("sending result bundle")
> +        else:
> +            logging.debug("sending Message %s" % json.dumps(new_msg))
> +        return self.poller.poll(json.dumps(new_msg))
> +
> +    def request_wait_all(self, messageID, role=None):
> +        """
> +        Asks the Coordinator to send back a particular messageID
> +        and blocks until that messageID is available for all nodes in
> +        this group or all nodes with the specified role in this group.
> +        """
> +        if role:
> +            return self._send({"request": "lava_wait_all",
> +                              "messageID": messageID,
> +                              "waitrole": role})
> +        else:
> +            return self._send({"request": "lava_wait_all",
> +                              "messageID": messageID})
> +
> +    def request_wait(self, messageID):
> +        """
> +        Asks the Coordinator to send back a particular messageID
> +        and blocks until that messageID is available for this node
> +        """
> +        # use self.target as the node ID
> +        wait_msg = {"request": "lava_wait",
> +                    "messageID": messageID,
> +                    "nodeID": self.target}
> +        return self._send(wait_msg)
> +
> +    def request_send(self, messageID, message):
> +        """
> +        Sends a message to the group via the Coordinator. The
> +        message is guaranteed to be available to all members of the
> +        group. The message is only picked up when a client in the group
> +        calls lava_wait or lava_wait_all.
> +        The message needs to be formatted JSON, not a simple string.
> +        { "messageID": "string", "message": { "key": "value"} }
> +        The message can consist of just the messageID:
> +        { "messageID": "string" }
> +        """
> +        send_msg = {"request": "lava_send",
> +                    "messageID": messageID,
> +                    "message": message}
> +        return self._send(send_msg)
> +
> +    def request_sync(self, msg):
> +        """
> +        Creates and send a message requesting lava_sync
> +        """
> +        sync_msg = {"request": "lava_sync", "messageID": msg}
> +        return self._send(sync_msg)
> +
> +    def run_tests(self, json_jobdata, group_data):
> +        if 'response' in group_data and group_data['response'] == 'nack':
> +            logging.error("Unable to initiliase a Multi-Node group - timed out waiting for other devices.")
> +            return
> +        config = get_config()
> +        if 'logging_level' in json_jobdata:
> +            logging.root.setLevel(json_jobdata["logging_level"])
> +        else:
> +            logging.root.setLevel(config.logging_level)
> +        if 'target' not in json_jobdata:
> +            logging.error("The job file does not specify a target device.")
> +            exit(1)
> +        jobdata = json.dumps(json_jobdata)
> +        if self.output_dir and not os.path.isdir(self.output_dir):
> +            os.makedirs(self.output_dir)
> +        job = LavaTestJob(jobdata, self.oob_file, config, self.output_dir)
> +        # pass this NodeDispatcher down so that the lava_test_shell can __call__ nodeTransport to write a message
> +        job.run(self, group_data)
> 
> === modified file 'lava_dispatcher/actions/lava_test_shell.py'
> --- lava_dispatcher/actions/lava_test_shell.py	2013-07-23 08:12:14 +0000
> +++ lava_dispatcher/actions/lava_test_shell.py	2013-08-21 09:44:41 +0000
> @@ -134,6 +134,16 @@
>  from lava_dispatcher.downloader import download_image
>  
>  LAVA_TEST_DIR = '%s/../../lava_test_shell' % os.path.dirname(__file__)
> +LAVA_MULTI_NODE_TEST_DIR = '%s/../../lava_test_shell/multi_node' % os.path.dirname(__file__)
> +
> +LAVA_GROUP_FILE = 'lava-group'
> +LAVA_ROLE_FILE = 'lava-role'
> +LAVA_SELF_FILE = 'lava-self'
> +LAVA_SEND_FILE = 'lava-send'
> +LAVA_SYNC_FILE = 'lava-sync'
> +LAVA_WAIT_FILE = 'lava-wait'
> +LAVA_WAIT_ALL_FILE = 'lava-wait-all'
> +LAVA_MULTI_NODE_CACHE_FILE = '/tmp/lava_multi_node_cache.txt'
>  
>  Target.android_deployment_data['distro'] = 'android'
>  Target.android_deployment_data['lava_test_sh_cmd'] = '/system/bin/mksh'
> @@ -508,6 +518,7 @@
>                                'items': {'type': 'object',
>                                          'properties':
>                                          {'git-repo': {'type': 'string',
> +<<<<<<< TREE
>                                                        'optional': True},
>                                           'bzr-repo': {'type': 'string',
>                                                        'optional': True},
> @@ -517,11 +528,23 @@
>                                                        'optional': True},
>                                           'testdef': {'type': 'string',
>                                                       'optional': True}
> +=======
> +                                                'optional': True},
> +                                        'bzr-repo': {'type': 'string',
> +                                                'optional': True},
> +                                        'tar-repo': {'type': 'string',
> +                                                'optional': True},
> +                                        'revision': {'type': 'string',
> +                                                'optional': True},
> +                                        'testdef': {'type': 'string',
> +                                                'optional': True}
> +>>>>>>> MERGE-SOURCE

conflict here

>                                           },
>                                          'additionalProperties': False},
>                                'optional': True
>                                },
>              'timeout': {'type': 'integer', 'optional': True},
> +            'role': {'type': 'string', 'optional': True},
>          },
>          'additionalProperties': False,
>      }
> @@ -544,6 +567,7 @@
>              if timeout == -1:
>                  timeout = runner._connection.timeout
>              initial_timeout = timeout
> +            signal_director.setConnection(runner._connection)
>              while self._keep_running(runner, timeout, signal_director):
>                  elapsed = time.time() - start
>                  timeout = int(initial_timeout - elapsed)
> @@ -556,6 +580,7 @@
>              pexpect.EOF,
>              pexpect.TIMEOUT,
>              '<LAVA_SIGNAL_(\S+) ([^>]+)>',
> +            '<LAVA_MULTI_NODE> <LAVA_(\S+) ([^>]+)>',
>          ]
>  
>          idx = runner._connection.expect(patterns, timeout=timeout)
> @@ -570,11 +595,21 @@
>              logging.debug("Received signal <%s>" % name)
>              params = params.split()
>              try:
> -                signal_director.signal(name, params)
> +                signal_director.signal(name, params, self.context)
>              except:
>                  logging.exception("on_signal failed")
>              runner._connection.sendline('echo LAVA_ACK')
>              return True
> +        elif idx == 4:
> +            name, params = runner._connection.match.groups()
> +            logging.debug("Received Multi_Node API <LAVA_%s>" % name)
> +            params = params.split()
> +            ret = False
> +            try:
> +                ret = signal_director.signal(name, params, self.context)
> +            except:
> +                logging.exception("on_signal(Multi_Node) failed")
> +            return ret
>  
>          return False
>  
> @@ -598,6 +633,37 @@
>                      fout.write(fin.read())
>                      os.fchmod(fout.fileno(), XMOD)
>  
> +    def _inject_multi_node_api(self, mntdir, target):
> +        shell = target.deployment_data['lava_test_sh_cmd']
> +
> +        # Generic scripts
> +        scripts_to_copy = glob(os.path.join(LAVA_MULTI_NODE_TEST_DIR, 'lava-*'))
> +
> +        for fname in scripts_to_copy:
> +            with open(fname, 'r') as fin:
> +                foutname = os.path.basename(fname)
> +                with open('%s/bin/%s' % (mntdir, foutname), 'w') as fout:
> +                    fout.write("#!%s\n\n" % shell)
> +                    # Target-specific scripts (add ENV to the generic ones)
> +                    if foutname == LAVA_GROUP_FILE:
> +                        fout.write('LAVA_GROUP="\n')
> +                        if 'roles' in self.context.group_data:
> +                            for client_name in self.context.group_data['roles']:
> +                                fout.write(r"\t%s\t%s\n" % (client_name, self.context.group_data['roles'][client_name]))
> +                        else:
> +                            logging.debug("group data MISSING")
> +                        fout.write('"\n')
> +                    elif foutname == LAVA_ROLE_FILE:
> +                        fout.write("TARGET_ROLE='%s'\n" % self.context.test_data.metadata['role'])
> +                    elif foutname == LAVA_SELF_FILE:
> +                        fout.write("HOSTNAME='%s'\n" % self.context.test_data.metadata['target.hostname'])
> +                    else:
> +                        fout.write("LAVA_TEST_BIN='%s/bin'\n" % target.deployment_data['lava_test_dir'])
> +                        fout.write("LAVA_MULTI_NODE_CACHE='%s'\n" % LAVA_MULTI_NODE_CACHE_FILE)
> +                        if self.context.test_data.metadata['logging_level'] == 'DEBUG':
> +                            fout.write("LAVA_MULTI_NODE_DEBUG='yes'\n")
> +                    fout.write(fin.read())
> +                    os.fchmod(fout.fileno(), XMOD)
>  
>      def _mk_runner_dirs(self, mntdir):
>          utils.ensure_directory('%s/bin' % mntdir)
> @@ -613,6 +679,8 @@
>          with target.file_system(results_part, 'lava') as d:
>              self._mk_runner_dirs(d)
>              self._copy_runner(d, target)
> +            if 'target_group' in self.context.test_data.metadata:
> +                self._inject_multi_node_api(d, target)
>  
>              testdef_loader = TestDefinitionLoader(self.context, target.scratch_dir)
>  
> 
> === modified file 'lava_dispatcher/context.py'
> --- lava_dispatcher/context.py	2013-07-24 16:56:18 +0000
> +++ lava_dispatcher/context.py	2013-08-21 09:44:41 +0000
> @@ -129,7 +129,7 @@
>      def run_command(self, command, failok=True):
>          """run command 'command' with output going to output-dir if specified"""
>          if isinstance(command, (str, unicode)):
> -            command = ['sh', '-c', command]
> +            command = ['nice', 'sh', '-c', command]
>          logging.debug("Executing on host : '%r'" % command)
>          output_args = {
>              'stdout': self.logfile_read,
> @@ -148,6 +148,22 @@
>          logging.debug("Executing on host : '%r'" % command)
>          return subprocess.check_output(command) 
>  
> -    def finish(self):
> -        self.client.finish()
> -
> +<<<<<<< TREE
> +    def finish(self):
> +        self.client.finish()
> +
> +=======
> +    def finish(self):
> +        self.client.finish()
> +
> +    def assign_transport(self, transport):
> +        self.transport = transport

this is probably not needed (see below)

> +
> +    def assign_group_data(self, group_data):
> +        """
> +        :param group_data: Arbitrary data related to the
> +        group configuration, passed in via the GroupDispatcher
> +        Used by lava-group
> +        """
> +        self.group_data = group_data
> +>>>>>>> MERGE-SOURCE

conflict

> === modified file 'lava_dispatcher/downloader.py'
> --- lava_dispatcher/downloader.py	2013-07-16 16:08:22 +0000
> +++ lava_dispatcher/downloader.py	2013-08-21 09:44:41 +0000
> @@ -41,7 +41,7 @@
>      process = None
>      try:
>          process = subprocess.Popen(
> -            ['ssh', url.netloc, 'cat', url.path],
> +            ['nice', 'ssh', url.netloc, 'cat', url.path],

Is this useful at all? Maybe ionice would be more hepful?

>              shell=False,
>              stdout=subprocess.PIPE
>          )
> 
> === modified file 'lava_dispatcher/job.py'
> --- lava_dispatcher/job.py	2013-07-24 16:56:18 +0000
> +++ lava_dispatcher/job.py	2013-08-21 09:44:41 +0000
> @@ -23,6 +23,7 @@
>  import pexpect
>  import time
>  import traceback
> +import hashlib
>  
>  from json_schema_validator.schema import Schema
>  from json_schema_validator.validator import Validator
> @@ -64,6 +65,34 @@
>              'type': 'string',
>              'optional': True,
>          },
> +        'device_group': {
> +            'type': 'array',
> +            'additionalProperties': False,
> +            'optional': True,
> +            'items': {
> +                'type': 'object',
> +                'properties': {
> +                    'role': {
> +                        'optional': False,
> +                        'type': 'string',
> +                    },
> +                    'count': {
> +                        'optional': False,
> +                        'type': 'integer',
> +                    },
> +                    'device_type': {
> +                        'optional': False,
> +                        'type': 'string',
> +                    },
> +                    'tags': {
> +                        'type': 'array',
> +                        'uniqueItems': True,
> +                        'items': {'type': 'string'},
> +                        'optional': True,
> +                    },
> +                },
> +            },
> +        },
>          'job_name': {
>              'type': 'string',
>              'optional': True,
> @@ -76,6 +105,26 @@
>              'type': 'string',
>              'optional': True,
>          },
> +        'target_group': {
> +            'type': 'string',
> +            'optional': True,
> +        },

this should be probably called multinode_target_group

> +        'port': {
> +            'type': 'integer',
> +            'optional': True,
> +        },

this should probably be called multinode_coordinator_port

> +        'hostname': {
> +            'type': 'string',
> +            'optional': True,
> +        },

this should probably be called multinode_coordinator_hostname

> +        'role': {
> +            'type': 'string',
> +            'optional': True,
> +        },

this should probably be called multinode_role

> +        'group_size': {
> +            'type': 'integer',
> +            'optional': True,
> +        },

this should be probably be called multinode_group_size

my feeling is that these with these generic names at the top level might be
somewhat confusing ... so I think we should either add the multinode_ prefix as
suggested above, or keep their names by move them inside an optional attribute
caled "multinode_data".

>          'timeout': {
>              'type': 'integer',
>              'optional': False,
> @@ -136,7 +185,9 @@
>          except:
>              return None
>  
> -    def run(self):
> +    def run(self, transport=None, group_data=None):

> +        self.context.assign_transport(transport)

this is probably not needed (see below).

> +        self.context.assign_group_data(group_data)
>          validate_job_data(self.job_data)
>          self._set_logging_level()
>          lava_commands = get_all_cmds()
> @@ -157,6 +208,31 @@
>  
>          self.context.test_data.add_tags(self.tags)
>  
> +        if 'target' in self.job_data:
> +            metadata['target'] = self.job_data['target']
> +            self.context.test_data.add_metadata(metadata)
> +
> +        if 'logging_level' in self.job_data:
> +            metadata['logging_level'] = self.job_data['logging_level']
> +            self.context.test_data.add_metadata(metadata)
> +
> +        if 'target_group' in self.job_data:
> +            metadata['target_group'] = self.job_data['target_group']
> +            self.context.test_data.add_metadata(metadata)
> +
> +            if 'role' in self.job_data:
> +                metadata['role'] = self.job_data['role']
> +                self.context.test_data.add_metadata(metadata)
> +
> +            if 'group_size' in self.job_data:
> +                metadata['group_size'] = self.job_data['group_size']
> +                self.context.test_data.add_metadata(metadata)
> +
> +            logging.info("[ACTION-B] Multi Node test!")
> +            logging.info("[ACTION-B] target_group is (%s)." % self.context.test_data.metadata['target_group'])
> +        else:
> +            logging.info("[ACTION-B] Single node test!")
> +
>          try:
>              job_length = len(self.job_data['actions'])
>              job_num = 0
> @@ -177,6 +253,7 @@
>                      status = 'fail'
>                      action.run(**params)
>                  except ADBConnectError as err:
> +                    logging.info("ADBConnectError")
>                      if cmd.get('command') == 'boot_linaro_android_image':
>                          logging.warning(('[ACTION-E] %s failed to create the'
>                                           ' adb connection') % (cmd['command']))
> @@ -195,6 +272,7 @@
>                          ## mark it as pass if the second boot works
>                          status = 'pass'
>                  except TimeoutError as err:
> +                    logging.info("TimeoutError")
>                      if cmd.get('command').startswith('lava_android_test'):
>                          logging.warning("[ACTION-E] %s times out." %
>                                          (cmd['command']))
> @@ -214,15 +292,23 @@
>                              self.context.client.proc.sendline("")
>                              time.sleep(5)
>                              self.context.client.boot_linaro_android_image()
> +                    else:
> +                        logging.warn("Unhandled timeout condition")
> +                        continue
>                  except CriticalError as err:
> +                    logging.info("CriticalError")
>                      raise
>                  except (pexpect.TIMEOUT, GeneralError) as err:
> +                    logging.warn("pexpect timed out, pass with status %s" % status)
>                      pass
>                  except Exception as err:
> +                    logging.info("General Exception")
>                      raise
>                  else:
> +                    logging.info("setting status pass")
>                      status = 'pass'
>                  finally:
> +                    logging.info("finally status %s" % status)
>                      err_msg = ""
>                      if status == 'fail':
>                          # XXX mwhudson, 2013-01-17: I have no idea what this
> @@ -255,7 +341,10 @@
>              self.context.test_data.add_metadata({
>                  'target.device_version': device_version
>              })
> -            if submit_results:
> +            if 'target_group' in self.job_data:
> +                # all nodes call aggregate, even if there is no submit_results command
> +                self._aggregate_bundle(transport, lava_commands, submit_results)
> +            elif submit_results:
>                  params = submit_results.get('parameters', {})
>                  action = lava_commands[submit_results['command']](
>                      self.context)
> @@ -268,7 +357,62 @@
>                  except Exception as err:
>                      logging.error("Failed to submit the test result. Error = %s", err)
>                      raise
> -            self.context.finish()
> +<<<<<<< TREE
> +            self.context.finish()
> +=======
> +            self.context.finish()
> +
> +    def _aggregate_bundle(self, transport, lava_commands, submit_results):
> +        if "sub_id" not in self.job_data:
> +            raise ValueError("Invalid MultiNode JSON - missing sub_id")
> +        # all nodes call aggregate, even if there is no submit_results command
> +        base_msg = {
> +            "request": "aggregate",
> +            "bundle": None,
> +            "sub_id": self.job_data['sub_id']
> +        }
> +        if not submit_results:
> +            transport(json.dumps(base_msg))
> +            return
> +        # need to collate this bundle before submission, then send to the coordinator.
> +        params = submit_results.get('parameters', {})
> +        action = lava_commands[submit_results['command']](self.context)
> +        token = None
> +        group_name = self.job_data['target_group']
> +        if 'token' in params:
> +            token = params['token']
> +        # the transport layer knows the client_name for this bundle.
> +        bundle = action.collect_bundles(**params)
> +        # catch parse errors in bundles
> +        try:
> +            bundle_str = json.dumps(bundle)
> +        except Exception as e:
> +            logging.error("Unable to parse bundle '%s' - %s" % (bundle, e))
> +            transport(json.dumps(base_msg))
> +            return
> +        sha1 = hashlib.sha1()
> +        sha1.update(bundle_str)
> +        base_msg['bundle'] = sha1.hexdigest()
> +        reply = transport(json.dumps(base_msg))
> +        # if this is sub_id zero, this will wait until the last call to aggregate
> +        # and then the reply is the full list of bundle checksums.
> +        if reply == "ack":
> +            # coordinator has our checksum for this bundle, submit as pending to launch_control
> +            action.submit_pending(bundle, params['server'], token, group_name)
> +            logging.info("Result bundle %s has been submitted to Dashboard as pending." % base_msg['bundle'])
> +            return
> +        elif reply == "nack":
> +            logging.error("Unable to submit result bundle checksum to coordinator")
> +            return
> +        else:
> +            if self.job_data["sub_id"].endswith(".0"):
> +                # submit this bundle, add it to the pending list which is indexed by group_name and post the set
> +                logging.info("Submitting bundle '%s' and aggregating with pending group results." % base_msg['bundle'])
> +                action.submit_group_list(bundle, params['server'], params['stream'], token, group_name)
> +                return
> +            else:
> +                raise ValueError("API error - collated bundle has been sent to the wrong node.")
> +>>>>>>> MERGE-SOURCE
>  
>      def _set_logging_level(self):
>          # set logging level is optional
> 
> === modified file 'lava_dispatcher/signals/__init__.py'
> --- lava_dispatcher/signals/__init__.py	2013-07-16 16:06:51 +0000
> +++ lava_dispatcher/signals/__init__.py	2013-08-21 09:44:41 +0000
> @@ -21,6 +21,7 @@
>  import contextlib
>  import logging
>  import tempfile
> +import json
>  
>  from lava_dispatcher.utils import rmtree
>  
> @@ -123,6 +124,16 @@
>          pass
>  
>  
> +class FailedCall(Exception):
> +    """
> +    Just need a plain Exception to trigger the failure of the
> +    signal handler and set keep_running to False.
> +    """
> +
> +    def __init__(self, call):
> +        Exception.__init__(self, "%s call failed" % call)
> +
> +
>  class SignalDirector(object):
>  
>      def __init__(self, client, testdefs_by_uuid):
> @@ -130,8 +141,11 @@
>          self.testdefs_by_uuid = testdefs_by_uuid
>          self._test_run_data = []
>          self._cur_handler = None
> +        self.context = None

context is only used to query it's transport attribute. Maybe you could just
pass in the transport instead?

> +        self.connection = None
>  
> -    def signal(self, name, params):
> +    def signal(self, name, params, context=None):
> +        self.context = context

I don't think you should assign self.context as a side effect of calling
signal() ... maybe this would be better if assigned during object
initialization (__init__)?

>          handler = getattr(self, '_on_' + name, None)
>          if not handler and self._cur_handler:
>              handler = self._cur_handler.custom_signal
> @@ -141,6 +155,11 @@
>                  handler(*params)
>              except:
>                  logging.exception("handling signal %s failed", name)
> +                return False
> +            return True
> +
> +    def setConnection(self, connection):
> +        self.connection = connection

I think in our naming style this should be set_connection instead.

>  
>      def _on_STARTRUN(self, test_run_id, uuid):
>          self._cur_handler = None
> @@ -162,6 +181,87 @@
>          if self._cur_handler:
>              self._cur_handler.endtc(test_case_id)
>  
> +    def _on_SEND(self, *args):
> +        arg_length = len(args)
> +        if arg_length == 1:
> +            msg = {"request": "lava_send", "messageID": args[0], "message": None}
> +        else:
> +            message_id = args[0]
> +            remainder = args[1:arg_length]
> +            logging.debug("%d key value pair(s) to be sent." % int(len(remainder)))
> +            data = {}
> +            for message in remainder:
> +                detail = str.split(message, "=")
> +                if len(detail) == 2:
> +                    data[detail[0]] = detail[1]
> +            msg = {"request": "lava_send", "messageID": message_id, "message": data}
> +        logging.debug("Handling signal <LAVA_SEND %s>" % msg)
> +        reply = self.context.transport(json.dumps(msg))
> +        if reply == "nack":
> +            raise FailedCall("LAVA_SEND nack")
> +
> +    def _on_SYNC(self, message_id):
> +        if not self.connection:
> +            logging.error("No connection available for on_SYNC")
> +            return
> +        logging.debug("Handling signal <LAVA_SYNC %s>" % message_id)
> +        msg = {"request": "lava_sync", "messageID": message_id, "message": None}
> +        reply = self.context.transport(json.dumps(msg))
> +        message_str = ""
> +        if reply == "nack":
> +#            raise FailedCall("LAVA_SYNC nack")
> +            message_str = " nack"
> +#        elif reply == "TIMEOUT":
> +#            raise FailedCall("LAVA_SYNC TIMEOUT")
> +#            message_str = " TIMEOUT"

Remove these commented lines? We have version control for a reason :-)

> +        else:
> +            message_str = ""
> +        ret = self.connection.sendline("<LAVA_SYNC_COMPLETE%s>" % message_str)
> +        logging.debug("runner._connection.sendline wrote %d bytes" % ret)
> +
> +    def _on_WAIT(self, message_id):
> +        if not self.connection:
> +            logging.error("No connection available for on_WAIT")
> +            return
> +        logging.debug("Handling signal <LAVA_WAIT %s>" % message_id)
> +        msg = {"request": "lava_wait", "messageID": message_id, "message": None}
> +        reply = self.context.transport(json.dumps(msg))
> +        message_str = ""
> +        if reply == "nack":
> +#            raise FailedCall("LAVA_WAIT nack")
> +            message_str = " nack"
> +#        elif reply == "TIMEOUT":
> +#            raise FailedCall("LAVA_WAIT TIMEOUT")
> +#            message_str = " TIMEOUT"

ditto

> +        else:
> +            for target, messages in reply.items():
> +                for key, value in messages.items():
> +                    message_str += " %s:%s=%s" % (target, key, value)
> +        self.connection.sendline("<LAVA_WAIT_COMPLETE%s>" % message_str)
> +
> +    def _on_WAIT_ALL(self, message_id, role=None):
> +        if not self.connection:
> +            logging.error("No connection available for on_WAIT_ALL")
> +            return
> +        logging.debug("Handling signal <LAVA_WAIT_ALL %s>" % message_id)
> +        msg = {"request": "lava_wait_all", "messageID": message_id, "role": role}
> +        reply = self.context.transport(json.dumps(msg))
> +        message_str = ""
> +        if reply == "nack":
> +#            raise FailedCall("LAVA_WAIT_ALL nack")
> +            message_str = " nack"
> +#        elif reply == "TIMEOUT":
> +#            raise FailedCall("LAVA_WAIT_ALL TIMEOUT")
> +#            message_str = " TIMEOUT"

ditto

> +        else:
> +            #the reply format is like this :
> +            #"{target:{key1:value, key2:value2, key3:value3},
> +            #  target2:{key1:value, key2:value2, key3:value3}}"
> +            for target, messages in reply.items():
> +                for key, value in messages.items():
> +                    message_str += " %s:%s=%s" % (target, key, value)
> +        self.connection.sendline("<LAVA_WAIT_ALL_COMPLETE%s>" % message_str)
> +
>      def postprocess_bundle(self, bundle):
>          for test_run in bundle['test_runs']:
>              uuid = test_run['analyzer_assigned_uuid']
> 
> === modified file 'lava_dispatcher/tests/test_device_version.py'
> --- lava_dispatcher/tests/test_device_version.py	2013-08-08 09:18:23 +0000
> +++ lava_dispatcher/tests/test_device_version.py	2013-08-21 09:44:41 +0000
> @@ -18,9 +18,14 @@
>  # along with this program; if not, see <http://www.gnu.org/licenses>.
>  
>  import re
> +<<<<<<< TREE
>  import lava_dispatcher.config
>  from lava_dispatcher.tests.helper import LavaDispatcherTestCase, create_device_config, create_config
>  import os
> +=======
> +from lava_dispatcher.tests.helper import LavaDispatcherTestCase, create_device_config, create_config
> +import os
> +>>>>>>> MERGE-SOURCE

conflict here

>  
>  from lava_dispatcher.device.target import Target
>  from lava_dispatcher.device.qemu import QEMUTarget
> @@ -28,6 +33,7 @@
>  from lava_dispatcher.context import LavaContext
>  from lava_dispatcher.config import get_config
>  
> +
>  def _create_fastmodel_target():
>      config = create_device_config('fastmodel01', {'device_type': 'fastmodel',
>                                                    'simulator_binary': '/path/to/fastmodel',
> @@ -36,6 +42,7 @@
>      return target
>  
>  
> +<<<<<<< TREE
>  def _create_qemu_target(extra_device_config={}):
>      create_config('lava-dispatcher.conf', {})
>  
> @@ -44,6 +51,18 @@
>      device_config = create_device_config('qemu01', device_config_data)
>  
>      dispatcher_config = get_config()
> +=======
> +def _create_qemu_target(extra_device_config=None):
> +    if extra_device_config is None:
> +        extra_device_config = {}
> +    create_config('lava-dispatcher.conf', {})
> +
> +    device_config_data = {'device_type': 'qemu'}
> +    device_config_data.update(extra_device_config)
> +    device_config = create_device_config('qemu01', device_config_data)
> +
> +    dispatcher_config = get_config()
> +>>>>>>> MERGE-SOURCE

conflict

>  
>      context = LavaContext('qemu01', dispatcher_config, None, None, None)
>      return QEMUTarget(context, device_config)
> @@ -56,7 +75,13 @@
>          self.assertIsInstance(target.get_device_version(), str)
>  
>      def test_qemu(self):
> +<<<<<<< TREE
>          fake_qemu = os.path.join(os.path.dirname(__file__), 'test-config', 'bin', 'fake-qemu')
>          target = _create_qemu_target({ 'qemu_binary': fake_qemu })
> +=======
> +        # noinspection PyUnresolvedReferences

what is this? (2)

> +        fake_qemu = os.path.join(os.path.dirname(__file__), 'test-config', 'bin', 'fake-qemu')
> +        target = _create_qemu_target({'qemu_binary': fake_qemu})
> +>>>>>>> MERGE-SOURCE
>          device_version = target.get_device_version()
>          assert(re.search('^[0-9.]+', device_version))
>

conflict again

> === modified file 'lava_dispatcher/utils.py'
> --- lava_dispatcher/utils.py	2013-07-24 16:56:18 +0000
> +++ lava_dispatcher/utils.py	2013-08-21 09:44:41 +0000
> @@ -78,7 +78,7 @@
>      """
>      cmd = 'tar -C %s -czf %s %s' % (rootdir, tfname, basedir)
>      if asroot:
> -        cmd = 'sudo %s' % cmd
> +        cmd = 'nice sudo %s' % cmd

see comment above about nice

>      if logging_system(cmd):
>          raise CriticalError('Unable to make tarball of: %s' % rootdir)
>  
> @@ -99,7 +99,7 @@
>      a list of all the files (full path). This is being used to get around
>      issues that python's tarfile seems to have with unicode
>      """
> -    if logging_system('tar -C %s -xzf %s' % (tmpdir, tfname)):
> +    if logging_system('nice tar -C %s -xzf %s' % (tmpdir, tfname)):

see comment about about nice

Also, you guys mentioned that decompressing consumes a lot of resources, but
AFAICT it is being done inside the python process and cannot be niced

>          raise CriticalError('Unable to extract tarball: %s' % tfname)
>  
>      return _list_files(tmpdir)
> 
> === added directory 'lava_test_shell/multi_node'
> === added file 'lava_test_shell/multi_node/lava-group'
> --- lava_test_shell/multi_node/lava-group	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-group	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,19 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#
> +#This command will produce in its standard output a representation of the
> +#device group that is participating in the multi-node test job.
> +#
> +#Usage: ``lava-group``
> +#
> +#The output format contains one line per device, and each line contains
> +#the hostname and the role that device is playing in the test, separated
> +#by a TAB character::
> +#
> +#	panda01	client
> +#	highbank01	loadbalancer
> +#	highbank02	backend
> +#	highbank03	backend
> +
> +echo -e ${LAVA_GROUP}

you do not need -e here. BTW -e is not supported in a bunch of shells (e.g. dash,
the default /bin/sh on Debian and Ubuntu) ...

I would recommend running checkbashisms (from the devscripts packages on
Debian/Ubuntu) against all new scripts.

Did you guys tested these scripts on Android?

> === added file 'lava_test_shell/multi_node/lava-multi-node.lib'
> --- lava_test_shell/multi_node/lava-multi-node.lib	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-multi-node.lib	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,210 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#
> +
> +MESSAGE_PREFIX="<LAVA_MULTI_NODE>"
> +MESSAGE_COMMAND="<${LAVA_MULTI_NODE_API}"
> +MESSAGE_HEAD="$MESSAGE_PREFIX $MESSAGE_COMMAND"
> +#MESSAGE_ID="<$1>"
> +MESSAGE_ACK="<${LAVA_MULTI_NODE_API}_ACK>"
> +
> +MESSAGE_REPLY="<${LAVA_MULTI_NODE_API}_COMPLETE"
> +MESSAGE_REPLY_ACK="<${LAVA_MULTI_NODE_API}_COMPLETE_ACK>"
> +
> +LAVA_MULTI_NODE_EXIT_ERROR=1
> +
> +_get_key_value_pattern () {
> +	echo $@|\
> +	tr ' ' '\n' |\
> +	sed -n '/\b\w\w*[=]\w\w*\b/p'|\
> +	tr '\n' ' '
> +}
> +
> +_lava_multi_node_debug () {
> +
> +if [ -n $LAVA_MULTI_NODE_DEBUG ] ; then
> +	echo "${MESSAGE_COMMAND}_DEBUG $@ $(date)>"
> +fi
> +
> +}
> +
> +_lava_multi_node_send () {
> +
> +_lava_multi_node_debug "$FUNCNAME started"
> +
> +result=$(echo $1 | grep "..*=..*")
> +
> +if [ -n "$1" -a "${result}x" == "x" ] ; then
> +	echo ${MESSAGE_HEAD} $@">"
> +else
> +	_lava_multi_node_debug "$FUNCNAME error messageID : " "$result"
> +	exit $LAVA_MULTI_NODE_EXIT_ERROR
> +fi
> +
> +_lava_multi_node_debug "$FUNCNAME finished"
> +
> +}
> +
> +_lava_multi_node_process_message () {
> +
> +_lava_multi_node_debug "$FUNCNAME save message to $LAVA_MULTI_NODE_CACHE"
> +#clean old cache file
> +rm $LAVA_MULTI_NODE_CACHE 2>/dev/null
> +
> +until [ -z "$1" ] ; do
> +	result=$(echo $1 | grep "..*=..*")
> +	if [ "${result}x" != "x" ] ; then
> +		echo $1 >> $LAVA_MULTI_NODE_CACHE
> +	elif [ "${1}x" == "nackx" ] ; then
> +		echo "Error:no-response $1, Exit from $LAVA_MULTI_NODE_API!"
> +		exit $LAVA_MULTI_NODE_EXIT_ERROR
> +	else
> +		echo "Warning:unrecognized message $1"
> +	fi
> +	shift
> +done
> +}
> +
> +lava_multi_node_send () {
> +
> +_lava_multi_node_debug "$FUNCNAME preparing"
> +
> +_lava_multi_node_send $@
> +
> +while [ -n "$MESSAGE_NEED_ACK" ] ; do
> +_lava_multi_node_debug "$FUNCNAME waiting for ack"
> +	read -t $MESSAGE_TIMEOUT line
> +	result=$(echo $line | grep "${MESSAGE_ACK}")
> +	if [ "${result}x" != "x" ] ; then
> +#		echo ${MESSAGE_ACK}
> +		break
> +	fi
> +	_lava_multi_node_send $@
> +done
> +
> +_lava_multi_node_debug "$FUNCNAME finished"
> +
> +}
> +
> +lava_multi_node_wait_for_signal () {
> +
> +_lava_multi_node_debug "$FUNCNAME starting to wait"
> +
> +while read line; do
> +	result=$(echo $line | grep "${MESSAGE_REPLY}>")
> +	if [ "${result}x" != "x" ] ; then
> +		if [ -n "$MESSAGE_NEED_ACK" ] ; then
> +			echo ${MESSAGE_REPLY_ACK}
> +		fi
> +		break
> +	fi
> +done
> +
> +_lava_multi_node_debug "$FUNCNAME waiting over"
> +
> +}
> +
> +lava_multi_node_wait_for_message () {
> +
> +_lava_multi_node_debug "$FUNCNAME starting to wait"
> +
> +if [ -n "$1" ] ; then
> +	export LAVA_MULTI_NODE_CACHE=$1
> +fi
> +
> +while read line; do
> +	result=$(echo $line | grep "${MESSAGE_REPLY}")
> +	if [ "${result}x" != "x" ] ; then
> +		line=${line##*${MESSAGE_REPLY}}
> +		_lava_multi_node_process_message ${line%%>*}
> +		if [ -n "$MESSAGE_NEED_ACK" ] ; then
> +			echo ${MESSAGE_REPLY_ACK}
> +		fi
> +		break
> +	fi
> +done
> +
> +_lava_multi_node_debug "$FUNCNAME waiting over"
> +
> +}
> +
> +lava_multi_node_get_network_info () {
> +
> +_NETWORK_INTERFACE=$1
> +_RAW_STREAM_V4=`ifconfig $_NETWORK_INTERFACE |grep "inet "`
> +_RAW_STREAM_V6=`ifconfig $_NETWORK_INTERFACE |grep "inet6 "`
> +_RAW_STREAM_MAC=`ifconfig $_NETWORK_INTERFACE |grep "ether "`
> +
> +#_IPV4_STREAM=`echo $_RAW_STREAM_V4 | awk '{print "ip="$2" netmask="$4" broadcast="$6}'`
> +#_IPV6_STREAM=`echo $_RAW_STREAM_V6 | awk '{print "ipv6="$2}'`
> +#_MAC_STREAM=`echo $_RAW_STREAM_MAC | awk '{print "mac="$2}'`
> +
> +_IPV4_STREAM_IP=`echo $_RAW_STREAM_V4 | cut -f2 -d" "`
> +_IPV4_STREAM_NM=`echo $_RAW_STREAM_V4 | cut -f4 -d" "`
> +_IPV4_STREAM_BC=`echo $_RAW_STREAM_V4 | cut -f6 -d" "`
> +_IPV4_STREAM="ipv4="$_IPV4_STREAM_IP" netmask="$_IPV4_STREAM_NM" broadcast="$_IPV4_STREAM_BC
> +
> +_IPV6_STREAM_IP=`echo $_RAW_STREAM_V6 | cut -f2 -d" "`
> +_IPV6_STREAM="ipv6="$_IPV6_STREAM_IP
> +
> +_MAC_STREAM="mac="`echo $_RAW_STREAM_MAC | cut -f2 -d" "`
> +
> +_HOSTNAME_STREAM="hostname="`hostname`
> +
> +_HOSTNAME_FULL_STREAM="hostname-full="`hostname -f`
> +
> +_DEF_GATEWAY_STREAM="default-gateway="`route -n |grep "UG "|  cut -f10 -d" "`
> +
> +#get DNS configure
> +let Number=1
> +for line in `cat /etc/resolv.conf | grep "nameserver"| cut -d " " -f 2` ; do
> +	export _DNS_${Number}_STREAM=$line
> +	let Number+=1
> +done
> +_DNS_STREAM="dns_1=${_DNS_1_STREAM} dns_2=${_DNS_2_STREAM} dns_3=${_DNS_3_STREAM}"
> +
> +_get_key_value_pattern $_IPV4_STREAM $_IPV6_STREAM $_MAC_STREAM $_HOSTNAME_STREAM $_HOSTNAME_FULL_STREAM $_DEF_GATEWAY_STREAM $_DNS_STREAM
> +
> +}
> +
> +lava_multi_node_check_cache () {
> +
> +if [ -n "$1" ] ; then
> +	export LAVA_MULTI_NODE_CACHE=$1
> +fi
> +
> +if [ ! -f $LAVA_MULTI_NODE_CACHE ] ; then
> +	_lava_multi_node_debug "$FUNCNAME not cache file $LAVA_MULTI_NODE_CACHE !"
> +	exit $LAVA_MULTI_NODE_EXIT_ERROR
> +fi
> +
> +}
> +
> +lava_multi_node_print_host_info () {
> +
> +_HOSTNAME=$1
> +_INFO=$2
> +_RAW_STREAM=`cat $LAVA_MULTI_NODE_NETWORK_CACHE |grep "$_HOSTNAME:$_INFO="`
> +
> +if [ -n "$_RAW_STREAM" ] ; then
> +	echo $_RAW_STREAM|cut -d'=' -f2
> +fi
> +
> +}
> +
> +lava_multi_node_make_hosts () {
> +
> +for line in `grep ":ipv4" $LAVA_MULTI_NODE_NETWORK_CACHE` ; do
> +	_IP_STREAM=`echo $line | cut -d'=' -f2`
> +	_TARGET_STREAM=`echo $line | cut -d':' -f1`
> +	_HOSTNAME_STREAM=`grep "$_TARGET_STREAM:hostname=" $LAVA_MULTI_NODE_NETWORK_CACHE | cut -d'=' -f2`
> +	if [ -n "$_HOSTNAME_STREAM" ]; then
> +		echo -e "$_IP_STREAM\t$_HOSTNAME_STREAM" >> $1
> +	else
> +		echo -e "$_IP_STREAM\t$_TARGET_STREAM" >> $1
> +	fi
> +done

ditto wrt -e option to echo.

The indentation in this script is completely broken. Please indent it properly
with the same style we use for python code (4 spaces, no tabs).

> +
> +}
> +
> 
> === added file 'lava_test_shell/multi_node/lava-network'
> --- lava_test_shell/multi_node/lava-network	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-network	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,104 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#lava-network
> +#-----------------
> +#Helper script to broadcast IP data from the test image, wait for data
> +#to be received by the rest of the group (or one role within the group)
> +#and then provide an interface to retrieve IP data about the group on
> +#the command line.
> +#
> +#Raising a suitable network interface is a job left for the designer of
> +#the test definition / image but once a network interface is available,
> +#lava-network can be asked to broadcast this information to the rest of
> +#the group. At a later stage of the test, before the IP details of the
> +#group need to be used, call lava-network collect to receive the same
> +#information about the rest of the group.
> +#
> +#All usage of lava-network needs to use a broadcast (which wraps a call
> +#to lava-send) and a collect (which wraps a call to lava-wait-all). As
> +#a wrapper around lava-wait-all, collect will block until the rest of
> +#the group (or devices in the group with the specified role) has made a
> +#broadcast.
> +#
> +#After the data has been collected, it can be queried for any board
> +#specified in the output of lava-group:
> +#
> +#lava-network query server
> +#192.168.3.56
> +#
> +#Usage:
> +#	broadcast network info:
> +#		lava-network broadcast [interface]
> +#	collect network info:
> +#		lava-network collect [interface] <role>
> +#	query specific host info:
> +#		lava-network query [hostname]
> +#	export hosts file:
> +#		lava-network hosts [path of hosts]
> +#
> +#So interface would be mandatory for broadcast and collect, hostname
> +#would be mandatory for query, "path of hosts" would be mandatory for
> +#hosts, role is optional for collect.
> +
> +
> +LAVA_MULTI_NODE_API="LAVA_NETWORK"
> +#MESSAGE_TIMEOUT=5
> +#MESSAGE_NEED_ACK=yes
> +
> +_LAVA_NETWORK_ID="network_info"
> +_LAVA_NETWORK_ARG_MIN=2
> +
> +source $LAVA_TEST_BIN/lava-multi-node.lib
> +
> +LAVA_MULTI_NODE_NETWORK_CACHE="/tmp/lava_multi_node_network_cache.txt"
> +
> +_lava_multi_node_debug "$LAVA_MULTI_NODE_API checking arguments..."
> +if [ $# -lt $_LAVA_NETWORK_ARG_MIN ]; then
> +	_lava_multi_node_debug "$FUNCNAME Not enough arguments."
> +	exit $LAVA_MULTI_NODE_EXIT_ERROR
> +fi
> +
> +_lava_multi_node_debug "$LAVA_MULTI_NODE_API handle sub-command..."
> +case "$1" in
> +	"broadcast")
> +	_lava_multi_node_debug "$LAVA_MULTI_NODE_API handle broadcast command..."
> +	LAVA_MULTI_NODE_API="LAVA_SEND"
> +	MESSAGE_COMMAND="<${LAVA_MULTI_NODE_API}"
> +	export MESSAGE_ACK="<${LAVA_MULTI_NODE_API}_ACK>"
> +	export MESSAGE_REPLY="<${LAVA_MULTI_NODE_API}_COMPLETE"
> +	export MESSAGE_REPLY_ACK="<${LAVA_MULTI_NODE_API}_COMPLETE_ACK>"
> +	export MESSAGE_HEAD="$MESSAGE_PREFIX $MESSAGE_COMMAND"
> +	NETWORK_INFO_STREAM=`lava_multi_node_get_network_info $2`
> +	lava_multi_node_send $_LAVA_NETWORK_ID $NETWORK_INFO_STREAM
> +	;;
> +
> +	"collect")
> +	_lava_multi_node_debug "$LAVA_MULTI_NODE_API handle collect command..."
> +	LAVA_MULTI_NODE_API="LAVA_WAIT_ALL"
> +	MESSAGE_COMMAND="<${LAVA_MULTI_NODE_API}"
> +	export MESSAGE_ACK="<${LAVA_MULTI_NODE_API}_ACK>"
> +	export MESSAGE_REPLY="<${LAVA_MULTI_NODE_API}_COMPLETE"
> +	export MESSAGE_REPLY_ACK="<${LAVA_MULTI_NODE_API}_COMPLETE_ACK>"
> +	export MESSAGE_HEAD="$MESSAGE_PREFIX $MESSAGE_COMMAND"
> +	lava_multi_node_send $_LAVA_NETWORK_ID $3 
> +	lava_multi_node_wait_for_message $LAVA_MULTI_NODE_NETWORK_CACHE
> +	;;
> +
> +	"query")
> +	_lava_multi_node_debug "$LAVA_MULTI_NODE_API handle query command..."
> +	lava_multi_node_check_cache $LAVA_MULTI_NODE_NETWORK_CACHE
> +	lava_multi_node_print_host_info $2 $3
> +	;;
> +
> +	"hosts")
> +	_lava_multi_node_debug "$LAVA_MULTI_NODE_API handle hosts command..."
> +	lava_multi_node_check_cache $LAVA_MULTI_NODE_NETWORK_CACHE
> +	lava_multi_node_make_hosts $2
> +	;;
> +
> +	*)
> +	_lava_multi_node_debug "$LAVA_MULTI_NODE_API command $1 is not supported."
> +	exit $LAVA_MULTI_NODE_EXIT_ERROR
> +	;;
> +esac

ditto wrt indentation

> === added file 'lava_test_shell/multi_node/lava-role'
> --- lava_test_shell/multi_node/lava-role	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-role	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,14 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#
> +#Prints the role the current device is playing in a multi-node job.
> +#
> +#Usage: ``lava-role``
> +#
> +#*Example.* In a directory with several scripts, one for each role
> +#involved in the test::
> +#
> +#    $ ./run-`lava-role`.sh
> +
> +echo ${TARGET_ROLE}
> 
> === added file 'lava_test_shell/multi_node/lava-self'
> --- lava_test_shell/multi_node/lava-self	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-self	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,9 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#
> +#Prints the name of the current device.
> +#
> +#Usage: ``lava-self``
> +
> +echo ${HOSTNAME}
> 
> === added file 'lava_test_shell/multi_node/lava-send'
> --- lava_test_shell/multi_node/lava-send	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-send	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,17 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#
> +#Sends a message to the group, optionally passing associated key-value
> +#data pairs. Sending a message is a non-blocking operation. The message
> +#is guaranteed to be available to all members of the group, but some of
> +#them might never retrieve it.
> +#
> +#Usage: ``lava-send <message-id> [key1=val1 [key2=val2] ...]``
> +LAVA_MULTI_NODE_API="LAVA_SEND"
> +#MESSAGE_TIMEOUT=5
> +#MESSAGE_NEED_ACK=yes
> +
> +source $LAVA_TEST_BIN/lava-multi-node.lib
> +#FIXME: need to match "key=val"

does this FIXME still apply? Do we need to act on it before landing multinode?

> +lava_multi_node_send $1 $(_get_key_value_pattern $@)
> 
> === added file 'lava_test_shell/multi_node/lava-sync'
> --- lava_test_shell/multi_node/lava-sync	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-sync	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,20 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#
> +#Global synchronization primitive. Sends a message, and waits for the
> +#same message from all of the other devices.
> +#
> +#Usage: ``lava-sync <message>``
> +#
> +#``lava-sync foo`` is effectively the same as ``lava-send foo`` followed
> +#by ``lava-wait-all foo``.
> +LAVA_MULTI_NODE_API="LAVA_SYNC"
> +#MESSAGE_TIMEOUT=5
> +#MESSAGE_NEED_ACK=yes
> +
> +source $LAVA_TEST_BIN/lava-multi-node.lib
> +
> +lava_multi_node_send $1
> +
> +lava_multi_node_wait_for_message

does this wait for *any* message? we should wait until we receive back the same
message that was passed to lava-sync. Am I missing something?

> === added file 'lava_test_shell/multi_node/lava-wait'
> --- lava_test_shell/multi_node/lava-wait	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-wait	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,21 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#
> +#Waits until any other device in the group sends a message with the given
> +#ID. This call will block until such message is sent.
> +#
> +#Usage: ``lava-wait <message-id>``
> +#
> +#If there was data passed in the message, the key-value pairs will be
> +#printed in the standard output, each in one line. If no key values were
> +#passed, nothing is printed.
> +LAVA_MULTI_NODE_API="LAVA_WAIT"
> +#MESSAGE_TIMEOUT=5
> +#MESSAGE_NEED_ACK=yes
> +
> +source $LAVA_TEST_BIN/lava-multi-node.lib
> +
> +lava_multi_node_send $1
> +
> +lava_multi_node_wait_for_message

lava-wait looks too similar to lava-sync ... the only difference I see is the
LAVA_MULTI_NODE_API assignment, which seems to be used only for debugging
purposes. lava-wait shouldn't send anything, just wait.

> === added file 'lava_test_shell/multi_node/lava-wait-all'
> --- lava_test_shell/multi_node/lava-wait-all	1970-01-01 00:00:00 +0000
> +++ lava_test_shell/multi_node/lava-wait-all	2013-08-21 09:44:41 +0000
> @@ -0,0 +1,23 @@
> +#!/bin/sh
> +#
> +#This file is for Multi-Node test
> +#
> +#Waits until **all** other devices in the group send a message with the
> +#given message ID. IF ``<role>`` is passed, only wait until all devices
> +#with that given role send a message.
> +#
> +#``lava-wait-all <message-id> [<role>]``
> +#
> +#If data was sent by the other devices with the message, the key-value
> +#pairs will be printed one per line, prefixed with the device name and
> +#whitespace.
> +LAVA_MULTI_NODE_API="LAVA_WAIT_ALL"
> +#MESSAGE_TIMEOUT=5
> +#MESSAGE_NEED_ACK=yes
> +
> +source $LAVA_TEST_BIN/lava-multi-node.lib
> +
> +lava_multi_node_send $1 $2
> +
> +lava_multi_node_wait_for_message

ditto

> === modified file 'requirements.txt'
> --- requirements.txt	2011-11-14 04:24:37 +0000
> +++ requirements.txt	2013-08-21 09:44:41 +0000
> @@ -1,6 +1,7 @@
>  django
>  django-openid-auth
> -linaro-django-jsonfield
> +pexpect
>  python-openid
>  lockfile
>  python-daemon
> +setproctitle
> 
> === modified file 'setup.py'
> --- setup.py	2013-07-17 09:16:24 +0000
> +++ setup.py	2013-08-21 09:44:41 +0000
> @@ -42,7 +42,7 @@
>              'lava_test_shell/lava-test-runner-android',
>              'lava_test_shell/lava-test-runner-ubuntu',
>              'lava_test_shell/lava-test-shell',
> -        ])
> +            ])

this does not look right

>      ],
>      install_requires=[
>          "json-schema-validator >= 2.3",
>

review needs-fixing

review: Needs Fixing

Revision history for this message

Neil Williams (codehelp) wrote on 2013-08-23:

Download full text (13.9 KiB)

On Fri, 23 Aug 2013 00:03:18 -0000
Antonio Terceiro <email address hidden> wrote:

> > This branch applies with conflicts. The conflicts are proposed to be
> > resolved as per this temporary branch:
> > lp:~codehelp/lava-dispatcher/multinode-merge
>
> I don't undertand why you didn't resolve the conflict already before
> making the merge proposal ... specially because you already have that
> done somewhere else :-)

> > lava-dispatcher will be the merged after dashboard but before
> > scheduler, so that MultiNode jobs can start as soon as the scheduler
> > is ready.
> >
> > Updated: Include missing changes from tip.
> [...]
>
> There were a bunch of file renamings/removals in
> lava_dispatcher/default-config in the diff, I don't think they should
> be here. I'm ommitting those from the review. Maybe the merge from
> trunk was not complete, please make sure you review the diff wrt
> trunk before going further with this.

That's why the merge branch is there. The file removals have been fixed
but there are other conflicts which cannot be avoided as bazaar /
launchpad just don't seem to accept that there is no conflict.

> > === modified file 'lava/dispatcher/commands.py'
> > --- lava/dispatcher/commands.py 2013-07-16 15:58:16 +0000
> > +++ lava/dispatcher/commands.py 2013-08-21 09:44:41 +0000
> > @@ -7,7 +7,7 @@
> > from json_schema_validator.errors import ValidationError
> > from lava.tool.command import Command
> > from lava.tool.errors import CommandError
> > -
> > +from lava.dispatcher.node import NodeDispatcher
> > import lava_dispatcher.config
> > from lava_dispatcher.config import get_config, get_device_config,
> > get_devices from lava_dispatcher.job import LavaTestJob,
> > validate_job_data @@ -93,6 +93,7 @@
> > # Set process id if job-id was passed to dispatcher
> > if self.args.job_id:
> > try:
> > + # noinspection PyUnresolvedReferences
>
> what is this?

PyCharm code inspection flag. It allows PyCharm to indicate that the
file is clean when otherwise PyCharm itself would get confused by the
failure to import. It's harmless but I can remove it. (There are lots
of others which I keep in the git versions of these files but this one
slipped into the bazaar copy too.)

> > + node.run()
> > + # the NodeDispatcher has started and closed.
> > + # FIXME: get any error state from nodeDispatcher!
>
> is it OK to land with this issue here unsolved?

I'll remove the FIXME. There were error conditions when NodeDispatcher
used a different form of comms but these no longer apply. There is
nothing else for node.run to return.

> > + json_data = None
> > + polling = False
>
> polling is only used inside poll(), and its value does not depend on
> anything that is outside of that method. make it a local variable
> instead.

Replaced with True as the l...

On Fri, 23 Aug 2013 00:03:18 -0000
Antonio Terceiro <antonio.terceiro@linaro.org> wrote:

> > This branch applies with conflicts. The conflicts are proposed to be
> > resolved as per this temporary branch:
> > lp:~codehelp/lava-dispatcher/multinode-merge
> 
> I don't undertand why you didn't resolve the conflict already before
> making the merge proposal ... specially because you already have that
> done somewhere else :-)

What I have somewhere else is just a copy of the merge after manual
resolution of the conflicts. Some of the conflicts are entirely
unavoidable - I'm inserting a new block into an existing function and
launchpad/bazaar consider this a conflict when there is no reason for
it to conflict.
 
> > lava-dispatcher will be the merged after dashboard but before
> > scheduler, so that MultiNode jobs can start as soon as the scheduler
> > is ready.
> > 
> > Updated: Include missing changes from tip.
> [...]
> 
> There were a bunch of file renamings/removals in
> lava_dispatcher/default-config in the diff, I don't think they should
> be here. I'm ommitting those from the review. Maybe the merge from
> trunk was not complete, please make sure you review the diff wrt
> trunk before going further with this.

> > === modified file 'lava/dispatcher/commands.py'
> > --- lava/dispatcher/commands.py	2013-07-16 15:58:16 +0000
> > +++ lava/dispatcher/commands.py	2013-08-21 09:44:41 +0000
> > @@ -7,7 +7,7 @@
> >  from json_schema_validator.errors import ValidationError
> >  from lava.tool.command import Command
> >  from lava.tool.errors import CommandError
> > -
> > +from lava.dispatcher.node import NodeDispatcher
> >  import lava_dispatcher.config
> >  from lava_dispatcher.config import get_config, get_device_config,
> > get_devices from lava_dispatcher.job import LavaTestJob,
> > validate_job_data @@ -93,6 +93,7 @@
> >          # Set process id if job-id was passed to dispatcher
> >          if self.args.job_id:
> >              try:
> > +                # noinspection PyUnresolvedReferences
> 
> what is this?

> > +                node.run()
> > +                # the NodeDispatcher has started and closed.
> > +                # FIXME: get any error state from nodeDispatcher!
> 
> is it OK to land with this issue here unsolved?

I'll remove the FIXME. There were error conditions when NodeDispatcher
used a different form of comms but these no longer apply. There is
nothing else for node.run to return.
 
> > +    json_data = None
> > +    polling = False
> 
> polling is only used inside poll(), and its value does not depend on
> anything that is outside of that method.  make it a local variable
> instead.

Replaced with True as the loop exits with break.
 
> > +    # starting value for the delay between polls
> > +    delay = 1
> 
> ditto for delay

Made local.
 
> > +    blocks = 4 * 1024
> > +    # how long between polls (in seconds)
> > +    step = 1
> 
> you might want to call this step_delay or delay_step?

poll_delay - same as the config uses.

> > +    timeout = 0
> > +
> > +    def __init__(self, data_str):
> > +        try:
> > +            self.json_data = json.loads(data_str)
> 
> you are parsing json here, then dumping when creating an instance of
> this class. We not just passing the data (i.e. a regular
> dictorionary) directly?

There is a strict limit on the length of the message.

msg_len = len(msg_str)
        if msg_len > 0xFFFE:
            logging.error("Message was too long to send!")
            return
        c = 0

The string version of the message is used as well as the dict.

There's no saving in the number of times json.dumps is called.
 
> > +        """
> > +        Blocking, synchronous polling of the Coordinator on the
> > configured port.
> > +        Single send operations greater than 0xFFFF are rejected to
> > prevent truncation.
> > +        :param msg_str: The message to send to the Coordinator, as
> > a JSON string.
> > +        :return: a JSON string of the response to the poll
> > +        """
> > +        msg_len = len(msg_str)
> > +        if msg_len > 0xFFFE:
> > +            logging.error("Message was too long to send!")
> > +            return
> 
> is just logging and error message enough here? I think the
> probability that the rest of the test logic would depend on this
> message being sent is pretty high, so if it's rejected, shouldn't we
> abort the job at this point?

There's no way for the NodeDispatcher to abort the entire MultiNode job
other than via the timeout. Neither a node nor the coordinator can
cause any other job to fail or abort because we cannot re-enter
job.run().

> > +        self.polling = True
> > +        c = 0
> > +        response = None
> > +        while self.polling:
> > +            c += self.step
> > +            s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> > +            s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
> > +            try:
> > +                s.connect((self.json_data['host'],
> > self.json_data['port']))
> > +                logging.debug("Connecting to LAVA Coordinator on
> > %s:%s" % (self.json_data['host'], self.json_data['port']))
> > +                self.delay = self.step
> > +            except socket.error as e:
> > +                logging.warn("socket error on connect: %d %s %s" %
> > +                             (e.errno, self.json_data['host'],
> > self.json_data['port']))
> > +                time.sleep(self.delay)
> > +                self.delay += 2
> > +                s.close()
> > +                continue
> 
> if there is a problem with the coordinator (e.g. it crashed) and/or
> with the configuration (wrong host, wrong port etc) and connection
> cannot be made, we will loop here until a timeout. Can we be more
> specific in the exception handling there, or it just does not matter?

It won't matter. The timeout is the only means we have of failing the
entire MultiNode job.

> > self.timeout))
> > +                time.sleep(self.delay)
> > +            # apply the default timeout to each poll operation.
> > +            if c > self.timeout:
> > +                response = json.dumps({"response": "nack"})
> > +                self.polling = False
> > +                break
> > +        return response
> 
> We probably already discussed this, but I feel we could be using
> something like zeromq for this. (we do not have to do anything about
> this right now, just saying :-))

The basic problem with things like zeromq is that those solutions would
require job.run() to be re-entrant. We don't have the ability to
interrupt a job and re-enter were it was interrupted.

> > +    def __call__(self, args):
> > +        """ Makes the NodeDispatcher callable so that the test
> > shell can send messages just using the
> > +        NodeDispatcher object.
> > +        This function blocks until the specified API call returns.
> > Some API calls may involve a
> > +        substantial period of polling.
> > +        :param args: JSON string of the arguments of the API call
> > to make
> > +        :return: A Python object containing the reply dict from
> > the API call
> > +        """
> > +        try:
> > +            return self._select(json.loads(args))
> 
> same here, maybe we could receive the args as a proper dictionary; it
> seems all calls of this method first dump their message to JSON?

That's too large a change at this stage - the origin of all of the data
for those calls is strings sent over serial. If we start changing
things like this, MultiNode will not meet this release cycle.

> > === modified file 'lava_dispatcher/downloader.py'
> > --- lava_dispatcher/downloader.py	2013-07-16 16:08:22 +0000
> > +++ lava_dispatcher/downloader.py	2013-08-21 09:44:41 +0000
> > @@ -41,7 +41,7 @@
> >      process = None
> >      try:
> >          process = subprocess.Popen(
> > -            ['ssh', url.netloc, 'cat', url.path],
> > +            ['nice', 'ssh', url.netloc, 'cat', url.path],
> 
> Is this useful at all? Maybe ionice would be more hepful?

It is massively useful. We had big problems with the load on servers
running MultiNode until selected calls were put under nice.

> > +        self.context.assign_transport(transport)
> 
> this is probably not needed (see below).
> 
> > @@ -130,8 +141,11 @@
> >          self.testdefs_by_uuid = testdefs_by_uuid
> >          self._test_run_data = []
> >          self._cur_handler = None
> > +        self.context = None
> 
> context is only used to query it's transport attribute. Maybe you
> could just pass in the transport instead?

transport is used in various places, the common object to each is the
context.

> > +        self.connection = None
> >  
> > -    def signal(self, name, params):
> > +    def signal(self, name, params, context=None):
> > +        self.context = context
> 
> I don't think you should assign self.context as a side effect of
> calling signal() ... maybe this would be better if assigned during
> object initialization (__init__)?

Done
 
> >          handler = getattr(self, '_on_' + name, None)
> >          if not handler and self._cur_handler:
> >              handler = self._cur_handler.custom_signal
> > @@ -141,6 +155,11 @@
> >                  handler(*params)
> >              except:
> >                  logging.exception("handling signal %s failed",
> > name)
> > +                return False
> > +            return True
> > +
> > +    def setConnection(self, connection):
> > +        self.connection = connection
> 
> I think in our naming style this should be set_connection instead.

Done

> > +        reply = self.context.transport(json.dumps(msg))
> > +        if reply == "nack":
> > +            raise FailedCall("LAVA_SEND nack")
> > +
> > +    def _on_SYNC(self, message_id):
> > +        if not self.connection:
> > +            logging.error("No connection available for on_SYNC")
> > +            return
> > +        logging.debug("Handling signal <LAVA_SYNC %s>" %
> > message_id)
> > +        msg = {"request": "lava_sync", "messageID": message_id,
> > "message": None}
> > +        reply = self.context.transport(json.dumps(msg))
> > +        message_str = ""
> > +        if reply == "nack":
> > +#            raise FailedCall("LAVA_SYNC nack")
> > +            message_str = " nack"
> > +#        elif reply == "TIMEOUT":
> > +#            raise FailedCall("LAVA_SYNC TIMEOUT")
> > +#            message_str = " TIMEOUT"
> 
> Remove these commented lines? We have version control for a reason :-)

Done.

> > 'test-config', 'bin', 'fake-qemu') target =
> > _create_qemu_target({ 'qemu_binary': fake_qemu }) +=======
> > +        # noinspection PyUnresolvedReferences
> 
> what is this? (2)

Removed
 
> Also, you guys mentioned that decompressing consumes a lot of
> resources, but AFAICT it is being done inside the python process and
> cannot be niced

There is a lot of compression/decompression done using tar -z which is
niced. The point is that with nice, other operations (like apache and
lava-server) get a chance to do stuff too.

> > +#Usage: ``lava-group``
> > +#
> > +#The output format contains one line per device, and each line
> > contains +#the hostname and the role that device is playing in the
> > test, separated +#by a TAB character::
> > +#
> > +#	panda01	client
> > +#	highbank01	loadbalancer
> > +#	highbank02	backend
> > +#	highbank03	backend
> > +
> > +echo -e ${LAVA_GROUP}
> 
> you do not need -e here. BTW -e is not supported in a bunch of shells
> (e.g. dash, the default /bin/sh on Debian and Ubuntu) ...

Fu is looking at that.

> The indentation in this script is completely broken. Please indent it
> properly with the same style we use for python code (4 spaces, no
> tabs).

Done

> > +source $LAVA_TEST_BIN/lava-multi-node.lib
> > +#FIXME: need to match "key=val"
> 
> does this FIXME still apply? Do we need to act on it before landing
> multinode?

Removed - it is already fixed.
 
> > +lava_multi_node_send $1
> > +
> > +lava_multi_node_wait_for_message
> 
> does this wait for *any* message? we should wait until we receive
> back the same message that was passed to lava-sync. Am I missing
> something?
 
The shell code doesn't need to care about this, it's handled in the
python. The shell can only wait for one thing at a time, we are not
running an event loop inside the test image. The board is either
waiting or it's running. It's not possible for the board to be waiting
for two messages at the same time.

Two or more messages can be available for any one board at any one time
inside the coordinator but the board itself can only wait for one
message at a time. The message itself is matched according to $1 which
is passed directly to the signal handler, in the python. To pick up the
other messages, the test definition needs to issue a new lava-wait with
that messageID.

> > values were +#passed, nothing is printed.
> > +LAVA_MULTI_NODE_API="LAVA_WAIT"
> > +#MESSAGE_TIMEOUT=5
> > +#MESSAGE_NEED_ACK=yes
> > +
> > +source $LAVA_TEST_BIN/lava-multi-node.lib
> > +
> > +lava_multi_node_send $1
> > +
> > +lava_multi_node_wait_for_message
> 
> lava-wait looks too similar to lava-sync ... the only difference I
> see is the LAVA_MULTI_NODE_API assignment, which seems to be used
> only for debugging purposes. lava-wait shouldn't send anything, just
> wait.

lava-wait has to send a messageID, that's the point above, but it's
handled in the python, not in the shell. lava-sync is just a lava-send
followed by a lava-wait for the same messageID.

> > -        ])
> > +            ])
> 
> this does not look right

pep8 indentation, yes, setup.py is right.

I'll update lava-dispatcher/multinode with the above changes.

Neil Williams
=============
http://www.linux.codehelp.co.uk/

lp://qastaging/lava-dispatcher/multinode updated on 2013-08-23

687. By Neil Williams on 2013-08-23: Fu Wei 2013-08-23 Fix the bug of disabling debug info in multi-node API
shell scripts.

Revision history for this message

Antonio Terceiro (terceiro) wrote on 2013-08-23:

> > > values were +#passed, nothing is printed.
> > > +LAVA_MULTI_NODE_API="LAVA_WAIT"
> > > +#MESSAGE_TIMEOUT=5
> > > +#MESSAGE_NEED_ACK=yes
> > > +
> > > +source $LAVA_TEST_BIN/lava-multi-node.lib
> > > +
> > > +lava_multi_node_send $1
> > > +
> > > +lava_multi_node_wait_for_message
> >
> > lava-wait looks too similar to lava-sync ... the only difference I
> > see is the LAVA_MULTI_NODE_API assignment, which seems to be used
> > only for debugging purposes. lava-wait shouldn't send anything, just
> > wait.
>
> lava-wait has to send a messageID, that's the point above, but it's
> handled in the python, not in the shell. lava-sync is just a lava-send
> followed by a lava-wait for the same messageID.

My point is that both lava-sync and lava-wait are at the moment
implement the exact same way: first lava_multi_node_send $1, then
lava_multi_node_wait_for_message. In my understanding they do the very
same thing. Maybe I still don't understand the mechanics, or maybe this
means we don't actually need both.

lp://qastaging/lava-dispatcher/multinode updated on 2013-08-27

688. By Neil Williams on 2013-08-24: Neil Williams 2013-08-23 Drop PyCharm inspection comment.
Neil Williams 2013-08-23 Set the context once in __init__ instead of each
time in the handler.
Neil Williams 2013-08-23 Set the context once in __init__ instead of each
time in the handler. Rename setConnection to set_connection.
Neil Williams 2013-08-23 Drop PyCharm inspection comment.
Neil Williams 2013-08-23 Move delay into a local function variable.
689. By Neil Williams on 2013-08-24: Neil Williams 2013-08-23 Add the stream parameter to the pending call so
that it can be passed down to put_pending over xmlrpc.
Neil Williams 2013-08-23 Add the stream parameter to the pending call
so that it can be passed down to put_pending over xmlrpc.
Switch to simplejson for the bundle parsing which makes a better
job of result bundles with Decimal(0.0) output.
690. By Neil Williams on 2013-08-24: Neil Williams 2013-08-23 Add debugging help page.
Neil Williams 2013-08-23 Add details of how to use the API in the use case.
Neil Williams 2013-08-23 Add hrefs and a section on installing packages.
Neil Williams 2013-08-23 Add debugging page and section on balancing
timeouts.
Neil Williams 2013-08-23 Add internal hrefs to make it easier to link
from examples.
691. By Neil Williams on 2013-08-27: Merge from tip 657:
[Tyler Baker] Implemented deploy_linaro_kernel in qemu class. Refactored dispatcher code for easier reuse. Added lava-test-shell ability to bootloader targets.
692. By Neil Williams on 2013-08-27: Neil Williams 2013-08-27 Fix LNG result bundle aggregation error with
measurements.

Revision history for this message

Neil Williams (codehelp) wrote on 2013-08-28:

> > >
> > > lava-wait looks too similar to lava-sync ... the only difference I
> > > see is the LAVA_MULTI_NODE_API assignment, which seems to be used
> > > only for debugging purposes. lava-wait shouldn't send anything, just
> > > wait.
> >
> > lava-wait has to send a messageID, that's the point above, but it's
> > handled in the python, not in the shell. lava-sync is just a lava-send
> > followed by a lava-wait for the same messageID.
>
> My point is that both lava-sync and lava-wait are at the moment
> implement the exact same way: first lava_multi_node_send $1, then
> lava_multi_node_wait_for_message. In my understanding they do the very
> same thing. Maybe I still don't understand the mechanics, or maybe this
> means we don't actually need both.

LAVA_SYNC and LAVA_WAIT (as LAVA_MULTI_NODE_API) are handled very differently inside the LAVA Coordinator. Wait sends an Ack immediately that any device in the group sends the requested messageID. Sync requires that *all* devices in the group send the requested messageID before allowing the Ack to be sent, instead it causes the NodeDispatcher to wait.

It's the difference between lava-wait and lava-wait-all.

lava-wait-all is needed because it handles roles but also because it is very likely that tests will want to issue a lava-send at some point and only use lava-wait a little bit later.

So whilst the handling inside signals/__init__.py can be the same apart from the API designation, that doesn't mean that the operation will do the same thing during a test.

Revision history for this message

Antonio Terceiro (terceiro) wrote on 2013-08-28:

On Wed, Aug 28, 2013 at 10:48:48AM -0000, Neil Williams wrote:
> LAVA_SYNC and LAVA_WAIT (as LAVA_MULTI_NODE_API) are handled very
> differently inside the LAVA Coordinator.

OK, this is the bit I was missing. Thanks for the explanation! :-)

lp://qastaging/lava-dispatcher/multinode updated on 2013-08-28

693. By Neil Williams on 2013-08-28: Fu Wei 2013-08-27 multinode review:Fix a bug about using 'printf'
Fu Wei 2013-08-26 multinode review: make 'waiting message ack' be
supported, only if the shell script is running in bash
Fu Wei 2013-08-26 multinode review: delete some bash-specific syntax
for multinode API shell scripts.