Bazaar

Merge lp://qastaging/~jameinel/bzr/1.17-gc-single-mem into lp://qastaging/~bzr/bzr/trunk-old

1.17-gc-single-mem
Merge into trunk-old

Proposed by John A Meinel on 2009-06-22

Status:

Merged

Approved by:

Ian Clatworthy on 2009-06-23

Approved revision:

no longer in the source branch.

Merged at revision:

not available

Proposed branch:

lp://qastaging/~jameinel/bzr/1.17-gc-single-mem

Merge into:

lp://qastaging/~bzr/bzr/trunk-old

Diff against target:

136 lines

To merge this branch:

bzr merge lp://qastaging/~jameinel/bzr/1.17-gc-single-mem

High

Confirmed

Link a bug report

Reviewer	Review Type	Date Requested	Status
Andrew Bennetts		2009-06-22	Approve on 2009-06-23
Review via email: mp+7768@code.qastaging.launchpad.net

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-22:

This patch finishes up the work I started with "commit" and memory consumption.

After this patch, when doing 'bzr commit' in a --2a format repository, we no longer hold more than one copy of the file text in memory.

So --2a now only holds:
1 file text
2 copies of the compressed bytes

For committing a single large file with lots of small lines, this is:

  bzr.dev (last week) 457324 KB 6.164s
  bzr.dev (_add_text) 215996 KB 3.980s
  this code 123904 KB 3.863s

So this code at least isn't slower, and it cuts the memory for a large file commit by a good amount.

https://bugs.edge.launchpad.net/bzr/+bug/109114

I also checked, and I couldn't see any major change in the time for 'bzr pack', so it doesn't seem to make anything slower, while decreasing peak memory consumption by quite a bit.

(Note that the memory consumed during 'bzr pack' is unaffected, and is still considerably higher (536MiB). I'm guessing it is dominated by the 'create_delta_index' which peaks at something like 2x bytes while building and re-shaping the index.)

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-06-23:

John A Meinel wrote:
[...]
> After this patch, when doing 'bzr commit' in a --2a format repository, we no longer hold more than one copy of the file text in memory.
>
> So --2a now only holds:
> 1 file text
> 2 copies of the compressed bytes
>
> For committing a single large file with lots of small lines, this is:
>
> bzr.dev (last week) 457324 KB 6.164s
> bzr.dev (_add_text) 215996 KB 3.980s
> this code 123904 KB 3.863s
>
> So this code at least isn't slower, and it cuts the memory for a large file commit by a good amount.

Nice!

Simple patch, too.

review approve

It would be nice to have some sort of automated test for memory consumption,
but perhaps usertest is a better place for that than bzr selftest.

-Andrew.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Aaron Bentley

Denys Duchier

Eric Siegerman

Gary van der Merwe

Jelmer Vernooij

John A Meinel

John Szakmeister

Jonathan Lange

Marius Kruger

Martin Albisetti

Matt Nordhoff

Paul Hummer

SuperMMX

Talden

Yoshinori Sano

to status/vote changes:

Alexander Belchenko

Martin Eisenhardt

Tim Penhey

Vincent Ladeuil

 === modified file 'NEWS'
 --- NEWS	2009-06-22 21:55:37 +0000
 +++ NEWS	2009-06-23 03:35:22 +0000
@@ -43,9 +43,10 @@
    (Martin Pool, #339385)
  * Reduced memory consumption during ``bzr commit`` of large files. For
--  pre 2a formats, should be down to ~3x the size of a file, and for
--  ``--2a`` formats should be down to exactly 2x the size. Related to bug
--  #109114. (John Arbash Meinel)
++  pre 2a formats, should be down to ~3x the size of a file.
++  For ``--2a`` format repositories, it is down to the size of the file
++  content plus the size of the compressed text.  Related to bug #109114.
++  (John Arbash Meinel)
  * Repositories using CHK pages (which includes the new 2a format) will no
    longer error during commit or push operations when an autopack operation
 === modified file 'bzrlib/groupcompress.py'
 --- bzrlib/groupcompress.py	2009-06-22 15:47:25 +0000
 +++ bzrlib/groupcompress.py	2009-06-23 03:35:23 +0000
@@ -108,6 +108,7 @@
          self._z_content_length = None
          self._content_length = None
          self._content = None
++        self._content_chunks = None
      def __len__(self):
          # This is the maximum number of bytes this object will reference if
@@ -137,6 +138,10 @@
                  % (num_bytes, self._content_length))
          # Expand the content if required
          if self._content is None:
++            if self._content_chunks is not None:
++                self._content = ''.join(self._content_chunks)
++                self._content_chunks = None
++        if self._content is None:
              if self._z_content is None:
                  raise AssertionError('No content to decompress')
              if self._z_content == '':
@@ -273,22 +278,55 @@
              bytes = apply_delta_to_source(self._content, content_start, end)
          return bytes
++    def set_chunked_content(self, content_chunks, length):
++        """Set the content of this block to the given chunks."""
++        # If we have lots of short lines, it is may be more efficient to join
++        # the content ahead of time. If the content is <10MiB, we don't really
++        # care about the extra memory consumption, so we can just pack it and
++        # be done. However, timing showed 18s => 17.9s for repacking 1k revs of
++        # mysql, which is below the noise margin
++        self._content_length = length
++        self._content_chunks = content_chunks
++        self._content = None
++        self._z_content = None
++
      def set_content(self, content):
          """Set the content of this block."""
          self._content_length = len(content)
          self._content = content
          self._z_content = None
++    def _create_z_content_using_lzma(self):
++        if self._content_chunks is not None:
++            self._content = ''.join(self._content_chunks)
++            self._content_chunks = None
++        if self._content is None:
++            raise AssertionError('Nothing to compress')
++        self._z_content = pylzma.compress(self._content)
++        self._z_content_length = len(self._z_content)
++
++    def _create_z_content_from_chunks(self):
++        compressor = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION)
++        compressed_chunks = map(compressor.compress, self._content_chunks)
++        compressed_chunks.append(compressor.flush())
++        self._z_content = ''.join(compressed_chunks)
++        self._z_content_length = len(self._z_content)
++
++    def _create_z_content(self):
++        if self._z_content is not None:
++            return
++        if _USE_LZMA:
++            self._create_z_content_using_lzma()
++            return
++        if self._content_chunks is not None:
++            self._create_z_content_from_chunks()
++            return
++        self._z_content = zlib.compress(self._content)
++        self._z_content_length = len(self._z_content)
++
      def to_bytes(self):
          """Encode the information into a byte stream."""
--        compress = zlib.compress
--        if _USE_LZMA:
--            compress = pylzma.compress
--        if self._z_content is None:
--            if self._content is None:
--                raise AssertionError('Nothing to compress')
--            self._z_content = compress(self._content)
--            self._z_content_length = len(self._z_content)
++        self._create_z_content()
          if _USE_LZMA:
              header = self.GCB_LZ_HEADER
          else:
@@ -762,10 +800,9 @@
          #       for 'commit' down to ~1x the size of the largest file, at a
          #       cost of increased complexity within this code. 2x is still <<
          #       3x the size of the largest file, so we are doing ok.
--        content = ''.join(self.chunks)
++        self._block.set_chunked_content(self.chunks, self.endpoint)
          self.chunks = None
          self._delta_index = None
--        self._block.set_content(content)
          return self._block
      def pop_last(self):
 === modified file 'bzrlib/tests/test_groupcompress.py'
 --- bzrlib/tests/test_groupcompress.py	2009-06-10 03:56:49 +0000
 +++ bzrlib/tests/test_groupcompress.py	2009-06-23 03:35:23 +0000
@@ -363,6 +363,15 @@
          raw_bytes = zlib.decompress(remaining_bytes)
          self.assertEqual(content, raw_bytes)
++        # we should get the same results if using the chunked version
++        gcb = groupcompress.GroupCompressBlock()
++        gcb.set_chunked_content(['this is some content\n'
++                                 'this content will be compressed\n'],
++                                 len(content))
++        old_bytes = bytes
++        bytes = gcb.to_bytes()
++        self.assertEqual(old_bytes, bytes)
++
      def test_partial_decomp(self):
          content_chunks = []
          # We need a sufficient amount of data so that zlib.decompress has

Bazaar

Merge lp://qastaging/~jameinel/bzr/1.17-gc-single-mem into lp://qastaging/~bzr/bzr/trunk-old

Commit message

Description of the change

Preview Diff

Subscribers