Merge into trunk-old : bug-317644 : Code : Bazaar

Status:

Rejected

Rejected by:

John A Meinel on 2009-10-06

Proposed branch:

lp://qastaging/~wjl/bzr/bug-317644

Merge into:

lp://qastaging/~bzr/bzr/trunk-old

Diff against target:

12 lines

To merge this branch:

bzr merge lp://qastaging/~wjl/bzr/bug-317644

Medium

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
John A Meinel			Needs Resubmitting on 2009-06-02
Review via email: mp+6978@code.qastaging.launchpad.net

This proposal supersedes a proposal from 2009-03-03.

Revision history for this message

Wesley J. Landaker (wjl) wrote on 2009-03-03: Posted in a previous version of this proposal

#

This fix is a simple and obvious one-liner. It would be great to get this merged.

Revision history for this message

Robert Collins (lifeless) wrote on 2009-05-06: Posted in a previous version of this proposal

#

Seems ok to me; having a test demonstrating it would be better though.

review: Approve

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-02: Posted in a previous version of this proposal

#

Download full text (3.9 KiB)

I went to add a test case, to ensure that it really did work correctly, unfortunately I found that it does not.

Part of the problem is that on Windows, python uses UCS-2 internally (16-bit wide characters) while on Linux it generally uses UCS-4 (32-bit wide characters).

As such, len(u'\U0010FFFF') == 2 on Windows and 1 on Linux.

When I started adding simple tests, it turned out that _unicode_re.sub() does very strange things. The test I wanted to add to bzrlib/tests/test_xml.py was simple enough:

+ def test_non_bmp_unicode(self):
+ uni_str = u'\xb5\u0090\U0010abcd'
+ self.assertEqual('µ􊯍',
+ xml8._encode_and_escape(uni_str))

However, the issues with wide characters complicate things.

Specifically, if I just use the change proposed, I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\udfcd' in position 20: ordinal not in range(128)

Digging into it, I think it is actually a bug in python2.5's Unicode regex _match functionality. In that if you do:

import re
_unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')
def print_matches(match):
    val = match.group()
    print repr(val)
    return "&#%d;" % ord(val)

_unicode_re.sub(print_matches, u'\U0010abcd')

It prints out:
u'\udbea'
u'&#56298;\udfcd'

Notice that it *only* matched the first character, and didn't do *any* processing on the second character.

If you change it to:
_unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')
_unicode_re.sub(print_matches, u'\U0010abcd')

You get:
u'\udbea'
u'\udfcd'
u'&#56298;&#57293;'

Note, however, that if I run the same tests on Linux I get:
u'\U0010abcd'
'􊯍'
versus
# No match
u'\U0010abcd'

My best guess is that we need to check the width of u'\U0010ffff' and respond accordingly. For example:

  if _unicode_re is None:
    if len(u'\U0010ffff') == 2:
      # this platform represents non BMP characters using 2 Py_UNICODE chars, so all
      # individual bytes are from 0 => ffff
      _unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')
    else:
      # \U0010ffff is a single Py_UNICODE char, so match it as such
      _unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')

And then the test needs to check the validity of both forms:

+ def test_non_bmp_unicode(self):
+ uni_str = u'\U0010abcd'
+ if len(uni_str) == 2:
+ self.assertEqual('&#56298;&#57293;',
+ xml8._encode_and_escape(uni_str))
+ else:
+ self.assertEqual('􊯍',
+ xml8._encode_and_escape(uni_str))

However, this also implies that we need to make sure we can translate those
values *back* to their non-normalized form.

Even more worrisome, though, is that now 2 people generating the same identical
inventory will be generating different byte strings (based on whether the
inventory was generated from Windows or Linux.)

This has a fairly serious implication, as it means that line-based deltas won't
always cleanly apply if 2 people have different base texts.

This won't come into play often, because if *I* generate a text, and you *copy*
it from me, things will be fine. However, 'bzr upgrade' (for exam...

I went to add a test case, to ensure that it really did work correctly, unfortunately I found that it does not.

Part of the problem is that on Windows, python uses UCS-2 internally (16-bit wide characters) while on Linux it generally uses UCS-4 (32-bit wide characters).

As such, len(u'\U0010FFFF') == 2 on Windows and 1 on Linux.

When I started adding simple tests, it turned out that _unicode_re.sub() does very strange things. The test I wanted to add to bzrlib/tests/test_xml.py was simple enough:

+    def test_non_bmp_unicode(self):
+        uni_str = u'\xb5\u0090\U0010abcd'
+        self.assertEqual('&#181;&#144;&#1092557;',
+                         xml8._encode_and_escape(uni_str))

However, the issues with wide characters complicate things.

Specifically, if I just use the change proposed, I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\udfcd' in position 20: ordinal not in range(128)

Digging into it, I think it is actually a bug in python2.5's Unicode regex _match functionality. In that if you do:

import re
_unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')
def print_matches(match):
    val = match.group()
    print repr(val)
    return "&#%d;" % ord(val)

_unicode_re.sub(print_matches, u'\U0010abcd')

It prints out:
u'\udbea'
u'&#56298;\udfcd'

Notice that it *only* matched the first character, and didn't do *any* processing on the second character.

If you change it to:
_unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')
_unicode_re.sub(print_matches, u'\U0010abcd')

You get:
u'\udbea'
u'\udfcd'
u'&#56298;&#57293;'

Note, however, that if I run the same tests on Linux I get:
u'\U0010abcd'
'&#1092557;'
versus
# No match
u'\U0010abcd'

My best guess is that we need to check the width of u'\U0010ffff' and respond accordingly. For example:

if _unicode_re is None:
    if len(u'\U0010ffff') == 2:
      # this platform represents non BMP characters using 2 Py_UNICODE chars, so all 
      # individual bytes are from 0 => ffff
      _unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')
    else:
      # \U0010ffff is a single Py_UNICODE char, so match it as such
      _unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')

And then the test needs to check the validity of both forms:

+    def test_non_bmp_unicode(self):
+        uni_str = u'\U0010abcd'
+        if len(uni_str) == 2:
+           self.assertEqual('&#56298;&#57293;',
+                            xml8._encode_and_escape(uni_str))
+        else:
+           self.assertEqual('&#1092557;',
+                            xml8._encode_and_escape(uni_str))

However, this also implies that we need to make sure we can translate those
values *back* to their non-normalized form.

Even more worrisome, though, is that now 2 people generating the same identical
inventory will be generating different byte strings (based on whether the
inventory was generated from Windows or Linux.)

This has a fairly serious implication, as it means that line-based deltas won't
always cleanly apply if 2 people have different base texts.

This won't come into play often, because if *I* generate a text, and you *copy*
it from me, things will be fine. However, 'bzr upgrade' (for example) will
regenerate the inventory texts. And if one person upgrades on Windows, and
another upgrades on Linux, they will end up with repositories with slightly
different texts for a given revision.
And then if Windows fetches a delta inventory from Linux, it will find out that
it thinks it has a corrupt line delta, because the sha1 of the expanded text
won't match.

So as I did deeper, I think we just need a better method for escaping non BMP
characters. First we need a mechanism for detecting them, and then a way to
losslessly encode them.

Note that regex we used was copied from the ElementTree code, which also only
uses \u0080-\uffff, which means that we may not round trip the non BMP
characters anyway, since our underlying XML library probably gets it wrong.

review: Needs Fixing

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-02:

#

Sorry about messing up the request status. I was confused by Launchpad's Code Review.

I wanted to *mark* the request as needing to be resubmitted. But if you select "Resubmit" as the status, it actually just goes ahead and resubmits it....

Anyway, as I mentioned in my last post, the unfortunate truth is that the "simple and obvious one-liner" fails, because of Unicode cross-platform problems.

So for now, I'd rather we didn't support it at all, rather than open ourselves up to accidental corruption issues.

I'll also note that --development6-rich-root (what will likely become the default format for Bazaar 2.0) probably won't suffer the same issue. As it doesn't try to use XML for its inventory layer, and can use raw UTF-8 for those strings.

Perhaps it is just simply better to say "non BMP chars are not supported for formats < --development6".

review: Needs Resubmitting

Unmerged revisions

4288. By Wesley J. Landaker on 2009-04-13: Fix escaping to handle all Unicode characters, not just those in the BMP.
4287. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-11: (vila)(trivial) Cleanup test imports and use features to better track
skipped tests.
4286. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-11: (vila) Fix realm extraction for http basic authentication
4285. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-11: (Jelmer, vila) Support registration of fallback credential stores.
4284. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-10: (Jelmer) Prompt for user names for http if they are not in the
configuration.
4283. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-10: (vila) (trivial) Fix typo and a misleading signature
4282. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-10: (Jelmer) Allow empty list of authors if there is no committer set.
4281. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-09: (Jelmer) Add the dpush command.
4280. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-09: (jam) Change _fetch_uses_deltas = False for CHK repos until we can
write a better fix.
4279. By Canonical.com Patch Queue Manager <email address hidden> on 2009-04-09: (Matt Nordhoff) Fix a typo in the launchpad plugin's help

1	=== modified file 'bzrlib/xml8.py'
2	--- bzrlib/xml8.py 2009-07-07 04:32:13 +0000
3	+++ bzrlib/xml8.py 2009-08-31 04:38:05 +0000
4	@@ -52,7 +52,7 @@
5	if _utf8_re is None:
6	_utf8_re = re.compile('[&<>\'\"]\|[\x80-\xff]+')
7	if _unicode_re is None:
8	- _unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')
9	+ _unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')
10
11
12	def _unicode_escape_replace(match, _map=_xml_escape_map):

Bazaar

Merge lp://qastaging/~wjl/bzr/bug-317644 into lp://qastaging/~bzr/bzr/trunk-old

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers