Merge lp://qastaging/~wjl/bzr/bug-317644 into lp://qastaging/~bzr/bzr/trunk-old

Proposed by John A Meinel
Status: Rejected
Rejected by: John A Meinel
Proposed branch: lp://qastaging/~wjl/bzr/bug-317644
Merge into: lp://qastaging/~bzr/bzr/trunk-old
Diff against target: 12 lines
To merge this branch: bzr merge lp://qastaging/~wjl/bzr/bug-317644
Reviewer Review Type Date Requested Status
John A Meinel Needs Resubmitting
Review via email: mp+6978@code.qastaging.launchpad.net

This proposal supersedes a proposal from 2009-03-03.

To post a comment you must log in.
Revision history for this message
Wesley J. Landaker (wjl) wrote : Posted in a previous version of this proposal

This fix is a simple and obvious one-liner. It would be great to get this merged.

Revision history for this message
Robert Collins (lifeless) wrote : Posted in a previous version of this proposal

Seems ok to me; having a test demonstrating it would be better though.

review: Approve
Revision history for this message
John A Meinel (jameinel) wrote : Posted in a previous version of this proposal
Download full text (3.9 KiB)

I went to add a test case, to ensure that it really did work correctly, unfortunately I found that it does not.

Part of the problem is that on Windows, python uses UCS-2 internally (16-bit wide characters) while on Linux it generally uses UCS-4 (32-bit wide characters).

As such, len(u'\U0010FFFF') == 2 on Windows and 1 on Linux.

When I started adding simple tests, it turned out that _unicode_re.sub() does very strange things. The test I wanted to add to bzrlib/tests/test_xml.py was simple enough:

+ def test_non_bmp_unicode(self):
+ uni_str = u'\xb5\u0090\U0010abcd'
+ self.assertEqual('µ􊯍',
+ xml8._encode_and_escape(uni_str))

However, the issues with wide characters complicate things.

Specifically, if I just use the change proposed, I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\udfcd' in position 20: ordinal not in range(128)

Digging into it, I think it is actually a bug in python2.5's Unicode regex _match functionality. In that if you do:

import re
_unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')
def print_matches(match):
    val = match.group()
    print repr(val)
    return "&#%d;" % ord(val)

_unicode_re.sub(print_matches, u'\U0010abcd')

It prints out:
u'\udbea'
u'&#56298;\udfcd'

Notice that it *only* matched the first character, and didn't do *any* processing on the second character.

If you change it to:
_unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')
_unicode_re.sub(print_matches, u'\U0010abcd')

You get:
u'\udbea'
u'\udfcd'
u'&#56298;&#57293;'

Note, however, that if I run the same tests on Linux I get:
u'\U0010abcd'
'&#1092557;'
versus
# No match
u'\U0010abcd'

My best guess is that we need to check the width of u'\U0010ffff' and respond accordingly. For example:

  if _unicode_re is None:
    if len(u'\U0010ffff') == 2:
      # this platform represents non BMP characters using 2 Py_UNICODE chars, so all
      # individual bytes are from 0 => ffff
      _unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')
    else:
      # \U0010ffff is a single Py_UNICODE char, so match it as such
      _unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')

And then the test needs to check the validity of both forms:

+ def test_non_bmp_unicode(self):
+ uni_str = u'\U0010abcd'
+ if len(uni_str) == 2:
+ self.assertEqual('&#56298;&#57293;',
+ xml8._encode_and_escape(uni_str))
+ else:
+ self.assertEqual('&#1092557;',
+ xml8._encode_and_escape(uni_str))

However, this also implies that we need to make sure we can translate those
values *back* to their non-normalized form.

Even more worrisome, though, is that now 2 people generating the same identical
inventory will be generating different byte strings (based on whether the
inventory was generated from Windows or Linux.)

This has a fairly serious implication, as it means that line-based deltas won't
always cleanly apply if 2 people have different base texts.

This won't come into play often, because if *I* generate a text, and you *copy*
it from me, things will be fine. However, 'bzr upgrade' (for exam...

Read more...

review: Needs Fixing
Revision history for this message
John A Meinel (jameinel) wrote :

Sorry about messing up the request status. I was confused by Launchpad's Code Review.

I wanted to *mark* the request as needing to be resubmitted. But if you select "Resubmit" as the status, it actually just goes ahead and resubmits it....

Anyway, as I mentioned in my last post, the unfortunate truth is that the "simple and obvious one-liner" fails, because of Unicode cross-platform problems.

So for now, I'd rather we didn't support it at all, rather than open ourselves up to accidental corruption issues.

I'll also note that --development6-rich-root (what will likely become the default format for Bazaar 2.0) probably won't suffer the same issue. As it doesn't try to use XML for its inventory layer, and can use raw UTF-8 for those strings.

Perhaps it is just simply better to say "non BMP chars are not supported for formats < --development6".

review: Needs Resubmitting

Unmerged revisions

4288. By Wesley J. Landaker

Fix escaping to handle all Unicode characters, not just those in the BMP.

4287. By Canonical.com Patch Queue Manager <email address hidden>

(vila)(trivial) Cleanup test imports and use features to better track
 skipped tests.

4286. By Canonical.com Patch Queue Manager <email address hidden>

(vila) Fix realm extraction for http basic authentication

4285. By Canonical.com Patch Queue Manager <email address hidden>

(Jelmer, vila) Support registration of fallback credential stores.

4284. By Canonical.com Patch Queue Manager <email address hidden>

(Jelmer) Prompt for user names for http if they are not in the
 configuration.

4283. By Canonical.com Patch Queue Manager <email address hidden>

(vila) (trivial) Fix typo and a misleading signature

4282. By Canonical.com Patch Queue Manager <email address hidden>

(Jelmer) Allow empty list of authors if there is no committer set.

4281. By Canonical.com Patch Queue Manager <email address hidden>

(Jelmer) Add the dpush command.

4280. By Canonical.com Patch Queue Manager <email address hidden>

(jam) Change _fetch_uses_deltas = False for CHK repos until we can
 write a better fix.

4279. By Canonical.com Patch Queue Manager <email address hidden>

(Matt Nordhoff) Fix a typo in the launchpad plugin's help

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/xml8.py'
2--- bzrlib/xml8.py 2009-07-07 04:32:13 +0000
3+++ bzrlib/xml8.py 2009-08-31 04:38:05 +0000
4@@ -52,7 +52,7 @@
5 if _utf8_re is None:
6 _utf8_re = re.compile('[&<>\'\"]|[\x80-\xff]+')
7 if _unicode_re is None:
8- _unicode_re = re.compile(u'[&<>\'\"\u0080-\uffff]')
9+ _unicode_re = re.compile(u'[&<>\'\"\u0080-\U0010ffff]')
10
11
12 def _unicode_escape_replace(match, _map=_xml_escape_map):