I've done some research in to this. Officially the only code points allowed are:
" [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ " ( https://www.w3.org/TR/REC-xml/#charsets)
So going on this our regex should be:
re.compile(r'[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]')
we probably should filter out the other code points too.
Its worth noting that REPLACMENT_CHARS_MAP replaces the vertical tab and form feed chars with "\n\n" before the CONTROL_CHARS regex filters them out!
So in summary please replace the regex with the one in the inline comment. (Note: its not tested)
« Back to merge proposal
I've done some research in to this. Officially the only code points allowed are:
" /www.w3. org/TR/ REC-xml/ #charsets)
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
" ( https:/
So going on this our regex should be:
re.compile( r'[\x00- \x08\x0B\ x0C\x0E- \x1F\x7F- \x9F]')
we probably should filter out the other code points too.
Its worth noting that REPLACMENT_ CHARS_MAP replaces the vertical tab and form feed chars with "\n\n" before the CONTROL_CHARS regex filters them out!
So in summary please replace the regex with the one in the inline comment. (Note: its not tested)