octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte


From: Burkart Lingner
Subject: [Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte UTF-8 characters
Date: Tue, 20 Mar 2012 16:29:26 +0000
User-agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:11.0) Gecko/20100101 Firefox/11.0

URL:
  <http://savannah.gnu.org/bugs/?35910>

                 Summary: Incorrect regex matching of multi-byte UTF-8
characters
                 Project: GNU Octave
            Submitted by: burkart
            Submitted on: Tue 20 Mar 2012 04:29:25 PM GMT
                Category: Interpreter
                Severity: 3 - Normal
                Priority: 5 - Normal
              Item Group: Incorrect Result
                  Status: None
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: 3.6.1
        Operating System: GNU/Linux

    _______________________________________________________

Details:

When matching a single character at a position where there's a multi-byte
UTF-8 character, only the first byte is matched. Depending on how this match
is then processed, it can result in invalid UTF-8. Example:


string = regexprep('§x', '^(.)', '$1;')
fprintf('%x\n', string)


yields


string = ?;?x
c2
3b
a7
78


where "?" is the replacement character and the UTF-8 codes for "§", ";", and
"x" are "0xC2 0xA7", "0x3B", and "0x78", respectively.

The expected output would have been


string = §;x
c2
a7
3b
78





    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?35910>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]