bug-diffutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-diffutils] Bug#680990: Diff does not show BOM difference. (fwd)


From: jeanmichel . 123
Subject: Re: [bug-diffutils] Bug#680990: Diff does not show BOM difference. (fwd)
Date: Tue, 17 Jul 2012 02:23:59 +0200 (CEST)

I did not integrated it in diff, nor tested it.
Because I do not know diff code and compilation enough.

But as it appears there is no unicode support in diff tool, such a code that 
IGNORE_ALL_UNICODE_SPACE in complement of IGNORE_ALL_SPACE
for instance:
       toUTF8_1 ( 0x0009 ),
       toUTF8_1 ( 0x000A ),
       toUTF8_1 ( 0x000B ),
       toUTF8_1 ( 0x000C ),
       toUTF8_1 ( 0x000D ),
       toUTF8_1 ( 0x0020 ),
// c2
       toUTF8_2 ( 0x0085 ),
       toUTF8_2 ( 0x00A0 ),
// e1
       toUTF8_3 ( 0x1680 ),
       toUTF8_3 ( 0x180E ),
// e2 80
       toUTF8_3 ( 0x2000 ),
       toUTF8_3 ( 0x2001 ),
       toUTF8_3 ( 0x2002  ),
       toUTF8_3 ( 0x2003  ),
       toUTF8_3 ( 0x2004  ),
       toUTF8_3 ( 0x2005  ),
       toUTF8_3 ( 0x2006  ),
       toUTF8_3 ( 0x2007  ),
       toUTF8_3 ( 0x2008  ),
       toUTF8_3 ( 0x2009  ),
       toUTF8_3 ( 0x200A  ),
       toUTF8_3 ( 0x2028  ),
       toUTF8_3 ( 0x2029  ),
       toUTF8_3 ( 0x202F  ),
// e2 81
       toUTF8_3 ( 0x205F  ),
// e3
       toUTF8_3 ( 0x3000  ),
// ef
       toUTF8_3 ( 0xfeff  ),
(taken from http://en.wikipedia.org/wiki/Whitespace_character and added BOM)
might look like:


//////////////////////////////////////////////////////
typedef unsigned char byte;
/** Check if this byteSequence start with an utf8 space character */
// Return the number of bytes which match the space character if any.
int isUtf8Space(byte*input)
{
  switch (input[0])
  {
    case 0x0009:
    case 0x000A:
    case 0x000B:
    case 0x000C:
    case 0x000D:
    case 0x0020:
      return 1;
      break;
    case 0x00c2:
      if ( (input[1]==0x00a0) || (input[1]==0x0085) )
        return 2;
      break;
    case 0x00e1:
      if (( input[1]==0x009a) && (input[2]==0x0080) )
        return 3;
      if (( input[1]==0x00a0) && (input[2]==0x008e) )
        return 3;
      break;
    case 0x00e2:
      switch (input[1])
        {  
           case 0x0080:
             if ( ( input[2]>=0x80) &&
                   (input[2]<=0x8a)   )
                 return 3;
             if ( input[2]==0xa8)
                 return 3;
             if ( input[2]==0xa9)
                 return 3;
             if ( input[2]==0xaf)
                 return 3;
            break;
           case 0x0081:
             if ( input[2]==0x9f)
                 return 3;
            break;
           case 0x0097:
             if ( input[2]==0xbf)
                 return 3;
            break;
        }
      break;
    case 0x00e3:
      if ( (input[1]==0x0080) && (input[2]==0x0080) )
        return 3;
      break;
    case 0x00ef:
      if ( (input[1]==0x00bb) && (input[2]==0x00bf) )
        return 3;
      break;
  }
  return 0;
}

/** test file*/
int  main(int argc, char**argv)
{
  char * t1 = argv[1];
  char * t2 = argv[2];
  int n;
  char c1, c2;
            //case IGNORE_ALL_UNICODE_SPACE:
              /* For -w, just skip past any white space.  */
              while ( (n=isUtf8Space (t1)) && *t1 != '\n')  t1+=n;
              while ( (n=isUtf8Space (t2)) && *t2 != '\n')  t2+=n;
                  c1 = *t1;
                  c2 = *t2;
              //break;
  printf ("<%s\n", t1);
  printf (">%s\n", t2);
  return strcmp(t1, t2);
}
//////////////////////////////////////////////////////

Unfortunately, it only handle UTF-8 and not UTF-16.


----- Mail original -----

It'd be reasonable to have 'diff' ignore byte-order-marks,
just as it already ignores things like white space,
when given a new option to do that.

Someone would have to write the code and documentation,
though.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]