[wdiff-dev] [patch #7121] New, per-character diff, mode

wdiff-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[wdiff-dev] [patch #7121] New, per-character diff, mode

From:	Martin von Gagern
Subject:	[wdiff-dev] [patch #7121] New, per-character diff, mode
Date:	Tue, 30 Mar 2010 08:01:42 +0000
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.2) Gecko/20100325 Gentoo Firefox/3.6.2

Update of patch #7121 (project wdiff):

                  Status:               Need Info => In Progress            

    _______________________________________________________

Follow-up Comment #4:

-1- see personal e-mail.

-3-
> I have started studying other encodings, such as UTF-16,
> the unicode routines available by glibc

I assume that the most GNUish way to handle different encodings would be
using libiconv <http://www.gnu.org/software/libiconv/>, which is also included
in glibc. That way, you can formulate your app to deal with unicode only, and
rely on libiconv for input processing. Downside is that there might be
encodings which have different possible encodings for the same unicode string,
in which case wdiff might loose some details about the binary representation.
Shouldn't worry us, though, as long as things look the same. wdiff is more
about looks than bytes, I think.

> Calling `setlocale(LC_CTYPE, NULL)' at program's startup
> to get the default encoding (plus a commandline option to
> supply it explicitly)

Most apps don't provide special command line switches, but instead accept
customization of encodings via environment variables. I'd stick with this to
avoid excessive numbers of command line switches. Unless you want to control
encoding for both sides of a diff independently...

> Splitting of words/chars will need some thought

By the way: I believe you want to split glyphs, not chars, i.e. you want to
keep combining characters together with their base characters. This requires
even more understanding of unicode, and applying it to other encodings
manually would probably be very painful, so conversions to unicode feel ever
more sensible.

> Most probably it will require to change all getc/putc calls
> to the apropriate multi-byte versions.

libiconv at least works on byte buffers, not I/O streams. And it's probably
slow if you pass one char at a time. Hmmm... I wonder whether we should switch
from the one-char-at-a-time implementation to something processing larger
buffers of bytes one at a time.

> Doing more elaborate things such as trully supporting the
> -i, --ignore-case option in all encodings will most probably
> require quite a lot of code (actually, I do not know yet how
> much).

The ICU library has some documents that might be useful here:
http://userguide.icu-project.org/transforms/casemappings
http://userguide.icu-project.org/collation/concepts

> Thus, a pass-through implementation (just to break words/chars
> right in any encoding) is feasible IMO in a few months;
> I will try it.

That sounds great. I guess it would be useful if you could share a bzr branch
of your progress. You could do so either by applying for group membership here
on savannah, or using launchpad, or any other publicly accessible bzr or http
server.

> If you are aware of other GNU software that handles unicode
> point me to it; there may be suitable ready code to use for
> this purpose.

The only GNU thing I know about is libiconv. It does conversion, but knows
little more about unicode.

The most powerful unicode library I've worked with so far is ICU, the
International Components for Unicode, <http://site.icu-project.org/>. It seems
pretty powerful, and also has some good documentation, to which I've already
pointed you up there. One more pointer: the White_Space and Word_Break
properties mentioned in http://userguide.icu-project.org/strings/properties
might be useful.

ICU is no GNU project, and has over 17MiB of libraries on my setup. So it
might be a useful and powerful support library, but wdiff should not depend on
it, maybe not even for simple char splitting.

Looking at the ICU docs made me realize just how deep the waters are getting
here. For example, if you do this correctly in a partially right-to-left
script, you should ensure that unicode direction modifiers do nest properly. I
assume things can become really complicated. I wouldn't expect you to address
all this complexity before I consider your patch for inclusion, the word-wise
diff doesn't do so either. But it might pay to think about these things, so
the implementation is designed in such a way it can deal with them in the long
run. Just a thought.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/patch/?7121>

_______________________________________________
  Nachricht geschickt von/durch Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[wdiff-dev] [patch #7121] New, per-character diff, mode, Martin von Gagern, 2010/03/29
- [wdiff-dev] [patch #7121] New, per-character diff, mode, Martin von Gagern, 2010/03/29
  - [wdiff-dev] [patch #7121] New, per-character diff, mode, Georgios Zarkadas, 2010/03/29
    - [wdiff-dev] [patch #7121] New, per-character diff, mode, Martin von Gagern <=
    - [wdiff-dev] Assignment of changes, Martin von Gagern, 2010/03/30

Prev by Date: [wdiff-dev] [patch #7121] New, per-character diff, mode
Next by Date: [wdiff-dev] Assignment of changes
Previous by thread: [wdiff-dev] [patch #7121] New, per-character diff, mode
Next by thread: [wdiff-dev] Assignment of changes
Index(es):
- Date
- Thread