[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] iconv diffs [Was: Why is utf8...]
From: |
Ethan Blanton |
Subject: |
Re: [Monotone-devel] iconv diffs [Was: Why is utf8...] |
Date: |
Sat, 17 Feb 2007 10:33:08 -0500 |
User-agent: |
Mutt/1.5.12-2006-07-14 |
Patrick Georgi spake unto us the following wisdom:
> but skipping a character should be possible:
> - build another iconv state that translates input encoding into input
> encoding (unless that enables a fast-path, which I'm not sure of -
> alternative might be some encoding that is the ultimate superset, if
> such an encoding exists)
> - push first unknown byte into it. if that creates a response already,
> discard (as it might be some header sequence) and restart with the same
> byte in the next step, otherwise start at the next byte
> - until iconv emits a response, push byte after byte into it
> - skip that many bytes in the input, replace with one "?"
This is more or less what we do in Gaim, for some of our fallback
attempts. This can still lead to junk in your output, particularly
given that a) there are non-UTF-8 character sets which look just like
valid UTF-8 (e.g., ISO-2022-{JP,KR}), and b) there are character sets
which will accept any byte as valid, though it may not be (e.g.,
ISO-8859-*).
The bottom line, though, is that if the user (or operating system) has
not successfully communicated the character set used for some chunk of
data, you _cannot_ do the right thing -- the best you can do is try
not to mess it up too much. For us, this basically means filter out
anything that isn't UTF-8 before it gets to the user (normally
replacing invalid sequences with one or more '?' characters), as our
UI is guaranteed to be UTF-8 by design. With monotone you aren't
given this guarantee, but a similar approach seems reasonable; try to
convert it to whatever LC_CHARSET recommends, restarting one byte at a
time and replacing any bytes which fail to convert with '?'.
Ethan
--
The laws that forbid the carrying of arms are laws [that have no remedy
for evils]. They disarm only those who are neither inclined nor
determined to commit crimes.
-- Cesare Beccaria, "On Crimes and Punishments", 1764
signature.asc
Description: Digital signature
- [Monotone-devel] Why is utf8 type _NOVERIFY, and other vocab stuff., Timothy Brownawell, 2007/02/14
- Re: [Monotone-devel] Why is utf8 type _NOVERIFY, and other vocab stuff., Nathaniel Smith, 2007/02/15
- [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff., Lapo Luchini, 2007/02/15
- Re: [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff., Zack Weinberg, 2007/02/15
- [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff., Lapo Luchini, 2007/02/15
- [Monotone-devel] iconv diffs [Was: Why is utf8...], Lapo Luchini, 2007/02/16
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Thomas Moschny, 2007/02/16
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Nathaniel Smith, 2007/02/16
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Patrick Georgi, 2007/02/17
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...],
Ethan Blanton <=
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Ulrich Drepper, 2007/02/16
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Thomas Moschny, 2007/02/16
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Ulrich Drepper, 2007/02/16
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Thomas Keller, 2007/02/16
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Thomas Moschny, 2007/02/16
- Re: [Monotone-devel] iconv diffs [Was: Why is utf8...], Patrick Georgi, 2007/02/16
- [Monotone-devel] Re: iconv diffs [Was: Why is utf8...], Lapo Luchini, 2007/02/16
- Re: [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff., Justin Patrin, 2007/02/16
- Re: [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff., Zack Weinberg, 2007/02/17
- Re: [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff., Justin Patrin, 2007/02/17