help-source-highlight
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-source-highlight] Unicode files ?


From: Lorenzo Bettini
Subject: Re: [Help-source-highlight] Unicode files ?
Date: Sat, 03 Apr 2010 11:57:33 +0200
User-agent: Thunderbird 2.0.0.24 (X11/20100317)

Dario Teixeira wrote:
Hi,

the html might bring also bad encoding in the head, but I
guess it is also due to the fact that source-highlight reads
two bytes, which in unicode represent a single character,
and interprets them as two characters instead of one. This is unicode, am I right? Sorry for my ignorance,
but with unicode in a text file every character is
represented by two bytes, right?

Nope. There is not one standard Unicode encoding, but several.  The most
common one is UTF-8, which is a variable length encoding where each Unicode
character can take from 1 to 4 bytes (originally it was up to 6, but that's
deprecated now).  Another variable-length encoding is UTF-16, where each
character can occupy between 2 and 4 bytes.  The only fixed-length encoding
is UTF-32 (UCS-4), where each character requires 4 bytes.

Oh, then I had got it completely wrong! O:)

I'd like to try with wstring and see whether this solves
something.

I haven't used C++ in a long time, but isn't wstring based on wchar_t,
which is 2 bytes long?   If so, it won't solve anything.  There is no
Unicode encoding that uses a fixed-length of 2 bytes!


again, I got it wrong... then now I'm wondering what wchar_t is for, but that's another issue...

Lorenzo, I think we can give you a hand in implementing this.  However,
if you read through this entire thread you will notice that the best
course of action is dependent on a crucial piece of information which
you are the most qualified person to provide: we need a list of the
manipulations that Source-highlight applies to strings.

well

1. it reads a line from the input source
2. uses boost regex library to match pieces of the line with given language definition regular expressions 3. possibly preprocess some characters depending on the output format (e.g., in html '<' is translated into '&lt;')
4. if no regular expression matched writes the line part to the output
5. if a regular expression matched writes the line part to the output with some "decoration" according to the output format

so I guess another big issue is whether boost regex library is able to handle unicode strings, right?

Moreover, wchar_t is useless I seem to understand, and a unicode library for C++ is required anyway...

Thanks Martin for the url, if anyone else can provide further links they are more than welcome :)

cheers
        Lorenzo

--
Lorenzo Bettini, PhD in Computer Science, DI, Univ. Torino
ICQ# lbetto, 16080134     (GNU/Linux User # 158233)
HOME: http://www.lorenzobettini.it MUSIC: http://www.purplesucker.com
http://www.myspace.com/supertrouperabba
BLOGS: http://tronprog.blogspot.com  http://longlivemusic.blogspot.com
http://www.gnu.org/software/src-highlite
http://www.gnu.org/software/gengetopt
http://www.gnu.org/software/gengen http://doublecpp.sourceforge.net




reply via email to

[Prev in Thread] Current Thread [Next in Thread]