[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] Wide characters
From: |
Ludovic Courtès |
Subject: |
Re: [PATCH] Wide characters |
Date: |
Mon, 23 Feb 2009 23:06:53 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.0.90 (gnu/linux) |
Hi,
Mike Gran <address@hidden> writes:
> I've been playing with this wide char stuff, and I have a patch that
> would move the encoding of characters to UCS-4.
Thanks for the good news!
> This is completely useless on its own, because, in this
> patch, the internal encoding of strings is still 8-bit chars, and,
> thus, there is no way to use the wide characters in strings.
Yes. I think the best thing will be to let you experiment in a
dedicated branch, so we can progressively see things take shape.
> It is all pretty simple. Since the internal representation of chars
> becomes UCS-4, I used scm_t_uint32 as the char type, and I removed the
> code that supported EBCDIC-encoded characters. I changed the tables
> of character names to deal with more names and discontiguous control
> characters. And, as a temporary kludge, I made a macro
> SCM_MAKE_8BIT_CHAR to cast the 8-bit characters used in strings to a
> scm_t_uint32. Also, I used functions from the Gnulib unicase and
> unictype modules for character properties, including a couple that
> Bruno Haible of Gnulib was kind enough to create for me.
That sounds good.
I only have minor comments at this point, see below.
IMO it'd be good to augment `chars.test' or `reader.test' to test some
of the new characters.
> The gnulib invocation for this was
This should appear as a diff of `m4/gnulib-cache.m4' and as the addition
of relevant Gnulib files since we now store them in the repository. The
best is probably to make this a separate commit.
> +#include "lib/unicase.h"
#include <unicase.h> should be enough.
> SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr,
> (SCM x, SCM y),
> "Return @code{#t} iff @var{x} is less than or equal to @var{y} in
> the\n"
> - "ASCII sequence, else @code{#f}.")
> + "Imocpde sequence, else @code{#f}.")
Typo? :-)
> SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr,
> (SCM x, SCM y),
> "Return @code{#t} iff @var{x} is the same character as @var{y}
> ignoring\n"
> - "case, else @code{#f}.")
> + "case, else @code{#f}. Case is computed in the Unicode locale.")
The phrase "Unicode locale" looks confusing to me. This function is
locale-independent, right?
> - return scm_from_bool
> (scm_c_upcase(SCM_CHAR(x))==scm_c_upcase(SCM_CHAR(y)));
> + return scm_from_bool (uc_toupper(SCM_CHAR(x))==uc_toupper(SCM_CHAR(y)));
Please leave a space before opening parentheses.
> -int
> -scm_c_upcase (unsigned int c)
> +scm_t_uint32
> +scm_c_upcase (scm_t_uint32 c)
This is an API change, but probably acceptable (and unavoidable).
> +char *const scm_charnames[] =
Could even be "const char *const scm_charnames[]".
> + {
> + /* C0 controls */
> + "nul", "soh", "stx", "etx", "eot", "enq", "ack", "bel",
> + "bs", "ht", "newline", "vt", "np", "cr", "so", "si",
> + "dle", "dc1", "dc2", "dc3", "dc4", "nak", "syn", "etb",
> + "can", "em", "sub", "esc", "fs", "gs", "rs", "us",
> + "del",
> + /* C1 controls */
> + "bph", "nbh", "ind", "nel", "ssa", "esa",
> + "hts", "htj", "vts", "pld", "plu", "ri" , "ss2", "ss3",
> + "dcs", "pu1", "pu2", "sts", "cch", "mw" , "spa", "epa",
> + "sos", "sci", "csi", "st", "osc", "pm", "apc"
> + };
Are the new names standard?
> int scm_n_charnames = sizeof (scm_charnames) / sizeof (char *);
Could be `const'.
> +SCM_API const scm_t_uint32 scm_charnums[];
> +SCM_API char *const scm_alt_charnames[];
> +SCM_API int scm_n_alt_charnames;
> +SCM_API const scm_t_uint32 scm_alt_charnums[];
This should all be marked `SCM_INTERNAL'.
Besides, instead of exposing these arrays, could we instead have two
functions in `chars.c', say:
scm_t_uint32 scm_i_lookup_character (const char *name);
const char *scm_i_character_name (scm_t_uint32 chr);
> + return (unsigned long)(scm_c_downcase(SCM_CHAR(obj))) % n;
The cast shouldn't be needed.
> - /* Dirk:FIXME:: This type of character syntax is not R5RS
> - * compliant. Further, it should be verified that the constant
> - * does only consist of octal digits. Finally, it should be
> - * checked whether the resulting fixnum is in the range of
> - * characters. */
> + /* FIXME:: This type of character syntax is not R5RS
> + * compliant. */
I think this comment remains valid, doesn't it?
> @@ -59,7 +59,7 @@ sf_flush (SCM port)
> {
> /* write the byte. */
> scm_call_1 (SCM_SIMPLE_VECTOR_REF (stream, 0),
> - SCM_MAKE_CHAR (*pt->write_buf));
> + SCM_MAKE_8BIT_CHAR (*pt->write_buf));
It's actually not a byte.
Thanks!
Ludo'.