guile-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Guile-commits] GNU Guile branch, master, updated. release_1-9-2-137-g3f


From: Michael Gran
Subject: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-137-g3f12aed
Date: Mon, 31 Aug 2009 00:01:27 +0000

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU Guile".

http://git.savannah.gnu.org/cgit/guile.git/commit/?id=3f12aedb50122209cf773174d215dc64309b9fe4

The branch, master has been updated
       via  3f12aedb50122209cf773174d215dc64309b9fe4 (commit)
       via  bda0d85f0c550eeb4182de0fb6051fc51d0b25ce (commit)
       via  5f5920e012d953def86a1c04a8e8d1a14b85f1a5 (commit)
       via  fac32b518ef9f456e8f6465c00e6c6f40a1123a8 (commit)
      from  f84c500d2e29c619e6a989d0d11911fea414d795 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 3f12aedb50122209cf773174d215dc64309b9fe4
Author: Michael Gran <address@hidden>
Date:   Sun Aug 30 15:58:32 2009 -0700

    Update docs for Unicode characters
    
    * NEWS: add note about Unicode characters
    
    * doc/ref/api-data.texi: update Characters subsection
    
    * libguile/chars.c: update docstrings to match manual

commit bda0d85f0c550eeb4182de0fb6051fc51d0b25ce
Author: Michael Gran <address@hidden>
Date:   Sun Aug 30 16:51:30 2009 -0700

    Tests for display and writing of characters
    
    * test-suite/tests/encoding-iso88591.test: tests for writing and display
      of characters
    
    * test-suite/tests/encoding-iso88597.test: tests for writing and display
      of characters
    
    * test-suite/tests/encoding-utf8.test: tests for writing and display
      of characters

commit 5f5920e012d953def86a1c04a8e8d1a14b85f1a5
Author: Michael Gran <address@hidden>
Date:   Sun Aug 30 16:48:03 2009 -0700

    Fix escape sequence normalization for wide strings
    
    * libguile/strings.c (scm_to_stringn): convert unistring escapes to
      guile escapes for both wide and narrow strings

commit fac32b518ef9f456e8f6465c00e6c6f40a1123a8
Author: Michael Gran <address@hidden>
Date:   Sun Aug 30 15:41:56 2009 -0700

    Fix encoding errors with strings returned by string ports
    
    String ports, being 8-bit, store strings using the character encoding
    of the port.  This fixes a bug where the default character encoding, and
    not the port's encoding, was being used to convert the string port data
    back to a string.
    
    * libguile/strports.c: extra comments
      (scm_strport_to_string):  use port's encoding when converting port data
      to a string
    
    * libguile/strings.c (scm_i_from_stringn): renamed from scm_from_stringn
      and made internal.  All callers changed.
      (scm_from_stringn): renamed to scm_i_from_stringn.
    
    * libguile/strings.h: declaration for scm_i_from_stringn

-----------------------------------------------------------------------

Summary of changes:
 NEWS                                    |    6 ++
 doc/ref/api-data.texi                   |  113 ++++++++++++++++++++----------
 libguile/chars.c                        |   64 +++++++++---------
 libguile/strings.c                      |   17 +++--
 libguile/strings.h                      |    4 +
 libguile/strports.c                     |   26 ++++++--
 test-suite/tests/encoding-iso88591.test |   61 +++++++++++++++++
 test-suite/tests/encoding-iso88597.test |   59 ++++++++++++++++
 test-suite/tests/encoding-utf8.test     |   82 ++++++++++++++++++++++
 9 files changed, 352 insertions(+), 80 deletions(-)

diff --git a/NEWS b/NEWS
index 4797b0c..97b55e9 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,12 @@ prerelease, and a full NEWS corresponding to 1.8 -> 2.0.)
 
 Changes in 1.9.3 (since the 1.9.2 prerelease):
 
+** Character functions operate on Unicode characters
+
+char-upcase and char-downcase use default Unicode casing rules.
+Character comparisons such as char<? and char-ci<? are now sorting
+based on Unicode code points.
+
 ** Removed deprecated uniform array procedures: scm_make_uve,
    scm_array_prototype, scm_list_to_uniform_array,
    scm_dimensions_to_uniform_array, scm_make_ra, scm_shap2ra, scm_cvref,
diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi
index 0fd4ee1..446ccd3 100755
--- a/doc/ref/api-data.texi
+++ b/doc/ref/api-data.texi
@@ -1779,16 +1779,31 @@ another manual.
 @subsection Characters
 @tpindex Characters
 
+In Scheme, there is a data type to describe a single character.  
+
+Defining what exactly a character @emph{is} can be more complicated
+than it seems.  Guile follows the advice of R6RS and just uses The
+Unicode Standard to help define what a character is.  So, for Guile,
+a character is anything in the Unicode Character Database.
+
+Unicode assigns each character an unique integer representation: a
address@hidden point}.  Guile uses Unicode code points as the integer
+representation of characters.  Valid code points are in the ranges 0
+to @code{#xD7FF} inclusive or @code{#xE000} to @code{#x10FFFF}
+inclusive.
+
 In Scheme, a character literal is written as @address@hidden where
 @var{name} is the name of the character that you want.  Printable
 characters have their usual single character name; for example,
address@hidden is a lower case @code{a}.
address@hidden is a lower case @code{a}.  Many of the non-printing
+characters, such as whitespace characters and control characters, also
+have names.
 
-Most of the ``control characters'' (those below codepoint 32) in the
address@hidden character set, as well as the space, may be referred
-to by longer names: for example, @code{#\tab}, @code{#\esc},
address@hidden, and so on.  The following table describes the
address@hidden names for each character.
+The most commonly used non-printing chararacters are space and
+newline.  Their character names are @code{#\space} and
address@hidden  There are also names for all of the ``C0 control
+characters'' (those with code points below 32).  The following table
+describes the names for each character.
 
 @multitable @columnfractions .25 .25 .25 .25
 @item 0 = @code{#\nul}
@@ -1801,9 +1816,9 @@ to by longer names: for example, @code{#\tab}, 
@code{#\esc},
  @tab 7 = @code{#\bel}
 @item 8 = @code{#\bs}
  @tab 9 = @code{#\ht}
- @tab 10 = @code{#\nl}
+ @tab 10 = @code{#\lf} 
  @tab 11 = @code{#\vt}
address@hidden 12 = @code{#\np}
address@hidden 12 = @code{#\ff}
  @tab 13 = @code{#\cr}
  @tab 14 = @code{#\so}
  @tab 15 = @code{#\si}
@@ -1826,85 +1841,106 @@ to by longer names: for example, @code{#\tab}, 
@code{#\esc},
 @item 32 = @code{#\sp}
 @end multitable
 
-The ``delete'' character (octal 177) may be referred to with the name
address@hidden
+The ``delete'' character (code point 127) may be referred to with the
+name @code{#\del}.
 
-Several characters have more than one name:
+One might note that the space character has two names --
address@hidden and @code{#\sp} -- as does the newline character.
+Several other non-printing characters have more than one name, for the
+sake of compatibility with previous versions.
 
address@hidden address@hidden {Original}
address@hidden Alias @tab Original
address@hidden @code{#\space} @tab @code{#\sp}
address@hidden @code{#\newline} @tab @code{#\nl}
address@hidden address@hidden {Preferred}
address@hidden Alternate @tab Standard
address@hidden @code{#\sp} @tab @code{#\space}
address@hidden @code{#\nl} @tab @code{#\newline}
address@hidden @code{#\lf} @tab @code{#\newline}
 @item @code{#\tab} @tab @code{#\ht}
 @item @code{#\backspace} @tab @code{#\bs}
 @item @code{#\return} @tab @code{#\cr}
address@hidden @code{#\page} @tab @code{#\np}
address@hidden @code{#\page} @tab @code{#\ff}
address@hidden @code{#\np} @tab @code{#\ff}
 @item @code{#\null} @tab @code{#\nul}
 @end multitable
 
+Characters may also be referred to with an octal value, such as
address@hidden for @code{#\bs} or @code{#\177} for @code{#\del}.
+
 @rnindex char?
 @deffn {Scheme Procedure} char? x
 @deffnx {C Function} scm_char_p (x)
 Return @code{#t} iff @var{x} is a character, else @code{#f}.
 @end deffn
 
+Fundamentally, the character comparisons operations below are
+numeric comparisons of the character's code points.
+
 @rnindex char=?
 @deffn {Scheme Procedure} char=? x y
-Return @code{#t} iff @var{x} is the same character as @var{y}, else @code{#f}.
+Return @code{#t} iff code point of @var{x} is equal to the code point
+of @var{y}, else @code{#f}.
 @end deffn
 
 @rnindex char<?
 @deffn {Scheme Procedure} char<? x y
-Return @code{#t} iff @var{x} is less than @var{y} in the @acronym{ASCII} 
sequence,
-else @code{#f}.
+Return @code{#t} iff the code point of @var{x} is less than the code
+point of @var{y}, else @code{#f}.
 @end deffn
 
 @rnindex char<=?
 @deffn {Scheme Procedure} char<=? x y
-Return @code{#t} iff @var{x} is less than or equal to @var{y} in the
address@hidden sequence, else @code{#f}.
+Return @code{#t} iff the code point of @var{x} is less than or equal
+to the code point of @var{y}, else @code{#f}.
 @end deffn
 
 @rnindex char>?
 @deffn {Scheme Procedure} char>? x y
-Return @code{#t} iff @var{x} is greater than @var{y} in the @acronym{ASCII}
-sequence, else @code{#f}.
+Return @code{#t} iff the code point of @var{x} is greater than the
+code point of @var{y}, else @code{#f}.
 @end deffn
 
 @rnindex char>=?
 @deffn {Scheme Procedure} char>=? x y
-Return @code{#t} iff @var{x} is greater than or equal to @var{y} in the
address@hidden sequence, else @code{#f}.
+Return @code{#t} iff the code point of @var{x} is greater than or
+equal to the code point of @var{y}, else @code{#f}.
 @end deffn
 
+Case-insensitive character comparisons of characters use @emph{Unicode
+case folding}.  In case folding comparisons, if a character is
+lowercase and has an uppercase form that can be expressed as a single
+character, it is converted to uppercase before comparison.  Unicode
+case folding is language independent: it uses rules that are generally
+true, but, it cannot cover all cases for all languages.
+
 @rnindex char-ci=?
 @deffn {Scheme Procedure} char-ci=? x y
-Return @code{#t} iff @var{x} is the same character as @var{y} ignoring
-case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is the same
+as the case-folded code point of @var{y}, else @code{#f}.
 @end deffn
 
 @rnindex char-ci<?
 @deffn {Scheme Procedure} char-ci<? x y
-Return @code{#t} iff @var{x} is less than @var{y} in the @acronym{ASCII} 
sequence
-ignoring case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is less
+than the case-folded code point of @var{y}, else @code{#f}.
 @end deffn
 
 @rnindex char-ci<=?
 @deffn {Scheme Procedure} char-ci<=? x y
-Return @code{#t} iff @var{x} is less than or equal to @var{y} in the
address@hidden sequence ignoring case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is less
+than or equal to the case-folded code point of @var{y}, else
address@hidden
 @end deffn
 
 @rnindex char-ci>?
 @deffn {Scheme Procedure} char-ci>? x y
-Return @code{#t} iff @var{x} is greater than @var{y} in the @acronym{ASCII}
-sequence ignoring case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is greater
+than the case-folded code point of @var{y}, else @code{#f}.
 @end deffn
 
 @rnindex char-ci>=?
 @deffn {Scheme Procedure} char-ci>=? x y
-Return @code{#t} iff @var{x} is greater than or equal to @var{y} in the
address@hidden sequence ignoring case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is greater
+than or equal to the case-folded code point of @var{y}, else
address@hidden
 @end deffn
 
 @rnindex char-alphabetic?
@@ -1946,14 +1982,15 @@ Return @code{#t} iff @var{chr} is either uppercase or 
lowercase, else
 @rnindex char->integer
 @deffn {Scheme Procedure} char->integer chr
 @deffnx {C Function} scm_char_to_integer (chr)
-Return the number corresponding to ordinal position of @var{chr} in the
address@hidden sequence.
+Return the code point of @var{chr}.
 @end deffn
 
 @rnindex integer->char
 @deffn {Scheme Procedure} integer->char n
 @deffnx {C Function} scm_integer_to_char (n)
-Return the character at position @var{n} in the @acronym{ASCII} sequence.
+Return the character that has code point @var{n}.  The integer @var{n}
+must be a valid code point.  Valid code points are in the ranges 0 to
address@hidden inclusive or @code{#xE000} to @code{#x10FFFF} inclusive.
 @end deffn
 
 @rnindex char-upcase
diff --git a/libguile/chars.c b/libguile/chars.c
index c7cb09c..c2feaa6 100644
--- a/libguile/chars.c
+++ b/libguile/chars.c
@@ -45,7 +45,8 @@ SCM_DEFINE (scm_char_p, "char?", 1, 0, 0,
 
 SCM_DEFINE1 (scm_char_eq_p, "char=?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff @var{x} is the same character as @var{y}, 
else @code{#f}.")
+             "Return @code{#t} iff code point of @var{x} is equal to the code 
point\n"
+             "of @var{y}, else @code{#f}.\n")
 #define FUNC_NAME s_scm_char_eq_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -57,8 +58,8 @@ SCM_DEFINE1 (scm_char_eq_p, "char=?", scm_tc7_rpsubr,
 
 SCM_DEFINE1 (scm_char_less_p, "char<?", scm_tc7_rpsubr, 
              (SCM x, SCM y),
-            "Return @code{#t} iff @var{x} is less than @var{y} in the Unicode 
sequence,\n"
-            "else @code{#f}.")
+             "Return @code{#t} iff the code point of @var{x} is less than the 
code\n"
+             "point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_less_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -69,8 +70,8 @@ SCM_DEFINE1 (scm_char_less_p, "char<?", scm_tc7_rpsubr,
 
 SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff @var{x} is less than or equal to @var{y} in 
the\n"
-            "Unicode sequence, else @code{#f}.")
+             "Return @code{#t} iff the code point of @var{x} is less than or 
equal\n"
+             "to the code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_leq_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -81,8 +82,8 @@ SCM_DEFINE1 (scm_char_leq_p, "char<=?", scm_tc7_rpsubr,
 
 SCM_DEFINE1 (scm_char_gr_p, "char>?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff @var{x} is greater than @var{y} in the 
Unicode\n"
-            "sequence, else @code{#f}.")
+             "Return @code{#t} iff the code point of @var{x} is greater than 
the\n"
+             "code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_gr_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -93,8 +94,8 @@ SCM_DEFINE1 (scm_char_gr_p, "char>?", scm_tc7_rpsubr,
 
 SCM_DEFINE1 (scm_char_geq_p, "char>=?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff @var{x} is greater than or equal to @var{y} 
in the\n"
-            "Unicode sequence, else @code{#f}.")
+             "Return @code{#t} iff the code point of @var{x} is greater than 
or\n"
+             "equal to the code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_geq_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -103,10 +104,17 @@ SCM_DEFINE1 (scm_char_geq_p, "char>=?", scm_tc7_rpsubr,
 }
 #undef FUNC_NAME
 
+/* FIXME?: R6RS specifies that these comparisons are case-folded.
+   This is the same thing as comparing the uppercase characters in
+   practice, but, not in theory.  Unicode has table containing their
+   definition of case-folded character mappings.  A more correct
+   implementation would be to use that table and make a char-foldcase
+   function.  */
+
 SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff @var{x} is the same character as @var{y} 
ignoring\n"
-            "case, else @code{#f}.  Case is locale free and not context 
sensitive.")
+             "Return @code{#t} iff the case-folded code point of @var{x} is 
the same\n"
+             "as the case-folded code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_ci_eq_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -117,9 +125,8 @@ SCM_DEFINE1 (scm_char_ci_eq_p, "char-ci=?", scm_tc7_rpsubr,
 
 SCM_DEFINE1 (scm_char_ci_less_p, "char-ci<?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff the Unicode uppercase form of @var{x} is 
less\n"
-            "than the Unicode uppercase form @var{y} in the Unicode 
sequence,\n"
-            "else @code{#f}.")
+             "Return @code{#t} iff the case-folded code point of @var{x} is 
less\n"
+             "than the case-folded code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_ci_less_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -130,9 +137,9 @@ SCM_DEFINE1 (scm_char_ci_less_p, "char-ci<?", 
scm_tc7_rpsubr,
 
 SCM_DEFINE1 (scm_char_ci_leq_p, "char-ci<=?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff the Unicode uppercase form of @var{x} is 
less\n"
-            "than or equal to the Unicode uppercase form of @var{y} in the\n"
-            "Unicode  sequence, else @code{#f}.")
+             "Return @code{#t} iff the case-folded code point of @var{x} is 
less\n"
+             "than or equal to the case-folded code point of @var{y}, else\n"
+             "@code{#f}")
 #define FUNC_NAME s_scm_char_ci_leq_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -143,9 +150,8 @@ SCM_DEFINE1 (scm_char_ci_leq_p, "char-ci<=?", 
scm_tc7_rpsubr,
 
 SCM_DEFINE1 (scm_char_ci_gr_p, "char-ci>?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff the Unicode uppercase form of @var{x} is 
greater\n"
-            "than the Unicode uppercase form of @var{y} in the Unicode\n"
-            "sequence, else @code{#f}.")
+             "Return @code{#t} iff the case-folded code point of @var{x} is 
greater\n"
+             "than the case-folded code point of @var{y}, else @code{#f}.")
 #define FUNC_NAME s_scm_char_ci_gr_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -156,9 +162,9 @@ SCM_DEFINE1 (scm_char_ci_gr_p, "char-ci>?", scm_tc7_rpsubr,
 
 SCM_DEFINE1 (scm_char_ci_geq_p, "char-ci>=?", scm_tc7_rpsubr,
              (SCM x, SCM y),
-            "Return @code{#t} iff the Unicode uppercase form of @var{x} is 
greater\n"
-            "than or equal to the Unicode uppercase form of @var{y} in the\n"
-            "Unicode sequence, else @code{#f}.")
+             "Return @code{#t} iff the case-folded code point of @var{x} is 
greater\n"
+             "than or equal to the case-folded code point of @var{y}, else\n"
+             "@code{#f}.")
 #define FUNC_NAME s_scm_char_ci_geq_p
 {
   SCM_VALIDATE_CHAR (1, x);
@@ -196,7 +202,6 @@ SCM_DEFINE (scm_char_whitespace_p, "char-whitespace?", 1, 
0, 0,
 #undef FUNC_NAME
 
 
-
 SCM_DEFINE (scm_char_upper_case_p, "char-upper-case?", 1, 0, 0, 
            (SCM chr),
            "Return @code{#t} iff @var{chr} is uppercase, else @code{#f}.\n")
@@ -217,7 +222,6 @@ SCM_DEFINE (scm_char_lower_case_p, "char-lower-case?", 1, 
0, 0,
 #undef FUNC_NAME
 
 
-
 SCM_DEFINE (scm_char_is_both_p, "char-is-both?", 1, 0, 0, 
             (SCM chr),
            "Return @code{#t} iff @var{chr} is either uppercase or lowercase, 
else @code{#f}.\n")
@@ -230,12 +234,9 @@ SCM_DEFINE (scm_char_is_both_p, "char-is-both?", 1, 0, 0,
 #undef FUNC_NAME
 
 
-
-
 SCM_DEFINE (scm_char_to_integer, "char->integer", 1, 0, 0, 
             (SCM chr),
-           "Return the number corresponding to ordinal position of @var{chr} 
in the\n"
-           "ASCII sequence.")
+            "Return the code point of @var{chr}.")
 #define FUNC_NAME s_scm_char_to_integer
 {
   SCM_VALIDATE_CHAR (1, chr);
@@ -244,10 +245,11 @@ SCM_DEFINE (scm_char_to_integer, "char->integer", 1, 0, 0,
 #undef FUNC_NAME
 
 
-
 SCM_DEFINE (scm_integer_to_char, "integer->char", 1, 0, 0, 
            (SCM n),
-           "Return the character at position @var{n} in the ASCII sequence.")
+            "Return the character that has code point @var{n}.  The integer 
@var{n}\n"
+            "must be a valid code point.  Valid code points are in the ranges 
0 to\n"
+            "@code{#xD7FF} inclusive or @code{#xE000} to @code{#x10FFFF} 
inclusive.")
 #define FUNC_NAME s_scm_integer_to_char
 {
   scm_t_wchar cn;
diff --git a/libguile/strings.c b/libguile/strings.c
index 4a8390d..06e3359 100644
--- a/libguile/strings.c
+++ b/libguile/strings.c
@@ -1477,15 +1477,18 @@ scm_is_string (SCM obj)
   return IS_STRING (obj);
 }
 
-static SCM
-scm_from_stringn (const char *str, size_t len, const char *encoding,
-                  scm_t_string_failed_conversion_handler handler)
+SCM
+scm_i_from_stringn (const char *str, size_t len, const char *encoding,
+                    scm_t_string_failed_conversion_handler handler)
 {
   size_t u32len, i;
   scm_t_wchar *u32;
   int wide = 0;
   SCM res;
 
+  if (len == 0)
+    return scm_nullstr;
+
   if (encoding == NULL)
     {
       /* If encoding is null, use Latin-1.  */
@@ -1575,7 +1578,7 @@ scm_from_locale_stringn (const char *str, size_t len)
       hndl = SCM_FAILED_CONVERSION_ERROR;
     }
 
-  return scm_from_stringn (str, len, enc, hndl);
+  return scm_i_from_stringn (str, len, enc, hndl);
 }
 
 SCM
@@ -1590,7 +1593,7 @@ scm_from_locale_string (const char *str)
 SCM
 scm_i_from_utf8_string (const scm_t_uint8 *str)
 {
-  return scm_from_stringn ((const char *) str,
+  return scm_i_from_stringn ((const char *) str,
                            strlen ((char *) str), "UTF-8",
                            SCM_FAILED_CONVERSION_ERROR);
 }
@@ -1681,7 +1684,7 @@ unistring_escapes_to_guile_escapes (char **bufp, size_t 
*lenp)
 }
 
 char *
-scm_to_locale_stringn (SCM str, size_t * lenp)
+scm_to_locale_stringn (SCM str, size_t *lenp)
 {
   SCM outport;
   scm_t_port *pt;
@@ -1788,6 +1791,8 @@ scm_to_stringn (SCM str, size_t *lenp, const char 
*encoding,
                           scm_list_2 (scm_from_locale_string (enc),
                                       str));
         }
+      if (handler == SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE)
+        unistring_escapes_to_guile_escapes (&buf, &len);
     }
   if (lenp)
     *lenp = len;
diff --git a/libguile/strings.h b/libguile/strings.h
index 2393aae..c521926 100644
--- a/libguile/strings.h
+++ b/libguile/strings.h
@@ -111,6 +111,10 @@ SCM_API SCM scm_substring_shared (SCM str, SCM start, SCM 
end);
 SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end);
 SCM_API SCM scm_string_append (SCM args);
 
+SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, 
+                                     const char *encoding,
+                                     scm_t_string_failed_conversion_handler 
+                                     handler);
 SCM_API SCM scm_c_make_string (size_t len, SCM chr);
 SCM_API size_t scm_c_string_length (SCM str);
 SCM_API size_t scm_c_symbol_length (SCM sym);
diff --git a/libguile/strports.c b/libguile/strports.c
index 490a15f..5bfeaad 100644
--- a/libguile/strports.c
+++ b/libguile/strports.c
@@ -301,9 +301,9 @@ scm_i_mkstrport (SCM pos, const char *locale_str, size_t 
str_len, long modes, co
      to a locale representation for storage.  But, since string ports
      rely on string functionality for their memory management, we need
      to create a new string that has the 8-bit locale representation
-     of the underlying string.  This violates the guideline that the
-     internal encoding of characters in strings is in unicode
-     codepoints. */
+     of the underlying string.  
+
+     locale_str is already in the locale of the port.  */
   str = scm_i_make_string (str_len, &buf);
   memcpy (buf, locale_str, str_len);
 
@@ -348,13 +348,18 @@ scm_mkstrport (SCM pos, SCM str, long modes, const char 
*caller)
      of the underlying string.  This violates the guideline that the
      internal encoding of characters in strings is in unicode
      codepoints. */
+
+  /* Ports are initialized with the thread-default values for encoding and
+     invalid sequence handling.  */
   buf = scm_to_locale_stringn (str, &str_len);
   z = scm_i_mkstrport (pos, buf, str_len, modes, caller);
   free (buf);
   return z;
 }
 
-/* create a new string from a string port's buffer.  */
+/* Create a new string from a string port's buffer, converting from
+   the port's 8-bit locale-specific representation to the standard
+   string representation.  */
 SCM scm_strport_to_string (SCM port)
 {
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
@@ -363,7 +368,18 @@ SCM scm_strport_to_string (SCM port)
   if (pt->rw_active == SCM_PORT_WRITE)
     st_flush (port);
 
-  str = scm_from_locale_stringn ((char *)pt->read_buf, pt->read_buf_size);
+  if (pt->read_buf_size == 0)
+    return scm_nullstr;
+
+  if (pt->encoding == NULL)
+    {
+      char *buf;
+      str = scm_i_make_string (pt->read_buf_size, &buf);
+      memcpy (buf, pt->read_buf, pt->read_buf_size);
+    }
+  else
+    str = scm_i_from_stringn ((char *)pt->read_buf, pt->read_buf_size,
+                              pt->encoding, pt->ilseq_handler);
   scm_remember_upto_here_1 (port);
   return str;
 }
diff --git a/test-suite/tests/encoding-iso88591.test 
b/test-suite/tests/encoding-iso88591.test
index d4de5e5..8e85436 100644
--- a/test-suite/tests/encoding-iso88591.test
+++ b/test-suite/tests/encoding-iso88591.test
@@ -33,6 +33,67 @@
 (if (defined? 'setlocale)
     (set! oldlocale (setlocale LC_ALL "")))
 
+(define ascii-a (integer->char 65))     ; LATIN CAPITAL LETTER A
+(define a-acute (integer->char #x00c1)) ; LATIN CAPITAL LETTER A WITH ACUTE
+(define alpha (integer->char #x03b1))   ; GREEK SMALL LETTER ALPHA
+(define cherokee-a (integer->char #x13a0)) ; CHEROKEE LETTER A
+
+(with-test-prefix "characters"
+  (pass-if "input A"
+    (char=? ascii-a #\A))
+
+  (pass-if "input A acute"
+    (char=? a-acute #\Á))
+
+  (pass-if "display A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-1")
+             (set-port-conversion-strategy! pt 'escape)
+             (display ascii-a pt)
+             (string=? "A"
+                       (get-output-string pt))))
+
+  (pass-if "display A acute"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-1")
+             (set-port-conversion-strategy! pt 'escape)
+             (display a-acute pt)
+             (string=? "Á"
+                       (get-output-string pt))))
+
+  (pass-if "display alpha"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-1")
+             (set-port-conversion-strategy! pt 'escape)
+             (display alpha pt)
+             (string-ci=? "\\u03b1"
+                       (get-output-string pt))))
+
+  (pass-if "display Cherokee a"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-1")
+             (set-port-conversion-strategy! pt 'escape)
+             (display cherokee-a pt)
+             (string-ci=? "\\u13a0"
+                       (get-output-string pt))))
+
+  (pass-if "write A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-1")
+             (set-port-conversion-strategy! pt 'escape)
+             (write ascii-a pt)
+             (string=? "#\\A"
+                       (get-output-string pt))))
+
+  (pass-if "write A acute"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-1")
+             (set-port-conversion-strategy! pt 'escape)
+             (write a-acute pt)
+             (string=? "#\\Á"
+                       (get-output-string pt)))))
+
+
 (define s1 "última")
 (define s2 "cédula")
 (define s3 "años")
diff --git a/test-suite/tests/encoding-iso88597.test 
b/test-suite/tests/encoding-iso88597.test
index 2221269..9f278f1 100644
--- a/test-suite/tests/encoding-iso88597.test
+++ b/test-suite/tests/encoding-iso88597.test
@@ -31,6 +31,65 @@
 (define oldlocale #f)
 (if (defined? 'setlocale)
     (set! oldlocale (setlocale LC_ALL "")))
+(define ascii-a (integer->char 65))     ; LATIN CAPITAL LETTER A
+(define a-acute (integer->char #x00c1)) ; LATIN CAPITAL LETTER A WITH ACUTE
+(define alpha (integer->char #x03b1))   ; GREEK SMALL LETTER ALPHA
+(define cherokee-a (integer->char #x13a0)) ; CHEROKEE LETTER A
+
+(with-test-prefix "characters"
+  (pass-if "input A"
+    (char=? ascii-a #\A))
+
+  (pass-if "input alpha"
+    (char=? alpha #\á))
+
+  (pass-if "display A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-7")
+             (set-port-conversion-strategy! pt 'escape)
+             (display ascii-a pt)
+             (string=? "A"
+                       (get-output-string pt))))
+
+  (pass-if "display A acute"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-7")
+             (set-port-conversion-strategy! pt 'escape)
+             (display a-acute pt)
+             (string-ci=? "\\xc1"
+                       (get-output-string pt))))
+
+  (pass-if "display alpha"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-7")
+             (set-port-conversion-strategy! pt 'escape)
+             (display alpha pt)
+             (string-ci=? "á"
+                       (get-output-string pt))))
+
+  (pass-if "display Cherokee A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-7")
+             (set-port-conversion-strategy! pt 'escape)
+             (display cherokee-a pt)
+             (string-ci=? "\\u13a0"
+                       (get-output-string pt))))
+
+  (pass-if "write A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-7")
+             (set-port-conversion-strategy! pt 'escape)
+             (write ascii-a pt)
+             (string=? "#\\A"
+                       (get-output-string pt))))
+
+  (pass-if "write alpha"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "ISO-8859-7")
+             (set-port-conversion-strategy! pt 'escape)
+             (write alpha pt)
+             (string=? "#\\á"
+                       (get-output-string pt)))))
 
 (define s1 "Ðåñß")
 (define s2 "ôçò")
diff --git a/test-suite/tests/encoding-utf8.test 
b/test-suite/tests/encoding-utf8.test
index a2613f1..d2bdb69 100644
--- a/test-suite/tests/encoding-utf8.test
+++ b/test-suite/tests/encoding-utf8.test
@@ -32,6 +32,88 @@
 (if (defined? 'setlocale)
     (set! oldlocale (setlocale LC_ALL "")))
 
+(define ascii-a (integer->char 65))     ; LATIN CAPITAL LETTER A
+(define a-acute (integer->char #x00c1)) ; LATIN CAPITAL LETTER A WITH ACUTE
+(define alpha (integer->char #x03b1))   ; GREEK SMALL LETTER ALPHA
+(define cherokee-a (integer->char #x13a0)) ; CHEROKEE LETTER A
+
+(with-test-prefix "characters"
+  (pass-if "input A"
+    (char=? ascii-a #\A))
+
+  (pass-if "input A acute"
+    (char=? a-acute #\Á))
+ 
+  (pass-if "input alpha"
+    (char=? alpha #\α))
+ 
+  (pass-if "input Cherokee A"
+    (char=? cherokee-a #\Ꭰ))
+ 
+ (pass-if "display A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "UTF-8")
+             (set-port-conversion-strategy! pt 'substitute)
+             (display ascii-a pt)
+             (string=? "A"
+                       (get-output-string pt))))
+
+  (pass-if "display A acute"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "UTF-8")
+             (set-port-conversion-strategy! pt 'substitute)
+             (display a-acute pt)
+             (string=? "Á"
+                       (get-output-string pt))))
+
+  (pass-if "display alpha"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "UTF-8")
+             (set-port-conversion-strategy! pt 'substitute)
+             (display alpha pt)
+             (string-ci=? "α"
+                       (get-output-string pt))))
+
+  (pass-if "display Cherokee A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "UTF-8")
+             (set-port-conversion-strategy! pt 'substitute)
+             (display cherokee-a pt)
+             (string-ci=? "Ꭰ"
+                       (get-output-string pt))))
+
+  (pass-if "write A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "UTF-8")
+             (set-port-conversion-strategy! pt 'escape)
+             (write ascii-a pt)
+             (string=? "#\\A"
+                       (get-output-string pt))))
+
+  (pass-if "write A acute"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "UTF-8")
+             (set-port-conversion-strategy! pt 'escape)
+             (write a-acute pt)
+             (string=? "#\\Á"
+                       (get-output-string pt))))
+
+  (pass-if "write alpha"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "UTF-8")
+             (set-port-conversion-strategy! pt 'escape)
+             (write alpha pt)
+             (string=? "#\\α"
+                       (get-output-string pt))))
+
+  (pass-if "write Cherokee A"
+           (let ((pt (open-output-string)))
+             (set-port-encoding! pt "UTF-8")
+             (set-port-conversion-strategy! pt 'escape)
+             (write cherokee-a pt)
+             (string=? "#\\Ꭰ"
+                       (get-output-string pt)))))
+
 (define s1 "última")
 (define s2 "cédula")
 (define s3 "años")


hooks/post-receive
-- 
GNU Guile




reply via email to

[Prev in Thread] Current Thread [Next in Thread]