guile-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Guile-commits] GNU Guile branch, string_abstraction2, updated. 96664081


From: Michael Gran
Subject: [Guile-commits] GNU Guile branch, string_abstraction2, updated. 966640813a00a1cdbbea19918fcfd932edcbd59f
Date: Mon, 25 May 2009 19:10:21 +0000

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU Guile".

http://git.savannah.gnu.org/cgit/guile.git/commit/?id=966640813a00a1cdbbea19918fcfd932edcbd59f

The branch, string_abstraction2 has been updated
       via  966640813a00a1cdbbea19918fcfd932edcbd59f (commit)
       via  b2d407915c35e5a45d65be9c6c15abcda16b8c87 (commit)
       via  84df3e211d1eaab8b0735a6f1f27f56e4f579875 (commit)
      from  823e444052817ee120d87a3575acb4f767f17475 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 966640813a00a1cdbbea19918fcfd932edcbd59f
Merge: b2d407915c35e5a45d65be9c6c15abcda16b8c87 
823e444052817ee120d87a3575acb4f767f17475
Author: Michael Gran <address@hidden>
Date:   Mon May 25 12:09:36 2009 -0700

    Merge branch 'string_abstraction2' of ssh://address@hidden/srv/git/guile 
into string_abstraction2
    
    Conflicts:
        test-suite/tests/encoding_iso88597.test
        test-suite/tests/encoding_utf8.test

commit b2d407915c35e5a45d65be9c6c15abcda16b8c87
Author: Michael Gran <address@hidden>
Date:   Mon May 25 11:40:46 2009 -0700

    Read and write wide strings.  Allow wide symbols.   Interpret wide code.
    
        * r4rs.scm (with-output-to-locale-u8vector): add new function
        to gather output as a locale vector
    
        * strports.h: new *_locale_u8vector function declarations
    
        * strports.c: deal with the fact that ports are inherently 8-bit
        while strings are not.  Make string ports hold the locale
        representation of the string, not the codepoints.  Add helper
        funcs to allow gathering the internal representation of a string.
        (scm_strport_to_locale_u8vector): new func
        (scm_call_with_output_locale_u8vector): new func
        (scm_open_input_locale_u8vector): new func
        (scm_get_output_locale_u8vector): new func
    
        * strings.h (SCM_PORT_ENCODING): new macro to deal with
        scm_i_port_encding possibly being NULL
    
        * strings.c: Revert SCM_DEBUG_STRING to SCM_DEBUG.  Copy instead
        of using the const locale encoding string in scm_i_port_encoding.
        (scm_i_symbol_ref): Allow wide symbols
        (scm_i_symbol_ref_to_wchar): Allow wide symbols
        (scm_i_symbol_ref_to_char): Allow wide symbols
        (scm_i_symbol_ref_eq_char): Allow wide symbols
        (scm_i_symbol_strcmp): Allow wide symbols
        (scm_set_conversion_error_behavior_x): Change
        calling. 'question-mark to more correct 'substitute.  Fix bug
        where 'error was always chosen.
        (scm_string): Do some optimization for binary encodings.
        (unistring_escapes_to_guile_escapes): new function to convert
        between libunistring and guile conventions
        (scm_to_locale_stringn): clarify error messages
    
        * read.c (read_complete_token): new func that reads a locale-aware
        token.
        (scm_read_string): add new string escapes \uNNNN and \UNNNNNN for
        unicode characters
        (scm_read_number): use read_complete_token
        (scm_read_mixed_case_symbol): use read_complete_token
    
        * print.h: declare new func scm_charprint
    
        * print.c (scm_charprint): new func to print characters
        (iprin1): use scm_charprint
    
        * posix.c (scm_setlocale): setlocale updates the locale encoding
        used for ports
        (scm_setbinary): new func, turn off locale encoding
        (scm_setencoding): new func, explicitly set encoding irrespective
        of locale
    
        * ports.h (scm_getc): new declaration
    
        * ports.c (scm_getc): New func to assemble a complete codepoint.
        (scm_ungetc): Unget complete codepoints
        (scm_peek_char): Peek complete codepoints
    
        * inline.h (scm_raw_getc): Split the getting of a character into
        two parts, pulling the byte (scm_raw_getc) and assembling the
        codepoints (scm_getc)
    
        * chars.h: Add duplicate declaraction of scm_t_wchar here, so that
        numbers.h doesn't have to be included just for that one
        definition.

commit 84df3e211d1eaab8b0735a6f1f27f56e4f579875
Author: Michael Gran <address@hidden>
Date:   Sun May 24 17:15:10 2009 -0700

    add tests for encoding/decoding wide strings
    
        * test-suite/tests/encoding_utf8.test: new
        * test-suite/tests/encoding_iso88591.test: new
        * test-suite/tests/encoding_iso88597.test: new
        * test-suite/tests/encoding_escapes.test: new

-----------------------------------------------------------------------

Summary of changes:
 libguile/chars.h      |    6 +-
 libguile/inline.h     |   27 +----
 libguile/numbers.h    |    3 +
 libguile/ports.c      |  261 +++++++++++++++++++++++++++++++++++++---------
 libguile/ports.h      |    1 +
 libguile/posix.c      |   47 ++++++++-
 libguile/posix.h      |    1 +
 libguile/print.c      |   16 +++-
 libguile/print.h      |    2 +
 libguile/read.c       |  283 ++++++++++++++++++++++++-------------------------
 libguile/strings.c    |  205 +++++++++++++++++++++++++++---------
 libguile/strings.h    |    5 +-
 libguile/strports.c   |  137 +++++++++++++++++++-----
 libguile/strports.h   |    6 +
 module/ice-9/r4rs.scm |    6 +
 15 files changed, 704 insertions(+), 302 deletions(-)

diff --git a/libguile/chars.h b/libguile/chars.h
index 92e1bc0..42e4f7b 100644
--- a/libguile/chars.h
+++ b/libguile/chars.h
@@ -23,12 +23,16 @@
 
 
 #include "libguile/__scm.h"
-#include "libguile/numbers.h" /* for def'n of scm_t_wchar */
 
 
 /* Immediate Characters
  */
 
+#ifndef SCM_WCHAR_DEFINED
+typedef scm_t_int32 scm_t_wchar;
+#define SCM_WCHAR_DEFINED
+#endif
+
 #define SCM_CHARP(x) (SCM_ITAG8(x) == scm_tc8_char)
 #define SCM_CHAR(x) ((scm_t_wchar)SCM_ITAG8_DATA(x))
 #define SCM_MAKE_CHAR(x) (scm_i_make_char(x))
diff --git a/libguile/inline.h b/libguile/inline.h
index 8f4ff74..23598db 100644
--- a/libguile/inline.h
+++ b/libguile/inline.h
@@ -86,7 +86,7 @@ SCM_API void scm_array_handle_set (scm_t_array_handle *h, 
ssize_t pos, SCM val);
 
 SCM_API int scm_is_pair (SCM x);
 
-SCM_API scm_t_wchar scm_getc (SCM port);
+SCM_API int scm_raw_getc (SCM port);
 SCM_API void scm_putc (char c, SCM port);
 SCM_API void scm_puts (const char *str_data, SCM port);
 
@@ -289,8 +289,8 @@ scm_is_pair (SCM x)
 #ifndef SCM_INLINE_C_INCLUDING_INLINE_H
 SCM_C_EXTERN_INLINE
 #endif
-scm_t_wchar
-scm_getc (SCM port)
+int
+scm_raw_getc (SCM port)
 {
   int c;
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
@@ -310,27 +310,6 @@ scm_getc (SCM port)
 
   c = *(pt->read_pos++);
 
-  switch (c)
-    {
-      case '\a':
-        break;
-      case '\b':
-        SCM_DECCOL (port);
-        break;
-      case '\n':
-        SCM_INCLINE (port);
-        break;
-      case '\r':
-        SCM_ZEROCOL (port);
-        break;
-      case '\t':
-        SCM_TABCOL (port);
-        break;
-      default:
-        SCM_INCCOL (port);
-        break;
-    }
-
   return c;
 }
 
diff --git a/libguile/numbers.h b/libguile/numbers.h
index f349e2b..330248d 100644
--- a/libguile/numbers.h
+++ b/libguile/numbers.h
@@ -173,7 +173,10 @@ typedef struct scm_t_complex
   double imag;
 } scm_t_complex;
 
+#ifndef SCM_WCHAR_DEFINED
 typedef scm_t_int32 scm_t_wchar;
+#define SCM_WCHAR_DEFINED
+#endif
 
 
 
diff --git a/libguile/ports.c b/libguile/ports.c
index 275e044..4ed30c2 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -29,6 +29,8 @@
 #include <errno.h>
 #include <fcntl.h>  /* for chsize on mingw */
 #include <assert.h>
+#include <uniconv.h>
+#include <unistr.h>
 
 #include "libguile/_scm.h"
 #include "libguile/async.h"
@@ -940,6 +942,156 @@ SCM_DEFINE (scm_read_char, "read-char", 0, 1, 0,
 }
 #undef FUNC_NAME
 
+#define SCM_MBCHAR_BUF_SIZE (5)
+
+scm_t_wchar
+scm_getc (SCM port)
+{
+  int c;
+  unsigned int bufcount = 0;
+  char buf[SCM_MBCHAR_BUF_SIZE];
+  scm_t_wchar codepoint = 0;
+  scm_t_uint32 *u32;
+  size_t u32len;
+
+
+  c = scm_raw_getc (port);
+  if (c == EOF)
+    return (scm_t_wchar) EOF;
+
+  buf[0] = c;
+  bufcount++;
+
+  /* Handle most ASCII characters quickly.  */
+  if ((scm_i_port_encoding == NULL) 
+      || ('0' <= c && c <= '9') || ('a' <= c && c <= 'z') || ('A' <= c && c <= 
'Z'))
+    {
+      /* In all (?) encodings except UTF-16 and UTF-32, alnum characters
+        represent themselves. */
+      codepoint = buf[0];
+      goto success;
+    }
+
+  for (;;)
+    {
+      u32 = u32_conv_from_encoding (SCM_PORT_ENCODING, 
scm_conversion_strategy, 
+                                   buf, bufcount, NULL, NULL, &u32len);
+      if (u32 == NULL || u32len == 0)
+       {
+         if (errno == ENOMEM)
+           scm_memory_error ("Input decoding");
+         /* If errno is EILSEQ or EINVAL, new characters probably
+            won't help, but, we'll try up to SCM_MBCHAR_BUF_SIZE
+            bytes, just in case. */
+
+         /* Keep looping.  */
+       }
+      else 
+       {
+         /* Complete codepoint found. */
+         codepoint = u32[0];
+         free (u32);
+         goto success;
+       }
+
+      if (bufcount == SCM_MBCHAR_BUF_SIZE - 1)
+       {
+         /* We've read several bytes and didn't find a good
+            codepoint.  Give up.  */
+         goto failure;
+       }
+
+      c = scm_raw_getc (port);
+
+      if (c == EOF)
+       {
+         /* EOF before a complete character was read.  Push it all
+            back and return EOF. */
+         while (bufcount > 0)
+           {
+             /* FIXME: this will probably cause errors in the port column. */
+             scm_ungetc (buf[bufcount-1], port);
+             bufcount --;
+           }
+         return EOF;
+       }
+
+      buf[bufcount++] = c;
+    }
+
+ success:
+  switch (codepoint)
+    {
+    case '\a':
+      break;
+    case '\b':
+      SCM_DECCOL (port);
+      break;
+    case '\n':
+      SCM_INCLINE (port);
+        break;
+    case '\r':
+      SCM_ZEROCOL (port);
+      break;
+    case '\t':
+      SCM_TABCOL (port);
+      break;
+    default:
+      SCM_INCCOL (port);
+      break;
+    }
+
+  return codepoint;
+
+ failure:
+  if (errno == EILSEQ)
+    {
+      int z, pos = 0;
+      char encoded_buf[4*SCM_MBCHAR_BUF_SIZE+1];
+
+      /* This character sequence can't be converted, and extra
+        characters won't help.  Try hard to make a string for the
+        error message. */
+
+      if (scm_conversion_strategy != iconveh_escape_sequence)
+       {
+         u32 = u32_conv_from_encoding (SCM_PORT_ENCODING, 
iconveh_escape_sequence, 
+                                       buf, bufcount, NULL, NULL, &u32len);
+         if (u32)
+           {
+             SCM encoded_str;
+             scm_t_wchar *wbuf;
+             encoded_str = scm_i_make_wide_string (u32len, &wbuf);
+             u32_cpy ((scm_t_uint32 *) wbuf, (scm_t_uint32 *) u32, u32len);
+             free (u32);
+             
+             scm_misc_error (NULL, "input encoding error for ~s: ~s",
+                             scm_list_2 (scm_from_locale_string 
(SCM_PORT_ENCODING),
+                                         encoded_str));
+           }
+       }
+
+      for (z = 0; z < bufcount; z++)
+       {
+         if (buf[z]>=32 && buf[z]<=127)
+           pos += sprintf (encoded_buf+pos, "%c", buf[z]);
+         else
+           pos += sprintf (encoded_buf+pos, "\\x%02x", (unsigned char) buf[z]);
+       }
+      
+      scm_misc_error (NULL, "Input encoding error for ~s: ~s",
+                     scm_list_2 (scm_from_locale_string (SCM_PORT_ENCODING),
+                                 scm_from_locale_string (encoded_buf)));
+    }
+  else
+    scm_misc_error (NULL, "Input encoding error (invalid)\n", SCM_EOL);
+
+  /* Never gets here.  */
+  return 0;
+}
+
+
+
 /* this should only be called when the read buffer is empty.  it
    tries to refill the read buffer.  it returns the first char from
    the port, which is either EOF or *(pt->read_pos).  */
@@ -970,7 +1122,7 @@ scm_fill_input (SCM port)
  * column. */
 
 static void
-update_port_lf (char c, SCM port)
+update_port_lf (scm_t_wchar c, SCM port)
 {
   if (c == '\a') {
   }
@@ -1003,7 +1155,7 @@ scm_lfwrite (const char *ptr, size_t size, SCM port)
   ptob->write (port, ptr, size);
 
   for (; size; ptr++, size--) 
-    update_port_lf (*ptr, port);
+    update_port_lf ((scm_t_wchar)(unsigned char)*ptr, port);
 
   if (pt->rw_random)
     pt->rw_active = SCM_PORT_WRITE;
@@ -1237,59 +1389,72 @@ scm_ungetc (scm_t_wchar c, SCM port)
 #define FUNC_NAME "scm_ungetc"
 {
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  scm_t_wchar *wbuf;
+  SCM str = scm_i_make_wide_string (1, &wbuf);
+  char *buf;
+  size_t len;
+  int i;
 
-  if (pt->read_buf == pt->putback_buf)
-    /* already using the put-back buffer.  */
+  wbuf[0] = c;
+  buf = scm_to_locale_stringn (str, &len);
+    
+  for (i = len - 1; i >= 0; i--)
     {
-      /* enlarge putback_buf if necessary.  */
-      if (pt->read_end == pt->read_buf + pt->read_buf_size
-         && pt->read_buf == pt->read_pos)
-       {
-         size_t new_size = pt->read_buf_size * 2;
-         unsigned char *tmp = (unsigned char *)
-           scm_gc_realloc (pt->putback_buf, pt->read_buf_size, new_size,
-                           "putback buffer");
-
-         pt->read_pos = pt->read_buf = pt->putback_buf = tmp;
-         pt->read_end = pt->read_buf + pt->read_buf_size;
-         pt->read_buf_size = pt->putback_buf_size = new_size;
-       }
-
-      /* shift any existing bytes to buffer + 1.  */
-      if (pt->read_pos == pt->read_end)
-       pt->read_end = pt->read_buf + 1;
-      else if (pt->read_pos != pt->read_buf + 1)
+      if (pt->read_buf == pt->putback_buf)
+       /* already using the put-back buffer.  */
        {
-         int count = pt->read_end - pt->read_pos;
-
-         memmove (pt->read_buf + 1, pt->read_pos, count);
-         pt->read_end = pt->read_buf + 1 + count;
+         /* enlarge putback_buf if necessary.  */
+         if (pt->read_end == pt->read_buf + pt->read_buf_size
+             && pt->read_buf == pt->read_pos)
+           {
+             size_t new_size = pt->read_buf_size * 2;
+             unsigned char *tmp = (unsigned char *)
+               scm_gc_realloc (pt->putback_buf, pt->read_buf_size, new_size,
+                               "putback buffer");
+             
+             pt->read_pos = pt->read_buf = pt->putback_buf = tmp;
+             pt->read_end = pt->read_buf + pt->read_buf_size;
+             pt->read_buf_size = pt->putback_buf_size = new_size;
+           }
+         
+         /* shift any existing bytes to buffer + 1.  */
+         if (pt->read_pos == pt->read_end)
+           pt->read_end = pt->read_buf + 1;
+         else if (pt->read_pos != pt->read_buf + 1)
+           {
+             int count = pt->read_end - pt->read_pos;
+             
+             memmove (pt->read_buf + 1, pt->read_pos, count);
+             pt->read_end = pt->read_buf + 1 + count;
+           }
+         
+         pt->read_pos = pt->read_buf;
        }
-
-      pt->read_pos = pt->read_buf;
-    }
-  else
-    /* switch to the put-back buffer.  */
-    {
-      if (pt->putback_buf == NULL)
+      else
+       /* switch to the put-back buffer.  */
        {
-         pt->putback_buf
-           = (unsigned char *) scm_gc_malloc (SCM_INITIAL_PUTBACK_BUF_SIZE,
-                                              "putback buffer");
-         pt->putback_buf_size = SCM_INITIAL_PUTBACK_BUF_SIZE;
+         if (pt->putback_buf == NULL)
+           {
+             pt->putback_buf
+               = (unsigned char *) scm_gc_malloc (SCM_INITIAL_PUTBACK_BUF_SIZE,
+                                                  "putback buffer");
+             pt->putback_buf_size = SCM_INITIAL_PUTBACK_BUF_SIZE;
+           }
+         
+         pt->saved_read_buf = pt->read_buf;
+         pt->saved_read_pos = pt->read_pos;
+         pt->saved_read_end = pt->read_end;
+         pt->saved_read_buf_size = pt->read_buf_size;
+         
+         pt->read_pos = pt->read_buf = pt->putback_buf;
+         pt->read_end = pt->read_buf + 1;
+         pt->read_buf_size = pt->putback_buf_size;
        }
-
-      pt->saved_read_buf = pt->read_buf;
-      pt->saved_read_pos = pt->read_pos;
-      pt->saved_read_end = pt->read_end;
-      pt->saved_read_buf_size = pt->read_buf_size;
-
-      pt->read_pos = pt->read_buf = pt->putback_buf;
-      pt->read_end = pt->read_buf + 1;
-      pt->read_buf_size = pt->putback_buf_size;
+      
+      *pt->read_buf = buf[i];
     }
 
-  *pt->read_buf = c;
+  free (buf);
 
   if (pt->rw_random)
     pt->rw_active = SCM_PORT_READ;
@@ -1339,7 +1504,7 @@ SCM_DEFINE (scm_peek_char, "peek-char", 0, 1, 0,
            "to @code{read-char} would have hung.")
 #define FUNC_NAME s_scm_peek_char
 {
-  int c, column;
+  scm_t_wchar c, column;
   if (SCM_UNBNDP (port))
     port = scm_current_input_port ();
   else
diff --git a/libguile/ports.h b/libguile/ports.h
index 233d597..3cb139c 100644
--- a/libguile/ports.h
+++ b/libguile/ports.h
@@ -268,6 +268,7 @@ SCM_API SCM scm_eof_object_p (SCM x);
 SCM_API SCM scm_force_output (SCM port);
 SCM_API SCM scm_flush_all_ports (void);
 SCM_API SCM scm_read_char (SCM port);
+SCM_INTERNAL scm_t_wchar scm_getc (SCM port);
 SCM_API size_t scm_c_read (SCM port, void *buffer, size_t size);
 SCM_API void scm_c_write (SCM port, const void *buffer, size_t size);
 SCM_API void scm_lfwrite (const char *ptr, size_t size, SCM port);
diff --git a/libguile/posix.c b/libguile/posix.c
index fe5f8b3..8cfec5c 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1665,7 +1665,9 @@ SCM_DEFINE (scm_setlocale, "setlocale", 1, 1, 0,
       SCM_SYSERROR;
     }
 
-  scm_i_port_encoding = locale_charset ();
+  if (scm_i_port_encoding != NULL)
+    free (scm_i_port_encoding);
+  scm_i_port_encoding = strdup (locale_charset ());
   /* Recompute the standard SRFI-14 character sets in a locale-dependent
      (actually charset-dependent) way.  */
   scm_srfi_14_compute_char_sets ();
@@ -1676,7 +1678,42 @@ SCM_DEFINE (scm_setlocale, "setlocale", 1, 1, 0,
 #undef FUNC_NAME
 #endif /* HAVE_SETLOCALE */
 
-const char const scm_i_binary_locale[] = "ISO-8859-1";
+SCM_DEFINE (scm_setencoding, "setencoding", 1, 0, 0,
+           (SCM enc),
+           "Sets the character encoding that will be used to interpret all\n"
+           "port I/O.  Normally, one would set this using @code{setlocale},\n"
+           "so that the character encoding is in sync with the environment.\n"
+           "This command can be used to set it directly.\n"
+           "\n"
+           "Note that since the output to the current output and error ports\n"
+           "is also interpreted using this locale, non-ASCII strings may not\n"
+           "be correctly displayed by the terminal device that receives 
current\n"
+           "output and current error.\n"
+           "\n"
+           "Subsequent calls to @code{setlocale} will report the encoding to\n"
+           "be that proscribed by the locale.")
+#define FUNC_NAME s_scm_setencoding
+{
+  const char str[] = " ";
+  scm_t_uint32 *u32;
+  size_t u32len;
+  char *new_enc = scm_to_locale_string (enc);
+
+  u32 = u32_conv_from_encoding (new_enc, iconveh_error, str, 1,
+                               NULL, NULL, &u32len);
+  if (u32 == NULL)
+    scm_misc_error (FUNC_NAME, "invalid or unknown character encoding ~s",
+                   scm_list_1 (enc));
+  free (u32);
+  if (scm_i_port_encoding != NULL)
+    free (scm_i_port_encoding);
+
+  scm_i_port_encoding = new_enc;
+  
+
+  return SCM_UNSPECIFIED;
+}
+#undef FUNC_NAME
 
 SCM_DEFINE (scm_setbinary, "setbinary", 0, 0, 0,
            (void),
@@ -1693,9 +1730,11 @@ SCM_DEFINE (scm_setbinary, "setbinary", 0, 0, 0,
            "\n"
            "After @code{setbinary} has been called, subsequent calls to \n"
            "@code{setlocale} will, again, modify the character encoding.\n")
-#define FUNC_NAME s_scm_set_binary
+#define FUNC_NAME s_scm_setbinary
 {
-  scm_i_port_encoding = scm_i_binary_locale;
+  if (scm_i_port_encoding != NULL)
+    free (scm_i_port_encoding);
+  scm_i_port_encoding = NULL;
   return SCM_UNSPECIFIED;
 }
 #undef FUNC_NAME
diff --git a/libguile/posix.h b/libguile/posix.h
index b0873b6..f87a49a 100644
--- a/libguile/posix.h
+++ b/libguile/posix.h
@@ -73,6 +73,7 @@ SCM_API SCM scm_access (SCM path, SCM how);
 SCM_API SCM scm_getpid (void);
 SCM_API SCM scm_putenv (SCM str);
 SCM_API SCM scm_setlocale (SCM category, SCM locale);
+SCM_API SCM scm_setencoding (SCM encoding);
 SCM_API SCM scm_setbinary (void);
 SCM_API SCM scm_mknod (SCM path, SCM type, SCM perms, SCM dev);
 SCM_API SCM scm_nice (SCM incr);
diff --git a/libguile/print.c b/libguile/print.c
index f63952c..55ea440 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -455,10 +455,10 @@ iprin1 (SCM exp, SCM port, scm_print_state *pstate)
              else if (i < 0 || i > '\177')
                scm_intprint (i, 8, port);
              else
-               scm_putc (i, port);
+               scm_charprint (i, port);
            }
          else
-           scm_putc (i, port);
+           scm_charprint (i, port);
        }
       else if (SCM_IFLAGP (exp)
               && ((size_t) SCM_IFLAGNUM (exp) < (sizeof iflagnames / sizeof 
(char *))))
@@ -750,6 +750,18 @@ scm_prin1 (SCM exp, SCM port, int writingp)
 }
 
 
+/* Print a character.
+ */
+void
+scm_charprint (scm_t_wchar ch, SCM port)
+{
+  scm_t_wchar *wbuf;
+  SCM wstr = scm_i_make_wide_string (1, &wbuf);
+
+  wbuf[0] = ch;
+  scm_lfwrite_str (wstr, port);
+}
+
 /* Print an integer.
  */
 
diff --git a/libguile/print.h b/libguile/print.h
index a77eb6e..b6a18d7 100644
--- a/libguile/print.h
+++ b/libguile/print.h
@@ -24,6 +24,7 @@
 
 #include "libguile/__scm.h"
 
+#include "libguile/chars.h" 
 #include "libguile/options.h"
 
 
@@ -76,6 +77,7 @@ SCM_API SCM scm_print_options (SCM setting);
 SCM_API SCM scm_make_print_state (void);
 SCM_API void scm_free_print_state (SCM print_state);
 SCM_INTERNAL SCM scm_i_port_with_print_state (SCM port, SCM print_state);
+SCM_API void scm_charprint (scm_t_wchar c, SCM port);
 SCM_API void scm_intprint (scm_t_intmax n, int radix, SCM port);
 SCM_API void scm_uintprint (scm_t_uintmax n, int radix, SCM port);
 SCM_API void scm_ipruk (char *hdr, SCM ptr, SCM port);
diff --git a/libguile/read.c b/libguile/read.c
index 5f93ea6..ab7dbf3 100644
--- a/libguile/read.c
+++ b/libguile/read.c
@@ -26,6 +26,7 @@
 #include <stdio.h>
 #include <ctype.h>
 #include <string.h>
+#include <unicase.h>
 
 #include "libguile/_scm.h"
 #include "libguile/chars.h"
@@ -186,34 +187,56 @@ static inline SCM scm_read_scsh_block_comment (int chr, 
SCM port);
 /* Read from PORT until a delimiter (e.g., a whitespace) is read.  Return
    zero if the whole token fits in BUF, non-zero otherwise.  */
 static inline int
-read_token (SCM port, char *buf, size_t buf_size, size_t *read)
+read_token (SCM port, SCM buf, size_t *read)
 {
+  scm_t_wchar chr;
   *read = 0;
 
-  while (*read < buf_size)
+  while (*read < scm_i_string_length (buf))
     {
-      scm_t_wchar chr;
-
       chr = scm_getc (port);
-      chr = (SCM_CASE_INSENSITIVE_P ? CHAR_DOWNCASE (chr) : chr);
 
       if (chr == EOF)
        return 0;
-      else if (CHAR_IS_DELIMITER (chr))
+
+      chr = (SCM_CASE_INSENSITIVE_P ? uc_tolower (chr) : chr);
+
+      if (CHAR_IS_DELIMITER (chr))
        {
          scm_ungetc (chr, port);
          return 0;
        }
-      else
-       {
-         *buf = (char) chr;
-         buf++, (*read)++;
-       }
+
+      scm_i_string_set_from_wchar (buf, *read, chr);
+      (*read)++;
     }
 
   return 1;
 }
 
+static SCM
+read_complete_token (SCM port, size_t *read)
+{
+  SCM buffer, str = SCM_EOL;
+  size_t len;
+  int overflow;
+
+  buffer = scm_i_make_string (READER_BUFFER_SIZE, NULL); 
+  overflow = read_token (port, buffer, read);
+  if (!overflow)
+    return scm_i_substring (buffer, 0, *read);
+
+  str = scm_string_copy (buffer);
+  do
+    {
+      overflow = read_token (port, buffer, &len);
+      str = scm_string_append (scm_list_2 (str, buffer));
+      *read += len;
+    }
+  while (overflow);
+
+  return scm_i_substring (str, 0, *read);
+}
 
 /* Skip whitespace from PORT and return the first non-whitespace character
    read.  Raise an error on end-of-file.  */
@@ -286,7 +309,7 @@ static SCM recsexpr (SCM obj, long line, int column, SCM 
filename);
 
 
 static SCM
-scm_read_sexp (int chr, SCM port)
+scm_read_sexp (scm_t_wchar chr, SCM port)
 #define FUNC_NAME "scm_i_lreadparen"
 {
   register int c;
@@ -381,10 +404,11 @@ scm_read_string (int chr, SCM port)
      object (the string returned).  */
 
   SCM str = SCM_BOOL_F;
-  char c_str[READER_STRING_BUFFER_SIZE];
   unsigned c_str_len = 0;
   scm_t_wchar c;
 
+  str = scm_i_make_string (READER_STRING_BUFFER_SIZE, NULL);  
+
   while ('"' != (c = scm_getc (port)))
     {
       if (c == EOF)
@@ -392,18 +416,11 @@ scm_read_string (int chr, SCM port)
                                    "end of file in string constant",
                                    SCM_EOL);
 
-      if (c_str_len + 1 >= sizeof (c_str))
+      if (c_str_len + 1 >= scm_i_string_length (str))
        {
-         /* Flush the C buffer onto a Scheme string.  */
-         SCM addy;
-
-         if (str == SCM_BOOL_F)
-           str = scm_c_make_string (0, SCM_MAKE_CHAR ('X'));
-
-         addy = scm_from_locale_stringn (c_str, c_str_len);
-         str = scm_string_append_shared (scm_list_2 (str, addy));
-
-         c_str_len = 0;
+         SCM addy = scm_i_make_string (READER_STRING_BUFFER_SIZE, NULL);
+         
+         str = scm_string_append (scm_list_2 (str, addy));
        }
 
       if (c == '\\')
@@ -462,68 +479,74 @@ scm_read_string (int chr, SCM port)
              c = a * 16 + b;
              break;
            }
+         case 'u':
+           {
+             scm_t_wchar a;
+             int i;
+             c = 0;
+             for (i = 0; i < 4; i ++)
+               {
+                 a = scm_getc (port);
+                 if (a == EOF)
+                   goto str_eof;
+                 if      ('0' <= a && a <= '9') a -= '0';
+                 else if ('A' <= a && a <= 'F') a = a - 'A' + 10;
+                 else if ('a' <= a && a <= 'f') a = a - 'a' + 10;
+                 else goto bad_escaped;
+                 c = c * 16 + a;
+               }
+             break;
+           }
+         case 'U':
+           {
+             scm_t_wchar a;
+             int i;
+             c = 0;
+             for (i = 0; i < 6; i ++)
+               {
+                 a = scm_getc (port);
+                 if (a == EOF)
+                   goto str_eof;
+                 if      ('0' <= a && a <= '9') a -= '0';
+                 else if ('A' <= a && a <= 'F') a = a - 'A' + 10;
+                 else if ('a' <= a && a <= 'f') a = a - 'a' + 10;
+                 else goto bad_escaped;
+                 c = c * 16 + a;
+               }
+             break;
+           }
          default:
          bad_escaped:
            scm_i_input_error (FUNC_NAME, port,
                               "illegal character in escape sequence: ~S",
                               scm_list_1 (SCM_MAKE_CHAR (c)));
          }
-      c_str[c_str_len++] = c;
+      scm_i_string_set (str, c_str_len++, SCM_MAKE_CHAR (c));
     }
 
   if (c_str_len > 0)
     {
-      SCM addy;
-
-      addy = scm_from_locale_stringn (c_str, c_str_len);
-      if (str == SCM_BOOL_F)
-       str = addy;
-      else
-       str = scm_string_append_shared (scm_list_2 (str, addy));
+      return scm_i_substring_copy (str, 0, c_str_len);
     }
-  else
-    str = (str == SCM_BOOL_F) ? scm_nullstr : str;
-
-  return str;
+  
+  return scm_nullstr;
 }
 #undef FUNC_NAME
 
 
 static SCM
-scm_read_number (int chr, SCM port)
+scm_read_number (scm_t_wchar chr, SCM port)
 {
-  SCM result, str = SCM_EOL;
-  char buffer[READER_BUFFER_SIZE];
+  SCM result;
+  SCM buffer;
   size_t read;
-  int overflow = 0;
 
   scm_ungetc (chr, port);
-  do
-    {
-      overflow = read_token (port, buffer, sizeof (buffer), &read);
-
-      if ((overflow) || (scm_is_pair (str)))
-       str = scm_cons (scm_from_locale_stringn (buffer, read), str);
-    }
-  while (overflow);
-
-  if (scm_is_pair (str))
-    {
-      /* The slow path.  */
-
-      str = scm_string_concatenate (scm_reverse_x (str, SCM_EOL));
-      result = scm_string_to_number (str, SCM_UNDEFINED);
-      if (!scm_is_true (result))
-       /* Return a symbol instead of a number.  */
-       result = scm_string_to_symbol (str);
-    }
-  else
-    {
-      result = scm_c_locale_stringn_to_number (buffer, read, 10);
-      if (!scm_is_true (result))
-       /* Return a symbol instead of a number.  */
-       result = scm_from_locale_symboln (buffer, read);
-    }
+  buffer = read_complete_token (port, &read);
+  result = scm_string_to_number (buffer, SCM_UNDEFINED);
+  if (!scm_is_true (result))
+    /* Return a symbol instead of a number.  */
+    result = scm_string_to_symbol (buffer);
 
   return result;
 }
@@ -531,57 +554,33 @@ scm_read_number (int chr, SCM port)
 static SCM
 scm_read_mixed_case_symbol (int chr, SCM port)
 {
-  SCM result, str = SCM_EOL;
-  int overflow = 0, ends_with_colon = 0;
-  char buffer[READER_BUFFER_SIZE];
+  SCM result;
+  int ends_with_colon = 0;
+  SCM buffer;
   size_t read = 0;
   int postfix = scm_is_eq (SCM_PACK (SCM_KEYWORD_STYLE), scm_keyword_postfix);
 
   scm_ungetc (chr, port);
-  do
-    {
-      overflow = read_token (port, buffer, sizeof (buffer), &read);
-
-      if (read > 0)
-       ends_with_colon = (buffer[read - 1] == ':');
-
-      if ((overflow) || (scm_is_pair (str)))
-       str = scm_cons (scm_from_locale_stringn (buffer, read), str);
-    }
-  while (overflow);
+  buffer = read_complete_token (port, &read);
+  if (read > 0)
+    ends_with_colon = scm_i_string_ref_eq_char (buffer, read - 1, ':');
 
-  if (scm_is_pair (str))
-    {
-      str = scm_string_concatenate (scm_reverse_x (str, SCM_EOL));
-      result = scm_string_to_symbol (str);
-
-      /* Per SRFI-88, `:' alone is an identifier, not a keyword.  */
-      if (postfix && ends_with_colon && (scm_c_string_length (result) > 1))
-       result = scm_symbol_to_keyword (result);
-    }
+  if (postfix && ends_with_colon && (read > 1))
+    result = scm_symbol_to_keyword (scm_string_to_symbol (scm_i_substring 
(buffer, 0, read - 1)));
   else
-    {
-      /* For symbols smaller than `sizeof (buffer)', we don't need to recur
-        to Scheme strings.  Therefore, we only create one Scheme object (a
-        symbol) per symbol read.  */
-      if (postfix && ends_with_colon && (read > 1))
-       result = scm_from_locale_keywordn (buffer, read - 1);
-      else
-       result = scm_from_locale_symboln (buffer, read);
-    }
+    result = scm_string_to_symbol (buffer);
 
   return result;
 }
 
 static SCM
-scm_read_number_and_radix (int chr, SCM port)
+scm_read_number_and_radix (scm_t_wchar chr, SCM port)
 #define FUNC_NAME "scm_lreadr"
 {
-  SCM result, str = SCM_EOL;
+  SCM result;
   size_t read;
-  char buffer[READER_BUFFER_SIZE];
+  SCM buffer = scm_i_make_string (READER_BUFFER_SIZE, NULL);
   unsigned int radix;
-  int overflow = 0;
 
   switch (chr)
     {
@@ -611,22 +610,8 @@ scm_read_number_and_radix (int chr, SCM port)
       radix = 10;
     }
 
-  do
-    {
-      overflow = read_token (port, buffer, sizeof (buffer), &read);
-
-      if ((overflow) || (scm_is_pair (str)))
-       str = scm_cons (scm_from_locale_stringn (buffer, read), str);
-    }
-  while (overflow);
-
-  if (scm_is_pair (str))
-    {
-      str = scm_string_concatenate (scm_reverse_x (str, SCM_EOL));
-      result = scm_string_to_number (str, scm_from_uint (radix));
-    }
-  else
-    result = scm_c_locale_stringn_to_number (buffer, read, radix);
+  buffer = read_complete_token (port, &read);
+  result = scm_string_to_number (buffer, scm_from_uint (radix));
 
   if (scm_is_true (result))
     return result;
@@ -694,9 +679,12 @@ scm_read_quote (int chr, SCM port)
 static inline SCM
 scm_read_semicolon_comment (int chr, SCM port)
 {
-  scm_t_wchar c;
+  int c;
 
-  for (c = scm_getc (port);
+  /* We use the raw getc here because there is no need to get the
+     locale correct with comment input. This presumes that newline
+     always represents itself no matter what the encoding is.  */
+  for (c = scm_raw_getc (port);
        (c != EOF) && (c != '\n');
        c = scm_getc (port));
 
@@ -728,10 +716,15 @@ scm_read_character (scm_t_wchar chr, SCM port)
 #define FUNC_NAME "scm_lreadr"
 {
   SCM ch;
-  char charname[READER_CHAR_NAME_MAX_SIZE];
+  SCM charname = scm_i_make_string (READER_CHAR_NAME_MAX_SIZE, NULL);
   size_t charname_len;
+  scm_t_wchar cp;
+  int overflow;
 
-  if (read_token (port, charname, sizeof (charname), &charname_len))
+  overflow = read_token (port, charname, &charname_len);
+  charname = scm_c_substring (charname, 0, charname_len);
+
+  if (overflow)
     goto char_error;
 
   if (charname_len == 0)
@@ -746,28 +739,32 @@ scm_read_character (scm_t_wchar chr, SCM port)
     }
 
   if (charname_len == 1)
-    return SCM_MAKE_CHAR (charname[0]);
+    return scm_i_string_ref (charname, 0);
 
-  if (*charname >= '0' && *charname < '8')
+  cp = scm_i_string_ref_to_wchar (charname, 0);
+  if (cp >= '0' && cp < '8')
     {
       /* Dirk:FIXME::  This type of character syntax is not R5RS
        * compliant.  Further, it should be verified that the constant
        * does only consist of octal digits.  Finally, it should be
        * checked whether the resulting fixnum is in the range of
        * characters.  */
-      SCM p = scm_c_locale_stringn_to_number (charname, charname_len, 8);
+      SCM p = scm_string_to_number (charname, scm_from_uint (8));
       if (SCM_I_INUMP (p))
        return SCM_MAKE_CHAR (SCM_I_INUM (p));
     }
 
-  ch = scm_i_charname_to_char (charname, charname_len);
+  /* The names of characters should never have non-Latin1
+     characters.  */
+  if (scm_i_is_narrow_string (charname))
+    ch = scm_i_charname_to_char (scm_i_string_chars (charname), 
+                                charname_len);
   if (scm_is_true (ch))
     return ch;
 
  char_error:
   scm_i_input_error (FUNC_NAME, port, "unknown character name ~a",
-                    scm_list_1 (scm_from_locale_stringn (charname,
-                                                         charname_len)));
+                    scm_list_1 (charname));
 
   return SCM_UNSPECIFIED;
 }
@@ -833,9 +830,13 @@ scm_read_scsh_block_comment (scm_t_wchar chr, SCM port)
 {
   int bang_seen = 0;
 
+  /* We can use the raw getc here because there is no need to get the
+     locale correct when reading comments. This presumes that 
+     hash and exclamation points always represent themselves no
+     matter what the source encoding is.*/
   for (;;)
     {
-      int c = scm_getc (port);
+      int c = scm_raw_getc (port);
 
       if (c == EOF)
        scm_i_input_error ("skip_block_comment", port,
@@ -860,12 +861,9 @@ scm_read_extended_symbol (scm_t_wchar chr, SCM port)
        #{This is all a symbol name}#
 
      So here, CHR is expected to be `{'.  */
-  SCM result;
   int saw_brace = 0, finished = 0;
   size_t len = 0;
-  char buf[1024];
-
-  result = scm_c_make_string (0, SCM_MAKE_CHAR ('X'));
+  SCM buf = scm_i_make_string (1024, NULL);
 
   while ((chr = scm_getc (port)) != EOF)
     {
@@ -879,19 +877,19 @@ scm_read_extended_symbol (scm_t_wchar chr, SCM port)
          else
            {
              saw_brace = 0;
-             buf[len++] = '}';
-             buf[len++] = chr;
+             scm_i_string_set_from_wchar (buf, len++, '}');
+             scm_i_string_set_from_wchar (buf, len++, chr);
            }
        }
       else if (chr == '}')
        saw_brace = 1;
       else
-       buf[len++] = chr;
+       scm_i_string_set_from_wchar (buf, len++, chr);
 
-      if (len >= sizeof (buf) - 2)
+      if (len >= scm_i_string_length (buf) - 2)
        {
-         scm_string_append (scm_list_2 (result,
-                                        scm_from_locale_stringn (buf, len)));
+         SCM addy = scm_i_make_string (1024, NULL);
+         buf = scm_string_append (scm_list_2 (buf, addy));
          len = 0;
        }
 
@@ -899,12 +897,7 @@ scm_read_extended_symbol (scm_t_wchar chr, SCM port)
        break;
     }
 
-  if (len)
-    result = scm_string_append (scm_list_2
-                               (result,
-                                scm_from_locale_stringn (buf, len)));
-
-  return (scm_string_to_symbol (result));
+  return (scm_string_to_symbol (scm_c_substring (buf, 0, len)));
 }
 
 
diff --git a/libguile/strings.c b/libguile/strings.c
index 91ed0d7..3ca2902 100644
--- a/libguile/strings.c
+++ b/libguile/strings.c
@@ -22,6 +22,7 @@
 #endif
 
 #include <string.h>
+#include <ctype.h>
 #include <stdio.h>
 #include <uniconv.h>
 #include <unistr.h>
@@ -137,7 +138,7 @@ make_wide_stringbuf (size_t len)
 {
   scm_t_wchar *mem;
   
-#if SCM_DEBUG_STRINGS
+#if SCM_DEBUG
   if (len < 1000)
     lenhist[len]++;
   else
@@ -259,7 +260,7 @@ scm_i_pthread_mutex_t stringbuf_write_mutex = 
SCM_I_PTHREAD_MUTEX_INITIALIZER;
 enum iconv_ilseq_handler scm_conversion_strategy;
 
 /* This is the current locale encoding.  */
-const char *scm_i_port_encoding;
+char *scm_i_port_encoding;
 
 SCM
 scm_i_make_string (size_t len, char **charsp)
@@ -458,11 +459,6 @@ scm_i_string_wide_chars (SCM str)
   SCM buf;
   size_t start;
 
-#if SCM_DEBUG_STRINGS
-  if (!IS_STRING (str)) 
-    scm_wrong_type_arg ("scm_i_string_wide_chars", 0, str);
-#endif
-
   get_str_buf_start (&str, &buf, &start);
   if (!scm_i_is_narrow_string (str))
     return STRINGBUF_WIDE_CHARS (buf) + start;
@@ -520,11 +516,6 @@ scm_i_string_writable_wide_chars (SCM orig_str)
   SCM buf, str = orig_str;
   size_t start;
 
-#if SCM_DEBUG_STRINGS
-  if (!IS_STRING (orig_str)) 
-    scm_wrong_type_arg ("scm_i_string_writable_wide_chars", 0, orig_str);
-#endif
-
   get_str_buf_start (&str, &buf, &start);
   if (IS_RO_STRING (str))
     scm_misc_error (NULL, "string is read-only: ~s", scm_list_1 (orig_str));
@@ -613,11 +604,6 @@ scm_i_string_contains_char (SCM str, char ch)
   size_t i;
   size_t len = scm_i_string_length (str);
 
-#if SCM_DEBUG_STRINGS
-  if (!IS_STRING (str)) 
-    scm_wrong_type_arg ("scm_i_symbol_ref_eq_char", 0, str);
-#endif
-
   i = 0;
   if (scm_i_is_narrow_string (str))
     {
@@ -848,13 +834,19 @@ scm_i_symbol_substring (SCM sym, size_t start, size_t end)
 SCM
 scm_i_symbol_ref (SCM sym, size_t x)
 {
-  return SCM_MAKE_CHAR (scm_i_symbol_chars (sym)[x]);
+  if (scm_i_is_narrow_symbol (sym))
+    return SCM_MAKE_CHAR (scm_i_symbol_chars (sym)[x]);
+  else
+    return SCM_MAKE_CHAR (scm_i_symbol_wide_chars (sym)[x]);
 }
 
 scm_t_wchar
 scm_i_symbol_ref_to_wchar (SCM sym, size_t x)
 {
-  return (scm_t_wchar) (unsigned char) (scm_i_symbol_chars (sym)[x]);
+  if (scm_i_is_narrow_symbol (sym))
+    return (scm_t_wchar) (unsigned char) (scm_i_symbol_chars (sym)[x]);
+  else
+    return scm_i_symbol_wide_chars (sym)[x];
 }
 
 
@@ -863,19 +855,46 @@ scm_i_symbol_ref_to_wchar (SCM sym, size_t x)
 char
 scm_i_symbol_ref_to_char (SCM sym, size_t x, char undef)
 {
-  return (scm_i_symbol_chars (sym)[x]);
+  scm_t_wchar ch;
+  if (scm_i_is_narrow_symbol (sym))
+    {
+      return scm_i_symbol_chars (sym)[x];
+    }
+  else
+    {
+      ch = scm_i_symbol_wide_chars (sym)[x];
+      if (ch < 0xFF)
+       return (char) (unsigned char) ch;
+      else
+       return undef;
+    }
 }
 
 int
 scm_i_symbol_ref_eq_char (SCM sym, size_t x, char c)
 {
-  return (c == scm_i_symbol_chars (sym)[x]);
+  if (scm_i_is_narrow_symbol (sym))
+    return (c == scm_i_symbol_chars (sym)[x]);
+  else
+    return ((unsigned char) c == scm_i_symbol_wide_chars (sym)[x]);
 }
 
 int
 scm_i_symbol_strcmp (SCM sym, char *str)
 {
-  return strcmp (scm_i_symbol_chars (sym), str);
+  int i;
+  if (scm_i_is_narrow_symbol (sym))
+    return strcmp (scm_i_symbol_chars (sym), str);
+  else
+    {
+      if (strlen(str) != scm_i_symbol_length (sym))
+       return 1;
+      for (i = 0; i < scm_i_symbol_length (sym); i++)
+       if ((scm_t_uint32) scm_i_symbol_wide_chars (sym)[i] 
+           != (unsigned char) str[i])
+         return 1;
+      return 0;
+    }
 }
 
 /* Debugging
@@ -982,11 +1001,13 @@ SCM_DEFINE (scm_set_conversion_error_behavior_x, 
"set-conversion-error-behavior!
            (SCM sym),
            "Sets the behavior of the interpreter when outputting a character\n"
            "that is not representable in the current locale.\n"
-           "@var{sym} can be either @code{'error}, @code{'questionmark}, or\n"
+           "@var{sym} can be either @code{'error}, @code{'substitute}, or\n"
            "@code{'escape}.  If it is @code{'error}, an error will be thrown\n"
            "when an unconvertible character is encountered.  If it is\n"
-           "@code{'questionmark}, then unconvertiable characters will \n"
-           "appear as question marks on output.  If it is @code{'escape},\n"
+           "@code{'substitute}, then unconvertible characters will \n"
+           "be replaced with approximate characters, or with question marks\n"
+           "if no approximately correct character is available.\n"
+           "If it is @code{'escape},\n"
            "it will appear as a hex escape when output.\n")
 #define FUNC_NAME s_scm_set_conversion_error_behavior_x
 {
@@ -998,16 +1019,16 @@ SCM_DEFINE (scm_set_conversion_error_behavior_x, 
"set-conversion-error-behavior!
   if (first)
     {
       err = scm_from_locale_symbol ("error");
-      qm = scm_from_locale_symbol ("questionmark");
-      esc = scm_from_locale_symbol ("error");
+      qm = scm_from_locale_symbol ("substitute");
+      esc = scm_from_locale_symbol ("escape");
       first = 0;
     }
 
-  if (scm_eq_p (sym, err))
+  if (scm_is_true (scm_eq_p (sym, err)))
     scm_conversion_strategy = iconveh_error;
-  else if (scm_eq_p (sym, qm))
+  else if (scm_is_true (scm_eq_p (sym, qm)))
     scm_conversion_strategy = iconveh_question_mark;
-  else if (scm_eq_p (sym, esc))
+  else if (scm_is_true (scm_eq_p (sym, esc)))
     scm_conversion_strategy = iconveh_escape_sequence;
   else
     SCM_MISC_ERROR ("unknown conversion behavior ~s", scm_list_1 (sym));
@@ -1056,6 +1077,7 @@ SCM_DEFINE (scm_string, "string", 0, 0, 1,
       p++;
       chrs = SCM_CDR (chrs);
       len--;
+      scm_remember_upto_here_1 (elt);
     }
   if (len > 0)
     scm_misc_error (NULL, "list changed while constructing string", SCM_EOL);
@@ -1352,8 +1374,7 @@ scm_from_locale_stringn (const char *str, size_t len)
   if (len == 0)
     return scm_nullstr;
 
-  if (scm_i_port_encoding == NULL
-      || strcmp(scm_i_port_encoding, "ISO-8859-1") == 0)
+  if (strcmp (SCM_PORT_ENCODING, "ISO-8859-1") == 0)
     {
       res = scm_i_make_string (len, &dst);
       memcpy (dst, str, len);
@@ -1361,7 +1382,7 @@ scm_from_locale_stringn (const char *str, size_t len)
     }
 
   u32len = 0;
-  u32 = (scm_t_wchar *) u32_conv_from_encoding (scm_i_port_encoding,
+  u32 = (scm_t_wchar *) u32_conv_from_encoding (SCM_PORT_ENCODING,
                                                scm_conversion_strategy,
                                                str, len,
                                                NULL,
@@ -1381,7 +1402,7 @@ scm_from_locale_stringn (const char *str, size_t len)
          escaped_str = scm_i_make_string (len, &dst);
          memcpy (dst, str, len);
          scm_misc_error (NULL, "input locale conversion error from ~s: ~s",
-                         scm_list_2 (scm_from_locale_string 
(scm_i_port_encoding),
+                         scm_list_2 (scm_from_locale_string 
(SCM_PORT_ENCODING),
                                      escaped_str));
        }
     }
@@ -1453,33 +1474,109 @@ scm_take_locale_string (char *str)
   return scm_take_locale_stringn (str, -1);
 }
 
+static void
+unistring_escapes_to_guile_escapes (char **bufp, size_t *lenp)
+{
+  char *before, *after;
+  size_t i, j;
+
+  before = *bufp;
+  after = *bufp;
+  i = 0;
+  j = 0;
+  while (i < *lenp)
+    {
+      if ((i <= *lenp - 6)
+         && before[i] == '\\'
+         && before[i+1] == 'u'
+         && before[i+2] == '0'
+         && before[i+3] == '0')
+       {
+         /* Convert \u00NN to \xNN */
+         after[j] = '\\';
+         after[j+1] = 'x';
+         after[j+2] = tolower(before[i+4]);
+         after[j+3] = tolower(before[i+5]);
+         i += 6;
+         j += 4;
+       }
+      else if ((i <= *lenp - 10)
+              && before[i] == '\\'
+              && before[i+1] == 'U'
+              && before[i+2] == '0'
+              && before[i+3] == '0')
+       {
+         /* Convert \U00NNNNNN to \UNNNNNN */
+         after[j] = '\\';
+         after[j+1] = 'U';
+         after[j+2] = tolower(before[i+4]);
+         after[j+3] = tolower(before[i+5]);
+         after[j+4] = tolower(before[i+6]);
+         after[j+5] = tolower(before[i+7]);
+         after[j+6] = tolower(before[i+8]);
+         after[j+7] = tolower(before[i+9]);
+         i += 10;
+         j += 8;
+       }
+      else
+       {
+         after[j] = before[i];
+         i ++;
+         j ++;
+       }
+    }
+  *lenp = j;
+  after = scm_realloc (after, j);
+}
+
 char *
 scm_to_locale_stringn (SCM str, size_t *lenp)
 {
   char *buf;
-  size_t len, i;
+  size_t strlen, len, i;
   int ret;
 
   if (!scm_is_string (str))
     scm_wrong_type_arg_msg (NULL, 0, str, "string");
 
+  strlen = scm_i_string_length (str);
+
   if (lenp == NULL)
-    for (i = 0; i < scm_i_string_length (str); i++)
+    for (i = 0; i < strlen; i++)
       if (scm_i_string_ref_eq_char (str, i, '\0'))
          scm_misc_error (NULL,
                          "string contains #\\nul character: ~S",
                          scm_list_1 (str));
-         
+  
+  if (scm_i_is_narrow_string (str) && scm_i_port_encoding == NULL)
+    {
+      if (lenp)
+       {
+         buf = scm_malloc (strlen);
+         memcpy (buf, scm_i_string_chars (str), strlen);
+         *lenp = strlen;
+         return buf;
+       }
+      else
+       {
+         buf = scm_malloc (strlen + 1);
+         strcpy (buf, scm_i_string_chars (str));
+         return buf;
+       }
+    }
 
   buf = NULL;
   len = 0;
   if (scm_i_is_narrow_string (str))
     {
-      ret = mem_iconveh (scm_i_string_chars (str), scm_i_string_length (str),
-                        "ISO-8859-1", scm_i_port_encoding,
+      ret = mem_iconveh (scm_i_string_chars (str), strlen,
+                        "ISO-8859-1", SCM_PORT_ENCODING,
                         scm_conversion_strategy, NULL,
                         &buf, &len);
 
+      if (ret == 0 && scm_conversion_strategy == iconveh_escape_sequence)
+       unistring_escapes_to_guile_escapes (&buf, &len);
+
       if (ret != 0)
        {
          /* If there is a conversion error, try to generate an escaped
@@ -1487,27 +1584,30 @@ scm_to_locale_stringn (SCM str, size_t *lenp)
          SCM escape_str;
          if (scm_conversion_strategy != iconveh_escape_sequence)
            ret = mem_iconveh (scm_i_string_chars (str), 
-                              scm_i_string_length (str),
-                              "ISO-8859-1", scm_i_port_encoding,
+                              strlen,
+                              "ISO-8859-1", SCM_PORT_ENCODING,
                               iconveh_escape_sequence, NULL,
                               &buf, &len);
          if (ret == 0)
            {
-             /* scm_unistring_escapes_to_guile_escapes (&buf, &len) */
+             unistring_escapes_to_guile_escapes (&buf, &len);
              escape_str = scm_from_locale_stringn (buf, len);
              free (buf);
-             scm_misc_error (NULL, "output conversion error: ~s", scm_list_1 
(escape_str));
+             scm_misc_error (NULL, "cannot convert to output locale ~s: ~a", 
+                             scm_list_2 (scm_from_locale_string 
(SCM_PORT_ENCODING),
+                                         escape_str));
            }
          else
-           scm_misc_error (NULL, "output conversion error", SCM_EOL);
+           scm_misc_error (NULL, "cannot convert to output locale ~s", 
+                           scm_list_1 (scm_from_locale_string 
(SCM_PORT_ENCODING)));
        }
     }
   else
     {
-      buf = u32_conv_to_encoding (scm_i_port_encoding, 
+      buf = u32_conv_to_encoding (SCM_PORT_ENCODING, 
                                  scm_conversion_strategy,
                                  (scm_t_uint32 *) scm_i_string_wide_chars 
(str), 
-                                 scm_i_string_length (str),
+                                 strlen,
                                  NULL,
                                  NULL, &len);
       if (buf == NULL)
@@ -1516,10 +1616,10 @@ scm_to_locale_stringn (SCM str, size_t *lenp)
             version of the string for the error message.  */
          SCM escape_str;
          if (scm_conversion_strategy != iconveh_escape_sequence)
-           buf = u32_conv_to_encoding (scm_i_port_encoding, 
+           buf = u32_conv_to_encoding (SCM_PORT_ENCODING, 
                                        iconveh_escape_sequence,
                                        (scm_t_uint32 *) 
scm_i_string_wide_chars (str), 
-                                       scm_i_string_length (str),
+                                       strlen,
                                        NULL,
                                        NULL, &len);
          if (buf != NULL)
@@ -1527,10 +1627,13 @@ scm_to_locale_stringn (SCM str, size_t *lenp)
              /* scm_unistring_escapes_to_guile_escapes (&buf, &len) */
              escape_str = scm_from_locale_stringn (buf, len);
              free (buf);
-             scm_misc_error (NULL, "conversion error: ~s", scm_list_1 
(escape_str));
+             scm_misc_error (NULL, "cannot convert to output locale ~s: ~s", 
+                             scm_list_2 (scm_from_locale_string 
(SCM_PORT_ENCODING),
+                                         escape_str));
            }
          else
-           scm_misc_error (NULL, "conversion error", SCM_EOL);
+           scm_misc_error (NULL, "cannot convert to output locale ~s", 
+                           scm_list_1 (scm_from_locale_string 
(SCM_PORT_ENCODING)));
        }
     }
 
@@ -1717,7 +1820,7 @@ scm_init_strings ()
   /* Since, ISO-8859-1 is an 8-bit charset with the Unicode codepoints
      0 to 255, it is effectively the same as doing no conversion on
      input or output. */
-  scm_i_port_encoding = "ISO-8859-1"; 
+  scm_i_port_encoding = NULL; 
 #include "libguile/strings.x" 
 }
 
diff --git a/libguile/strings.h b/libguile/strings.h
index 08dd89c..a2ed19d 100644
--- a/libguile/strings.h
+++ b/libguile/strings.h
@@ -76,7 +76,10 @@
      an error for for strings that are not null-terminated.
 */
 
-extern const char *scm_i_port_encoding;
+extern char *scm_i_port_encoding;
+#define SCM_PORT_ENCODING ((scm_i_port_encoding==NULL) ? "ISO-8859-1" : 
scm_i_port_encoding)
+extern enum iconv_ilseq_handler scm_conversion_strategy;
+
 
 SCM_API SCM scm_set_conversion_error_behavior_x (SCM behavior);
 SCM_API SCM scm_string_p (SCM x);
diff --git a/libguile/strports.c b/libguile/strports.c
index bc3fd70..3fa41a6 100644
--- a/libguile/strports.c
+++ b/libguile/strports.c
@@ -38,6 +38,7 @@
 #include "libguile/modules.h"
 #include "libguile/validate.h"
 #include "libguile/deprecation.h"
+#include "libguile/srfi-4.h"
 
 #include "libguile/strports.h"
 
@@ -288,42 +289,33 @@ st_truncate (SCM port, off_t length)
 }
 
 SCM 
-scm_mkstrport (SCM pos, SCM str, long modes, const char *caller)
+scm_i_mkstrport (SCM pos, const char *locale_str, size_t str_len, long modes, 
const char *caller)
 {
-  SCM z;
+  SCM z, str;
   scm_t_port *pt;
-  size_t str_len, c_pos;
-
-  SCM_ASSERT (scm_is_string (str), str, SCM_ARG1, caller);
+  size_t c_pos;
+  char *buf;
+
+  /* Because ports are inherently 8-bit, strings need to be converted
+     to a locale representation for storage.  But, since string ports
+     rely on string functionality for their memory management, we need
+     to create a new string that has the 8-bit locale representation
+     of the underlying string.  This violates the guideline that the
+     internal encoding of characters in strings is in unicode
+     codepoints. */
+  str = scm_i_make_string (str_len, &buf);
+  memcpy (buf, locale_str, str_len);
 
-  str_len = scm_i_string_length (str);
   c_pos = scm_to_unsigned_integer (pos, 0, str_len);
 
   if (!((modes & SCM_WRTNG) || (modes & SCM_RDNG)))
     scm_misc_error ("scm_mkstrport", "port must read or write", SCM_EOL);
 
-  /* XXX
- 
-     Make a new string to isolate us from changes to the original.
-     This is done so that we can rely on scm_i_string_chars to stay in
-     place even across SCM_TICKs.
-
-     Additionally, when we are going to write to the string, we make a
-     copy so that we can write to it without having to use
-     scm_i_string_writable_chars.
-  */
-
-  if (modes & SCM_WRTNG)
-    str = scm_c_substring_copy (str, 0, str_len);
-  else
-    str = scm_c_substring (str, 0, str_len);
-
   scm_i_scm_pthread_mutex_lock (&scm_i_port_table_mutex);
   z = scm_new_port_table_entry (scm_tc16_strport);
   pt = SCM_PTAB_ENTRY(z);
   SCM_SETSTREAM (z, SCM_UNPACK (str));
   SCM_SET_CELL_TYPE(z, scm_tc16_strport|modes);
-  /* see above why we can use scm_i_string_chars here. */
   pt->write_buf = pt->read_buf = (unsigned char *) scm_i_string_chars (str);
   pt->read_pos = pt->write_pos = pt->read_buf + c_pos;
   pt->write_buf_size = pt->read_buf_size = str_len;
@@ -339,22 +331,60 @@ scm_mkstrport (SCM pos, SCM str, long modes, const char 
*caller)
   return z;
 }
 
+SCM 
+scm_mkstrport (SCM pos, SCM str, long modes, const char *caller)
+{
+  SCM z;
+  size_t str_len;
+  char *buf;
+
+  SCM_ASSERT (scm_is_string (str), str, SCM_ARG1, caller);
+
+  /* Because ports are inherently 8-bit, strings need to be converted
+     to a locale representation for storage.  But, since string ports
+     rely on string functionality for their memory management, we need
+     to create a new string that has the 8-bit locale representation
+     of the underlying string.  This violates the guideline that the
+     internal encoding of characters in strings is in unicode
+     codepoints. */
+  buf = scm_to_locale_stringn (str, &str_len);
+  z = scm_i_mkstrport (pos, buf, str_len, modes, caller);
+  free (buf);
+  return z;
+}
+
 /* create a new string from a string port's buffer.  */
 SCM scm_strport_to_string (SCM port)
 {
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
   SCM str;
-  char *dst;
   
   if (pt->rw_active == SCM_PORT_WRITE)
     st_flush (port);
 
-  str = scm_i_make_string (pt->read_buf_size, &dst);
-  memcpy (dst, (char *) pt->read_buf, pt->read_buf_size);
+  str = scm_from_locale_stringn ((char *)pt->read_buf, pt->read_buf_size);
   scm_remember_upto_here_1 (port);
   return str;
 }
 
+/* Create a vector containing the locale representation of the string in the
+   port's buffer.  */
+SCM scm_strport_to_locale_u8vector (SCM port)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+  SCM vec;
+  char *buf;
+  
+  if (pt->rw_active == SCM_PORT_WRITE)
+    st_flush (port);
+
+  buf = scm_malloc (pt->read_buf_size);
+  memcpy (buf, pt->read_buf, pt->read_buf_size);
+  vec = scm_take_u8vector ((unsigned char *) buf, pt->read_buf_size);
+  scm_remember_upto_here_1 (port);
+  return vec;
+}
+
 SCM_DEFINE (scm_object_to_string, "object->string", 1, 1, 0,
            (SCM obj, SCM printer),
            "Return a Scheme string obtained by printing @var{obj}.\n"
@@ -379,6 +409,25 @@ SCM_DEFINE (scm_object_to_string, "object->string", 1, 1, 
0,
 }
 #undef FUNC_NAME
 
+SCM_DEFINE (scm_call_with_output_locale_u8vector, 
"call-with-output-locale-u8vector", 1, 0, 0, 
+           (SCM proc),
+           "Calls the one-argument procedure @var{proc} with a newly created 
output\n"
+           "port.  When the function returns, a vector containing the bytes of 
a\n"
+           "locale representation of the characters written into the port is 
returned\n")
+#define FUNC_NAME s_scm_call_with_output_locale_u8vector
+{
+  SCM p;
+
+  p = scm_mkstrport (SCM_INUM0, 
+                    scm_make_string (SCM_INUM0, SCM_UNDEFINED),
+                    SCM_OPN | SCM_WRTNG,
+                     FUNC_NAME);
+  scm_call_1 (proc, p);
+
+  return scm_get_output_locale_u8vector (p);
+}
+#undef FUNC_NAME
+
 SCM_DEFINE (scm_call_with_output_string, "call-with-output-string", 1, 0, 0, 
            (SCM proc),
            "Calls the one-argument procedure @var{proc} with a newly created 
output\n"
@@ -423,6 +472,27 @@ SCM_DEFINE (scm_open_input_string, "open-input-string", 1, 
0, 0,
 }
 #undef FUNC_NAME
 
+SCM_DEFINE (scm_open_input_locale_u8vector, "open-input-locale-u8vector", 1, 
0, 0,
+           (SCM vec),
+           "Take a u8vector containing the bytes of a string encoded in the\n"
+           "current locale and return an input port that delivers characters\n"
+           "from the string. The port can be closed by\n"
+           "@code{close-input-port}, though its storage will be reclaimed\n"
+           "by the garbage collector if it becomes inaccessible.")
+#define FUNC_NAME s_scm_open_input_string
+{
+  scm_t_array_handle hnd;
+  ssize_t inc;
+  size_t len;
+  const scm_t_uint8 *buf;
+
+  buf = scm_u8vector_elements (vec, &hnd, &len, &inc);
+  SCM p = scm_i_mkstrport(SCM_INUM0, (const char *) buf, len, SCM_OPN | 
SCM_RDNG, FUNC_NAME);
+  scm_array_handle_release (&hnd);
+  return p;
+}
+#undef FUNC_NAME
+
 SCM_DEFINE (scm_open_output_string, "open-output-string", 0, 0, 0, 
            (void),
            "Return an output port that will accumulate characters for\n"
@@ -455,11 +525,26 @@ SCM_DEFINE (scm_get_output_string, "get-output-string", 
1, 0, 0,
 #undef FUNC_NAME
 
 
+SCM_DEFINE (scm_get_output_locale_u8vector, "get-output-locale-u8vector", 1, 
0, 0, 
+           (SCM port),
+           "Given an output port created by @code{open-output-string},\n"
+           "return a u8 vector containing the characters of the string\n"
+           "encoded in the current locale.")
+#define FUNC_NAME s_scm_get_output_locale_u8vector
+{
+  SCM_VALIDATE_OPOUTSTRPORT (1, port);
+  return scm_strport_to_locale_u8vector (port);
+}
+#undef FUNC_NAME
+
+
 /* Given a null-terminated string EXPR containing a Scheme expression
    read it, and return it as an SCM value. */
 SCM
 scm_c_read_string (const char *expr)
 {
+  /* FIXME: the c string gets packed into a string, only to get
+     immediately unpacked in scm_mkstrport.  */
   SCM port = scm_mkstrport (SCM_INUM0,
                            scm_from_locale_string (expr),
                            SCM_OPN | SCM_RDNG,
diff --git a/libguile/strports.h b/libguile/strports.h
index 58ca71f..080efdb 100644
--- a/libguile/strports.h
+++ b/libguile/strports.h
@@ -43,13 +43,19 @@ SCM_API scm_t_bits scm_tc16_strport;
 
 
 SCM_API SCM scm_mkstrport (SCM pos, SCM str, long modes, const char * caller);
+SCM_INTERNAL SCM scm_i_mkstrport (SCM pos, const char *locale_str, size_t 
str_len, 
+                                 long modes, const char *caller);
 SCM_API SCM scm_strport_to_string (SCM port);
+SCM_API SCM scm_strport_to_locale_u8vector (SCM port);
 SCM_API SCM scm_object_to_string (SCM obj, SCM printer);
 SCM_API SCM scm_call_with_output_string (SCM proc);
+SCM_API SCM scm_call_with_output_locale_u8vector (SCM proc);
 SCM_API SCM scm_call_with_input_string (SCM str, SCM proc);
 SCM_API SCM scm_open_input_string (SCM str);
+SCM_API SCM scm_open_input_locale_u8vector (SCM str);
 SCM_API SCM scm_open_output_string (void);
 SCM_API SCM scm_get_output_string (SCM port);
+SCM_API SCM scm_get_output_locale_u8vector (SCM port);
 SCM_API SCM scm_c_read_string (const char *expr);
 SCM_API SCM scm_c_eval_string (const char *expr);
 SCM_API SCM scm_c_eval_string_in_module (const char *expr, SCM module);
diff --git a/module/ice-9/r4rs.scm b/module/ice-9/r4rs.scm
index de2aeb2..b6489f5 100644
--- a/module/ice-9/r4rs.scm
+++ b/module/ice-9/r4rs.scm
@@ -180,6 +180,12 @@ procedures, their behavior is implementation dependent."
   (call-with-output-string
    (lambda (p) (with-output-to-port p thunk))))
 
+(define (with-output-to-locale-u8vector thunk)
+  "Calls THUNK and returns its output as a vector containing
+the bytes of the locale representation its output."
+  (call-with-output-locale-u8vector
+   (lambda (p) (with-output-to-port p thunk))))
+
 (define (with-error-to-string thunk)
   "Calls THUNK and returns its error output as a string."
   (call-with-output-string


hooks/post-receive
-- 
GNU Guile




reply via email to

[Prev in Thread] Current Thread [Next in Thread]