guile-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Guile-commits] GNU Guile branch, master, updated. release_1-9-14-82-g23


From: Ludovic Courtès
Subject: [Guile-commits] GNU Guile branch, master, updated. release_1-9-14-82-g23aad15
Date: Sat, 22 Jan 2011 23:37:40 +0000

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU Guile".

http://git.savannah.gnu.org/cgit/guile.git/commit/?id=23aad1505d6ae8d964febda6e7550337aa5a5773

The branch, master has been updated
       via  23aad1505d6ae8d964febda6e7550337aa5a5773 (commit)
       via  31d4d02be7288d32dce95ddc45d824b725b6712a (commit)
       via  f4bc4e5934ab6f9c6ab629ac5e33174233091c2a (commit)
      from  8e43ed5d0bd035fae0ba106b245f03559cf529ec (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 23aad1505d6ae8d964febda6e7550337aa5a5773
Author: Ludovic Courtès <address@hidden>
Date:   Sun Jan 23 00:20:07 2011 +0100

    Augment `THANKS'.

commit 31d4d02be7288d32dce95ddc45d824b725b6712a
Author: Ludovic Courtès <address@hidden>
Date:   Sun Jan 23 00:13:27 2011 +0100

    Hide the string escaping hacks.
    
    * libguile/strings.c (scm_i_unistring_escapes_to_guile_escapes): Rename
      to...
      (unistring_escapes_to_guile_escapes): ... this.  Make `static'.
      (scm_i_unistring_escapes_to_r6rs_escapes): Rename to...
      (unistring_escapes_to_r6rs_escapes): ... this.  Make `static'.
    
    * libguile/strings.h (scm_i_unistring_escapes_to_guile_escapes,
      scm_i_unistring_escapes_to_r6rs_escapes): Remove declarations.

commit f4bc4e5934ab6f9c6ab629ac5e33174233091c2a
Author: Ludovic Courtès <address@hidden>
Date:   Sun Jan 23 00:18:27 2011 +0100

    Rewrite `read-char', `display', etc. using iconv calls instead of 
libunistring.
    
    Thanks to Bruno Haible for his suggestions.  See
    <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>,
    for details.
    
    * libguile/ports.c (register_finalizer_for_port): Always register a
      finalizer for PORT.
      (finalize_port): Close ENTRY->input_cd and ENTRY->output_cd.
      (scm_new_port_table_entry): Initialize the `input_cd' and `output_cd'
      fields.
      (utf8_to_codepoint): New function.
      (get_codepoint): Rewrite to use `iconv' instead of libunistring.
      (scm_i_set_port_encoding_x): Initialize the `input_cd' and `output_cd'
      fields.
      (update_port_lf): Move upward.  Use `switch' instead of `if's.
    
    * libguile/ports.h (scm_t_port)[input_cd, output_cd]: New fields.
    
    * libguile/print.c (codepoint_to_utf8, display_string): New functions.
      (display_character): Use `display_string'.
      (write_combining_character): Likewise.
      (iprin1): Use `display_string' instead of `scm_lfwrite_str', and
      `display_character' instead of `scm_putc'.
      (write_character): Likewise.
      (write_character_escaped): New function.
    
    * test-suite/tests/encoding-escapes.test ("display output
      escapes")["Rashomon"]: Use lower-case escapes.
      ["fake escape"]: New test.

-----------------------------------------------------------------------

Summary of changes:
 THANKS                                 |    1 +
 libguile/ports.c                       |  312 ++++++++++++++----------
 libguile/ports.h                       |    6 +-
 libguile/print.c                       |  431 ++++++++++++++++++++------------
 libguile/strings.c                     |   18 +-
 libguile/strings.h                     |    4 -
 test-suite/tests/encoding-escapes.test |   19 ++-
 7 files changed, 482 insertions(+), 309 deletions(-)

diff --git a/THANKS b/THANKS
index ac36eb4..bae2b1e 100644
--- a/THANKS
+++ b/THANKS
@@ -55,6 +55,7 @@ For fixes or providing information which led to a fix:
          Raimon Grau
          Szavai Gyula
          Roland Haeder
+         Bruno Haible
           Sven Hartrumpf
           Eric Hanchrow
           Judy Hawkins
diff --git a/libguile/ports.c b/libguile/ports.c
index 5983ff2..fa82961 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -1,5 +1,6 @@
-/* Copyright (C) 1995,1996,1997,1998,1999,2000,2001, 2003, 2004, 2006, 2007, 
2008, 2009, 2010 Free Software Foundation, Inc.
- * 
+/* Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2004,
+ *   2006, 2007, 2008, 2009, 2010, 2011 Free Software Foundation, Inc.
+ *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
  * as published by the Free Software Foundation; either version 3 of
@@ -30,6 +31,7 @@
 #include <errno.h>
 #include <fcntl.h>  /* for chsize on mingw */
 #include <assert.h>
+#include <iconv.h>
 #include <uniconv.h>
 #include <unistr.h>
 #include <striconveh.h>
@@ -515,22 +517,21 @@ scm_i_pthread_mutex_t scm_i_port_table_mutex = 
SCM_I_PTHREAD_MUTEX_INITIALIZER;
 
 static void finalize_port (GC_PTR, GC_PTR);
 
-/* Register a finalizer for PORT, if needed by its port type.  */
+/* Register a finalizer for PORT.  */
 static SCM_C_INLINE_KEYWORD void
 register_finalizer_for_port (SCM port)
 {
   long port_type;
+  GC_finalization_proc prev_finalizer;
+  GC_PTR prev_finalization_data;
 
   port_type = SCM_TC2PTOBNUM (SCM_CELL_TYPE (port));
-  if (scm_ptobs[port_type].free)
-    {
-      GC_finalization_proc prev_finalizer;
-      GC_PTR prev_finalization_data;
 
-      GC_REGISTER_FINALIZER_NO_ORDER (SCM2PTR (port), finalize_port, 0,
-                                     &prev_finalizer,
-                                     &prev_finalization_data);
-    }
+  /* Register a finalizer for PORT so that its iconv CDs get freed and
+     optionally its type's `free' function gets called.  */
+  GC_REGISTER_FINALIZER_NO_ORDER (SCM2PTR (port), finalize_port, 0,
+                                 &prev_finalizer,
+                                 &prev_finalization_data);
 }
 
 /* Finalize the object (a port) pointed to by PTR.  */
@@ -550,6 +551,8 @@ finalize_port (GC_PTR ptr, GC_PTR data)
        register_finalizer_for_port (port);
       else
        {
+         scm_t_port *entry;
+
          port_type = SCM_TC2PTOBNUM (SCM_CELL_TYPE (port));
          if (port_type >= scm_numptob)
            abort ();
@@ -559,6 +562,13 @@ finalize_port (GC_PTR ptr, GC_PTR data)
               is for explicit `close-port' by user.  */
            scm_ptobs[port_type].free (port);
 
+         entry = SCM_PTAB_ENTRY (port);
+
+         if (entry->input_cd != (iconv_t) -1)
+           iconv_close (entry->input_cd);
+         if (entry->output_cd != (iconv_t) -1)
+           iconv_close (entry->output_cd);
+
          SCM_SETSTREAM (port, 0);
          SCM_CLR_PORT_OPEN_FLAG (port);
 
@@ -594,6 +604,11 @@ scm_new_port_table_entry (scm_t_bits tag)
     entry->encoding = NULL;
   else
     entry->encoding = scm_gc_strdup (enc, "port");
+
+  /* The conversion descriptors will be opened lazily.  */
+  entry->input_cd = (iconv_t) -1;
+  entry->output_cd = (iconv_t) -1;
+
   entry->ilseq_handler = scm_i_get_conversion_strategy (SCM_BOOL_F);
 
   SCM_SET_CELL_TYPE (z, tag);
@@ -1028,8 +1043,71 @@ SCM_DEFINE (scm_read_char, "read-char", 0, 1, 0,
 }
 #undef FUNC_NAME
 
+/* Update the line and column number of PORT after consumption of C.  */
+static inline void
+update_port_lf (scm_t_wchar c, SCM port)
+{
+  switch (c)
+    {
+    case '\a':
+      break;
+    case '\b':
+      SCM_DECCOL (port);
+      break;
+    case '\n':
+      SCM_INCLINE (port);
+      break;
+    case '\r':
+      SCM_ZEROCOL (port);
+      break;
+    case '\t':
+      SCM_TABCOL (port);
+      break;
+    default:
+      SCM_INCCOL (port);
+      break;
+    }
+}
+
 #define SCM_MBCHAR_BUF_SIZE (4)
 
+/* Convert the SIZE-byte UTF-8 sequence in UTF8_BUF to a codepoint.
+   UTF8_BUF is assumed to contain a valid UTF-8 sequence.  */
+static scm_t_wchar
+utf8_to_codepoint (const scm_t_uint8 *utf8_buf, size_t size)
+{
+  scm_t_wchar codepoint;
+
+  if (utf8_buf[0] <= 0x7f)
+    {
+      assert (size == 1);
+      codepoint = utf8_buf[0];
+    }
+  else if ((utf8_buf[0] & 0xe0) == 0xc0)
+    {
+      assert (size == 2);
+      codepoint = ((scm_t_wchar) utf8_buf[0] & 0x1f) << 6UL
+       | (utf8_buf[1] & 0x3f);
+    }
+  else if ((utf8_buf[0] & 0xf0) == 0xe0)
+    {
+      assert (size == 3);
+      codepoint = ((scm_t_wchar) utf8_buf[0] & 0x0f) << 12UL
+       | ((scm_t_wchar) utf8_buf[1] & 0x3f) << 6UL
+       | (utf8_buf[2] & 0x3f);
+    }
+  else
+    {
+      assert (size == 4);
+      codepoint = ((scm_t_wchar) utf8_buf[0] & 0x07) << 18UL
+       | ((scm_t_wchar) utf8_buf[1] & 0x3f) << 12UL
+       | ((scm_t_wchar) utf8_buf[2] & 0x3f) << 6UL
+       | (utf8_buf[3] & 0x3f);
+    }
+
+  return codepoint;
+}
+
 /* Read a codepoint from PORT and return it.  Fill BUF with the byte
    representation of the codepoint in PORT's encoding, and set *LEN to
    the length in bytes of that representation.  Raise an error on
@@ -1037,122 +1115,72 @@ SCM_DEFINE (scm_read_char, "read-char", 0, 1, 0,
 static scm_t_wchar
 get_codepoint (SCM port, char buf[SCM_MBCHAR_BUF_SIZE], size_t *len)
 {
-  int c;
-  size_t bufcount = 0;
-  scm_t_uint32 result_buf;
-  scm_t_wchar codepoint = 0;
-  scm_t_uint32 *u32;
-  size_t u32len;
+  int err, byte_read;
+  size_t bytes_consumed, output_size;
+  scm_t_wchar codepoint;
+  char *output;
+  scm_t_uint8 utf8_buf[SCM_MBCHAR_BUF_SIZE];
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
 
-  c = scm_get_byte_or_eof (port);
-  if (c == EOF)
-    return (scm_t_wchar) EOF;
-
-  buf[0] = c;
-  bufcount++;
+  if (SCM_UNLIKELY (pt->input_cd == (iconv_t) -1))
+    /* Initialize the conversion descriptors.  */
+    scm_i_set_port_encoding_x (port, pt->encoding);
 
-  if (pt->encoding == NULL)
+  for (output_size = 0, output = (char *) utf8_buf,
+        bytes_consumed = 0, err = 0;
+       err == 0 && output_size == 0
+        && (bytes_consumed == 0 || byte_read != EOF);
+       bytes_consumed++)
     {
-      /* The encoding is Latin-1: bytes are characters.  */
-      codepoint = (unsigned char) buf[0];
-      goto success;
-    }
+      char *input;
+      size_t input_left, output_left, done;
 
-  for (;;)
-    {
-      u32len = sizeof (result_buf) / sizeof (scm_t_uint32);
-      u32 = u32_conv_from_encoding (pt->encoding,
-                                    (enum iconv_ilseq_handler) 
pt->ilseq_handler,
-                                   buf, bufcount, NULL, &result_buf, &u32len);
-      if (u32 == NULL || u32len == 0)
+      byte_read = scm_get_byte_or_eof (port);
+      if (byte_read == EOF)
        {
-         if (errno == ENOMEM)
-           scm_memory_error ("Input decoding");
-
-         /* Otherwise errno is EILSEQ or EINVAL, so perhaps more
-             bytes are needed.  Keep looping.  */
-       }
-      else
-       {
-         /* Complete codepoint found. */
-         codepoint = u32[0];
-
-         if (SCM_UNLIKELY (u32 != &result_buf))
-           /* libunistring up to 0.9.3 (included) would always heap-allocate
-              the result even when a large-enough RESULT_BUF is supplied, see
-              
<http://lists.gnu.org/archive/html/bug-libunistring/2010-07/msg00003.html>.  */
-           free (u32);
-
-         goto success;
+         if (bytes_consumed == 0)
+           return (scm_t_wchar) EOF;
+         else
+           continue;
        }
 
-      if (bufcount == SCM_MBCHAR_BUF_SIZE)
-       {
-         /* We've read several bytes and didn't find a good
-            codepoint.  Give up.  */
-         goto failure;
-       }
+      buf[bytes_consumed] = byte_read;
 
-      c = scm_get_byte_or_eof (port);
+      input = buf;
+      input_left = bytes_consumed + 1;
+      output_left = sizeof (utf8_buf);
 
-      if (c == EOF)
+      done = iconv (pt->input_cd, &input, &input_left,
+                   &output, &output_left);
+      if (done == (size_t) -1)
        {
-         /* EOF before a complete character was read.  Push it all
-            back and return EOF. */
-         while (bufcount > 0)
-           {
-             /* FIXME: this will probably cause errors in the port column. */
-             scm_unget_byte (buf[bufcount-1], port);
-             bufcount --;
-           }
-          return EOF;
+         err = errno;
+         if (err == EINVAL)
+           /* Missing input: keep trying.  */
+           err = 0;
        }
-      
-      if (c == '\n')
-       {
-          /* It is always invalid to have EOL in the middle of a
-             multibyte character.  */
-         scm_unget_byte ('\n', port);
-         goto failure;
-       }
-       
-      buf[bufcount++] = c;
+      else
+       output_size = sizeof (utf8_buf) - output_left;
     }
 
- success:
-  switch (codepoint)
-    {
-    case '\a':
-      break;
-    case '\b':
-      SCM_DECCOL (port);
-      break;
-    case '\n':
-      SCM_INCLINE (port);
-        break;
-    case '\r':
-      SCM_ZEROCOL (port);
-      break;
-    case '\t':
-      SCM_TABCOL (port);
-      break;
-    default:
-      SCM_INCCOL (port);
-      break;
-    }
+  if (err != 0)
+    goto failure;
+
+  /* Convert the UTF8_BUF sequence to a Unicode code point.  */
+  codepoint = utf8_to_codepoint (utf8_buf, output_size);
+  update_port_lf (codepoint, port);
 
-  *len = bufcount;
+  *len = bytes_consumed;
 
   return codepoint;
 
  failure:
   {
     char *err_buf;
-    SCM err_str = scm_i_make_string (bufcount, &err_buf);
-    memcpy (err_buf, buf, bufcount);
+    SCM err_str = scm_i_make_string (bytes_consumed, &err_buf);
+    memcpy (err_buf, buf, bytes_consumed);
 
-    if (errno == EILSEQ)
+    if (err == EILSEQ)
       scm_misc_error (NULL, "input encoding error for ~s: ~s",
                      scm_list_2 (scm_from_locale_string 
(scm_i_get_port_encoding (port)),
                                  err_str));
@@ -1205,23 +1233,6 @@ scm_fill_input (SCM port)
  * This function differs from scm_c_write; it updates port line and
  * column. */
 
-static void
-update_port_lf (scm_t_wchar c, SCM port)
-{
-  if (c == '\a')
-    ;                           /* Do nothing. */
-  else if (c == '\b')
-    SCM_DECCOL (port);
-  else if (c == '\n')
-    SCM_INCLINE (port);
-  else if (c == '\r')
-    SCM_ZEROCOL (port);
-  else if (c == '\t')
-    SCM_TABCOL (port);
-  else
-    SCM_INCCOL (port);
-}
-
 void
 scm_lfwrite (const char *ptr, size_t size, SCM port)
 {
@@ -1278,6 +1289,7 @@ scm_lfwrite_substr (SCM str, size_t start, size_t end, 
SCM port)
 }
 
 /* Write a scheme string STR to PORT.  */
+/* FIXME: Get rid of it.  */
 void
 scm_lfwrite_str (SCM str, SCM port)
 {
@@ -2060,12 +2072,7 @@ scm_i_set_port_encoding_x (SCM port, const char *enc)
     {
       valid_enc = find_valid_encoding (enc);
       if (valid_enc == NULL)
-        {
-          SCM err;
-          err = scm_from_locale_string (enc);
-          scm_misc_error (NULL, "invalid or unknown character encoding ~s",
-                          scm_list_1 (err));
-        }
+       goto invalid_encoding;
     }
 
   if (scm_is_false (port))
@@ -2087,13 +2094,62 @@ scm_i_set_port_encoding_x (SCM port, const char *enc)
     }
   else
     {
+      iconv_t new_input_cd, new_output_cd;
+
+      new_input_cd = (iconv_t) -1;
+      new_output_cd = (iconv_t) -1;
+
       /* Set the character encoding for this port.  */
       pt = SCM_PTAB_ENTRY (port);
       if (valid_enc == NULL)
         pt->encoding = NULL;
       else
         pt->encoding = scm_gc_strdup (valid_enc, "port");
+
+      if (valid_enc == NULL)
+       valid_enc = "ISO-8859-1";
+
+      if (SCM_CELL_WORD_0 (port) & SCM_RDNG)
+       {
+         /* Open an input iconv conversion descriptor, from VALID_ENC
+            to UTF-8.  We choose UTF-8, not UTF-32, because iconv
+            implementations can typically convert from anything to
+            UTF-8, but not to UTF-32 (see
+            
<http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00007.html>).  */
+         new_input_cd = iconv_open ("UTF-8", valid_enc);
+         if (new_input_cd == (iconv_t) -1)
+           goto invalid_encoding;
+       }
+
+      if (SCM_CELL_WORD_0 (port) & SCM_WRTNG)
+       {
+         new_output_cd = iconv_open (valid_enc, "UTF-8");
+         if (new_output_cd == (iconv_t) -1)
+           {
+             if (new_input_cd != (iconv_t) -1)
+               iconv_close (new_input_cd);
+             goto invalid_encoding;
+           }
+       }
+
+      if (pt->input_cd != (iconv_t) -1)
+       iconv_close (pt->input_cd);
+      if (pt->output_cd != (iconv_t) -1)
+       iconv_close (pt->output_cd);
+
+      pt->input_cd = new_input_cd;
+      pt->output_cd = new_output_cd;
     }
+
+  return;
+
+ invalid_encoding:
+  {
+    SCM err;
+    err = scm_from_locale_string (enc);
+    scm_misc_error (NULL, "invalid or unknown character encoding ~s",
+                   scm_list_1 (err));
+  }
 }
 
 SCM_DEFINE (scm_port_encoding, "port-encoding", 1, 0, 0,
diff --git a/libguile/ports.h b/libguile/ports.h
index 3af3441..43cb392 100644
--- a/libguile/ports.h
+++ b/libguile/ports.h
@@ -3,7 +3,7 @@
 #ifndef SCM_PORTS_H
 #define SCM_PORTS_H
 
-/* Copyright (C) 1995,1996,1997,1998,1999,2000,2001, 2003, 2004, 2006, 2008, 
2009 Free Software Foundation, Inc.
+/* Copyright (C) 1995,1996,1997,1998,1999,2000,2001, 2003, 2004, 2006, 2008, 
2009, 2010 Free Software Foundation, Inc.
  *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
@@ -109,6 +109,10 @@ typedef struct
   /* a buffer for un-read chars and strings.  */
   unsigned char *putback_buf;
   size_t putback_buf_size;        /* allocated size of putback_buf.  */
+
+  /* input/output iconv conversion descriptors */
+  void *input_cd;
+  void *output_cd;
 } scm_t_port;
 
 
diff --git a/libguile/print.c b/libguile/print.c
index 352ca94..679327a 100644
--- a/libguile/print.c
+++ b/libguile/print.c
@@ -1,5 +1,6 @@
-/* Copyright (C) 1995-1999,2000,2001, 2002, 2003, 2004, 2006, 2008, 2009, 
2010, 2011 Free Software Foundation, Inc.
- * 
+/* Copyright (C) 1995-1999, 2000, 2001, 2002, 2003, 2004, 2006, 2008,
+ *   2009, 2010, 2011 Free Software Foundation, Inc.
+ *
  * This library is free software; you can redistribute it and/or
  * modify it under the terms of the GNU Lesser General Public License
  * as published by the Free Software Foundation; either version 3 of
@@ -23,6 +24,10 @@
 #endif
 
 #include <errno.h>
+#include <iconv.h>
+#include <stdio.h>
+#include <assert.h>
+
 #include <uniconv.h>
 #include <unictype.h>
 
@@ -56,10 +61,16 @@
 
 /* Character printers.  */
 
+static size_t display_string (const void *, int, size_t, SCM,
+                             scm_t_string_failed_conversion_handler);
+
 static int display_character (scm_t_wchar, SCM,
                              scm_t_string_failed_conversion_handler);
+
 static void write_character (scm_t_wchar, SCM, int);
 
+static void write_character_escaped (scm_t_wchar, int, SCM);
+
 
 
 /* {Names of immediate symbols}
@@ -541,16 +552,31 @@ iprin1 (SCM exp, SCM port, scm_print_state *pstate)
             {
               size_t len, i;
 
-              scm_putc ('"', port);
+              display_character ('"', port, iconveh_question_mark);
               len = scm_i_string_length (exp);
               for (i = 0; i < len; ++i)
                write_character (scm_i_string_ref (exp, i), port, 1);
 
-              scm_putc ('"', port);
+              display_character ('"', port, iconveh_question_mark);
               scm_remember_upto_here_1 (exp);
             }
           else
-            scm_lfwrite_str (exp, port);
+           {
+             size_t len, printed;
+
+             len = scm_i_string_length (exp);
+             printed = display_string (scm_i_string_data (exp),
+                                       scm_i_is_narrow_string (exp),
+                                       len, port,
+                                       scm_i_get_conversion_strategy (port));
+             if (SCM_UNLIKELY (printed < len))
+               /* FIXME: Provide the error location.  */
+               scm_encoding_error (__func__, errno,
+                                   "cannot convert to output locale",
+                                   "UTF-32", scm_i_get_port_encoding (port),
+                                   exp);
+           }
+
           scm_remember_upto_here_1 (exp);
           break;
        case scm_tc7_symbol:
@@ -740,69 +766,162 @@ scm_prin1 (SCM exp, SCM port, int writingp)
     }
 }
 
-/* Attempt to display CH to PORT according to STRATEGY.  Return non-zero
-   if CH was successfully displayed, zero otherwise (e.g., if it was not
-   representable in PORT's encoding.)  */
-static int
-display_character (scm_t_wchar ch, SCM port,
-                  scm_t_string_failed_conversion_handler strategy)
+/* Convert codepoint CH to UTF-8 and store the result in UTF8.  Return
+   the number of bytes of the UTF-8-encoded string.  */
+static size_t
+codepoint_to_utf8 (scm_t_wchar ch, scm_t_uint8 utf8[4])
 {
-  int printed;
-  const char *encoding;
+  size_t len;
+  scm_t_uint32 codepoint;
+
+  codepoint = (scm_t_uint32) ch;
 
-  encoding = scm_i_get_port_encoding (port);
-  if (encoding == NULL)
+  if (codepoint <= 0x7f)
     {
-      if (ch <= 0xff)
-       {
-         scm_putc (ch, port);
-         printed = 1;
-       }
-      else
-       printed = 0;
+      len = 1;
+      utf8[0] = (scm_t_uint8) codepoint;
+    }
+  else if (codepoint <= 0x7ffUL)
+    {
+      len = 2;
+      utf8[0] = 0xc0 | (codepoint >> 6);
+      utf8[1] = 0x80 | (codepoint & 0x3f);
+    }
+  else if (codepoint <= 0xffffUL)
+    {
+      len = 3;
+      utf8[0] = 0xe0 | (codepoint >> 12);
+      utf8[1] = 0x80 | ((codepoint >> 6) & 0x3f);
+      utf8[2] = 0x80 | (codepoint & 0x3f);
     }
   else
     {
-      size_t len;
-      char locale_encoded[8 * sizeof (ch)], *result;
-
-      len = sizeof (locale_encoded);
-      result = u32_conv_to_encoding (encoding, strategy,
-                                    (scm_t_uint32 *) &ch, 1,
-                                    NULL, locale_encoded, &len);
-      if (result != NULL)
+      len = 4;
+      utf8[0] = 0xf0 | (codepoint >> 18);
+      utf8[1] = 0x80 | ((codepoint >> 12) & 0x3f);
+      utf8[2] = 0x80 | ((codepoint >> 6) & 0x3f);
+      utf8[3] = 0x80 | (codepoint & 0x3f);
+    }
+
+  return len;
+}
+
+/* Display the LEN codepoints in STR to PORT according to STRATEGY;
+   return the number of codepoints successfully displayed.  If NARROW_P,
+   then STR is interpreted as a sequence of `char', denoting a Latin-1
+   string; otherwise it's interpreted as a sequence of
+   `scm_t_wchar'.  */
+static size_t
+display_string (const void *str, int narrow_p,
+               size_t len, SCM port,
+               scm_t_string_failed_conversion_handler strategy)
+
+{
+#define STR_REF(s, x)                          \
+  (narrow_p                                    \
+   ? (scm_t_wchar) ((unsigned char *) (s))[x]  \
+   : ((scm_t_wchar *) (s))[x])
+
+  size_t printed;
+  scm_t_port *pt;
+
+  pt = SCM_PTAB_ENTRY (port);
+
+  if (SCM_UNLIKELY (pt->output_cd == (iconv_t) -1))
+    /* Initialize the conversion descriptors.  */
+    scm_i_set_port_encoding_x (port, pt->encoding);
+
+  printed = 0;
+
+  while (len > printed)
+    {
+      size_t done, utf8_len, input_left, output_left, i;
+      size_t codepoints_read, output_len;
+      char *input, *output;
+      char utf8_buf[256], encoded_output[256];
+      size_t offsets[256];
+
+      /* Convert STR to UTF-8.  */
+      for (i = printed, utf8_len = 0, input = utf8_buf;
+          i < len && utf8_len + 4 < sizeof (utf8_buf);
+          i++)
        {
-         /* CH is graphic; print it.  */
+         offsets[utf8_len] = i;
+         utf8_len += codepoint_to_utf8 (STR_REF (str, i),
+                                        (scm_t_uint8 *) input);
+         input = utf8_buf + utf8_len;
+       }
+
+      input = utf8_buf;
+      input_left = utf8_len;
+
+      output = encoded_output;
+      output_left = sizeof (encoded_output);
+
+      done = iconv (pt->output_cd, &input, &input_left,
+                   &output, &output_left);
 
-         if (strategy == SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE)
+      output_len = sizeof (encoded_output) - output_left;
+
+      if (SCM_UNLIKELY (done == (size_t) -1))
+       {
+         /* Reset the `iconv' state.  */
+         iconv (pt->output_cd, NULL, NULL, NULL, NULL);
+
+         if (errno == EILSEQ &&
+             strategy != SCM_FAILED_CONVERSION_ERROR)
            {
-             /* Apply the same escaping syntax as in `write_character'.  */
-             if (SCM_R6RS_ESCAPES_P)
+             /* Conversion failed somewhere in INPUT and we want to
+                escape or substitute the offending input character.  */
+
+             /* Print the OUTPUT_LEN bytes successfully converted.  */
+             scm_lfwrite (encoded_output, output_len, port);
+
+             /* See how many input codepoints these OUTPUT_LEN bytes
+                corresponds to.  */
+             codepoints_read = offsets[input - utf8_buf] - printed;
+             printed += codepoints_read;
+
+             if (strategy == SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE)
                {
-                 /* LOCALE_ENCODED is large enough to store an R6RS
-                    `\xNNNN;' escape sequence.  However, libunistring
-                    up to 0.9.3 (included) always returns a
-                    heap-allocated RESULT.  */
-                 if (SCM_UNLIKELY (result != locale_encoded))
-                   result = scm_realloc (result, len * 7);
-
-                 scm_i_unistring_escapes_to_r6rs_escapes (result, &len);
+                 scm_t_wchar ch;
+
+                 /* Find CH, the offending codepoint, and escape it.  */
+                 ch = STR_REF (str, offsets[input - utf8_buf]);
+                 write_character_escaped (ch, 1, port);
                }
              else
-               scm_i_unistring_escapes_to_guile_escapes (result, &len);
-           }
+               /* STRATEGY is `SCM_FAILED_CONVERSION_QUESTION_MARK'.  */
+               display_string ("?", 1, 1, port, strategy);
 
-         scm_lfwrite (result, len, port);
-         printed = 1;
-
-         if (SCM_UNLIKELY (result != locale_encoded))
-           free (result);
+             printed++;
+           }
+         else
+           /* Something bad happened that we can't handle: bail out.  */
+           break;
        }
       else
-       printed = 0;
+       {
+         /* INPUT was successfully converted, entirely; print the
+            result.  */
+         scm_lfwrite (encoded_output, output_len, port);
+         codepoints_read = i - printed;
+         printed += codepoints_read;
+       }
     }
 
   return printed;
+#undef STR_REF
+}
+
+/* Attempt to display CH to PORT according to STRATEGY.  Return non-zero
+   if CH was successfully displayed, zero otherwise (e.g., if it was not
+   representable in PORT's encoding.)  */
+static int
+display_character (scm_t_wchar ch, SCM port,
+                  scm_t_string_failed_conversion_handler strategy)
+{
+  return display_string (&ch, 0, 1, port, strategy) == 1;
 }
 
 /* Attempt to pretty-print CH, a combining character, to PORT.  Return
@@ -811,39 +930,101 @@ display_character (scm_t_wchar ch, SCM port,
 static int
 write_combining_character (scm_t_wchar ch, SCM port)
 {
-  int printed;
-  const char *encoding;
+  scm_t_wchar str[2];
+
+  str[0] = SCM_CODEPOINT_DOTTED_CIRCLE;
+  str[1] = ch;
+
+  return display_string (str, 0, 2, port, iconveh_error) == 2;
+}
 
-  encoding = scm_i_get_port_encoding (port);
-  if (encoding != NULL)
+/* Write CH to PORT in its escaped form, using the string escape syntax
+   if STRING_ESCAPES_P is non-zero.  */
+static void
+write_character_escaped (scm_t_wchar ch, int string_escapes_p, SCM port)
+{
+  if (string_escapes_p)
     {
-      scm_t_wchar str[2];
-      char locale_encoded[sizeof (str)], *result;
-      size_t len;
-
-      str[0] = SCM_CODEPOINT_DOTTED_CIRCLE;
-      str[1] = ch;
-
-      len = sizeof (locale_encoded);
-      result = u32_conv_to_encoding (encoding, iconveh_error,
-                                    (scm_t_uint32 *) str, 2,
-                                    NULL, locale_encoded, &len);
-      if (result != NULL)
+      /* Represent CH using the in-string escape syntax.  */
+
+      static const char hex[] = "0123456789abcdef";
+      static const char escapes[7] = "abtnvfr";
+      char buf[9];
+
+      if (ch >= 0x07 && ch <= 0x0D && ch != 0x0A)
        {
-         scm_lfwrite (result, len, port);
-         printed = 1;
-         if (SCM_UNLIKELY (result != locale_encoded))
-           free (result);
+         /* Use special escapes for some C0 controls.  */
+         buf[0] = '\\';
+         buf[1] = escapes[ch - 0x07];
+         scm_lfwrite (buf, 2, port);
+       }
+      else if (!SCM_R6RS_ESCAPES_P)
+       {
+         if (ch <= 0xFF)
+           {
+             buf[0] = '\\';
+             buf[1] = 'x';
+             buf[2] = hex[ch / 16];
+             buf[3] = hex[ch % 16];
+             scm_lfwrite (buf, 4, port);
+           }
+         else if (ch <= 0xFFFF)
+           {
+             buf[0] = '\\';
+             buf[1] = 'u';
+             buf[2] = hex[(ch & 0xF000) >> 12];
+             buf[3] = hex[(ch & 0xF00) >> 8];
+             buf[4] = hex[(ch & 0xF0) >> 4];
+             buf[5] = hex[(ch & 0xF)];
+             scm_lfwrite (buf, 6, port);
+           }
+         else if (ch > 0xFFFF)
+           {
+             buf[0] = '\\';
+             buf[1] = 'U';
+             buf[2] = hex[(ch & 0xF00000) >> 20];
+             buf[3] = hex[(ch & 0xF0000) >> 16];
+             buf[4] = hex[(ch & 0xF000) >> 12];
+             buf[5] = hex[(ch & 0xF00) >> 8];
+             buf[6] = hex[(ch & 0xF0) >> 4];
+             buf[7] = hex[(ch & 0xF)];
+             scm_lfwrite (buf, 8, port);
+           }
        }
       else
-       /* Can't write the result to PORT.  */
-       printed = 0;
+       {
+         /* Print an R6RS variable-length hex escape: "\xNNNN;".  */
+         scm_t_wchar ch2 = ch;
+
+         int i = 8;
+         buf[i] = ';';
+         i --;
+         if (ch == 0)
+           buf[i--] = '0';
+         else
+           while (ch2 > 0)
+             {
+               buf[i] = hex[ch2 & 0xF];
+               ch2 >>= 4;
+               i --;
+             }
+         buf[i] = 'x';
+         i --;
+         buf[i] = '\\';
+         scm_lfwrite (buf + i, 9 - i, port);
+       }
     }
   else
-    /* PORT is Latin-1-encoded and can't display the fancy things.  */
-    printed = 0;
+    {
+      /* Represent CH using the character escape syntax.  */
+      const char *name;
 
-  return printed;
+      name = scm_i_charname (SCM_MAKE_CHAR (ch));
+      if (name != NULL)
+       scm_puts (name, port);
+      else
+       PRINT_CHAR_ESCAPE (ch, port);
+    }
 }
 
 /* Write CH to PORT, escaping it if it's non-graphic or not
@@ -854,25 +1035,28 @@ static void
 write_character (scm_t_wchar ch, SCM port, int string_escapes_p)
 {
   int printed = 0;
+  scm_t_string_failed_conversion_handler strategy;
+
+  strategy = scm_i_get_conversion_strategy (port);
 
   if (string_escapes_p)
     {
       /* Check if CH deserves special treatment.  */
       if (ch == '"' || ch == '\\')
        {
-         scm_putc ('\\', port);
-         scm_putc (ch, port);
+         display_character ('\\', port, iconveh_question_mark);
+         display_character (ch, port, strategy);
          printed = 1;
        }
       else if (ch == ' ' || ch == '\n')
        {
-         scm_putc (ch, port);
+         display_character (ch, port, strategy);
          printed = 1;
        }
     }
   else
     {
-      scm_puts ("#\\", port);
+      display_string ("#\\", 1, 2, port, iconveh_question_mark);
 
       if (uc_combining_class (ch) != UC_CCC_NR)
        /* Character is a combining character, so attempt to
@@ -891,93 +1075,8 @@ write_character (scm_t_wchar ch, SCM port, int 
string_escapes_p)
     printed = display_character (ch, port, iconveh_error);
 
   if (!printed)
-    {
-      /* CH isn't graphic or cannot be represented in PORT's
-        encoding.  */
-
-      if (string_escapes_p)
-       {
-         /* Represent CH using the in-string escape syntax.  */
-
-         static const char hex[] = "0123456789abcdef";
-          static const char escapes[7] = "abtnvfr";
-         char buf[9];
-
-          if (ch >= 0x07 && ch <= 0x0D && ch != 0x0A)
-            {
-              /* Use special escapes for some C0 controls.  */
-              buf[0] = '\\';
-              buf[1] = escapes[ch - 0x07];
-              scm_lfwrite (buf, 2, port);
-            }
-          else if (!SCM_R6RS_ESCAPES_P)
-           {
-             if (ch <= 0xFF)
-               {
-                 buf[0] = '\\';
-                 buf[1] = 'x';
-                 buf[2] = hex[ch / 16];
-                 buf[3] = hex[ch % 16];
-                 scm_lfwrite (buf, 4, port);
-               }
-             else if (ch <= 0xFFFF)
-               {
-                 buf[0] = '\\';
-                 buf[1] = 'u';
-                 buf[2] = hex[(ch & 0xF000) >> 12];
-                 buf[3] = hex[(ch & 0xF00) >> 8];
-                 buf[4] = hex[(ch & 0xF0) >> 4];
-                 buf[5] = hex[(ch & 0xF)];
-                 scm_lfwrite (buf, 6, port);
-               }
-             else if (ch > 0xFFFF)
-               {
-                 buf[0] = '\\';
-                 buf[1] = 'U';
-                 buf[2] = hex[(ch & 0xF00000) >> 20];
-                 buf[3] = hex[(ch & 0xF0000) >> 16];
-                 buf[4] = hex[(ch & 0xF000) >> 12];
-                 buf[5] = hex[(ch & 0xF00) >> 8];
-                 buf[6] = hex[(ch & 0xF0) >> 4];
-                 buf[7] = hex[(ch & 0xF)];
-                 scm_lfwrite (buf, 8, port);
-               }
-           }
-         else
-           {
-             /* Print an R6RS variable-length hex escape: "\xNNNN;".  */
-             scm_t_wchar ch2 = ch;
-
-             int i = 8;
-             buf[i] = ';';
-             i --;
-             if (ch == 0)
-               buf[i--] = '0';
-             else
-               while (ch2 > 0)
-                 {
-                   buf[i] = hex[ch2 & 0xF];
-                   ch2 >>= 4;
-                   i --;
-                 }
-             buf[i] = 'x';
-             i --;
-             buf[i] = '\\';
-             scm_lfwrite (buf + i, 9 - i, port);
-           }
-       }
-      else
-       {
-         /* Represent CH using the character escape syntax.  */
-         const char *name;
-
-         name = scm_i_charname (SCM_MAKE_CHAR (ch));
-         if (name != NULL)
-           scm_puts (name, port);
-         else
-           PRINT_CHAR_ESCAPE (ch, port);
-       }
-    }
+    /* CH isn't graphic or cannot be represented in PORT's encoding.  */
+    write_character_escaped (ch, string_escapes_p, port);
 }
 
 /* Print an integer.
diff --git a/libguile/strings.c b/libguile/strings.c
index abe4a7b..93dd6a7 100644
--- a/libguile/strings.c
+++ b/libguile/strings.c
@@ -1584,9 +1584,13 @@ scm_take_locale_string (char *str)
 
 /* Change libunistring escapes (`\uXXXX' and `\UXXXXXXXX') in BUF, a
    *LENP-byte locale-encoded string, to `\xXX', `\uXXXX', or `\UXXXXXX'.
-   Set *LENP to the size of the resulting string.  */
-void
-scm_i_unistring_escapes_to_guile_escapes (char *buf, size_t *lenp)
+   Set *LENP to the size of the resulting string.
+
+   FIXME: This is a hack we should get rid of.  See
+   <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html>
+   for details.  */
+static void
+unistring_escapes_to_guile_escapes (char *buf, size_t *lenp)
 {
   char *before, *after;
   size_t i, j;
@@ -1642,8 +1646,8 @@ scm_i_unistring_escapes_to_guile_escapes (char *buf, 
size_t *lenp)
    of the resulting string.  BUF must be large enough to handle the
    worst case when `\uXXXX' escapes (6 characters) are replaced by
    `\xXXXX;' (7 characters).  */
-void
-scm_i_unistring_escapes_to_r6rs_escapes (char *buf, size_t *lenp)
+static void
+unistring_escapes_to_r6rs_escapes (char *buf, size_t *lenp)
 {
   char *before, *after;
   size_t i, j;
@@ -1857,10 +1861,10 @@ scm_to_stringn (SCM str, size_t *lenp, const char 
*encoding,
             (seven characters).  Make BUF large enough to hold
             that.  */
          buf = scm_realloc (buf, (len * 7) / 6 + 1);
-         scm_i_unistring_escapes_to_r6rs_escapes (buf, &len);
+         unistring_escapes_to_r6rs_escapes (buf, &len);
        }
       else
-        scm_i_unistring_escapes_to_guile_escapes (buf, &len);
+        unistring_escapes_to_guile_escapes (buf, &len);
 
       buf = scm_realloc (buf, len);
     }
diff --git a/libguile/strings.h b/libguile/strings.h
index 1a8ff7c..168fcb7 100644
--- a/libguile/strings.h
+++ b/libguile/strings.h
@@ -216,10 +216,6 @@ SCM_INTERNAL char **scm_i_allocate_string_pointers (SCM 
list);
 SCM_INTERNAL void scm_i_get_substring_spec (size_t len,
                                            SCM start, size_t *cstart,
                                            SCM end, size_t *cend);
-SCM_INTERNAL void scm_i_unistring_escapes_to_guile_escapes (char *buf,
-                                                           size_t *len);
-SCM_INTERNAL void scm_i_unistring_escapes_to_r6rs_escapes (char *buf,
-                                                          size_t *len);
 
 /* Debugging functions */
 
diff --git a/test-suite/tests/encoding-escapes.test 
b/test-suite/tests/encoding-escapes.test
index 01b2e20..ee3fba1 100644
--- a/test-suite/tests/encoding-escapes.test
+++ b/test-suite/tests/encoding-escapes.test
@@ -1,6 +1,6 @@
 ;;;; encoding-escapes.test --- test suite for Guile's string encodings -*- 
mode: scheme; coding: utf-8 -*-
 ;;;;
-;;;; Copyright (C) 2009, 2010 Free Software Foundation, Inc.
+;;;; Copyright (C) 2009, 2010, 2011 Free Software Foundation, Inc.
 ;;;;
 ;;;; This library is free software; you can redistribute it and/or
 ;;;; modify it under the terms of the GNU Lesser General Public
@@ -118,8 +118,21 @@
       (set-port-encoding! pt "ASCII")
       (set-port-conversion-strategy! pt 'escape)
       (display s4 pt)
-      (string=? "\\u7F85\\u751F\\u9580"
-                (get-output-string pt)))))
+      (string=? "\\u7f85\\u751f\\u9580"
+                (get-output-string pt))))
+
+  (pass-if "fake escape"
+           ;; The input string below contains something that looks like
+           ;; an escape in libunistring syntax, but which should be left
+           ;; as is in the output.  See
+           ;; 
<http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00004.html>
+           ;; for background info.
+           (let ((pt (open-output-string)))
+            (set-port-encoding! pt "ASCII")
+             (set-port-conversion-strategy! pt 'escape)
+             (display "λ -- \\u0012" pt)
+             (string=? "\\u03bb -- \\u0012"
+                       (get-output-string pt)))))
 
 (with-test-prefix "input escapes"
 


hooks/post-receive
-- 
GNU Guile



reply via email to

[Prev in Thread] Current Thread [Next in Thread]