bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#11187: [PATCH] Fix incorrect width handling of multibyte characters


From: Jim Meyering
Subject: bug#11187: [PATCH] Fix incorrect width handling of multibyte characters in fmt
Date: Thu, 05 Apr 2012 21:09:12 +0200

Vladimir "'φ-coder/phcoder'" Serbinenko wrote:
> Currently fmt assumes that 1 byte= 1 column which creates wrongly
> formatted strings. Attached patch fixes it

Hi Vlad,

Thank you for contributing.
This is a large enough change that we'll need an FSF copyright
assignment from you.  If you haven't already sent in the one for
gnulib, please just add coreutils to the list of affected projects.
(you can do up to 4 projects at a time)

Here are a few suggested adjustments:

It'd be great to have a test-suite addition that fails without
your patch, yet succeeds with it.  If you simply provide small
sample input/output pairs along with a selected locale name, we
can convert that to an actual test suite script for you.

Also, since this is a NEWS-worthy change, it is customary to
add an entry in the NEWS file, too.  I'd put this in a section
entitled "Improvements".

> diff --git a/src/fmt.c b/src/fmt.c
> index 89d13a6..56f7c0b 100644
> --- a/src/fmt.c
> +++ b/src/fmt.c
> @@ -20,6 +20,7 @@
>  #include <stdio.h>
>  #include <sys/types.h>
>  #include <getopt.h>
> +#include <wchar.h>
>
>  /* Redefine.  Otherwise, systems (Unicos for one) with headers that define
>     it to be a type get syntax errors for the variable declaration below.  */
> @@ -135,6 +136,7 @@ struct Word
>
>      const char *text;                /* the text of the word */
>      int length;                      /* length of this word */
> +    int width;

Please don't follow the bad example there.  This value is always
unsigned, so it is better to use size_t, to match the type of your
new get_display_width function.

>      int space;                       /* the size of the following space */
>      unsigned int paren:1;    /* starts with open paren */
>      unsigned int period:1;   /* ends in [.?!])* */
> @@ -259,6 +261,42 @@ static int next_prefix_indent;
>     paragraphs chosen by fmt_paragraph().  */
>  static int last_line_length;
>

Please add a comment saying what the function does/returns,
and naming/describing the arguments.

> +static size_t
> +get_display_width (const char *beg, const char *end)
> +{
> +  const char *ptr;
> +  size_t r = 0;
> +  mbstate_t ps;
> +
> +  memset (&ps, 0, sizeof (ps));

We prefer to initialize mbstate_t variables all on one line, like this:

    mbstate_t ps = { 0, };

> +  for (ptr = beg; *ptr && ptr < end; )
> +    {
> +      wchar_t wc;
> +      size_t s;

Oops.  You've used TABs for indentation.
Note how mixing TABs and spaces makes the indentation look invalid.
Please use only spaces instead.

> +      s = mbrtowc (&wc, ptr, end - ptr, &ps);
> +      if (s == (size_t) -1)
> +     break;
> +      if (s == (size_t) -2)
> +     {
> +       ptr++;
> +       r++;
> +       continue;
> +     }
> +      if (wc == '\e' && ptr + 3 < end
> +       && ptr[1] == '[' && (ptr[2] == '0' || ptr[2] == '1')
> +       && ptr[3] == 'm')
> +     {
> +       ptr += 4;
> +       continue;
> +     }
> +      r += wcwidth (wc);
> +      ptr += s;
> +    }
> +  return r;
> +}
> +
>  void
>  usage (int status)
>  {
> @@ -669,7 +707,9 @@ get_line (FILE *f, int c)
>            c = getc (f);
>          }
>        while (c != EOF && !isspace (c));
> -      in_column += word_limit->length = wptr - word_limit->text;
> +      word_limit->length = wptr - word_limit->text;
> +      in_column += word_limit->width = get_display_width (word_limit->text,
> +                                                       wptr);
>        check_punctuation (word_limit);
>
>        /* Scan inter-word space.  */
> @@ -871,13 +911,13 @@ fmt_paragraph (void)
>            if (w == word_limit)
>              break;
>
> -          len += (w - 1)->space + w->length; /* w > start >= word */
> +          len += (w - 1)->space + w->width;  /* w > start >= word */
>          }
>        while (len < max_width);
>        start->best_cost = best + base_cost (start);
>      }
>
> -  word_limit->length = saved_length;
> +  word_limit->width = saved_length;
>  }
>
>  /* Return the constant component of the cost of breaking before the
> @@ -902,13 +942,13 @@ base_cost (WORD *this)
>        else if ((this - 1)->punct)
>          cost -= PUNCT_BONUS;
>        else if (this > word + 1 && (this - 2)->final)
> -        cost += WIDOW_COST ((this - 1)->length);
> +        cost += WIDOW_COST ((this - 1)->width);
>      }
>
>    if (this->paren)
>      cost -= PAREN_BONUS;
>    else if (this->final)
> -    cost += ORPHAN_COST (this->length);
> +    cost += ORPHAN_COST (this->width);
>
>    return cost;
>  }
> @@ -983,7 +1023,7 @@ put_word (WORD *w)
>    s = w->text;
>    for (n = w->length; n != 0; n--)
>      putchar (*s++);
> -  out_column += w->length;
> +  out_column += w->width;
>  }
>
>  /* Output to stdout SPACE spaces, or equivalent tabs.  */





reply via email to

[Prev in Thread] Current Thread [Next in Thread]