[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: horrible utf-8 performace in wc
From: |
Bo Borgerson |
Subject: |
Re: horrible utf-8 performace in wc |
Date: |
Wed, 07 May 2008 18:29:22 -0400 |
User-agent: |
Thunderbird 2.0.0.12 (X11/20080227) |
Pádraig Brady wrote:
> In the first 65535 code points there are also 404 chars which are
> not classed as combining in the unicode database, but are classed
> as zero width in the glibc locale data at least (zero-width space
> being one of them like you mentioned). I determined this with the
> attached progs:
>
> ./zw | python unidata.py | grep " 0 " | wc -l
Hi Pádraig,
Wow, I knew there were some stand-alone zero-width characters, but I had
no idea there were so many!
I poked around a little in gnulib and found a function for determining
the combining class of a Unicode character.
I think the attached patch does what you were intending to do, and it
also counts all of the stand-alone zero-width characters you found:
----
$ ./zw | python unidata.py | grep " 0 " | perl packu.pl | src/wc -m
404
$ src/wc -m 2char
2 2char
----
Please note that this requires a re-run of `./bootstrap', since it needs
to bring some extra stuff in from gnulib.
Hope that helps.
Bo
diff --git a/bootstrap.conf b/bootstrap.conf
index 8bde0ad..ef5a328 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -82,6 +82,7 @@ gnulib_modules="
stpncpy
strftime
strpbrk strtoimax strtoumax strverscmp sys_stat timespec tzset
+ unictype/combining-class
unicodeio unistd-safer unlink-busy unlinkdir unlocked-io
uptime
useless-if-before-free
diff --git a/src/wc.c b/src/wc.c
index 61ab485..ed6630c 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -32,6 +32,8 @@
#include "readtokens0.h"
#include "safe-read.h"
+#include "unictype.h"
+
#if !defined iswspace && !HAVE_ISWSPACE
# define iswspace(wc) \
((wc) == to_uchar (wc) && isspace (to_uchar (wc)))
@@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
linepos += width;
if (iswspace (wide_char))
goto mb_word_separator;
+ else if (uc_combining_class (wide_char) != 0)
+ chars--; /* don't count combining chars */
in_word = true;
}
break;
packu.pl
Description: Perl program
eÌé
- horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/06
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc,
Bo Borgerson <=
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/08
- Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
Re: horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/07
Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08