[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?
From: |
Ben Pfaff |
Subject: |
Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count? |
Date: |
Tue, 2 Sep 2014 21:36:07 -0700 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
[I'm bringing this back on the list because it seemed likely to me that
it was taken off-list by mistake, and I don't see anything potentially
private.]
On Wed, Sep 03, 2014 at 12:29:38AM -0400, Andrew Boling wrote:
> > On the other hand: Are you using grapheme breaks or word breaks?
> > The u*_grapheme_breaks() functions, in particular, are very simple,
> > and use only public libunistring interfaces, so it would be very
> > easy for the library client to implement its own specialized version
> > that also returns the number of breaks found.
>
>
> Grapheme breaks. It would be trivial to implement my own loop that achieves
> the desired end result, but as someone who tries to avoid the strlen()
> mentality it struck me as a question worth asking. I brought up the
> u*_wordbreaks() functions as well because the workflows were very similar
> and they shared the same quirk. (calculating a list of items but returning
> the item count)
>
> At the end of the day, it's not a critical missing feature and the tools
> are there for anyone that this is important to. I'm just a stickler for
> counts being returned when they make sense. Needless scans are something I
> expect out of higher level languages.
OK.
I'm not sure I was clear about what I was suggesting. Here's
u8_grapheme_breaks():
void
u8_grapheme_breaks (const uint8_t *s, size_t n, char *p)
{
ucs4_t prev;
int mblen;
prev = 0;
for (; n > 0; s += mblen, p += mblen, n -= mblen)
{
ucs4_t next;
int i;
mblen = u8_mbtouc (&next, s, n);
p[0] = uc_is_grapheme_break (prev, next);
for (i = 1; i < mblen; i++)
p[i] = 0;
prev = next;
}
}
You could easily write your own that also returns the number of breaks:
size_t
u8_grapheme_breaks_with_count (const uint8_t *s, size_t n, char *p)
{
size_t n_breaks = 0;
ucs4_t prev;
int mblen;
prev = 0;
for (; n > 0; s += mblen, p += mblen, n -= mblen)
{
ucs4_t next;
int i;
mblen = u8_mbtouc (&next, s, n);
p[0] = uc_is_grapheme_break (prev, next);
n_breaks += p[0] != 0;
for (i = 1; i < mblen; i++)
p[i] = 0;
prev = next;
}
return n_breaks;
}
It's not as nice as having it built-in, but it is not complicated and
does not do an extra scan.