Re: [bug-libunistring] _wordbreaks/_grapheme

bug-libunistring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?

From:	Ben Pfaff
Subject:	Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?
Date:	Tue, 2 Sep 2014 21:36:07 -0700
User-agent:	Mutt/1.5.21 (2010-09-15)

[I'm bringing this back on the list because it seemed likely to me that
it was taken off-list by mistake, and I don't see anything potentially
private.]

On Wed, Sep 03, 2014 at 12:29:38AM -0400, Andrew Boling wrote:
> > On the other hand: Are you using grapheme breaks or word breaks?
> > The u*_grapheme_breaks() functions, in particular, are very simple,
> > and use only public libunistring interfaces, so it would be very
> > easy for the library client to implement its own specialized version
> > that also returns the number of breaks found.
> 
> 
> Grapheme breaks. It would be trivial to implement my own loop that achieves
> the desired end result, but as someone who tries to avoid the strlen()
> mentality it struck me as a question worth asking. I brought up the
> u*_wordbreaks() functions as well because the workflows were very similar
> and they shared the same quirk. (calculating a list of items but returning
> the item count)
> 
> At the end of the day, it's not a critical missing feature and the tools
> are there for anyone that this is important to. I'm just a stickler for
> counts being returned when they make sense. Needless scans are something I
> expect out of higher level languages.

OK.

I'm not sure I was clear about what I was suggesting.  Here's
u8_grapheme_breaks():

    void
    u8_grapheme_breaks (const uint8_t *s, size_t n, char *p)
    {
      ucs4_t prev;
      int mblen;

      prev = 0;
      for (; n > 0; s += mblen, p += mblen, n -= mblen)
        {
          ucs4_t next;
          int i;

          mblen = u8_mbtouc (&next, s, n);

          p[0] = uc_is_grapheme_break (prev, next);
          for (i = 1; i < mblen; i++)
            p[i] = 0;

          prev = next;
        }
    }

You could easily write your own that also returns the number of breaks:

    size_t
    u8_grapheme_breaks_with_count (const uint8_t *s, size_t n, char *p)
    {
      size_t n_breaks = 0;
      ucs4_t prev;
      int mblen;

      prev = 0;
      for (; n > 0; s += mblen, p += mblen, n -= mblen)
        {
          ucs4_t next;
          int i;

          mblen = u8_mbtouc (&next, s, n);

          p[0] = uc_is_grapheme_break (prev, next);
          n_breaks += p[0] != 0;
          for (i = 1; i < mblen; i++)
            p[i] = 0;

          prev = next;
        }
        return n_breaks;
    }

It's not as nice as having it built-in, but it is not complicated and
does not do an extra scan.

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-libunistring] _wordbreaks/_grapheme_breaks and break count?, Andrew Boling, 2014/09/02
- Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?, Ben Pfaff, 2014/09/02
  - Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?, Andrew Boling, 2014/09/02
    - Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?, Ben Pfaff, 2014/09/02
    - Message not available
    - Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?, Ben Pfaff <=

Prev by Date: Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?
Next by Date: Re: [bug-libunistring] backward iteration
Previous by thread: Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?
Next by thread: [bug-libunistring] backward iteration
Index(es):
- Date
- Thread