[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bytewise u??_conv_from_encoding
From: |
Bruno Haible |
Subject: |
Re: Bytewise u??_conv_from_encoding |
Date: |
Wed, 05 Jan 2022 21:59:21 +0100 |
Hello Marc,
> > (A) Using a pipe at the shell level:
> > iconv -t UTF-8 | my-program
> >
> > (B) Using a programming language that has a coroutines concept.
> > This way, both the decoder and the consumer can be programmed in
> > a straightforward manner.
> >
> > (C) In C, with multiple threads.
> >
> > (D) In C, with a decoder programmed in a straightforward manner
> > and a consumer that is written as a callback with state.
> >
> > (E) In C, with a decoder written as a callback with state
> > and a consumer programmed in a straightforward manner.
> >
> > > Thus, I am wondering whether it makes sense to offer a stateful
> > > decoder that takes byte by byte and signals as soon as a decoded byte
> > > sequence is ready.
> >
> > It seems that you are thinking of approach (D).
>
> > I think (D) is the worst, because writing application code in a callback
> > style with state is hard and error-prone. I would favour (E) instead,
> > if (A) is not possible.
>
> If I understand your classification correctly, I meant something more
> like (E) than (D), I think. As an interface, I would propose would be
> something along the following lines:
>
> decoder_t d = decoder_create (iconveh_t *cd);
> switch (decoder_push (d, byte))
> {
> case DECODER_BYTE_READ:
> char *res = decoder_result (d);
> size_t len = decoder_length (d);
> ...
What does the programmer do here with res and len? This is where things
get complex.
> case DECODER_EOF:
> ...
> case DECODER_INCOMPLETE:
> ...
> case DECODER_ERROR:
> ...
> }
> ...
> decoder_destroy (d);
What you describe here is (D), in my view.
(E) would look like this:
extern decoder_t create_decoder_context (void);
extern void push_bytes_into_decoder (const char *p, size_t n, decoder_t);
extern void free_decoder_context (decoder_t);
> > (B) means to use a different programming language. I can't recommend C++
> > [1].
>
> The main problem I see with C++'s coroutines is that they are
> stackless coroutines; their expressiveness is tiny compared to
> languages with full coroutine support, to say nothing of programming
> languages like Scheme with its first-class continuations.
It doesn't surprise me. 'constexpr', another new addition to C++, similarly
does only a fraction of what would be useful.
> > (C) is possible, but complex. See e.g. gnulib's pipe-filter-ii.c or
> > pipe-filter-gi.c. Generally, threads are overkill when all you need are
> > coroutines.
>
> I agree. Unfortunately, Posix's response to dropping makecontext and
> friends seems to be to use threads. It would be great if C had a
> lightweight context-swapping mechanism.
Maybe. I think setcontext() has a severe problem; see
<https://www.gnu.org/software/gnulib/manual/html_node/setcontext.html>.
> By the way, libunistring's u??_conv_from_encoding does not seem to be
> adapted to consuming buffers. The problem is that one doesn't know in
> advance where boundaries of multi-byte sequences are so
> u??_conv_from_encoding will likely signal a decoding error.
Yes, u??_conv_from_encoding is made for converting entire strings.
If you want to restart conversion after some bytes that are part of
a multibyte character, you need the low-level iconv().
Bruno