[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bytewise u??_conv_from_encoding
From: |
Bruno Haible |
Subject: |
Re: Bytewise u??_conv_from_encoding |
Date: |
Sat, 01 Jan 2022 13:57:29 +0100 |
Hi Marc,
> The demand to read a file (in local encoding) and to decode it
> incrementally seems a typical one.
There are four ways to satisfy this demand.
(A) Using a pipe at the shell level:
iconv -t UTF-8 | my-program
(B) Using a programming language that has a coroutines concept.
This way, both the decoder and the consumer can be programmed in
a straightforward manner.
(C) In C, with multiple threads.
(D) In C, with a decoder programmed in a straightforward manner
and a consumer that is written as a callback with state.
(E) In C, with a decoder written as a callback with state
and a consumer programmed in a straightforward manner.
> Thus, I am wondering whether it makes sense to offer a stateful
> decoder that takes byte by byte and signals as soon as a decoded byte
> sequence is ready.
It seems that you are thinking of approach (D).
I think (D) is the worst, because writing application code in a callback
style with state is hard and error-prone. I would favour (E) instead,
if (A) is not possible.
(B) means to use a different programming language. I can't recommend C++ [1].
(C) is possible, but complex. See e.g. gnulib's pipe-filter-ii.c or
pipe-filter-gi.c. Generally, threads are overkill when all you need are
coroutines.
Now, when implementing (E), it will be useful to have some kind of "abstract
input stream" data type. Such a thing does not exist in C, for historical
reasons. But it can be done similarly to the "abstract output stream" data
type that is at the heart of GNU libtextstyle [2][3][4].
> On top of that, a decoding Unicode mbfile interface can be built, say ucfile.
One of the problems of byte-by-byte decoding is that it's inefficient. It's
way more efficient to do the same task (decoding, consuming) on an entire
buffer of, say, at least 1 KiB. Buffering minimizes the context switches and
time spent in function entry/exit. That needs to be considered in the design.
Bruno
[1] https://en.cppreference.com/w/cpp/language/coroutines
[2]
https://www.gnu.org/software/gettext/libtextstyle/manual/html_node/The-output-stream-hierarchy.html
[3]
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=libtextstyle/gnulib-local/lib/iconv-ostream.oo.h
[4]
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=libtextstyle/gnulib-local/lib/iconv-ostream.oo.c