coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: seq feature: print letters


From: Assaf Gordon
Subject: Re: seq feature: print letters
Date: Tue, 08 Jul 2014 23:01:34 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

Hello,

On 06/30/2014 06:23 AM, address@hidden wrote:
I'd like to suggest a patch to allow seq to generate letter sequences.

Attached is an improved implementation for the same functionality:
( http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html )

With this patch, 'seq' can print letters of alphabets in the current locale
(or user-specified language). Examples:

     # print all letters in the current alphabet
     seq --alphabet
     seq -a
     # print the first 10 letters in the current alphabet
     seq -a 10
     # print the letters of the Russian alphabet
     # (assuming the locale is installed)
     LC_ALL=ru_RU.utf-8 seq -a
     # print the letters of the hebrew alphabet
     # (assuming the current locale supports UTF-8 or
     #  other encoding supported by gnulib/libunistring)
     seq --alphabet=he


The new data takes ~5100 bytes (instead of previous >15KB).

It requires (one time) encoding of a 'database' textual file (included) using a 
perl script (included).
Conceptually similar to the unicode tables, this only needs to be done when an 
alphabet is updated.

The alphabets are encoded in 'src/alphabets_data.h'.
The decoder is in 'src/alphabets.{c,h}' .
The added functionality is in few new functions in 'src/seq.c' .

===

If you think that this is an acceptable feature (at least conceptually), then 
I'd be happy to discuss further details,
such as which languages to include, and implementation suggestions (for 
example, should this be moved to gnulib?).

Are there any important encoding issues I might have missed (the code tries to 
be as portable as possible, internally storing UCS values, converting them to 
UTF8 with 'u8-uctomb()', then printing them with 'u8-strconv-to-locale()' - so 
no assumption about the active encoding).

Should there be an interface for multi-letter output (e.g. "aa" after "z"),

===

Regarding Bernhard's comment:

On 07/03/2014 02:18 AM, Bernhard Voelker wrote:
The user could let the shell produce the input:
   $ printf "%c" {a..z} | seq -s ' ' --alpha=- 2 2 6
   b d f
thus picking the Nth character from the input. ;-)

I don't think this example is portable, as "{a..z}" is not in POSIX sh, so 
can't be used in scripting.

However, more generally, it's easy to generate ranges of unicode symbols if 
their value is known:

    # Arabic letters (unicode block 0x627 - 0x64a)
    seq $((0x627)) $((0x64a)) | xargs env printf '\\\\u%04x\\\\n' | xargs env 
printf
# Cyrillic letters (unicode block 0x410 - 0x42f)
    seq $((0x410)) $((0x42f)) | xargs env printf '\\\\u%04x\\\\n' | xargs env 
printf

But the problem is that official alphabets letters for each language are very 
irregular:
For example, few letters in the Arabic block aren't official ordinal letters 
(they are valid alphabet symbols
for letter under certain conditions).
Also, in some languages, a letter is actually two unicode symbols (e.g. in Czech, "Ch" is a single 
letter, in addition to the "C" and "H" letters).
In non-english latin based languages, besides the simple ASCII letters of A-Z, 
there are additional symbols which are not sequential unicode values.

Whether this feature is desired or not in coreutils is one question. But if it is (for 
more languages than English), then I think simple "ranges" will not suffice.


Comments are welcomed,
 -gordon














Attachment: seq_alphabet.2014-07-08.patch.xz
Description: application/xz


reply via email to

[Prev in Thread] Current Thread [Next in Thread]