|
From: | Paul Eggert |
Subject: | Re: [bug-libunistring] bug#34524: wc: word count incorrect when words separated only by no-break space |
Date: | Sun, 24 Feb 2019 09:47:02 -0800 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 |
Bruno Haible wrote:
I would find it best to introduce an option '--unicode' to 'wc', that would produce Unicode compliant results, at the cost of - not following POSIX to the letter,
It'd make sense to have an option. How about a more-general option --words, that would let the user define what a word is? This option's operand could use ERE syntax, or a shorthand beginning with '+' for common combinations. For example, the command:
wc --words='[[:alnum:]]+'would say that a word consists of the longest contiguous sequence of alphanumeric characters. And
wc --words='+unicode' would use the Unicode definition of word, whatever it is.
[Prev in Thread] | Current Thread | [Next in Thread] |