Hi Ed.
I finally took a look at this. I don't see a need for major changes in the
doc. If you look at node "Fields" it says pretty clearly:
When @command{awk} reads an input record, the record is
automatically @dfn{parsed} or separated by the @command{awk}
utility into chunks called @dfn{fields}. By default, fields
are separated by @dfn{whitespace}, like words in a line.
Whitespace in @command{awk} means any string of one or more
spaces, TABs, or newlines; other characters that are considered
whitespace by other languages (such as formfeed, vertical tab,
etc.) are @emph{not} considered whitespace by @command{awk}.
The doc does not anywhere make a claim that the whitespace is related to the
regex character class [:space:] (which in fact, it is not), so I think this
was just your confusion.
Thanks,
Arnold
Ed Morton <address@hidden> wrote:
I was just looking up which exact characters get included in the set of
field separators when FS is " " (the default value) and got confused by
this in the gawk documentation:
Class Meaning
[:blank:] Space and TAB characters
[:space:] Space characters (these are: space, TAB, newline,
carriage return, formfeed and vertical tab)
FS == " "
Fields are separated by runs of *whitespace*. Leading and
trailing whitespace are ignored. This is the default.
/(bold added by me)/
I took the last statement above to mean that FS would be the set of
characters defined by the [:space:] character class but it's not since
FS doesn't include carriage return (\r) nor vertical tab (\v) (I didn't
bother checking others)when FS is " ", neither is it the [:blank:]
character class since it includes newlines (\n). Instead it seems to be
[:blank:] plus newline and that's supported by the POSIX spec if we
assume by <blank> they mean [:blank:]:
...by default, a field is a string of non- <blank> non- <newline>
characters.
But what does newline mean in all of the above? Is it always linefeed
(\n) on all platforms or is it LF (\n) on UNIX and CRLF (\r\n) on
Windows or something else? I really don't know.
So - maybe you could update the documentation to say "Fields are
separated by runs of the whitespace (i.e. [:blank:] plus linefeed
characters)" or similar? I couldn't find anywhere in the documentation
that states exactly which characters FS includes when assigned " " nor
what exactly is meant by "whitespace" throughout the documentation and I
think that one tweak to provide a clear definition of the term
"whitespace" would clarify all of it.
Ed.