bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26029: Problems with join


From: Reuti
Subject: bug#26029: Problems with join
Date: Thu, 9 Mar 2017 19:24:40 +0100

Hi,

> Am 09.03.2017 um 18:20 schrieb Assaf Gordon <address@hidden>:
> 
>> […]
>> Aha, I didn't check this. Then the "-j" option should be moved to a new 
>> section "Deprecated" in the man/info page of the coreutils version too. (And 
>> mention the special handling of -j1 resp. -j2, while -j3 … works as one 
>> expects.)
> 
> I would humbly suggest other wording: I'm not sure '-j' is deprecated.
> It is useful, and does work as expected in most cases.

It's only mentioned in the addendum here:


http://pubs.opengroup.org/onlinepubs/9699919799//utilities/join.html

"Earlier versions  of  this  standard  allowed  -j, -j1, -j2 options, and a 
form of the -o option that allowed the list option-argument to be multiple 
arguments. These forms are  no longer specified by POSIX.1-2008 but may be 
present in some implementations.
…
The obsolescent -j options and the multi-argument -o option are removed in this 
version."


Therefore I still favor to move "-j" at the end of the man page in a separate 
section, also taking:

Q15: http://www.opengroup.org/austin/papers/posix_faq.html

into account.


> 
> But, it should be better documented to warn against this edge-case.
> 
> Reuti wrote:
>> -j FIELD equivalent to '-1 FIELD -2 FIELD'
>> does not work in all cases essentially.
> 
> It 'just works' in most cases, but indeed we should improve the documentation 
> about edge cases.
> 
> First,
> this is the relevant section that handles the '-j' parameter:
> https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/join.c#n1079

Yep, this I checked in the source too.


> 
> Second,
> Let's ensure '-jN' works in the common cases,
> when it is *not* followed by a number:
> 
> Two input files:
> 
>   $ cat a.txt
>   1 2 3 aaa
>   2 3 4 bbb
> 
>   $ cat b.txt
>   1 2 3 XXX
>   2 3 4 YYY
> 
> '-j1' alone is equivalent to '-1 1 -2 1':
> 
>   $ join -1 1 -2 1 a.txt b.txt
>   1 2 3 aaa 2 3 XXX
>   2 3 4 bbb 3 4 YYY
> 
>   $ join -j1 a.txt b.txt
>   1 2 3 aaa 2 3 XXX
>   2 3 4 bbb 3 4 YYY
> 
> '-j2' alone is equivalent to '-1 2 -2 2':
> 
>   $ join -1 2 -2 2 a.txt b.txt
>   2 1 3 aaa 1 3 XXX
>   3 2 4 bbb 2 4 YYY
> 
>   $ join -j2 a.txt b.txt
>   2 1 3 aaa 1 3 XXX
>   3 2 4 bbb 2 4 YYY
> 
> '-j3' alone is equivalent to '-1 3 -2 3':
> 
>   $ join -1 3 -2 3 a.txt b.txt
>   3 1 2 aaa 1 2 XXX
>   4 2 3 bbb 2 3 YYY
> 
>   $ join -j3 a.txt b.txt
>   3 1 2 aaa 1 2 XXX
>   4 2 3 bbb 2 3 YYY
> 
> So, in the most common cases, '-jN' works for all Ns
> (for "all" being 1,2,3 but really, who needs more than 3 numbers? :) ).
> This is perhaps not like BSD's join.
> 
> 
> Now comes the tricky part:
> If the '-j1' or '-j2' is followed by another parameter,
> and that parameter turns out *not* to be an valid field number,
> It is treated like '-j 1' (or '-1 1 -2 1'), and join just "does the right 
> thing":
> 
>   $ join -j2 -i a.txt b.txt
>   2 1 3 aaa 1 3 XXX
>   3 2 4 bbb 2 4 YYY
> 
> This is implemented here:
> https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/join.c#n1171

Aha, I didn't spot this. That's really tricky. I only observed the changing 
error message complaining about the remaining arguments depending on removing 
and adding an additional field number. And in case the filename is just a 
number it's even getting more convoluted, as also the overall number of 
arguments come into play then.

$ join -j1 1 2

generates no error, although -j1 got  a 1, but it predicts that it must be the 
name of a file, as otherwise one argument would be missing on the command line 
AFAICS.


> And the result is that most of the time, join "just works" (IMHO, but
> other opinions welcomed).
> 
> 
> If the '-j1' or '-j2' is followed by a number, this is were the unexpected 
> behaviour occurs, as it sets the key field for that file alone. E.g. '-j1 2' 
> is equivalent to '-1 2' (and the key for the second
> file is not set, thus defaults to 1):
> 
>   $ join -j1 2 a.txt b.txt
>   2 1 3 aaa 3 4 YYY
> 
>   $ join -1 2 a.txt b.txt
>   2 1 3 aaa 3 4 YYY
> 
> 
> Is the above a satisfactory explanation?

Yes, absolutely.


> If so, it'll be more-or-less what I'll add to the manual.
> 
> I see that this has been implemented back in 2005, here:
> https://git.savannah.gnu.org/cgit/coreutils.git/commit/src/join.c?id=f9118c1c2e35b
> with the comment:
> "Parse obsolete options -j1 and -j2
>  so that it is a pure extension to POSIX 1003.1-2001."
> 
> I can perhaps guestimate that since this usage is never
> mentioned anywhere, it is considered undocumented and discouraged usage
> (and indeed, I don't think I've ever encountered it, or previously
> saw a bug-report or question about it - so it's rather rare).
> 
> We could add a warning to the man page - what do others think?

+1

-- Reuti

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail


reply via email to

[Prev in Thread] Current Thread [Next in Thread]