[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#6366: join can't join on numeric fields
From: |
Jim Meyering |
Subject: |
bug#6366: join can't join on numeric fields |
Date: |
Wed, 09 Jun 2010 08:56:07 +0200 |
Alex Shinn wrote:
> 2010/6/8 Pádraig Brady <address@hidden>:
>> On 07/06/10 06:19, Alex Shinn wrote:
>>>
>>> Ideally join should be able to handle files sorted in any order
>>> that sort provides, but as a bare minimum it should at least
>>> be able to join files sorted on numeric fields.
>>
>> Well if there were no aliases in the numbers, you could always
>> sort the output numerically after the join if it was important.
>
> By first sorting lexicographically, you mean?
> In the use case I had, the data was already sorted
> numerically. So whenever I want to join two files,
> currently I have to do:
>
> sort file1 > file1.tmp
> sort file2 > file2.tmp
> join file1.tmp file2.tmp | sort -n > out
> rm -f file1.tmp file2.tmp
>
> instead of just
>
> join -n file1 file2 > out
>
> In the small tools philosophy you want to avoid adding
> redundancy, but in this case join isn't doing the same
> thing as sort, it's just working with it better. Not to mention
> the fact that sort is an expensive operation to have to
> perform multiple times, not just an extra O(n) filter
> to throw in the middle of a pipeline.
>
>> However if you wanted to join "01" and "1" then your patch is required.
>> Are numeric aliases common enough to warrant this? I think so.
>
> Leading zeros may not be so common, but don't forget
> "1.0" and "1" or "1e2" and "100" and "100.0", etc.
>
>> I'd use -g, --general-numeric to correspond with `sort`.
>
> Yes, that's probably better.
There may be a fly in the ointment.
When comparing floating point numbers how would join measure equality?
Should it consider 1.000000000000001e2 to be equal to 100.0 ?
What if the maximum precision available does not
allow us to distinguish those two values?
What about -0 and 0? (with IEEE 754, they'll compare equal)