[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 2 issues with binning
From: |
Erik Auerswald |
Subject: |
Re: 2 issues with binning |
Date: |
Wed, 22 Jun 2022 16:52:32 +0200 |
Hi Andreas,
On Sun, Jun 19, 2022 at 11:54:48PM +0200, Andreas Schamanek wrote:
> [...]
> ## Possible issue due to binary floating point arithmetics:
>
> $ printf '%s\n' 4.19 4.2 4.21 | datamash --full bin:0.1 1
> 4.19 4.1
> 4.2 4.1
> 4.21 4.2
>
> Of course, 4.2 should bin to 4.2, unless I mistaken.
I think that this is indeed a floating point issue:
$ printf '%s\n' 4.19 4.2000000000000000001 4.21 | env LC_NUMERIC=C datamash
--full bin:0.1 1 | column -t
4.19 4.1
4.2000000000000000001 4.2
4.21 4.2
$ printf '%s\n' 4.19 4.20000000000000000001 4.21 | env LC_NUMERIC=C datamash
--full bin:0.1 1 | column -t
4.19 4.1
4.20000000000000000001 4.1
4.21 4.2
> ## Possible issue with binning negative numbers:
>
> $ printf '%s\n' 0 1 2 | datamash --full bin:2 1
> 0 0
> 1 0
> 2 2
>
> For positive numbers, the bins are inclusive ("[") on the lower end,
> exclusive on the upper end (")"), i.e. here they are [0,2) and
> [2,4).
>
> I expected this type of binning to continue for negative numbers,
> i.e. that the bins left of [0,2) are [-2,0) and next one would be
> [-4,2). However:
>
> $ printf '%s\n' -2 -1 0 | datamash --full bin:2 1
> -2 -4
> -1 -2
> 0 0
>
> I was expecting -2 to map to -2. Maybe, bin:1 shows my concerns even
> better:
>
> $ printf '%s\n' -2 -1 0 | datamash --full bin:1 1
> -2 -3
> -1 -2
> 0 0
>
> Curious, what you think!
I, too, think that this is unexpected.
Using a bin size of 1 as example, GNU Datamash uses the following bins:
..., (-3,-2], (-2,-1], (-1,-0], [0,1), [1,2), [2,3), ...
It seems to me as if it were less suprising to use:
..., [-3,-2), [-2,-1), [-1,0), [0,1), [1,2), [2,3), ...
The current behavior is documented (with an example) at
<https://www.gnu.org/software/datamash/manual/html_node/Binning-numbers.html>
and checked with test "bin5" in file "tests/datamash-tests-2".
The following patch might adjust binning to use the less suprising bins
(hardly tested, may not work that well):
---------------- 8< ----------------
diff --git a/src/field-ops.c b/src/field-ops.c
index a426b90..e1588b5 100644
--- a/src/field-ops.c
+++ b/src/field-ops.c
@@ -566,10 +566,11 @@ field_op_collect (struct fieldop *op,
case OP_BIN_BUCKETS:
{
const long double val = num_value / op->params.bin_bucket_size;
- modfl (val, & op->value);
+ const long double frac = modfl (val, & op->value);
/* signbit will take care of negative-zero as well. */
- if (signbit (op->value))
+ if (signbit (op->value) && !is_zero (frac))
--op->value;
+ op->value = pos_zero (op->value);
op->value *= op->params.bin_bucket_size;
}
break;
---------------- >8 ----------------
The above patch is intended to illustrate a possible way to change the
implementation. It is not intended to be merged as-is.
The comment regarding negative-zero would need to be adjusted as well, the
new line with "pos_zero" is required to bin -0 into 0. The tests would
need adjustments as well, at least "bin5" from "tests/datamash-tests-2".
The documentation would need to be adjusted, too. The new behavior
would also need to be mentioned in the "NEWS" file.
HTH,
Erik