[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gnuastro-commits] master f4bb1ff5: Library (type.h/table.h): default pr
From: |
Mohammad Akhlaghi |
Subject: |
[gnuastro-commits] master f4bb1ff5: Library (type.h/table.h): default printing of floats in exp mode |
Date: |
Sat, 12 Nov 2022 23:31:44 -0500 (EST) |
branch: master
commit f4bb1ff599ad47d5c4d85195289f2d762a7d067b
Author: Mohammad Akhlaghi <mohammad@akhlaghi.org>
Commit: Mohammad Akhlaghi <mohammad@akhlaghi.org>
Library (type.h/table.h): default printing of floats in exp mode
Until now, when any of the programs were asked to build a plain-text table
they would do so in fixed-point notation. But this could cause a loss of
information/precision (especially when the column contains values with a
very large dynamic range).
With this commit, the default mode for printing floating points (as single
numbers or as table columns) has been changed to exponential mode. This
fixes the problem of potential loss of information. Also to help clarify
the issue to a new user, a new sub-section has been added under the Table
program to describe all the intricacies with printing floating point
numbers and how to best manage their columns in the respective
types. Furthermore, the counting of significant digits when finding the
type of a floating point string input has also been corrected (previously
it wouldn't properly count the digits in an exponential notation).
---
NEWS | 7 +++
bin/table/asttable.conf | 8 ++--
doc/gnuastro.texi | 119 +++++++++++++++++++++++++++++++++++++++++++-----
lib/gnuastro/table.h | 2 +-
lib/type.c | 71 ++++++++++++++---------------
5 files changed, 154 insertions(+), 53 deletions(-)
diff --git a/NEWS b/NEWS
index d33eb8ba..4c0ec264 100644
--- a/NEWS
+++ b/NEWS
@@ -58,6 +58,13 @@ See the end of the file for license conditions.
** Changed features
Table:
+ - To avoid potential loss of information in floating point columns, when
+ printing the columns to standard output (in the terminal) or saving in
+ a plain-text file, the default floating point formats are now printed
+ in exponential (scientific) notation (e.g., 1.234e-5). Until now the
+ default notation was fixed-point notation (e.g., 0.00001234). A new
+ sub-section called "Printing floating point numbers" has been added to
+ the book (under the "Table" section) to clarify this important point.
-A: new short format for --txtf64format. The '-d' short format was
conflicting with the short option name for '--descending'.
diff --git a/bin/table/asttable.conf b/bin/table/asttable.conf
index cd90523a..f9697279 100644
--- a/bin/table/asttable.conf
+++ b/bin/table/asttable.conf
@@ -24,8 +24,8 @@
catrowhdu 1
catcolumnhdu 1
-# Options
- txtf32format fixed
+# Output
+ txtf32format exp
txtf32precision 6
- txtf64format fixed
- txtf64precision 12
+ txtf64format exp
+ txtf64precision 15
diff --git a/doc/gnuastro.texi b/doc/gnuastro.texi
index 9b8bc33d..a5f595aa 100644
--- a/doc/gnuastro.texi
+++ b/doc/gnuastro.texi
@@ -468,6 +468,7 @@ Invoking ConvertType
Table
+* Printing floating point numbers:: Optimal storage of floating point types.
* Column arithmetic:: How to do operations on table columns.
* Operation precedence in Table:: Order of running options in Table.
* Invoking asttable:: Options and arguments to Table.
@@ -523,7 +524,7 @@ Arithmetic operators
* Elliptical shape operators:: Operations that are focused on an ellipse.
* Loading external columns:: Read a column from a table into the stack.
* Size and position operators:: Extracting image size and pixel positions.
-* Building new dataset and stack management:: How to construct an empty
dataset from scratch.
+* Building new dataset and stack management:: How to construct an empty
dataset from scratch.
* Operand storage in memory or a file:: Tools for complex operations in one
command.
Convolve
@@ -10998,6 +10999,7 @@ The ranges listed below are inclusive.
The maximum (minimum is its negative) possible value is
@mymath{3.402823\times10^{38}}.
Single-precision floating points can accurately represent a floating point
number up to @mymath{\sim7.2} significant decimals.
Given the heavy noise in astronomical data, this is usually more than
sufficient for storing results.
+For more, see @ref{Printing floating point numbers}.
@item f64
@itemx float64
@@ -11005,6 +11007,7 @@ Given the heavy noise in astronomical data, this is
usually more than sufficient
The maximum (minimum is its negative) possible value is @mymath{\sim10^{308}}.
Double-precision floating points can accurately represent a floating point
number @mymath{\sim15.9} significant decimals.
This is usually good for processing (mixing) the data internally, for example,
a sum of single precision data (and later storing the result as @code{float32}).
+For more, see @ref{Printing floating point numbers}.
@end table
@cartouche
@@ -14013,12 +14016,78 @@ This is a good place to get a general feeling of all
the things you can do with
Finally, in @ref{Invoking asttable}, we give some examples and describe each
option in Table.
@menu
+* Printing floating point numbers:: Optimal storage of floating point types.
* Column arithmetic:: How to do operations on table columns.
* Operation precedence in Table:: Order of running options in Table.
* Invoking asttable:: Options and arguments to Table.
@end menu
-@node Column arithmetic, Operation precedence in Table, Table, Table
+@node Printing floating point numbers, Column arithmetic, Table, Table
+@subsection Printing floating point numbers
+
+@cindex Floating point numbers
+@cindex Printing floating point numbers
+Many of the columns containing astronomical data will contain floating point
numbers (those that aren't an integer, like @mymath{1.23} or
@mymath{4.56\times10^{-7}}).
+However, printing (for human readability) of floating point numbers has some
intricacies that we will explain in this section.
+For a basic introduction to different types of integers or floating points,
see @ref{Numeric data types}.
+
+It may be tempting to simply use 64-bit floating points all the time and avoid
this section over all.
+But have in mind that compared to 32-bit floating point type, a 64-bit
floating point type will consume double the storage, double the RAM and will
take almost double the time for processing.
+So when the statistical precision of your numbers is less than that offered by
32-bit floating point precision, it is much better to store them in this format.
+
+Within almost all commonly used CPUs of today, numbers (including integers or
floating points) are stored in binary base-2 format (where the only digits that
can be used to represent the number are 0 and 1).
+However, we (humans) are use to numbers in base-10 (where we have 10 digits:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9).
+For integers, there is a one-to-one correspondance between a base-2 and
base-10 representation.
+Therefore, converting a base-10 integer (that you will be giving as an option
value when running a Gnuastro program, for example) to base-2 (that the
computer will store in memory), or vice-versa, will not cause any loss of
information for integers.
+
+The problem is that floating point numbers don't have such a one-to-one
correspondance between the two notations.
+The full discussion on how floating point numbers are stored in binary format
is beyond the scope of this book.
+But please have a look at the corresponding
@url{https://en.wikipedia.org/wiki/Floating-point_arithmetic, Wikipedia
article} to get a rough feeling about the complexity.
+Of course, if you are interested in the details, that Wikipedia article should
be a good starting point for further reading.
+
+@cindex IEEE 754 (floating point)
+The most common convention for storing floating point numbers in digital
storage is IEEE Standard for Floating-Point Arithmetic;
@url{https://en.wikipedia.org/wiki/IEEE_754, IEEE 754}.
+In short, the full width (in bits) assigned to that type (for example the 32
bits allocated for 32-bit floating point types) is divided into separate
components: The first bit is the ``sign'' (specifying if the number is negative
or positive).
+In 32-bit floats, the next 8 bits are the ``exponent'' and finally (again, in
32-bit floats), the ``fraction'' is stored in the next 23 bits.
+For example see
@url{https://commons.wikimedia.org/wiki/File:Float_example.svg, this image on
Wikipedia}.
+
+@cindex Decimal digits
+@cindex Precision of floats
+In IEEE 754, around zero, the base-2 and base-10 representations approximately
match.
+However, as we go away from 0, you will loose precision.
+The important concept in understanding the precision of floating point numbers
is ``decimal digits'', or the number of digits in the number, independent of
where the decimal point is.
+For example @mymath{1.23} has three decimal digits and
@mymath{4.5678\times10^9} has 5 decimal digits.
+According to IEEE
754@footnote{@url{https://en.wikipedia.org/wiki/IEEE_754#Basic_and_interchange_formats}},
32-bit and 64-bit floating point numbers can accurately (statistically)
represent a floating point with 7.22 and 15.95 decimal digits respectively.
+
+@cartouche
+@noindent
+@strong{Should I store my columns as 32-bit or 64-bit floating point type?} If
your floating point numbers have 7 decimal digits or less (for example noisy
image pixel values, measured star or galaxy magnitudes, and anything that is
derived from them like galaxy mass and etc), you can safely use 32-bit
precision (the statistical error on the measurements is usually significantly
larger than 7 digits!).
+However, some columns require more digits; thus 64-bit precision.
+For example, RA or Dec with more than one arcsecond accuracy: the degrees can
have 3 digits, and 1 arcsecond is @mymath{1/3600\sim0.0003} of a degree,
requiring 4 more digits).
+You can use the @ref{Numerical type conversion operators} of @ref{Column
arithmetic} to convert your columns to a certain type for storage.
+@end cartouche
+
+The discussion above was for the storage of floating point numbers.
+When printing floating point numbers in a human-friendly format (for example,
in a plain-text file or on standard output in the command-line), the computer
has to convert its internal base-2 representation to a base-10 representation.
+This second conversion may cause a small discrepancy between the stored and
printed values.
+
+@cartouche
+@noindent
+@strong{Use FITS tables as output of measurement programs:} When you are doing
a measurement to produce a catalog (for example with @ref{MakeCatalog}) set the
output to be a FITS table (for example @option{--output=mycatalog.fits}).
+A FITS binary table will store the same the base-2 number that was measured by
the CPU.
+However, if you choose to store the output table as a plain-text table, you
risk loosing information due to the human friendly base-10 floating point
conversion (which is necessary in a plain-text output).
+@end cartouche
+
+To customize how columns containing floating point values are printed (in a
plain-text output file, or in the standard output in your terminal), Table has
four options for the two different types: @option{--txtf32format},
@option{--txtf32precision}, @option{--txtf64format} and
@option{--txtf64precision}.
+They are fully described in @ref{Invoking asttable}.
+
+@cartouche
+@noindent
+@strong{Summary:} it is therefore recommended to always store your tables as
FITS (binary) tables.
+To view the contents of the table on the command-line or to feed it to a
program that doesn't recognize FITS tables, you can use the four options above
for a custom base-10 conversion that will not cause any loss of data.
+@end cartouche
+
+@node Column arithmetic, Operation precedence in Table, Printing floating
point numbers, Table
@subsection Column arithmetic
In many scenarios, you want to apply some kind of operation on the columns and
save them in another table or feed them into another program.
@@ -14449,7 +14518,7 @@ If any output file is explicitly requested (with
@option{--output}) the output t
When no output file is explicitly requested the output table will be written
to the standard output.
If the specified output is a FITS file, the type of FITS table (binary or
ASCII) will be determined from the @option{--tabletype} option.
If the output is not a FITS file, it will be printed as a plain text table
(with space characters between the columns).
-When the output is not binary (for example standard output or a plain-text),
the @option{--txtf32*} or @option{--txtf64*} options can be used for the
formatting of floating point columns.
+When the output is not binary (for example standard output or a plain-text),
the @option{--txtf32*} or @option{--txtf64*} options can be used for the
formatting of floating point columns (see @ref{Printing floating point
numbers}).
When the columns are accompanied by meta-data (like column name, units, or
comments), this information will also printed in the plain text file before the
table, as described in @ref{Gnuastro text table format}.
For the full list of options common to all Gnuastro programs please see
@ref{Common options}.
@@ -14858,39 +14927,65 @@ Finally, if you already have a FITS table by other
means (for example, by downlo
@item -f STR
@itemx --txtf32format=STR
-The plain-text format of 32-bit floating point columns when output is not
binary (this option is ignored for binary outputs like FITS tables).
+The plain-text format of 32-bit floating point columns when output is not
binary (this option is ignored for binary outputs like FITS tables, see
@ref{Printing floating point numbers}).
The acceptable values are listed below.
This is just the format of the plain-text outputs; see
@option{--txtf32precision} for customizing their precision.
@table @code
@item fixed
-Fixed-point notation (for example @code{1234567.89012})
+Fixed-point notation (for example @code{123.4567}).
@item exp
-Exponential notation (for example @code{1.2345689012e+06})
+Exponential notation (for example @code{1.234567e+02}).
@end table
+The default mode is @code{exp} since it is the most generic and will not cause
any loss of data.
+Be very cautious if you set it to @code{fixed}.
+As a rule of thumb, the fixed-point notation is only good if the numbers are
larger than 1.0, but not too large!
+Given that the total number of accurate decimal digits is fixed the more
digits you have on the left of the decimal point (integer part), the more
un-accurate digits will be printed on the right of the decimal point.
+
@item -p STR
@itemx --txtf32precision=INT
-Number of digits after the decimal point (precision) for columns with a 32-bit
floating point datatype.
+Number of digits after (to the right side of) the decimal point (precision)
for columns with a 32-bit floating point datatype (this option is ignored for
binary outputs like FITS tables, see @ref{Printing floating point numbers}).
This can take any positive integer (including 0).
When given a value of zero, the floating point number will be rounded to the
nearest integer.
+@cindex IEEE 754
+The default value to this option is 6.
+This is because according to IEEE 754, 32-bit floating point numbers can be
accurately presented to 7.22 decimal digits (see @ref{Printing floating point
numbers}).
+Since we only have an integer number of digits in a number, we'll round it to
7 decimal digits.
+Furthermore, the precision is only defined to the right side of the decimal
point.
+In exponential notation (default of @option{--txtf32format}), one decimal
digit will be printed on the left of the decimal point.
+So the default value to this option is @mymath{7-1=6}.
+
@item -A STR
@itemx --txtf64format=STR
-The plain-text format of 64-bit floating point columns when output is not
binary (this option is ignored for binary outputs like FITS tables).
+The plain-text format of 64-bit floating point columns when output is not
binary (this option is ignored for binary outputs like FITS tables, see
@ref{Printing floating point numbers}).
The acceptable values are listed below.
This is just the format of the plain-text outputs; see
@option{--txtf64precision} for customizing their precision.
@table @code
@item fixed
-Fixed-point notation (for example @code{1234567.89012})
+Fixed-point notation (for example @code{12345.6789012345}).
@item exp
-Exponential notation (for example @code{1.2345689012e+06})
+Exponential notation (for example @code{1.23456789012345e4}).
@end table
+The default mode is @code{exp} since it is the most generic and will not cause
any loss of data.
+Be very cautious if you set it to @code{fixed}.
+As a rule of thumb, the fixed-point notation is only good if the numbers are
larger than 1.0, but not too large!
+Given that the total number of accurate decimal digits is fixed the more
digits you have on the left of the decimal point (integer part), the more
un-accurate digits will be printed on the right of the decimal point.
+
@item -B STR
@itemx --txtf64precision=INT
-Number of digits after the decimal point (precision) for columns with a 64-bit
floating point datatype.
+Number of digits after the decimal point (precision) for columns with a 64-bit
floating point datatype (this option is ignored for binary outputs like FITS
tables, see @ref{Printing floating point numbers}).
This can take any positive integer (including 0).
When given a value of zero, the floating point number will be rounded to the
nearest integer.
+
+@cindex IEEE 754
+The default value to this option is 15.
+This is because according to IEEE 754, 64-bit floating point numbers can be
accurately presented to 15.95 decimal digits (see @ref{Printing floating point
numbers}).
+Since we only have an integer number of digits in a number, we'll round it to
16 decimal digits.
+Furthermore, the precision is only defined to the right side of the decimal
point.
+In exponential notation (default of @option{--txtf64format}), one decimal
digit will be printed on the left of the decimal point.
+So the default value to this option is @mymath{16-1=15}.
@end table
@@ -16343,7 +16438,7 @@ Reading NaN as a floating point number in Gnuastro is
not case-sensitive.
* Elliptical shape operators:: Operations that are focused on an ellipse.
* Loading external columns:: Read a column from a table into the stack.
* Size and position operators:: Extracting image size and pixel positions.
-* Building new dataset and stack management:: How to construct an empty
dataset from scratch.
+* Building new dataset and stack management:: How to construct an empty
dataset from scratch.
* Operand storage in memory or a file:: Tools for complex operations in one
command.
@end menu
diff --git a/lib/gnuastro/table.h b/lib/gnuastro/table.h
index f2193a3a..e9fc39d4 100644
--- a/lib/gnuastro/table.h
+++ b/lib/gnuastro/table.h
@@ -65,7 +65,7 @@ __BEGIN_C_DECLS /* From C++ preparations */
#define GAL_TABLE_DEF_PRECISION_INT 0
#define GAL_TABLE_DEF_PRECISION_FLT 6
-#define GAL_TABLE_DEF_PRECISION_DBL 14
+#define GAL_TABLE_DEF_PRECISION_DBL 15
diff --git a/lib/type.c b/lib/type.c
index ad58d107..21ac211a 100644
--- a/lib/type.c
+++ b/lib/type.c
@@ -393,8 +393,6 @@ gal_type_bit_string(void *in, size_t size)
char *
gal_type_to_string(void *ptr, uint8_t type, int quote_if_str_has_space)
{
- float *f=ptr;
- double *d=ptr;
char *c, *str=NULL;
switch(type)
{
@@ -434,17 +432,10 @@ gal_type_to_string(void *ptr, uint8_t type, int
quote_if_str_has_space)
statisically significant digits in some scenarios and its result is
generally not easily predictable: can be fixed-point or exponential
depending on printed length! But the printed length of a number can
- hide statisical significance. Therefore, the most conservative
- format is '%Nf' (where 'N' is the number of digits after the decimal
- point; depending on the precision of the type). But in this case,
- when the number is smaller than 1.0, '%.Nf' will loose precision!
- For example '0.0001234567' will be printed as '0.000123'! Thefore
- when the value is less than 1.0, we should use %Ne (with the same
- number of digits after the decimal point). */
- case GAL_TYPE_FLOAT32:
- TO_STRING( float, *f<1.0 ? "%.6e" : "%.6f" ); break;
- case GAL_TYPE_FLOAT64:
- TO_STRING( double, *d<1.0f ? "%.14e" : "%.14f"); break;
+ hide statisical significance. See the discussion in
+ 'bin/table/asttable.conf' for the values used here. */
+ case GAL_TYPE_FLOAT32: TO_STRING( float, "%.6e" ); break;
+ case GAL_TYPE_FLOAT64: TO_STRING( double, "%.14e"); break;
/* Unknown type! */
default:
@@ -578,8 +569,8 @@ void *
gal_type_string_to_number(char *string, uint8_t *type)
{
long int l;
+ size_t digits;
void *ptr, *out;
- int fnz=-1, lnz=0; /* 'F'irst (or 'L'ast) 'N'on-'Z'ero. */
uint8_t forcedfloat=0;
char *c, *tailptr, *cp;
@@ -673,38 +664,46 @@ gal_type_string_to_number(char *string, uint8_t *type)
}
else
{
- /* The maximum number of decimal digits to store in float or double
- precision floating point are:
-
- float: 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22
- double: 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95
-
- FLT_DIG (at least 6 in ISO C) keeps the number of digits (not zero
- before or after) that can be represented by a single precision
- floating point number. If there are more digits, then we should
- store the value as a double precision.
-
- Note that the number can have non-digit characters that we don't
- want, like: '.', 'e', 'E', ','. */
+ /* Start counting the number of digits from the start of the string
+ (while ignoring any '0's at the start) */
+ digits=0;
for(cp=string;*cp!='\0';++cp)
- if(isdigit(*cp) && *cp!='0' && fnz==-1)
- fnz=cp-string;
+ {
+ if(isdigit(*cp))
+ { if(!(digits==0 && *cp=='0')) ++digits; }
+ if(*cp=='e') break;
+ }
- /* In the previous loop, we went to the end of the string, so 'cp'
- now points to its '\0'. We just have to iterate backwards! */
+ /* In the previous loop, we went to the end of the string (or the 'e'
+ character in an exponential), so 'cp' now points to its end. We
+ just have to iterate backwards and stop when we hit a non-zero
+ character */
for(;cp!=string;--cp)
- if(isdigit(*cp) && *cp!='0')
+ if(isdigit(*cp))
{
- lnz=cp-string;
- break;
+ if(*cp=='0') --digits;
+ else break;
}
/* Calculate the number of decimal digits and decide if it the number
- should be a float or a double. */
- if( lnz-fnz > FLT_DIG || fabs(d)>FLT_MAX || fabs(d)<FLT_MIN )
+ should be a float or a double.
+
+ The maximum number of decimal digits to store 32-bit floating
+ point is 7.22 (see "Printing floating point numbers" section of
+ the book). We will round this to 7 to be on the safe side. If the
+ given number has more than 7 decimal digits, or is outside the
+ range of possible values for a 32-bit float, it should be saved as
+ a 64-bit float. */
+ if( digits > 7 || fabs(d)>FLT_MAX || fabs(d)<FLT_MIN )
{ ptr=&d; *type=GAL_TYPE_FLOAT64; }
else
{ f=d; ptr=&f; *type=GAL_TYPE_FLOAT32; }
+
+ /* For a check:
+ printf("%s:%s: %zu %s\n", __func__, string, digits,
+ gal_type_name(*type, 1));
+ printf("%s: GOOD\n", __func__); exit(0);
+ */
}
/* Allocate a one-element dataset, then copy the number into it. */
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [gnuastro-commits] master f4bb1ff5: Library (type.h/table.h): default printing of floats in exp mode,
Mohammad Akhlaghi <=