Re: NaN-toolbox much faster now

octave-maintainers
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: NaN-toolbox much faster now

From:	Jaroslav Hajek
Subject:	Re: NaN-toolbox much faster now
Date:	Tue, 17 Mar 2009 19:08:44 +0100
On Tue, Mar 17, 2009 at 4:46 PM, Alois Schlögl <address@hidden> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Jaroslav Hajek wrote:
>> On Tue, Mar 17, 2009 at 9:25 AM, Alois Schlögl <address@hidden> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Jaroslav Hajek wrote:
>>>> On Mon, Mar 16, 2009 at 1:13 PM, Alois Schlögl <address@hidden> wrote:
>>>>
>>>>> Thanks for confirming the test. You asked for "opinions about skipping
>>>>> NaN's, and the Octave NA's support". Here are some thoughts on that issue.
>>>>>
>>>>> Concerning the question whether NaN's and NA should be handled separately.
>>>>> - - Just because R has NA's is not necessarily a good reason why Octave
>>>>> needs it, too. Possible advantages need to be explained.
>>>>>
>>>>> - - In statistical and probabilistic applications skipping both, NaN and
>>>>> NA, is a reasonable approach - there is no need to distinguish NaN from 
>>>>> NA.
>>>>>
>>>> In fact this seems to be what R actually does. It seems that in R, the
>>>> classification of NA/NaN is exactly the converse of Octave's, i.e.
>>>> isna(NaN) is true, while isnan(NA) is not, and that when you tell a
>>>> function to "skip NAs" (na.rm = TRUE), it indeed skips both NaNs and
>>>> NAs. So the question I'm raising here is: is Octave's support of NAs
>>>> actually a good idea, given that it is, in a good sense, actually
>>>> incompatible with R? Of course, "fixing" isnan to work like in R would
>>>> on the other hand break compatibility with C and Matlab and
>>>> everything.
>>>>
>>>>
>>>>> - - In case NaN's are used for error handling, the question is how is NA
>>>>> improving the error handling? The main advantage would be that less
>>>>> NaN's need to be handled, but NA come with additional costs of added
>>>>> complexity and possible confusion (causing more programming errors, slow
>>>>> down of development speed, as well as performance loss). Therefore, if
>>>>> NA's should get special support, the benefits of this concept should be
>>>>> made clear.
>>>>>
>>>>> - - The benefits of the NaN-toolbox over the traditional approach are:
>>>>> (i) functions are doing more often the right thing,
>>>>> (ii) applications are less likely to fail due to NaN-related issues.
>>>>> (iii) its more likely that users unaware of the NaN-issue get it right
>>>>> in the first place,
>>>>> (iv) no need to think about whether nanmean or mean is the right function;
>>>>> (v) of course using always nanmean(), etc. would also do, but its nicer
>>>>> to write only mean(), etc.;
>>>>> Basically, the idea is to make the use of these functions easier. The
>>>>> use of NA in addition to NaN's is detrimental to this aim. So the
>>>>> advantage of using NA's is not clear.
>>>>>
>>>> These are points that we've discussed previously. They're mostly
>>>> agreeable with unless functions are used in a non-statistical sense -
>>>> and I can only imagine that for "mean". After all, they're classified
>>>> as "statistics", so one could agree that "mean" should be understood
>>>> to be the statistical mean.
>>> That's how I see it, too.
>>>
>>>> Performance is another consideration. It seems that penalty for
>>>> removing NaNs ranges up to some 30%, which may be significant for some
>>>> uses. So maybe the functions should provide an option to turn off the
>>>> checking for NaNs, just for the case when data are guaranteed to be
>>>> NaN-free.
>>>>
>>>> cheers
>>>>
>>>
>>> Ok, in order to address that request, I've added the function
>>> flag_implicit_skip_nan.m. If you call flag_implicit_skip_nan(0), NaN's
>>> are not skipped anymore, and the traditional behavior is reproduced.
>>> This will affect all functions that are based on sumskipnan.m
>>> flag_implicit_skip_nan(1) will again turn on the NaN-skipping behavior.
>>>
>>
>> I think there's been a general agreement among most contributors that
>> global flags are "considered harmful".
>
> I guess this comes from the old days (around Octave 2.0 and 2.1) when
> there were really much to many global flags used much too often.
>
> However, a general ban on on the use of global flags is going to far, IMHO.
>
> OTOH, if global variables within functions are really "harmful", it
> would be only consequent to enforce this rule by preventing the
> use of global variables within functions. ;-)
>
>
>> Using them from command line is fine, but in functions they become a
>> burden because you need to preserve their status, like this:
>>
>> old_flag_implicit_skip_nan = flag_implicit_skip_nan;
>> flag_implicit_skip_nan (1)
>>
>> unwind_protect
>>   ... function body ...
>> unwind_protect_celanup
>>   flag_implicit_skip_nan (old_flag_implicit_skip_nan);
>> end_unwind_protect
>>
>> So far the trend in Octave was to eliminate global flags rather than
>> introduce new ones. Why not just use options?
>> Like std(x,0,2, "skipnans", false)
>>
>
> Actually, there is only a problem if the flag is modified within the
> unwind_protect part. If some "specialist" is changing the default
> behavior within an unwind_protect part, (s)he should be also able to
> call for flag_implicit_skip_nan(1) in the unwind_protect_clean part, too.
>
> The use of flag_implicit_skip_nan, or fun(...,"skipnans", false) is
> detrimental to clean programming and should really remain a thing for
> "specialists". Neither do I see a need for nor do I endorse a frequent
> use of this feature. Perhaps, it becomes obsolete, again.
>
> The present solution is simple and elegant because the flag controls
> only the behavior of some core functions like sumskipnan. All other
> functions like mean, var, std, meansq, etc. "inherit" this behaviour
> without changing the source code. The alternative of adding more input
> arguments would bloat the functions with checks on the input arguments.
>
> Although the present solution is sufficient for the task, its only the
> second best solution. The best solution would be if the data has some
> attribute flag indicating whether it contains NaN's or not. (The
> attribute should be generated when the data is generated). Then, an
> automated approach for selecting the algorithm could be used, and
> flag_implicit_skip_nan() and fun(...., "skipnans", false) would become
> obsolete. Would not this be a nice idea?
>

Yes, at first sight it would be nice - it's not the first time
something like this has been proposed. The problem is that you'd need
to clutter all data copying code with checking for NaNs, which would
really be too much of a burden, especially when using things like
std::copy. Actually this somewhat similar to is what is done for
integer types - they raise global flags on operation or conversion
overflows. And this comes at the cost of encapsulating them in a
special class, which brings more problems... I don't see this as a
viable option for real types. A pesimistic NaN-flagging scheme could
also be considered, but the question then is whether it would actually
bring any good.

regards

-- 
RNDr. Jaroslav Hajek
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz
[Prev in Thread]
Current Thread
[Next in Thread]
Re: NaN-toolbox much faster now, (continued)
Prev by Date: Re: NaN-toolbox much faster now
Next by Date: Re: 3.0.4 RC3 (mingw 3.4.5)-2
Previous by thread: Re: NaN-toolbox much faster now
Next by thread: Re: NaN-toolbox much faster now
Index(es):
- Date
- Thread