emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Rationale for split-string?


From: Stephen J. Turnbull
Subject: Re: Rationale for split-string?
Date: Tue, 22 Apr 2003 22:19:31 +0900
User-agent: Gnus/5.090016 (Oort Gnus v0.16) XEmacs/21.5 (cabbage)

>>>>> "Luc" == Luc Teirlinck <address@hidden> writes:

    Luc> Miles Bader wrote:

    mb>    I think Stephen's formulation is very natural, in that you
    mb> usually want OMIT-NULLS to be t if you're splitting on a
    mb> non-whitespace string.

Miles, here you meant OMIT-NULLS to be nil, right?

I think Miles's proposal to default the one-argument form of
`split-string' to GNU behavior and have the two-argument form as
XEmacs's, with the three argument form for precise control, is a good
compromise.  Add

    (defconst split-string-default-separators "[ \\f\\t\\n\\r\\v]+"
      "The default value of separators for `split-string'.

    A regexp matching strings of whitespace.  May be locale-dependent
    \(as yet unimplemented).  Should not match non-breaking spaces.")

and the current XEmacs behavior is very naturally available with

    (split-string string split-string-default-separators)

(although the fact that that means something different from
`(split-string string)' is definitely a wart).

------------------------------------------------------------------------
Back to our regularly scheduled controversy on principles:

    Luc> First of all, I am not worried about Stephen's formulation
    Luc> being unnatural (although the original formulation actually
    Luc> would produce unnatural results in the default case), but
    Luc> about it breaking existing code.

GNU Emacs made the change (viz. cvs diff -r EMACS_20_2 -r EMACS_20_4
subr.el) without worrying sufficiently about breaking existing code
(see Stefan Reichör's post here <address@hidden>, or run
XEmacs's regression test suite on XEmacs 21.5).  I don't see why that
should be a barrier to reverting to the old, regular, behavior now.

Further, as far as GNU Emacs itself goes, I see your theory and raise
you a full-tree patch.  I volunteer to revise the code and fix the
callers in all GNU Emacs code distributed on the mainline.  (I've
already requested papers from rms.)

Sure, we can't guarantee that third party code won't get broken, but
Jerry James has anted an audit of all XEmacs code including the
packages, a significant fraction of 3rd party Emacs Lisp code.
Nothing there will break, although once we get this settled, many
packages can have their local versions of `split-string' either thrown
out or turned into trivial defsubsts around the core version.  Want to
match Jerry's effort with some facts here?  Find us some callers,
we'll send patches to their maintainers.

    Luc> I believe you are underestimating the level of generality of
    Luc> split-string and the wild heterogeneity of its applications.

Et tu, Luc.  You don't imagine using split-string to parse Makefiles
or Python code[1], to detect trailing whitespace (perhaps generated by
older auto-fill implementations to mark sentence breaks) that violates
coding standards, etc.  (Not surprising, since GNU Emacs 21.x can't do
those things using `split-string'.)

Since generality and heterogeneity are much better served by simple
regular interfaces, what you are really arguing is quite the opposite.
Ie, that there's only one important application (splitting into tokens
separated by non-significant whitespace).  And you want the
`split-string' API optimized for that and very similar applications by
default, even though that means that `split-string's non-default
behavior looks totally schizophrenic by comparison.  A lot of people
agree with you (including rms AFAICT), but others don't.  Many XEmacs
people disagree strongly.  (They prefer regularity.)

    Luc> It is by no means whatsoever true that except in the
    Luc> whitespace case you would want to keep all null matches.  If
    Luc> SEPARATORS is a "terminator character", say newline,

Note that Miles's proposal would actually give the behavior you want
in `(split-string string "\n")'.  (Admittedly, you'd like
`(split-string string "\n" 'end)' even better.)  Point for Miles!

But you are exactly right: sometimes one wants it one way, and
sometimes the other.  It is this _irreconcilable_ difference that
leads me to strongly prefer separate APIs, one which imposes
stream-of-token semantics, and one which merely splits strings.  I
think `split-string' is a more natural name for the latter.

    Luc> The "however" is that we are not defining a *new* function
    Luc> but *re*defining an *existing* function, an often used and
    Luc> extremely general existing function.  That is all but
    Luc> guaranteed to produce a wild variety of bugs.

Please consider the history of the change.  You're inaccurate on all
counts.  We propose _reverting_ what is already a redefinition.
Because the redefined function is _less general_ than the original,
it's _used less often_ than it could be.  (Jerry James's audit of
XEmacs and package code demonstrates this.)  And it won't "produce"
bugs, it will _exchange_ a new set of unknown bugs (which is likely to
be small everywhere except in code very specific to GNU Emacs 21) for
a set of existing bugs, which everybody agrees need to be fixed.

So the question basically boils down to whether it makes sense to have
a regular, easily understood definition with exceptions restricted to
a few very clear cases with consensus support, or to aggressively make
"plausible" exceptions.  The last time GNU Emacs did the latter with
this function, it clearly screwed up.

    Luc> In fact let us assume, for the sake of argument, that Stephen
    Luc> and you are 100% right.  That would mean that any correct
    Luc> existing code, using the present Emacs split-string with a
    Luc> non-nil SEPARATORS, checks for empty matches at the beginning
    Luc> and end and adds any such matches to the split-string output
    Luc> to correct the "bug" in the present split-string.  After
    Luc> Stephen's change, any empty match at the beginning and end of
    Luc> the string will produce not one, but two empty strings.

That's silly; what anybody sane would do in the face of GNU Emacs's
demonstrated willingness to change semantics of such a fundamental
function is to copy the old definition into their own code.  It would
probably be shorter, and surely simpler and faster, than the gross
hack you propose.


Footnotes: 
[1] (defun python-parse-indentation (line) 
      (let ((i 0)
            (line (split-string line python-single-indentation)))
        (while (string= (car line) "")
          (setq i (1+ i))
          (setq line (cdr line)))
        (cons i line)))

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]