emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "Font-lock is limited to text matching" is a myth


From: Steve Yegge
Subject: Re: "Font-lock is limited to text matching" is a myth
Date: Mon, 10 Aug 2009 23:47:37 -0700

Hello all,

Thanks for opening this can of, er, threads.  I was going to ask about
these things myself soon in any case, because it's clear that js2-mode
is not doing a very effective job of surfacing its rich information in
Emacs.  This is partly my fault, but it is also partly due to some
issues with font-lock that I'll describe in nauseating detail.

There are several important ideas being conflated in this thread that
I think need to be teased apart before we can talk responsibly about
any of them.  I've called out the top five conflations in sections
below delimited by roman numerals.

This is all in some sense an elaboration of what Eric Ludlam just
posted, to which I can only add my miserable +1.

Stephen Eilert wrote:
> I do not think that was done without a very good reason (and there's
  a lengthy post explaining it), unless the author is a complete
  masochist.

I don't think of myself that way.  Here, as requested, is a lengthy post
explaining my approach.  For the record, it could have been much
lengthier, and I have lengthy replies ready for all your objections and
concerns.  (Just in case you were wondering.)

I really do want to get this resolved, though.

I. Asynchronous parsing

js2-mode performs both syntactic and (some) semantic analysis.  It
knows, for instance, when you're using a symbol that's not defined in
its file.  js2-mode does not currently understand project structure,
but I'm doing some work in this area, and it may at some point gather
semantic information collected from several files.

Because this analysis requires parsing the entire file at least once
(see my discussion of partial/incremental parsing below), and it may
someday involve looking at symbol tables from other files, it seemed
best to run the parse asynchronously, so as not to interfere with the
user's editing.

One byproduct of having an accurate parser and symbol table is that
you can obtain style runs with relatively small effort, so js2-mode
does its own highlighting.  The downside is that this highlighting
information is unavailable at font-lock time, and it is not available
piecewise -- it's all-or-nothing.

There is a relatively simple alternative that might appease Daniel:
I could have js2-mode simply not do any highlighting by default,
except for errors and warnings.  We'd use whatever highlighting is
provided by espresso-mode, and users would be able to choose between
espresso-highlighting and js2-mode highlighting.  With the former,
they'd get "instantaneous" font-locking, albeit not as rich as what
js2-mode can provide.

This would be trivial to change.  I am actively maintaining js2-mode,
and the only reason I haven't checked in any changes since my initial
commit to the trunk is inexperience:  I'm trying to get a handle on how
many changes people tend to aggregate before checking in a change to
any given mode.  But I have several fixes (including some patches
contributed from users) that are ready to commit, and more on the way.

Errors and warnings would still need to be asynchronous (if they're
enabled).  So, too, would the imenu outline and my in-progress
buffer-based outline, which is somewhat nicer than the IMenu one.

But I think the main objection to js2-mode revolves around its
highlighting, correct?  If so, AND if we can solve the font-lock
integration issues, AND if we can fix the multi-mode issues (II
below), then I'm hopeful that js2-mode might become a reasonable
choice as the default editing mode for _javascript_.

I think espresso-mode is a fine fallback position.  Anything but
java-mode!  The default today is java-mode, and I had no qualms about
replacing it as the default for _javascript_.

Note: diagnostic messages in js2-mode are highlighted using overlays.
I tried using overlays for all highlighting but it was unacceptably
slow and had a tendency to crash Emacs.  But there are usually not
prohibitively many errors and warnings, since the error-recovery
algorithm is somewhat coarse-grained.  So error-reporting works
independently of font-lock.

II. Multi-mode support

_javascript_ is especially needful of mumamo (or equivalent) multi-mode
support, because much of the _javascript_ in the wild is embedded in
HTML, in template files, even in strings in other languages.

js2-mode does not support mumamo (or mmm-mode, which which I am
currently more familiar) because js2-mode's lexer needs to support
ignoring parts of the buffer.  I do not think this would be very
hard to implement, but I have not done it yet.

If I don't get to it before the next version of Emacs launches, then I
think this should effectively disqualify js2-mode from being the
default _javascript_ mode.  It would be an inconsistent user experience
to have one _javascript_ mode in .js files and another mode for
_javascript_ inside multi-mode-enabled files.

I'm ready to give it a try, though, and I'll ping Lennart offline about
integrating the two somehow.

III. Incremental and partial parsing

Lennart and others have asked whether it is possible for js2-mode to
support partial or incremental parsing.  The short answer is
"incremental: yes; partial: no".

nxml-mode, last I checked, does incremental parsing.  It parses ahead
in the buffer, but then stops and saves its state.  If you jump forward
in the buffer, it resumes and continues the parse until some point
beyond the section you're viewing.

js2-mode could do it this way without much additional effort.  I chose
not to because once you've decided to use background parsing, it
doesn't seem like an especially useful optimization.  But I could see
it being helpful in some cases, such as when you're editing near the
top of a large file -- as long as the whole file isn't encased in some
top-level _expression_, which unfortunately is often the case in JS.

Partial parsing is a different beast entirely.  The goal of a partial
parser is to re-parse the minimum amount necessary, given some region
that has changed.  I've dug into this a bit, because originally I
wanted to support it in js2-mode.  I even made some progress on an
implementation.

While a few production parsers (for Java and _javascript_) have
implemented partial parsing, the vast majority of them do not support
it -- instead, they re-parse from the top.  They do this because the
incremental benefit of partial parsing is debatable, assuming you're
time- and resource-constrained, as most of us are.

I took a close look at Eclipse and IntelliJ, and even asked some
of their users to characterize the highlighting behavior of the IDE.
Without exception, the IDE users had internalized a ~1000 ms delay
in highlighting and error reporting as part of their normal workflow,
and they uniformly described it as "instant{aneous}" until I made
them time it.

I've been an Emacs user for 20+ years now, and like many I found
the idea of a parsing delay to be somewhere between "undesirable"
and "sickening".  But the majority of programmers today have
apparently learned not to notice delays of ~1sec as long as it
never interferes with their typing or indentation (see IV below).

So after looking at my ~8000 lines of elisp devoted to parsing
_javascript_, I weighed it and decided not to support partial parsing.
It's certainly possible to support it, but I think my time would be
better spent on things that average users are more likely to notice.

YMMV, of course.

The upshot is that if I'm going to support mumamo, it will need
to work within js2-mode's existing full-reparse framework.  I can
think of various ways to make it work, though, and as I mentioned
I'll talk to Lennart about it.

IV.  Indentation

The indentation in js2-mode is broken.  I'll be the first to say it.

It is based on the indentation in Karl Langstrom's mode, which does a
better job for _javascript_ than any indenter based on cc-engine, but
that doesn't mean it's a good job.  And it's essentially unconfigurable.

espresso-mode shares this problem, which means that for this
important use case it is not an improvement over js2-mode.

Daniel's objections to js2-mode's non-interaction with font-lock
apply equally to the non-interaction with cc-engine's indentation
configuration system.  The indent configuration for _javascript_ should
share as many settings as practical with cc-mode.

I actually made a serious attempt to generate the `c-style-alist'
data structure for js2-mode using the parse tree, but ran into three
issues:

  1) it's much harder than I thought it would be, even with a full
     parse tree available.  I had some 2000 lines of elisp invested
     in it when I pooped out, to be perfectly frank.

  2) `c-style-alist' (like font-lock) does not have enough semantic
      variables to encompass the range of indentation contexts that
      _javascript_ programmers care about.  I think we'd need to add
      5-10 more, although it's been 18 months since I looked into it.

  3) indentation in "normal" Emacs modes also runs synchronously as
     the user types.  Waiting 500-800 msec or more for the parse to
     finish is (I think) not acceptable for indentation.  For small
     files the parse time is acceptable, but it would not be generally
     scalable.

#3 is the reason I gave up on #1.  It didn't seem to be worth the
effort to produce an accurate but slow indenter.

I don't know exactly how to solve this problem.  I have lots of
ideas, but it appears there are few low-hanging fruit in this space.

V. Font Lock framework design problems

There seems to be a common misconception flitting about to the
effect that font-lock is perfect and will never need to change.

This is a somewhat paradoxical viewpoint in view of the corpses
littering the path to jit-lock, which include font-lock, fast-lock,
lazy-lock, and vapor-lock.  Each decade we've had a cadre of people
claiming that *-lock meets everyone's needs, and then it gets rewritten
anyway.

So it's hard to understand how it remains such a popular viewpoint.

I'll make yet another attempt to dispel it, since once we're past the
emotional stumbling blocks, font-lock may be able to evolve again.

Va) Inadequate/insufficient style names

There are not enough font-lock faces to represent all the semantic
style runs that are identifiable to "real" language analyzers.
js2-mode makes several semantic distinctions not available in most
Emacs modes, although such distinctions are available in JDEE and
other Cedet-enabled modes, so js2-mode is by no means alone in its
needs.

In addition to the autoloaded font-lock faces, which js2-mode uses
whenever possible, js2-mode defines several new faces, including:

  * function parameters
  * "class" instance members (in JS, prototype and instance props)
  * local variables
  * undeclared variables
  * private members (although I implemented it poorly -- see below)
  * html/xml tags, attr names and delimiters -- used both for html
    in jsdoc comments and for E4X literals
  * doc tags such as those typically found in javadoc/jsdoc comments
  * warnings, errors, and informational diagnostics

I do not expect that this set is all-inclusive -- over time as js2-mode
and similar modes get smarter, they will be able to make other
semantic distinctions that users may wish to customize independently.
Given that Emacs is the most configurable editor on the planet, I do
not see any reason to entertain arguments to the contrary.

Vb) Ad-hoc default faces that are not being autoloaded

There are some modes (e.g. sgml-mode, html-mode, nxml-mode) that
define their own versions of some of the xml/html faces, but it did
not seem right to make js2-mode 'require one of these modes just to
get at ad-hoc "standard" definitions for these faces.

We should define standard faces for xml/html tags and entities, and
for any other faces that are effectively defined by 2 or more modes.

Vc) Additional semantic styles not needed by _javascript_

I have other language modes in progress, and together they define an
ever larger set of semantic styles.  The set of available font-lock
names should try to encompass the _union_ of the needs of most
languages, not the intersection.  There should, for instance, be a
font-lock-symbol-face for languages with distinguished symbols such
as Lisp, Scheme and Ruby.

I think this is relatively easy to fix, provided a little thought
goes into choosing the new faces.  Vd and Ve below should help
clarify why it requires greater than zero thought.

Vd) Composable semantic styles

Some font-lock faces represent "primary" semantic roles, in a vague
way.  For instance, there is a font-lock-function-name-face, and
this is different from font-lock-variable-name-face.  While in some
languages (including _javascript_) the distinction is not necessarily
exact, they can usually be reconciled -- e.g. being a function is
a more "important" property of an identifier than being a variable.

Most of the font-lock faces represent very common primary roles:
strings, comments, keywords, types, preprocessor macros.  But not all.
font-lock-constant face is actually orthogonal to the primary role.
A class or method or parameter can be const or non-const in some
languages.

The semantic notion of public/private/protected/package/friend
visibility is another example.  So is "abstract"/"pure virtual".

Emacs supports composable faces (a style run may have multiple
faces, and the attributes compose according to predefined rules),
but font-lock provides neither consistent nor adequate support for
this notion.

Ve) Ambiguous semantic styles

At least one of the face names is ambiguous -- it's not clear what
font-lock-builtin-face is actually supposed to highlight.  The result
is that different language modes use it for different kinds of
entities.  If you customize the face for one mode, you may wind up
with unsatisfying results in another mode due to the differences
in relative weighting/distribution of semantic types across languages.

As a hypothetical example, someone might enhance python-mode to
use font-lock-builtin-face to highlight True/False/None and possibly
"self", since they're not keywords but they are all handled specially
by the runtime.  (font-lock-type-face might be better for this, but
since they're not really classes, you could argue it either way).
These tokens appear relatively infrequently in Python.  If someone
else were to use it to highlight functions implemented in C in elisp,
there would be a lot more of that face appearing in elisp buffers,
and it might not be easy to choose one face that looks nice in both
situations.

Regardless of the fate of js2-mode, font-lock needs to add more
semantic faces.  By default these new faces might simply inherit face
attributes from their "syntactic parents" -- e.g. the faces for
locals, parameters, instance and static vars might all inherit the
settings for `font-lock-variable-name-face'.  But users should be
able to differentiate among them when the information is available.

Vf) No font-lock interface for setting exact style runs

I could be mistaken here -- if so, please correct me.

My limited understanding of font-lock and its main entry-point
mechanisms such as font-lock-keywords and font-lock-apply-highlight,
all of which use the MATCH-HIGHLIGHT data structure, is that they
are not quite powerful enough for my needs in their current incarnation.

This issue is independent of asynchronous parsing -- I think that
even if my parser were instantaneous, I would still have this issue.

The problem is that I need a way, in a given font-lock redisplay, to
say "highlight the region from X to Y with text properties {Z}".

This use case does not seem like it should be inordinately difficult
to support, but it does not seem to be supported today.

When I assert that it's not possible, I understand that it's
_theoretically_ possible.  Given a _javascript_ file with 2500 style
runs, assuming I had that information available at font-lock time, I
could return a matcher that contains 2500 regular expressions, each
one of which is tailored to match one and exactly one region in the
buffer.

In practice, however, I am not aware of a way to do this that is
either clean or efficient.

If this simple feature were supported, I would have a great deal more
incentive to try to get my parsing to be fast enough to work within
the time constraints users expect from font-lock.

Vg) Lack of differentiation between mode- and minor-mode styles

One of the most common complaints from the thousands of users of
js2-mode, most of whom have exercised enough self-restraint to use the
term "work in progress" in preference to "abomination", is that
js2-mode has poor support for minor modes that do their work with
font-lock -- 80-column highlighters being a popular example, although
there are others.

The fundamental problem here is that the font-lock framework does not
differentiate between the mode's syntax highlighting and the keywords
installed by minor modes and by user code.  Instead, it merges them.

As far as I can tell, the officially supported mechanism for
adding additional font-lock patterns is `font-lock-add-keywords'.
This either appends or prepends the keywords to the defaults.

It might be possible to reverse-engineer it, for instance by manually
diffing the buffer's font-lock-defaults and font-lock-keywords and
trying to figure out which ones were added by participants other than
the major mode.  Even if it's possible, it's not clear that it always
works now, and would always work in the future.

For one thing, it's possible (as Daniel observes) to bypass this
mechanism and call font-lock-apply-highlight directly, which makes
the reverse-engineering even more cumbersome and fragile.

(Vf) is the reason (Vg) is a problem for js2-mode.  font-lock-defaults
does not seem to be a very satisfactory way to apply 2000-10000
precise style runs to a buffer, so I do all my own highlighting,
and it doesn't include style-run contributions from minor modes.

I've made some halfhearted attempts to hack around the problem, but
they've proven fragile.  If font-lock were to support (Vf), then I
think (Vg) should "just work".

VI.  Summary

I've called out some of the main integration issues I've encountered.
I've penned several major and minor language modes, not just js2-mode,
and I've chosen to whine here about the problems that could best be
classified as "problem themes".

I'm around, and I'm available for nontrivial work.  If group consensus
is that js2-mode isn't ready yet, I'm happy to keep hacking on it and
taking user patches and feedback until Emacs 24 rolls around.

But it would be nice to have more direct support for modes like mine.
I'm willing to do my end of it, but I'm always oversubscribed, and I've
already signed up to support mouse-enter and mouse-left text props
as part of another js2-mode-related thread.

So a little help would go a long way.

-steve

reply via email to

[Prev in Thread] Current Thread [Next in Thread]