Re: "Font-lock is limited to text matching" is a myth

Hello all,

Thanks for opening this can of, er, threads. I was going to ask about

these things myself soon in any case, because it's clear that js2-mode

is not doing a very effective job of surfacing its rich information in

Emacs. This is partly my fault, but it is also partly due to some

issues with font-lock that I'll describe in nauseating detail.

There are several important ideas being conflated in this thread that

I think need to be teased apart before we can talk responsibly about

any of them. I've called out the top five conflations in sections

below delimited by roman numerals.

This is all in some sense an elaboration of what Eric Ludlam just

posted, to which I can only add my miserable +1.

Stephen Eilert wrote:

> I do not think that was done without a very good reason (and there's

a lengthy post explaining it), unless the author is a complete

masochist.

I don't think of myself that way. Here, as requested, is a lengthy post

explaining my approach. For the record, it could have been much

lengthier, and I have lengthy replies ready for all your objections and

concerns. (Just in case you were wondering.)

I really do want to get this resolved, though.

I. Asynchronous parsing

js2-mode performs both syntactic and (some) semantic analysis. It

knows, for instance, when you're using a symbol that's not defined in

its file. js2-mode does not currently understand project structure,

but I'm doing some work in this area, and it may at some point gather

semantic information collected from several files.

Because this analysis requires parsing the entire file at least once

(see my discussion of partial/incremental parsing below), and it may

someday involve looking at symbol tables from other files, it seemed

best to run the parse asynchronously, so as not to interfere with the

user's editing.

One byproduct of having an accurate parser and symbol table is that

you can obtain style runs with relatively small effort, so js2-mode

does its own highlighting. The downside is that this highlighting

information is unavailable at font-lock time, and it is not available

piecewise -- it's all-or-nothing.

There is a relatively simple alternative that might appease Daniel:

I could have js2-mode simply not do any highlighting by default,

except for errors and warnings. We'd use whatever highlighting is

provided by espresso-mode, and users would be able to choose between

espresso-highlighting and js2-mode highlighting. With the former,

they'd get "instantaneous" font-locking, albeit not as rich as what

js2-mode can provide.

This would be trivial to change. I am actively maintaining js2-mode,

and the only reason I haven't checked in any changes since my initial

commit to the trunk is inexperience: I'm trying to get a handle on how

many changes people tend to aggregate before checking in a change to

any given mode. But I have several fixes (including some patches

contributed from users) that are ready to commit, and more on the way.

Errors and warnings would still need to be asynchronous (if they're

enabled). So, too, would the imenu outline and my in-progress

buffer-based outline, which is somewhat nicer than the IMenu one.

But I think the main objection to js2-mode revolves around its

highlighting, correct? If so, AND if we can solve the font-lock

integration issues, AND if we can fix the multi-mode issues (II

below), then I'm hopeful that js2-mode might become a reasonable

choice as the default editing mode for _javascript_.

I think espresso-mode is a fine fallback position. Anything but

java-mode! The default today is java-mode, and I had no qualms about

replacing it as the default for _javascript_.

Note: diagnostic messages in js2-mode are highlighted using overlays.

I tried using overlays for all highlighting but it was unacceptably

slow and had a tendency to crash Emacs. But there are usually not

prohibitively many errors and warnings, since the error-recovery

algorithm is somewhat coarse-grained. So error-reporting works

independently of font-lock.

II. Multi-mode support

_javascript_ is especially needful of mumamo (or equivalent) multi-mode

support, because much of the _javascript_ in the wild is embedded in

HTML, in template files, even in strings in other languages.

js2-mode does not support mumamo (or mmm-mode, which which I am

currently more familiar) because js2-mode's lexer needs to support

ignoring parts of the buffer. I do not think this would be very

hard to implement, but I have not done it yet.

If I don't get to it before the next version of Emacs launches, then I

think this should effectively disqualify js2-mode from being the

default _javascript_ mode. It would be an inconsistent user experience

to have one _javascript_ mode in .js files and another mode for

_javascript_ inside multi-mode-enabled files.

I'm ready to give it a try, though, and I'll ping Lennart offline about

integrating the two somehow.

III. Incremental and partial parsing

Lennart and others have asked whether it is possible for js2-mode to

support partial or incremental parsing. The short answer is

"incremental: yes; partial: no".

nxml-mode, last I checked, does incremental parsing. It parses ahead

in the buffer, but then stops and saves its state. If you jump forward

in the buffer, it resumes and continues the parse until some point

beyond the section you're viewing.

js2-mode could do it this way without much additional effort. I chose

not to because once you've decided to use background parsing, it

doesn't seem like an especially useful optimization. But I could see

it being helpful in some cases, such as when you're editing near the

top of a large file -- as long as the whole file isn't encased in some

top-level _expression_, which unfortunately is often the case in JS.

Partial parsing is a different beast entirely. The goal of a partial

parser is to re-parse the minimum amount necessary, given some region

that has changed. I've dug into this a bit, because originally I

wanted to support it in js2-mode. I even made some progress on an

implementation.

While a few production parsers (for Java and _javascript_) have

implemented partial parsing, the vast majority of them do not support

it -- instead, they re-parse from the top. They do this because the

incremental benefit of partial parsing is debatable, assuming you're

time- and resource-constrained, as most of us are.

I took a close look at Eclipse and IntelliJ, and even asked some

of their users to characterize the highlighting behavior of the IDE.

Without exception, the IDE users had internalized a ~1000 ms delay

in highlighting and error reporting as part of their normal workflow,

and they uniformly described it as "instant{aneous}" until I made

them time it.

I've been an Emacs user for 20+ years now, and like many I found

the idea of a parsing delay to be somewhere between "undesirable"

and "sickening". But the majority of programmers today have

apparently learned not to notice delays of ~1sec as long as it

never interferes with their typing or indentation (see IV below).

So after looking at my ~8000 lines of elisp devoted to parsing

_javascript_, I weighed it and decided not to support partial parsing.

It's certainly possible to support it, but I think my time would be

better spent on things that average users are more likely to notice.

YMMV, of course.

The upshot is that if I'm going to support mumamo, it will need

to work within js2-mode's existing full-reparse framework. I can

think of various ways to make it work, though, and as I mentioned

I'll talk to Lennart about it.

IV. Indentation

The indentation in js2-mode is broken. I'll be the first to say it.

It is based on the indentation in Karl Langstrom's mode, which does a

better job for _javascript_ than any indenter based on cc-engine, but

that doesn't mean it's a good job. And it's essentially unconfigurable.

espresso-mode shares this problem, which means that for this

important use case it is not an improvement over js2-mode.

Daniel's objections to js2-mode's non-interaction with font-lock

apply equally to the non-interaction with cc-engine's indentation

configuration system. The indent configuration for _javascript_ should

share as many settings as practical with cc-mode.

I actually made a serious attempt to generate the `c-style-alist'

data structure for js2-mode using the parse tree, but ran into three

issues:

1) it's much harder than I thought it would be, even with a full

parse tree available. I had some 2000 lines of elisp invested

in it when I pooped out, to be perfectly frank.

2) `c-style-alist' (like font-lock) does not have enough semantic

variables to encompass the range of indentation contexts that

_javascript_ programmers care about. I think we'd need to add

5-10 more, although it's been 18 months since I looked into it.

3) indentation in "normal" Emacs modes also runs synchronously as

the user types. Waiting 500-800 msec or more for the parse to

finish is (I think) not acceptable for indentation. For small

files the parse time is acceptable, but it would not be generally

scalable.

#3 is the reason I gave up on #1. It didn't seem to be worth the

effort to produce an accurate but slow indenter.

I don't know exactly how to solve this problem. I have lots of

ideas, but it appears there are few low-hanging fruit in this space.

V. Font Lock framework design problems

There seems to be a common misconception flitting about to the

effect that font-lock is perfect and will never need to change.

This is a somewhat paradoxical viewpoint in view of the corpses

littering the path to jit-lock, which include font-lock, fast-lock,

lazy-lock, and vapor-lock. Each decade we've had a cadre of people

claiming that *-lock meets everyone's needs, and then it gets rewritten

anyway.

So it's hard to understand how it remains such a popular viewpoint.

I'll make yet another attempt to dispel it, since once we're past the

emotional stumbling blocks, font-lock may be able to evolve again.

Va) Inadequate/insufficient style names

There are not enough font-lock faces to represent all the semantic

style runs that are identifiable to "real" language analyzers.

js2-mode makes several semantic distinctions not available in most

Emacs modes, although such distinctions are available in JDEE and

other Cedet-enabled modes, so js2-mode is by no means alone in its

needs.

In addition to the autoloaded font-lock faces, which js2-mode uses

whenever possible, js2-mode defines several new faces, including:

* function parameters

* "class" instance members (in JS, prototype and instance props)

* local variables

* undeclared variables

* private members (although I implemented it poorly -- see below)

* html/xml tags, attr names and delimiters -- used both for html

in jsdoc comments and for E4X literals

* doc tags such as those typically found in javadoc/jsdoc comments

* warnings, errors, and informational diagnostics

I do not expect that this set is all-inclusive -- over time as js2-mode

and similar modes get smarter, they will be able to make other

semantic distinctions that users may wish to customize independently.

Given that Emacs is the most configurable editor on the planet, I do

not see any reason to entertain arguments to the contrary.

Vb) Ad-hoc default faces that are not being autoloaded

There are some modes (e.g. sgml-mode, html-mode, nxml-mode) that

define their own versions of some of the xml/html faces, but it did

not seem right to make js2-mode 'require one of these modes just to

get at ad-hoc "standard" definitions for these faces.

We should define standard faces for xml/html tags and entities, and

for any other faces that are effectively defined by 2 or more modes.

Vc) Additional semantic styles not needed by _javascript_

I have other language modes in progress, and together they define an

ever larger set of semantic styles. The set of available font-lock

names should try to encompass the _union_ of the needs of most

languages, not the intersection. There should, for instance, be a

font-lock-symbol-face for languages with distinguished symbols such

as Lisp, Scheme and Ruby.

I think this is relatively easy to fix, provided a little thought

goes into choosing the new faces. Vd and Ve below should help

clarify why it requires greater than zero thought.

Vd) Composable semantic styles

Some font-lock faces represent "primary" semantic roles, in a vague

way. For instance, there is a font-lock-function-name-face, and

this is different from font-lock-variable-name-face. While in some

languages (including _javascript_) the distinction is not necessarily

exact, they can usually be reconciled -- e.g. being a function is

a more "important" property of an identifier than being a variable.

Most of the font-lock faces represent very common primary roles:

strings, comments, keywords, types, preprocessor macros. But not all.

font-lock-constant face is actually orthogonal to the primary role.

A class or method or parameter can be const or non-const in some

languages.

The semantic notion of public/private/protected/package/friend

visibility is another example. So is "abstract"/"pure virtual".

Emacs supports composable faces (a style run may have multiple

faces, and the attributes compose according to predefined rules),

but font-lock provides neither consistent nor adequate support for

this notion.

Ve) Ambiguous semantic styles

At least one of the face names is ambiguous -- it's not clear what

font-lock-builtin-face is actually supposed to highlight. The result

is that different language modes use it for different kinds of

entities. If you customize the face for one mode, you may wind up

with unsatisfying results in another mode due to the differences

in relative weighting/distribution of semantic types across languages.

As a hypothetical example, someone might enhance python-mode to

use font-lock-builtin-face to highlight True/False/None and possibly

"self", since they're not keywords but they are all handled specially

by the runtime. (font-lock-type-face might be better for this, but

since they're not really classes, you could argue it either way).

These tokens appear relatively infrequently in Python. If someone

else were to use it to highlight functions implemented in C in elisp,

there would be a lot more of that face appearing in elisp buffers,

and it might not be easy to choose one face that looks nice in both

situations.

Regardless of the fate of js2-mode, font-lock needs to add more

semantic faces. By default these new faces might simply inherit face

attributes from their "syntactic parents" -- e.g. the faces for

locals, parameters, instance and static vars might all inherit the

settings for `font-lock-variable-name-face'. But users should be

able to differentiate among them when the information is available.

Vf) No font-lock interface for setting exact style runs

I could be mistaken here -- if so, please correct me.

My limited understanding of font-lock and its main entry-point

mechanisms such as font-lock-keywords and font-lock-apply-highlight,

all of which use the MATCH-HIGHLIGHT data structure, is that they

are not quite powerful enough for my needs in their current incarnation.

This issue is independent of asynchronous parsing -- I think that

even if my parser were instantaneous, I would still have this issue.

The problem is that I need a way, in a given font-lock redisplay, to

say "highlight the region from X to Y with text properties {Z}".

This use case does not seem like it should be inordinately difficult

to support, but it does not seem to be supported today.

When I assert that it's not possible, I understand that it's

_theoretically_ possible. Given a _javascript_ file with 2500 style

runs, assuming I had that information available at font-lock time, I

could return a matcher that contains 2500 regular expressions, each

one of which is tailored to match one and exactly one region in the

buffer.

In practice, however, I am not aware of a way to do this that is

either clean or efficient.

If this simple feature were supported, I would have a great deal more

incentive to try to get my parsing to be fast enough to work within

the time constraints users expect from font-lock.

Vg) Lack of differentiation between mode- and minor-mode styles

One of the most common complaints from the thousands of users of

js2-mode, most of whom have exercised enough self-restraint to use the

term "work in progress" in preference to "abomination", is that

js2-mode has poor support for minor modes that do their work with

font-lock -- 80-column highlighters being a popular example, although

there are others.

The fundamental problem here is that the font-lock framework does not

differentiate between the mode's syntax highlighting and the keywords

installed by minor modes and by user code. Instead, it merges them.

As far as I can tell, the officially supported mechanism for

adding additional font-lock patterns is `font-lock-add-keywords'.

This either appends or prepends the keywords to the defaults.

It might be possible to reverse-engineer it, for instance by manually

diffing the buffer's font-lock-defaults and font-lock-keywords and

trying to figure out which ones were added by participants other than

the major mode. Even if it's possible, it's not clear that it always

works now, and would always work in the future.

For one thing, it's possible (as Daniel observes) to bypass this

mechanism and call font-lock-apply-highlight directly, which makes

the reverse-engineering even more cumbersome and fragile.

(Vf) is the reason (Vg) is a problem for js2-mode. font-lock-defaults

does not seem to be a very satisfactory way to apply 2000-10000

precise style runs to a buffer, so I do all my own highlighting,

and it doesn't include style-run contributions from minor modes.

I've made some halfhearted attempts to hack around the problem, but

they've proven fragile. If font-lock were to support (Vf), then I

think (Vg) should "just work".

VI. Summary

I've called out some of the main integration issues I've encountered.

I've penned several major and minor language modes, not just js2-mode,

and I've chosen to whine here about the problems that could best be

classified as "problem themes".

I'm around, and I'm available for nontrivial work. If group consensus

is that js2-mode isn't ready yet, I'm happy to keep hacking on it and

taking user patches and feedback until Emacs 24 rolls around.

But it would be nice to have more direct support for modes like mine.

I'm willing to do my end of it, but I'm always oversubscribed, and I've

already signed up to support mouse-enter and mouse-left text props

as part of another js2-mode-related thread.

So a little help would go a long way.

-steve

From:	Steve Yegge
Subject:	Re: "Font-lock is limited to text matching" is a myth
Date:	Mon, 10 Aug 2009 23:47:37 -0700