[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [O] Smart Quotes Exporting
From: |
Mark Shoulson |
Subject: |
Re: [O] Smart Quotes Exporting |
Date: |
Fri, 15 Jun 2012 16:20:43 +0000 (UTC) |
User-agent: |
Loom/3.14 (http://gmane.org/) |
Nicolas Goaziou <n.goaziou <at> gmail.com> writes:
>
> Hello,
>
> Mark Shoulson <mark <at> kli.org> writes:
>
> >> ASCII exporter also handle UTF-8. So it's good to have there too.
> >
> > Really? I would have thought ASCII meant ASCII, as in 7-bit clean
> > text.
>
> org-e-ascii.el (as old org-ascii.el) handles ASCII, Latin1 and UTF-8
> encodings.
I noticed that after writing my response. The name just threw me a little.
Yes, that exporter needs to handle it too.
> > It looked to me like your solution would essentially boil down to "do
> > string handling when there's a string, otherwise recur down and find
> > the strings," which essentially means apply it to all the
> > strings... and there were already functions out there applying things
> > to strings, so this can just ride along with them. Here, let's look
> > at your suggestion and see if we can find what I missed:
> >
....
> > So, if it's a string, use the regexps (if they can be smart enough to look
at
> > beginning and end of the string, which they can--though I haven't been
using the
> > :post-blank property so presumably something is amiss), and if it isn't a
> > string, recur down until you get to a string... Ah, but only if it's in
> > org-element-recursive-objects.
>
> You're missing an important part: the regexps cannot be smart enough for
> quotes at the beginning or the end of the string. There, you must look
> outside the string. Hence:
Well, wait; regexps can make some pretty darn good guesses at the beginnings
or ends of strings. Quotations don't normally end in spaces (in the
conventions used with ""; French typography is different, but if you're using
spaces around your quotes you have worse problems (line-breaks) to worry
about). So if a string ends in space(s) followed by a quote, it's very likely
that quote is an open-quote for some stuff that comes after. Conversely, if a
string starts with a quote followed by some spaces, it's very likely a close-
quote to what went on before.
This isn't quite it; beginning-of-string followed by quote, then punctuation
and then spaces is also a close-quote, etc... There is a lot of fine-tuning.
But even what I currently have was able to handle your
Caesar said, "/Alea Jacta est./"
example. Yes, there are edge-cases which this won't catch, and it remains to
be seen how pervasive and annoying those are. It may be that repeated
tweaking of regexps will handle enough of the ordinary cases. It may be that
after a few rounds of regexp-hacking someone will finally decide that regexp-
hacking just won't handle enough of the important cases. But I think even as
it stands now we'd probably handle 80-90% of the normal situations, which
really is as much as we reasonably can hope for.
Could I trouble someone to try applying my patch and trying it out for
yourself and seeing just how bad/good the performance is? It seems to work
okay for the cases I've been trying, but maybe my dataset isn't robust
enough. Let's give it a test and seen how many actual cases in common usage
it gets wrong. Maybe see how much can be fixed by tuning regexps.
>
> > ] 1. If it has a quote as its first or last position, check for
> > ] objects before or after the string to guess its status. An
> > ] object never starts with a white space, but you may have to
> > ] check :post-blank property in order to know if previous object
> > ] had white spaces at its end.
>
> But you can only do that from the element containing the string, not
> from the string itself.
The case where a quote both sits at the edge of a string (i.e. at the border
of some element, formatting, etc) *and* does not have whitespace next to it,
with possible punctuation, does not seem to be a normal occurrence to me. If
I'm wrong, how common *is* it?
>
> > So the issue with the current state is that it
> > would wind up applying to too much? (it would hit code and verbatim
elements,
> > for example, and that would be wrong.)
>
> No, you are not applying it too much (verbatim elements don't contain
> plain-text objects) but your function hasn't got access to enough
> information to be useful.
The on-screen version, of course, will have to be smarter and check for
the "face" formatting to make sure it doesn't happen in comments or verbatims;
I am pretty sure it does not do that yet.
> > wait, called on the top-level parsed tree object, recursively doing
> > its thing before(?) the transcoders of the individual objects get to
> > it.
>
> That's called a parse tree filter. That should be a possibility
> indeed. The function would be applied on the parse tree and would
> replace strings within elements containing plain text (that is
> paragraph, verse-block and table-row types). parse tree filters are
> applied very early in the export process.
>
> Another option would be to integrate it into
> `org-element-normalize-contents', but I think the previous way is
> better.
Maybe. I know it sounds like I'm fixated on the plain-text solution, but I'm
not convinced the envisioned problems are more than theoretical, or that they
will cause an unacceptable amount of error (keeping in mind that some error
*is* acceptable and unavoidable).
> > The on-screen one would still use the plain-string computation, as you
said,
> > since the full parse isn't available.
>
> Yes.
>
> > It would also need to be tweaked not to act on verbatim/comment text,
> > etc.
>
> Yes. You may want to use `org-element-at-point' and `org-element-type'
> to tell if you're somewhere smart quotes are allowed (in table,
> table-row, paragraph, verse-block elements).
Probably. I think I saw some other package make these decisions by peeking at
the formatting and seeing if it is set in comment-face or something, but
checking the element at point is presumably more sensible.
~mark