emacs-orgmode
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [O] Smart Quotes Exporting


From: Mark Shoulson
Subject: Re: [O] Smart Quotes Exporting
Date: Fri, 15 Jun 2012 16:20:43 +0000 (UTC)
User-agent: Loom/3.14 (http://gmane.org/)

Nicolas Goaziou <n.goaziou <at> gmail.com> writes:

> 
> Hello,
> 
> Mark Shoulson <mark <at> kli.org> writes:
> 
> >> ASCII exporter also handle UTF-8. So it's good to have there too.
> >
> > Really?  I would have thought ASCII meant ASCII, as in 7-bit clean
> > text.
> 
> org-e-ascii.el (as old org-ascii.el) handles ASCII, Latin1 and UTF-8
> encodings.

I noticed that after writing my response.  The name just threw me a little.  
Yes, that exporter needs to handle it too.

> > It looked to me like your solution would essentially boil down to "do
> > string handling when there's a string, otherwise recur down and find
> > the strings," which essentially means apply it to all the
> > strings... and there were already functions out there applying things
> > to strings, so this can just ride along with them.  Here, let's look
> > at your suggestion and see if we can find what I missed:
> >
....
> > So, if it's a string, use the regexps (if they can be smart enough to look 
at
> > beginning and end of the string, which they can--though I haven't been 
using the
> > :post-blank property so presumably something is amiss), and if it isn't a
> > string, recur down until you get to a string... Ah, but only if it's in
> > org-element-recursive-objects.
> 
> You're missing an important part: the regexps cannot be smart enough for
> quotes at the beginning or the end of the string. There, you must look
> outside the string. Hence:

Well, wait; regexps can make some pretty darn good guesses at the beginnings 
or ends of strings.  Quotations don't normally end in spaces (in the 
conventions used with ""; French typography is different, but if you're using 
spaces around your quotes you have worse problems (line-breaks) to worry 
about).  So if a string ends in space(s) followed by a quote, it's very likely 
that quote is an open-quote for some stuff that comes after.  Conversely, if a 
string starts with a quote followed by some spaces, it's very likely a close-
quote to what went on before.

This isn't quite it; beginning-of-string followed by quote, then punctuation 
and then spaces is also a close-quote, etc... There is a lot of fine-tuning.  
But even what I currently have was able to handle your 

Caesar said, "/Alea Jacta est./"

example.  Yes, there are edge-cases which this won't catch, and it remains to 
be seen how pervasive and annoying those are.  It may be that repeated 
tweaking of regexps will handle enough of the ordinary cases.  It may be that 
after a few rounds of regexp-hacking someone will finally decide that regexp-
hacking just won't handle enough of the important cases.  But I think even as 
it stands now we'd probably handle 80-90% of the normal situations, which 
really is as much as we reasonably can hope for.

Could I trouble someone to try applying my patch and trying it out for 
yourself and seeing just how bad/good the performance is?  It seems to work 
okay for the cases I've been trying, but maybe my dataset isn't robust 
enough.  Let's give it a test and seen how many actual cases in common usage 
it gets wrong.  Maybe see how much can be fixed by tuning regexps.

> 
> > ]      1. If it has a quote as its first or last position, check for
> > ]         objects before or after the string to guess its status. An
> > ]         object never starts with a white space, but you may have to
> > ]         check :post-blank property in order to know if previous object
> > ]         had white spaces at its end.
> 
> But you can only do that from the element containing the string, not
> from the string itself.

The case where a quote both sits at the edge of a string (i.e. at the border 
of some element, formatting, etc) *and* does not have whitespace next to it, 
with possible punctuation, does not seem to be a normal occurrence to me.  If 
I'm wrong, how common *is* it?

> 
> > So the issue with the current state is that it
> > would wind up applying to too much? (it would hit code and verbatim 
elements,
> > for example, and that would be wrong.)
> 
> No, you are not applying it too much (verbatim elements don't contain
> plain-text objects) but your function hasn't got access to enough
> information to be useful.

The on-screen version, of course, will have to be smarter and check for 
the "face" formatting to make sure it doesn't happen in comments or verbatims; 
I am pretty sure it does not do that yet.
 
> > wait, called on the top-level parsed tree object, recursively doing
> > its thing before(?) the transcoders of the individual objects get to
> > it.
> 
> That's called a parse tree filter. That should be a possibility
> indeed. The function would be applied on the parse tree and would
> replace strings within elements containing plain text (that is
> paragraph, verse-block and table-row types). parse tree filters are
> applied very early in the export process.
> 
> Another option would be to integrate it into
> `org-element-normalize-contents', but I think the previous way is
> better.

Maybe.  I know it sounds like I'm fixated on the plain-text solution, but I'm 
not convinced the envisioned problems are more than theoretical, or that they 
will cause an unacceptable amount of error (keeping in mind that some error 
*is* acceptable and unavoidable).

> > The on-screen one would still use the plain-string computation, as you 
said,
> > since the full parse isn't available.
> 
> Yes.
> 
> > It would also need to be tweaked not to act on verbatim/comment text,
> > etc.
> 
> Yes. You may want to use `org-element-at-point' and `org-element-type'
> to tell if you're somewhere smart quotes are allowed (in table,
> table-row, paragraph, verse-block elements).

Probably.  I think I saw some other package make these decisions by peeking at 
the formatting and seeing if it is set in comment-face or something, but 
checking the element at point is presumably more sensible.

~mark




reply via email to

[Prev in Thread] Current Thread [Next in Thread]