emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bidirectional text and URLs


From: Eli Zaretskii
Subject: Re: Bidirectional text and URLs
Date: Sat, 29 Nov 2014 10:22:45 +0200

> From: "Stephen J. Turnbull" <address@hidden>
> Cc: address@hidden,
>     address@hidden
> Date: Sat, 29 Nov 2014 15:09:02 +0900
> 
>  > > but I would say that given that the UAX#9 bidi algorithm does what's
>  > > wanted 99.44% of the time, it makes sense to mark text reordered by
>  > > RTL markers with a warning face
>  > 
>  > That might be considered an annoyance by users of bidi scripts.
>  > There's any number of perfectly valid URLs that use the same
>  > formatting control characters.
> 
> Why?  Because many displays don't implement UAX#9?  Or is it because
> UAX#9 defines segments in a way that would reorder the components of a
> domain name or path?  That is, the logical URL
> 
>     http://www.example.com/ABC/DEF/
> 
> is expected by a bidi reader to appear as
> 
>     http://www.example.com/CBA/FED/
> 
> but UAX#9 would display it as
> 
>     http://www.example.com/FED/CBA/

Yes.  And there are worse examples (e.g., try an HTML link which
includes both a URL and a link text).

The problem here is that all those /, :, <, and > characters are
neutrals, so they take the direction of surrounding text, i.e. are
reversed for display when the surrounding text is RTL.  In addition, <
and > are mirrored in that case.  That can make quite a jumble.
(Unicode 6.3 added special handling for "paired-bracket" characters,
which makes the situation with < and > somewhat better, but we only
support that on master, Emacs 24.4 doesn't.)

> Whatever the reason, I'd have to say that's too bad for users of bidi
> languages, because that means *any* bidi URLs is ambiguous, and
> therefore subject to being deliberately obfuscated by reflection
> and/or jumbling, regardless of the presence of directional controls.

I agree, but the issue discussed here is different: it's AFAIU about
users of LTR scripts that can fall victim to use of directional
controls that are by default (almost) invisible on Emacs display.  I
think we would like to have at least that situation "handled" in some
way.  My point above was that the way we handle that should not unduly
punish users of bidi scripts, i.e. legitimate uses of these controls.

>  > What you suggest might be TRT when left-to-right text is enclosed
>  > within directional override controls (which is what Lars did in his
>  > example).  These controls assign right-to-left directionality to all
>  > the enclosed characters, which is indeed highly suspicious in URLs.
> 
> This isn't hard to detect.  But there is also the case where you have
> a word which is a different word when reflected.

If we have a dictionary, we can detect that, too.  If we don't, then
detecting only the enclosed-LTR case is better than nothing, I think.

Another possibility is to modify the way these control characters are
displayed by manipulating their entries in the glyphless-char-display
char-table.  It should probably be enough to display them as hex-code
in a box, to make the user aware of the possible problem.  This should
be done by applications that display URLs, like eww, Gnus, Rmail,
etc.; not globally.

> I assume that this is the case in bidi languages as well

Yes, but that would require RTL text embedded in a left-to-right
overriding embedding, which is easily detectable, like the opposite
case that started this thread.

> and of course any jumble is possible as a domain or path component
> which is an abbreviation.  And any useful jumble can probably be
> registered as a domain, and certainly incorporated in a path.

I doubt that a domain like this could be registered, as using such
characters in a domain name is AFAIU against the regulations, see
RFC3987.

>  > In addition to using a special face, another possibility is to present
>  > the directional overrides in these cases in percent-hex notation,
>  > which will disable their effect on the enclosed text.  Of course, this
>  > should be only done when the enclosed text is entirely made of LTR
>  > characters and neutrals.
> 
> Well, no.  I assume that bidi readers are as vulnerable to phishing
> and other frauds as non-bidi readers (hard as that may be to believe
> for you bidi readers).  That is not yet clear.

The easy cases with RTL text, as mentioned above, should be also
easily detectable, and I agree they should get the same treatment.

>  > > You do need a way to turn it off, or to make it reasonably smart, in
>  > > the case of ASCII which is often mixed with other charsets.
>  > 
>  > Not sure what you mean here.
> 
> As above, where the domain name is ASCII and the path is RTL.  Or the
> path (or the domain) might be mixed.
> 
>  > "Turn off" how?
> 
> "We need to decide what we want to do, and then look for a mechanism."

OK, let me rephrase: what effect will "turning off" have on display?

>  > And how do you do that without unduly punishing perfectly valid
>  > URLs that need these controls to avoid visual "jumbles"?
> 
> I hate to tell you, but the phishers have *already* started punishing
> those perfectly valid URLs.  You have a choice of punishment, that's
> all: "jumbled display" vs. "defrauded users".

I very much hope we will find a sane middle ground, possibly subject
to user control.  I'd hate to see Emacs become another case of the TSA
disaster.

> Except that as I say above, apparently all bidi URLs must now be
> considered to offer suspicious display under some circumstances, so
> maybe you have no choice about the defrauded users.  In that case I
> suppose avoiding jumbles does take precedence.

Once we decide which cases we want to avoid or flag, we could be smart
there, by comparing the original and reordered strings, perhaps aided
by some dictionary lookup.  The infrastructure is either already there
or easy to add.  It's "just" a matter of deciding what to do and when.

Someone(TM) should present a list of well-thought requirements, and we
can take it from there.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]