[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Decoding ACE created by libidn2...
From: |
Thomas Jacob |
Subject: |
Re: Decoding ACE created by libidn2... |
Date: |
Thu, 06 Jun 2013 15:21:56 +0200 |
On Wed, 2013-06-05 at 22:57 +0200, Simon Josefsson wrote:
> It is not trivial, and there may be multiple reasonable implementations.
> I have been meaning to write up one way to do it, and to implement that,
> in the hope that it could be established as a standard, but haven't
> found time. I recall sending a short summary of the steps required to
> the IDNA list (I think) a long time ago when I noticed this issue with
> IDNA2008.
I see...
> > Libidn2 doesn't seem to supply such a function yet, the
> > older Libidn (at least the cmd line tool) doesn't either
> > really, but I can manually split the punycode part from
> > the xn-- in each label and then use Libidn's punycode decoder
> > to reach my goal. Seems a bit of a hassle though.
>
> Yup, something like this is what a library could implement. There are
> aspects which is unclear (for example, how to split the domain? On
> ASCII dot '.' only, or the IDNA2003 domain separators? Should you split
> on escaped dots?).
Hmm, just noticed that the idnkit2.2 guys actually have implemented
their own interpretation of reverse conversion now, here's some of
what they do:
python t.py | /usr/local/bin/idnconv2 -reverse
www.buße.de
www․buße․de
www‥buße‥de
www…buße…de
www⒈buße⒈de
www⒉buße⒉de
www⒊buße⒊de
www⒋buße⒋de
www⒌buße⒌de
www⒍buße⒍de
www⒎buße⒎de
www⒏buße⒏de
www⒐buße⒐de
www⒑buße⒑de
www⒒buße⒒de
www⒓buße⒓de
www⒔buße⒔de
www⒕buße⒕de
www⒖buße⒖de
www⒗buße⒗de
www⒘buße⒘de
www⒙buße⒙de
www⒚buße⒚de
www⒛buße⒛de
www㏂buße㏂de
www㏇buße㏇de
www㏘buße㏘de
www︙buße︙de
www︰buße︰de
www﹒buße﹒de
www.buße.de
www🄀buße🄀de
t.py:
for l in file('lst').readlines():
if not l.startswith('U+'):
continue
ustr = l.split()[0].split('+')[1]
u = unichr(int(ustr, 16))
print (u'www%sxn--bue-6ka%sde' % (u,u)).encode('utf-8')
'lst' contains a text/cutnpaste
from
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:toNFKC=/\./:]
They don't interpret %2E however:
echo "www%2Exn--bue-6ka%2Ede" | /usr/local/bin/idnconv2 -reverse
www%2exn--bue-6ka%2ede
but to be honest, I don't really understand the intrinsics
of IDNA2003/2008 and the whole unicode character transformation
and classification rules, that's why I am happy to use
your libraries whenever possible ;=)
Regards,
Thomas