[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH python-tests] gnu: python-2.7: Enable UCS-4 Unicode encoding.
From: |
Danny Milosavljevic |
Subject: |
Re: [PATCH python-tests] gnu: python-2.7: Enable UCS-4 Unicode encoding. |
Date: |
Tue, 24 Jan 2017 00:46:04 +0100 |
Hi Ludo,
> > Otherwise LGTM. I checked some other distros and they seem to have this
> > enabled. Thanks!
>
> That means that strings are internally UCS-4-encoded, right? What’s the
> rationale, and what happens when this flag is omitted?
The CPython C interface changes depending on the flag and some Python
extensions don't work with the narrow UTF-16 Unicode - which is what it would
use if you don't specify.
The default, UTF-16, is basically just historical baggage from when Unicode had
fewer than 65536 codepoints in the standard.
The max codepoint used nowadays is 1114111.
UCS-4 encoding means that just one 32-bit word encodes one Unicode codepoint
(it's 1:1). It's the most straightforward encoding if you don't care about size
wastage.
If you *do* care about size wastage, you use UTF-8.
Only if you are tied down by some kind of backward compatibility constraints
you use UTF-16 or UCS-2 (the latter doesn't even have some way to encode
codepoints over 65535 AT ALL - but UTF-16 uses a variable-length encoding to
represent those).
Python Unicode string builds on Microsoft Windows and Mac OS X usually use
UTF-16 while on GNU Linux distributions we usually use UCS-4.
Python 3 does the obvious thing and has only one string class and switches the
internal string encoding depending on what codepoints are used. That way the
user is none the wiser and it still saves space.
But Python 2.7 still has "strings" and "unicode strings" which are disjunct
with no such optimizations.
So this patch basically just makes sure that we do the same as other
distributions so that all the Python 2.7 extensions work.