May 11, 2011 9:37 pm
Harrismh777hi folks,
I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in
python 3.x by default. (I know what default means, I mean, what changed?)
I think part of my problem is that I'm spoiled (American, ascii
heritage) and have been either stuck in ascii knowingly, or UTF-8
without knowing (just because the code points lined up). I am confused
by the implications for using 3.x, because I am reading that there are
significant things to be aware of... what?
On my installation 2.6 sys.maxunicode comes up with 1114111, and my
2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6
was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that
the default compile option for 2.7 & 3.2 (I didn't change anything) is
set for UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much
correctly?
The books say that the .py sources are UTF-8 by default... and that
3.x is either UCS-2 or UCS-4. If I use the file handling capabilities
of Python in 3.x (by default) what encoding will be used, and how will
that affect the output?
If I do not specify any code points above ascii 0xFF does any of
this matter anyway?
Thanks.
kind regards,
m harris
May 11, 2011 10:09 pm
Ian KellyOn Wed, May 11, 2011 at 3:37 PM, harrismh777 <harrismh777@charter.net> wrote:
> hi folks,
> ? I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
> ? I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.
> ? I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
Mainly Python 3 no longer does explicit conversion between bytes and
unicode, requiring the programmer to be explicit about such
conversions. If you have Python 2 code that is sloppy about this, you
may get some Unicode encode/decode errors when trying to run the same
code in Python 3. The 2to3 tool can help somewhat with this, but it
can't prevent all problems.
> ? On my installation 2.6 ?sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). ? Do I understand this much correctly?
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). ? Do I understand this much correctly?
I think that UCS-2 has always been the default unicode width for
CPython, although the exact representation used internally is an
implementation detail.
> ? The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4. ?If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
> either UCS-2 or UCS-4. ?If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
If you open a file in binary mode, the result is a non-decoded byte stream.
If you open a file in text mode and do not specify an encoding, then
the result of locale.getpreferredencoding() is used for decoding, and
the result is a unicode stream.
> ? If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?
> matter anyway?
You mean 0x7F, and probably, due to the need to explicitly encode and decode.
May 11, 2011 10:34 pm
Benjamin KaplanOn Wed, May 11, 2011 at 2:37 PM, harrismh777 <harrismh777@charter.net> wrote:
> hi folks,
> ? I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
>
> ? I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
>
> ? On my installation 2.6 ?sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). ? Do I understand this much correctly?
>
> ? I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
>
> ? I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
>
> ? On my installation 2.6 ?sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). ? Do I understand this much correctly?
>
Not really sure about that, but it doesn't matter anyway. Because even
though internally the string is stored as either a UCS-2 or a UCS-4
string, you never see that. You just see this string as a sequence of
characters. If you want to turn it into a sequence of bytes, you have
to use an encoding.
> ? The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4. ?If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
>
> ? If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?
> either UCS-2 or UCS-4. ?If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
>
> ? If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?
ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
there is a difference for anything over that range. A byte string is a
sequence of bytes. A unicode string is a sequence of these mythical
abstractions called characters. So a unicode string u'\u00a0' will
have a length of 1. Encode that to UTF-8 and you'll find it has a
length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
the top bit is used to signal that you need the next byte for this
character)
If you want the history behind the whole encoding mess, Joel Spolsky
wrote a rather amusing article explaining how this all came about:
http://www.joelonsoftware.com/articles/Unicode.html
And the biggest reason to use Unicode is so that you don't have to
worry about your program messing up because someone hands you input in
a different encoding than you used.
May 11, 2011 10:51 pm
Harrismh777Ian Kelly wrote:
Ian, Benjamin, thanks much.
> The `unicode' class was renamed to `str', and a stripped-down version
> of the 2.X `str' class was renamed to `bytes'.
> of the 2.X `str' class was renamed to `bytes'.
... thank you, this is very helpful.
>> > If I do not specify any code points above ascii 0xFF does any of this
>> > matter anyway?
>> > matter anyway?
> You mean 0x7F, and probably, due to the need to explicitly encode and decode.
Yes, actually, I did... and from Benjamin's reply it seems that
this matters only if I am working with bytes. Is it true that if I am
working without using bytes sequences that I will not need to care about
the encoding anyway, unless of course I need to specify a unicode code
point?
Thanks again.
kind regards,
m harris
Previous Thread: Lightweight way to create new projects from templates
Next Thread: Urllib2 request with binary file as payload
Related Forum Topics
- Argparse and default values
- Where to set default data - where received, or where used
- The default locale of sorted()
- Strptime format string nasty default
- Default value for optional parameters unexpected behaviour?
- Python as a default shell, replacement of bash, sh, cmd ?
- Enchancement suggestion for argparse: intuit type from default
- Method default argument whose type is the class not yet defined
- Unicode
- Unicode codepoints