Python forum for Python programmers

Unicode by default

May 11, 2011 9:37 pm
Harrismh777

hi folks,
I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in
python 3.x by default. (I know what default means, I mean, what changed?)

I think part of my problem is that I'm spoiled (American, ascii
heritage) and have been either stuck in ascii knowingly, or UTF-8
without knowing (just because the code points lined up). I am confused
by the implications for using 3.x, because I am reading that there are
significant things to be aware of... what?

On my installation 2.6 sys.maxunicode comes up with 1114111, and my
2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6
was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that
the default compile option for 2.7 & 3.2 (I didn't change anything) is
set for UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much
correctly?

The books say that the .py sources are UTF-8 by default... and that
3.x is either UCS-2 or UCS-4. If I use the file handling capabilities
of Python in 3.x (by default) what encoding will be used, and how will
that affect the output?

If I do not specify any code points above ascii 0xFF does any of
this matter anyway?



Thanks.

kind regards,
m harris

May 11, 2011 10:09 pm
Ian Kelly
Re: unicode by default

On Wed, May 11, 2011 at 3:37 PM, harrismh777 <harrismh777@charter.net> wrote:
> hi folks,
> ? I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)

The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.

> ? I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?

Mainly Python 3 no longer does explicit conversion between bytes and
unicode, requiring the programmer to be explicit about such
conversions. If you have Python 2 code that is sloppy about this, you
may get some Unicode encode/decode errors when trying to run the same
code in Python 3. The 2to3 tool can help somewhat with this, but it
can't prevent all problems.

> ? On my installation 2.6 ?sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). ? Do I understand this much correctly?

I think that UCS-2 has always been the default unicode width for
CPython, although the exact representation used internally is an
implementation detail.

> ? The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4. ?If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?

If you open a file in binary mode, the result is a non-decoded byte stream.

If you open a file in text mode and do not specify an encoding, then
the result of locale.getpreferredencoding() is used for decoding, and
the result is a unicode stream.

> ? If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?

You mean 0x7F, and probably, due to the need to explicitly encode and decode.


May 11, 2011 10:34 pm
Benjamin Kaplan
Re: unicode by default

On Wed, May 11, 2011 at 2:37 PM, harrismh777 <harrismh777@charter.net> wrote:
> hi folks,
> ? I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
>
> ? I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
>
> ? On my installation 2.6 ?sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). ? Do I understand this much correctly?
>

Not really sure about that, but it doesn't matter anyway. Because even
though internally the string is stored as either a UCS-2 or a UCS-4
string, you never see that. You just see this string as a sequence of
characters. If you want to turn it into a sequence of bytes, you have
to use an encoding.

> ? The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4. ?If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
>
> ? If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?

ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
there is a difference for anything over that range. A byte string is a
sequence of bytes. A unicode string is a sequence of these mythical
abstractions called characters. So a unicode string u'\u00a0' will
have a length of 1. Encode that to UTF-8 and you'll find it has a
length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
the top bit is used to signal that you need the next byte for this
character)

If you want the history behind the whole encoding mess, Joel Spolsky
wrote a rather amusing article explaining how this all came about:
http://www.joelonsoftware.com/articles/Unicode.html

And the biggest reason to use Unicode is so that you don't have to
worry about your program messing up because someone hands you input in
a different encoding than you used.


May 11, 2011 10:51 pm
Harrismh777
Re: unicode by default

Ian Kelly wrote:

Ian, Benjamin, thanks much.

> The `unicode' class was renamed to `str', and a stripped-down version
> of the 2.X `str' class was renamed to `bytes'.

... thank you, this is very helpful.

>> > If I do not specify any code points above ascii 0xFF does any of this
>> > matter anyway?

> You mean 0x7F, and probably, due to the need to explicitly encode and decode.

Yes, actually, I did... and from Benjamin's reply it seems that
this matters only if I am working with bytes. Is it true that if I am
working without using bytes sequences that I will not need to care about
the encoding anyway, unless of course I need to specify a unicode code
point?

Thanks again.

kind regards,
m harris





Previous Thread: Lightweight way to create new projects from templates
Next Thread: Urllib2 request with binary file as payload

Related Forum Topics
Argparse and default values
There's one thing I don't understand about argparse, why doesn't
--help show what is the default value??
I mean if I add an option that can be customized it would be good for
the user
to know what is the current value in my opinion.

Is there a way to make it show it?


Where to set default data - where received, or where used
I'm programming a project which will use a file to save parameters
needed by the program. There are already two previous file formats,
each of which can only be run by the version of the program which
created them. I'm trying to avoid that problem in the future. To do
that, I intend to use...
The default locale of sorted()
Hi,

I'm not able to find the documentation on what locale is used for
sorted() when the 'cmp' argument is not specified. Could anybody let
me what the default is? If I always want LC_ALL=C, do I need to
explicitly set the locale? Or it is the default?

Regards,
Peng


Strptime format string nasty default
Hi,

I am using time.strptime method as follows:

I receive an input string, representing some date in the following
format:

%d%m%Y

However, the day part may be a single digit or two, depending on
magnitude.

For example:

'10052012' will be parsed as day 10, month 5, year...
Default value for optional parameters unexpected behaviour?
Hi,
I'm trying to define a function that has an optional parameter which
should be an empty list whenever it isn't given. However, it takes as
value the same value as the last time the function was executed. What
is the reason of this behaviour? How does python deal with default
values (i.e....
Python as a default shell, replacement of bash, sh, cmd ?
Has it been considered to add shell features to python, such that it can be used as a default shell, as a replacement for bash, etc.

I'm sure everyone would agree that doing this would make the terminal very powerful.

What are your views on this?


Enchancement suggestion for argparse: intuit type from default
Using argparse, if I write:

parser.add_argument('--foo', default=100)

it seems like it should be able to intuit that the type of foo should
be int (i.e. type(default)) without my having to write:

parser.add_argument('--foo', type=int, default=100)

Does this seem like a...
Method default argument whose type is the class not yet defined
What is the best solution to solve the following problem in Python 3.3?

import math
>>> class Point:
... def __init__(self, x=0, y=0):
... self.x = x
... self.y = y
... def __sub__(self, other):
... return Point(self.x - other.x, self.y - other.y)
... ...
Unicode
Hello guys,

I'm using Linux CentOS and Python 2.4 with MySQL 5.xx, I get error
with Unicode I tried many things that I found on the net but none of
them working.

If I dont use UTF-8 it inserts the data into the DB but some French
char. are not correctly decoded. Could you please help me...
Unicode codepoints
Hi,

I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?

def...