Why I Switched to Python 3

Intro

It took me almost a year and a half of tinkering around in Python to make the jump from 2 to 3. I'd initially picked up Zed Shaw's Learn Python the Hard Way, which was written in 2.7 (but has since been updated to 3). From there, I read through Wes McKinney's Python for Data Analysis (which is also enjoying similar updating). I learned 2 and so I worked in 2. Things were cool.

However, by the time I wanted to start branching out and doing some work with Anki, I had run into my first "library not supported for Python 2" issue.

So I reinstalled Python, begrudgingly started wrapping my print statements in parentheses, and started to poke around in some of the things I'd gain from making the jump.

But before I share what ultimately brought me around to becoming pro-3, there are two important things to note if you're new and considering "Which version of Python should I use?":

I've probably had a few-dozen conversations in the last year where I've said "Just trust me and download 3." And so I wanted to take a minute to highlight a few things that helped me embrace my own cut-over.

A Couple Basics

print

One of the more annoying things I clung to was how deeply-engrained in my muscle memory the print syntax was.

Not only that, but my state-of-the-art, loop-debugging tool was going to break everywhere I'd placed it. Meaning this

In [2]:
# 2.X
if cond:
    # I'm a stubborn debugger
    print "This hit"
  File "<ipython-input-2-1bcbd3a7311d>", line 4
    print "This hit"
                   ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(int "This hit")?

would have to be replaced by

In [3]:
# 3.X
if cond:
    print('This hit')
This hit

everywhere it was relevant. Gah.

But what I didn't appreciate at the time was that treating print as a function meant that there were new, interesting features baked into it now. Looking at the docstring in the implementation.

In [4]:
# printing things about print
print(print.__doc__)
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.

We can now print straight to a file instead of standard out.

In [34]:
fileObj = open('secrets.txt', 'w')
print('Secret message', file=fileObj)
In [35]:
!type "secrets.txt"
Secret message

That's kinda neat. Even cooler though is modifying the end argument

In [7]:
importantEx = 'Look how easy this makes printing the sassy clap hands'.split()

for word in importantEx:
    print(word, end=' 👏 ')
Look 👏 how 👏 easy 👏 this 👏 makes 👏 printing 👏 the 👏 sassy 👏 clap 👏 hands 👏 

Int and Float Division

Not a paralyzing gotcha, but it's nice to put almost 0 thought into doing an operation like

In [8]:
5 / 2
Out[8]:
2.5

This used to arise because Python couldn't intuit how to handle going from two integers to a float.

So as anyone coming from Python 2 already knows, the correct way around this was to make one or both of them floats

# Bad
>>> 5 / 2
2

# Good
>>> 5.0 / 2
2.5

# Good
>>> 5 / 2.0
2.5

This is trivial when doing simple calculations, but can really throw a wrench in something a bit less explicit. For instance

>>> scores = [1, 2, 3, 4, 5, 6]
>>> averageScore = sum(scores) / len(scores)
>>> print(averageScore)
3

Hey, it compiles. Ship it!

Nevermind that the average is actually 3.5. And who wants to wrap len(scores) in float()? Not I. Python 3 has saved me ones of seconds for this reason alone.

Iterators

A brief tangent before I get to what really sealed the deal for me. Let's get a 1000 foot view on one of my favorite aspects of Python in general: iterators.

To demonstrate, let's look at the simplest of examples

In [9]:
for letter in ['a', 'b', 'c', 'd', 'e']:
    print(letter, end=' ') # callbacks!
a b c d e 

Easy, right? 101 stuff. But what's actually happening?

The for val in iterable syntax in Python fires off some pretty elegant operations behind the scenes.

Iterables. Iterables Everywhere.

Generally-speaking, you can think of an iterable as "anything that can be scanned through."

So in the case of the for loop, Python wraps our iterable object in the iter method, like so

In [10]:
iter(['a', 'b', 'c', 'd', 'e'])
Out[10]:
<list_iterator at 0x5b06fd0>

Notice the output here: list_iterator at some memory location. That suggests there's probably other kinds of iterators out there, yeah? Let's try.

In [11]:
ourString = 'Hello world'
iter(ourString)
Out[11]:
<str_iterator at 0x5b06f60>
In [12]:
ourSet = set(['a', 'a', 'b', 'c'])
iter(ourSet)
Out[12]:
<set_iterator at 0x5b15318>
In [13]:
ourDict = {'a': 1, 'b': 2}
iter(ourDict)
Out[13]:
<dict_keyiterator at 0x5b0c688>

Okay. But what is iter?

In [14]:
print(iter.__doc__)
iter(iterable) -> iterator
iter(callable, sentinel) -> iterator

Get an iterator from an object.  In the first form, the argument must
supply its own iterator, or be a sequence.
In the second form, the callable is called until it returns the sentinel.

Emphasis on "the argument must supply its own iterator." Let's see what that means with a quick peek under the hood.

__Mifflin__

A core component of the way Python works is via Double-Underscore (Dunder) Methods.

These come affixed to nearly everything in Python and provide instructions to the functions and operators used everywhere. For instance, if we wanted to add two numbers, it might look like

In [15]:
intA = 2
intB = 3

intA + intB
Out[15]:
5

But what's actually happening when we have

thing1 + thing2

is Python looks into thing1 and thing2 for instructions on how to use the + operator.

Internally, this looks closer to

In [16]:
intA.__add__(intB)
Out[16]:
5

Thankfully, we don't need to know this. Instead, whoever implemented integers in Python knew we'd want to use +, and this is the solution that arose.

But Dunder Methods also allow us to inspect equality

In [17]:
listA = [1, 2, 3]
listB = [3, 2, 1]

listA.__eq__(listB)
Out[17]:
False

Or make changes to existing items

In [38]:
vowels = ['A', 'E', 'I', 'O', 'U', 'Y']

import random

if random.random() > .5:
    vowels.__delitem__(5)

vowels
Out[38]:
['A', 'E', 'I', 'O', 'U']

Or a number of different things. There are a lot of them:

In [19]:
ourList = [1, 2, 3]
print([method for method in dir(ourList)
              if method.startswith('__')])
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__']

So What is iter?

Narrowly-defined, an iterator is something that:

  1. Has an implementation of the __next__() function that gets the next element in sequence.
  2. Throws a StopIteration Exception when it runs out of elements.

What makes iterators great is the way they execute. They don't have a corresponding __last__() method-- so they only know how to go forward. As a consequence of this, as soon as they pass over an item, they forget about it and keep moving. This makes them very cheap memory-wise.

Watch what happens when we make an iterator from our list

In [20]:
exampleIter = iter(['a', 'b', 'c', 'd', 'e'])
print(type(exampleIter))
<class 'list_iterator'>

As we saw earlier, this gives us one a list_iterator object, which is basically a roadmap of "what value comes next?"

img

And we can step through it, getting the next element with, fittingly, the __next__() Dunder Method.

Dropping

In [21]:
exampleIter.__next__()
Out[21]:
'a'

img

each

In [22]:
exampleIter.__next__()
Out[22]:
'b'

img

value

In [23]:
exampleIter.__next__()
Out[23]:
'c'

img

along

In [24]:
exampleIter.__next__()
Out[24]:
'd'

img

the

In [25]:
exampleIter.__next__()
Out[25]:
'e'

img

way.

In [26]:
exampleIter.__next__()
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-26-826e79fe9f71> in <module>()
----> 1 exampleIter.__next__()

StopIteration: 

As promised, once the iterator ran out of values to return via __next__(), it sounded the alarm with a StopIteration Exception.

But we don't have to worry about this when we do a for loop, because it's doing each of these steps for you, while also neatly suppressing the StopIteration message.

In [27]:
for letter in ['a', 'b', 'c', 'd', 'e']:
    print(letter, end=' ') # callbacks!
a b c d e 

Pretty seamless, I think.

But it Wasn't Always this Way

One of the biggest, and in my estimation most-compelling, differences between Python 2 and Python 3 comes when examining how this used to work.

Back in Python 2, if you wanted to scan over a bunch of elements, you'd first have to load it all in memory, and then scan over everything. For instance, if you wanted to use the range() function to look at a bunch of consecutive values, Python with build all of those elements out into a list before even starting in on the scan.

This difference might seem trivial at first for smaller use-cases.

In [28]:
%%timeit

total = 0
for i in range(2*10):
    total += i
975 ns ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [29]:
%%timeit

total = 0
for i in list(range(2*10)):
    total += i
1.28 µs ± 4.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

But makes a world of difference at scale.

In [30]:
%%timeit -r 1

total = 0
for i in range(2*10**8):
    total += i
11.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [31]:
%%timeit -r 1

total = 0
for i in list(range(2*10**8)):
    total += i
1min 23s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

This same operation took almost 7 times longer, packing every value into a list first.

And this same idea extends immediately to files. If you're only interested in the first 500 rows of an enormous .csv, consider how much easier it is to read from the top and close the file when you have what you need, versus reading every single line into memory only to throw away like 99% of it.

But to be Fair

Python 2 did technically support the iteration protocol, but you had to know to seek it out.

For instance, if you had a dictionary with a few hundred keys

In [32]:
ourDict = {k: k for k in range(1000)}

you could scan through each key in neat, iteration fashion using the iterkeys() function

for key in ourDict.iterkeys():
    print key

Whereas in Python 3, we just use keys()

for key in ourDict.keys():
    print(key)

And that's it. It's just the default behavior in 3. You don't have to explain to a beginner working in 2 why they should be using the iterator version of each function-- to soap-box about code scalability while they're more concerned about getting the loop to run correctly.

Closing Thoughts

As far as the other features go, I'm certain that any Python veterans reading this have an opinion or two on some glaring exclusions I made (unicode, argument packing, and string formatting come to mind). Ultimately, though, this was what put the fear of Guido in me to start porting/evangelizing a move from 2 to 3.

I spend a good deal of my time teaching others how to get going in Python, and I care a lot about setting them up to write performant code. And so this shift to "slap an iterator on it" by default is certainly a welcome one, as it makes that job substantially easier on me.

To that end, I'd encourage anyone who found the "build a data road map, and leisurely scan through it" process to go down a rabbit hole by Googling:

  • Functional programming
  • Lazy Evaluation
  • PySpark
  • itertools

This stuff is basically magic.

Cheers,

-Nick