Why I Switched to Python 3
Intro¶
It took me almost a year and a half of tinkering around in Python to make the jump from 2 to 3. I'd initially picked up Zed Shaw's Learn Python the Hard Way, which was written in 2.7 (but has since been updated to 3). From there, I read through Wes McKinney's Python for Data Analysis (which is also enjoying similar updating). I learned 2 and so I worked in 2. Things were cool.
However, by the time I wanted to start branching out and doing some work with Anki, I had run into my first "library not supported for Python 2" issue.
So I reinstalled Python, begrudgingly started wrapping my print
statements in parentheses, and started to poke around in some of the things I'd gain from making the jump.
But before I share what ultimately brought me around to becoming pro-3, there are two important things to note if you're new and considering "Which version of Python should I use?":
- If you're not getting dumped into an environment where you'll be supporting 2.X code, then you've already passed perhaps the biggest pro-2.X argument that exists.
- The days of Python 2 support are literally numbered.
I've probably had a few-dozen conversations in the last year where I've said "Just trust me and download 3." And so I wanted to take a minute to highlight a few things that helped me embrace my own cut-over.
A Couple Basics¶
print¶
One of the more annoying things I clung to was how deeply-engrained in my muscle memory the print
syntax was.
Not only that, but my state-of-the-art, loop-debugging tool was going to break everywhere I'd placed it. Meaning this
# 2.X
if cond:
# I'm a stubborn debugger
print "This hit"
would have to be replaced by
# 3.X
if cond:
print('This hit')
everywhere it was relevant. Gah.
But what I didn't appreciate at the time was that treating print
as a function meant that there were new, interesting features baked into it now. Looking at the docstring in the implementation.
# printing things about print
print(print.__doc__)
We can now print straight to a file instead of standard out.
fileObj = open('secrets.txt', 'w')
print('Secret message', file=fileObj)
!type "secrets.txt"
That's kinda neat. Even cooler though is modifying the end
argument
importantEx = 'Look how easy this makes printing the sassy clap hands'.split()
for word in importantEx:
print(word, end=' 👏 ')
Int and Float Division¶
Not a paralyzing gotcha, but it's nice to put almost 0 thought into doing an operation like
5 / 2
This used to arise because Python couldn't intuit how to handle going from two integers to a float.
So as anyone coming from Python 2 already knows, the correct way around this was to make one or both of them floats
# Bad
>>> 5 / 2
2
# Good
>>> 5.0 / 2
2.5
# Good
>>> 5 / 2.0
2.5
This is trivial when doing simple calculations, but can really throw a wrench in something a bit less explicit. For instance
>>> scores = [1, 2, 3, 4, 5, 6]
>>> averageScore = sum(scores) / len(scores)
>>> print(averageScore)
3
Hey, it compiles. Ship it!
Nevermind that the average is actually 3.5
. And who wants to wrap len(scores)
in float()
? Not I. Python 3 has saved me ones of seconds for this reason alone.
Iterators¶
A brief tangent before I get to what really sealed the deal for me. Let's get a 1000 foot view on one of my favorite aspects of Python in general: iterators.
To demonstrate, let's look at the simplest of examples
for letter in ['a', 'b', 'c', 'd', 'e']:
print(letter, end=' ') # callbacks!
Easy, right? 101 stuff. But what's actually happening?
The for val in iterable
syntax in Python fires off some pretty elegant operations behind the scenes.
Iterables. Iterables Everywhere.¶
Generally-speaking, you can think of an iterable as "anything that can be scanned through."
So in the case of the for
loop, Python wraps our iterable object in the iter
method, like so
iter(['a', 'b', 'c', 'd', 'e'])
Notice the output here: list_iterator
at some memory location. That suggests there's probably other kinds of iterators out there, yeah? Let's try.
ourString = 'Hello world'
iter(ourString)
ourSet = set(['a', 'a', 'b', 'c'])
iter(ourSet)
ourDict = {'a': 1, 'b': 2}
iter(ourDict)
Okay. But what is iter
?
print(iter.__doc__)
Emphasis on "the argument must supply its own iterator." Let's see what that means with a quick peek under the hood.
__Mifflin__¶
A core component of the way Python works is via Double-Underscore (Dunder) Methods.
These come affixed to nearly everything in Python and provide instructions to the functions and operators used everywhere. For instance, if we wanted to add two numbers, it might look like
intA = 2
intB = 3
intA + intB
But what's actually happening when we have
thing1 + thing2
is Python looks into thing1
and thing2
for instructions on how to use the +
operator.
Internally, this looks closer to
intA.__add__(intB)
Thankfully, we don't need to know this. Instead, whoever implemented integers
in Python knew we'd want to use +
, and this is the solution that arose.
But Dunder Methods also allow us to inspect equality
listA = [1, 2, 3]
listB = [3, 2, 1]
listA.__eq__(listB)
Or make changes to existing items
vowels = ['A', 'E', 'I', 'O', 'U', 'Y']
import random
if random.random() > .5:
vowels.__delitem__(5)
vowels
Or a number of different things. There are a lot of them:
ourList = [1, 2, 3]
print([method for method in dir(ourList)
if method.startswith('__')])
So What is iter
?¶
Narrowly-defined, an iterator
is something that:
- Has an implementation of the
__next__()
function that gets the next element in sequence. - Throws a
StopIteration Exception
when it runs out of elements.
What makes iterators great is the way they execute. They don't have a corresponding __last__()
method-- so they only know how to go forward. As a consequence of this, as soon as they pass over an item, they forget about it and keep moving. This makes them very cheap memory-wise.
Watch what happens when we make an iterator from our list
exampleIter = iter(['a', 'b', 'c', 'd', 'e'])
print(type(exampleIter))
As we saw earlier, this gives us one a list_iterator
object, which is basically a roadmap of "what value comes next?"
And we can step through it, getting the next element with, fittingly, the __next__()
Dunder Method.
Dropping
exampleIter.__next__()
each
exampleIter.__next__()
value
exampleIter.__next__()
along
exampleIter.__next__()
the
exampleIter.__next__()
way.
exampleIter.__next__()
As promised, once the iterator ran out of values to return via __next__()
, it sounded the alarm with a StopIteration Exception
.
But we don't have to worry about this when we do a for
loop, because it's doing each of these steps for you, while also neatly suppressing the StopIteration
message.
for letter in ['a', 'b', 'c', 'd', 'e']:
print(letter, end=' ') # callbacks!
Pretty seamless, I think.
But it Wasn't Always this Way¶
One of the biggest, and in my estimation most-compelling, differences between Python 2 and Python 3 comes when examining how this used to work.
Back in Python 2, if you wanted to scan over a bunch of elements, you'd first have to load it all in memory, and then scan over everything. For instance, if you wanted to use the range()
function to look at a bunch of consecutive values, Python with build all of those elements out into a list before even starting in on the scan.
This difference might seem trivial at first for smaller use-cases.
%%timeit
total = 0
for i in range(2*10):
total += i
%%timeit
total = 0
for i in list(range(2*10)):
total += i
But makes a world of difference at scale.
%%timeit -r 1
total = 0
for i in range(2*10**8):
total += i
%%timeit -r 1
total = 0
for i in list(range(2*10**8)):
total += i
This same operation took almost 7 times longer, packing every value into a list first.
And this same idea extends immediately to files. If you're only interested in the first 500 rows of an enormous .csv, consider how much easier it is to read from the top and close the file when you have what you need, versus reading every single line into memory only to throw away like 99% of it.
But to be Fair¶
Python 2 did technically support the iteration protocol, but you had to know to seek it out.
For instance, if you had a dictionary with a few hundred keys
ourDict = {k: k for k in range(1000)}
you could scan through each key in neat, iteration fashion using the iterkeys()
function
for key in ourDict.iterkeys():
print key
Whereas in Python 3, we just use keys()
for key in ourDict.keys():
print(key)
And that's it. It's just the default behavior in 3. You don't have to explain to a beginner working in 2 why they should be using the iterator version of each function-- to soap-box about code scalability while they're more concerned about getting the loop to run correctly.
Closing Thoughts¶
As far as the other features go, I'm certain that any Python veterans reading this have an opinion or two on some glaring exclusions I made (unicode, argument packing, and string formatting come to mind). Ultimately, though, this was what put the fear of Guido in me to start porting/evangelizing a move from 2 to 3.
I spend a good deal of my time teaching others how to get going in Python, and I care a lot about setting them up to write performant code. And so this shift to "slap an iterator on it" by default is certainly a welcome one, as it makes that job substantially easier on me.
To that end, I'd encourage anyone who found the "build a data road map, and leisurely scan through it" process to go down a rabbit hole by Googling:
- Functional programming
- Lazy Evaluation
- PySpark
- itertools
This stuff is basically magic.
Cheers,
-Nick