Data Munging with Pandas
I’m almost embarassed I haven’t heard of pandas until now. pandas is a suite of python tools for data munging, and works more or less like a really fast Excel spreadsheet. I find it a pain to get data in a form I can use, and usually my first instinct is to just hack together something workable and then leave it. But that is sloppy in practice, and not very satisfying. So I’ll redo some parts of my last post on rating teams, so you can see it in action.
So before when I wanted to get a list of NFL teams for my ranking programs, I did something like this:
Not really intuitive what I did there, mostly a lot of hacking to get the list of teams in a usable fashion for the rest of the program. It returns a python dictionary, and it looks like this:
Let’s do it a better way.
Here is how we do it with
pandas (imported as
A mere four lines of code, each with a distinct (and clear) purpose! First line, open the webpage containing data. Second line, read in the data to a pandas
DataFrame object (think a spreadsheet). Third, name the columns, and finally return the object, which looks like so:
The first column is the index, which starts at zero in a pythonic manner. The second column is team ID, which is what the data itself uses. So don’t get confused by those numbers. Having a
DataFrame object allows us to easily index the columns and rows. It also easily allows us to add columns to it. So say we get a rating vector, and want to pair it up with our team names. Before I did it like so:
Again, kind of a hack to stack the lists (via
vstack), then sort them by rating, which involves transposing the array, using tuple keys…and if you’ve never done this before you have to look up all the options. And then you have to write to file in such a way as to link up team names with the ID in the vector…it’s messy and I don’t like it. So let’s try it with
pandas. You saw the team object before, right? Now we do:
Done. (Of course we can add different formatting options, too). Add a column titled
rating to our teams object, sort by the column
rating, in descending order, then write to file. SO EASY. And here we have it:
It’s just so simple, and I’m really glad that I found this module for my python programs!
And as a bonus for reading this far, here is a picture of a laughing panda.