Data Munging with Pandas

I’m almost embarassed I haven’t heard of pandas until now. pandas is a suite of python tools for data munging, and works more or less like a really fast Excel spreadsheet. I find it a pain to get data in a form I can use, and usually my first instinct is to just hack together something workable and then leave it. But that is sloppy in practice, and not very satisfying. So I’ll redo some parts of my last post on rating teams, so you can see it in action.

So before when I wanted to get a list of NFL teams for my ranking programs, I did something like this:

  
def get_teams():  
    """ Get team data from masseyratings.com,  
      parse it into a dictionary with team number  
      as the key, team name as the value  
    """  
    f = urllib.urlopen("http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&exhib=on&format=2")  
    s = f.read().split()  
    my_teamnames = {}  
    for i in range(0,len(s)/2):  
        my_teamnames.update({i : s[i*2 + 1]})  
    return my_teamnames

Not really intuitive what I did there, mostly a lot of hacking to get the list of teams in a usable fashion for the rest of the program. It returns a python dictionary, and it looks like this:

    {0: 'Arizona',
     1: 'Atlanta',
     2: 'Baltimore',
     ...
     29: 'Tampa_Bay',
     30: 'Tennessee',
     31: 'Washington'}

Let’s do it a better way.
Here is how we do it with pandas (imported as pd):

  
def get_teams():  
    team_data = urllib2.urlopen('http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&format=2')  
    teams = pd.read_csv(team_data, header=None)  
    teams.columns = ['Team_ID', 'Team_name']  
    return teams

A mere four lines of code, each with a distinct (and clear) purpose! First line, open the webpage containing data. Second line, read in the data to a pandas DataFrame object (think a spreadsheet). Third, name the columns, and finally return the object, which looks like so:

    Team_ID Team_name
    0 1 Arizona
    1 2 Atlanta
    2 3 Baltimore
    ... ... ...
    29 30 Tampa_Bay
    30 31 Tennessee
    31 32 Washington

The first column is the index, which starts at zero in a pythonic manner. The second column is team ID, which is what the data itself uses. So don’t get confused by those numbers. Having a DataFrame object allows us to easily index the columns and rows. It also easily allows us to add columns to it. So say we get a rating vector, and want to pair it up with our team names. Before I did it like so:

  
def write_sorted(teams,rating):  
    combine = np.vstack((rating,teams))  
    sort_combine = np.array(sorted(combine.T,key=tuple,reverse=True))  
    f = open(str(filename)+'.txt','w')  
    for i in range(0,len(teams)):  
        f.write(str(team_name[int(sort_combine[i,1])])+', '+str(sort_combine[i,0])+'\n')  
    f.close()

Again, kind of a hack to stack the lists (via vstack), then sort them by rating, which involves transposing the array, using tuple keys…and if you’ve never done this before you have to look up all the options. And then you have to write to file in such a way as to link up team names with the ID in the vector…it’s messy and I don’t like it. So let’s try it with pandas. You saw the team object before, right? Now we do:

  
def write_sorted(teams,rating,filename):  
    teams['rating'] = rating  
    teams = teams.sort(columns='rating',ascending=False)  
    teams.to_csv(filename)

Done. (Of course we can add different formatting options, too). Add a column titled rating to our teams object, sort by the column rating, in descending order, then write to file. SO EASY. And here we have it:

    Team_ID,Team_name,rating
    13, Houston,0.7603317475519333
    19, New_England,0.7582151079219482
    16, Kansas_City,0.7420115719733623
    ...
    15, Jacksonville,0.30074546774453426
    18, Minnesota,0.2838074821824212
    25, Pittsburgh,0.24660385713879754
    32, Washington,0.20859838193293445

It’s just so simple, and I’m really glad that I found this module for my python programs!

And as a bonus for reading this far, here is a picture of a laughing panda.

Colley Method for Ranking NFL Teams

The NFL season started this week, which means it is a perfect time to set up a program for ranking teams! My fiancee bought me a book last Christmas called Who’s #1, which is an awesome book on how ranking and rating methods work. Specifically the book is geared towards sports, but I was surprised to learn how many companies use ranking algorithms — for example, Google and Netflix. I am — by and large — ignorant of most things sports related, but this past year I had lived with five other guys who love sports and who had invited me to join their March Madness bracket group. Since I don’t really follow college basketball, but I wanted a shot at winning our groups bracket challenge (there was a cash prize!), I used the methods in the book to rank college basketball teams and fill out the brackets algorithmically. I ended up taking the second place prize out of a group of nearly 25 brackets in our group, and in the top 5% nationally. I took first in my bracket group I did with the other graduate students…but I didn’t put any money in that one! (Bummer…)

One of the first methods I learned is called the Colley Method, and you can read all about it here. You might recognize it as one of the key computer methods in college football’s Bowl Championship Series (no comment on that). The method takes the win-loss record of each team, assigns them a numerical rating based on who they beat (or lost to), and ranks them according to that rating. The big idea is that if you beat a very good team, it should count more than if you beat a poor team, and if you lose to a very good team, you should not be ‘penalized’ as much as if you lose to a terrible team. In a sense, it works by interconnecting teams based on who they won and lost against — kind of like a pseudo-social network. If you beat me, and I beat your friend, then we assume you would have beaten your friend, even if you never go to play them. Hopefully that makes sense!

Anyway, if you want to try it yourself, here is a link to the code that ranks both the 2012 and 2013 NFL teams. I included an option for preseason games, since there is (at the time of this writing) only one regular season game that has been played in the NFL. Note that you will need the numpy module for this to work.

To run it, type

python ranking.py -e -y 2013

This will call all the other python files to fetch the data from http://www.masseyratings.com and will then process it an implement the Colley method, giving you a colley.txt file containing the ratings of each team (sorted of course). The flag -y gives input for the year (2012 or 2013), and the -e flag, if present, includes exhibition matches as well (e.g. preseason games).

Now for the guts of the code (for those who are interested).

First we need a way to get the data from masseyratings.com, and put the games into an array as well as get our teams. Because the games use an integer to represent the teams, I create a python dictionary to map the number to the team name. So getdata.py does precisely this:

  
def get_teams():  
     """ Get team data from masseyratings.com,  
       parse it into a dictionary with team number  
       as the key, team name as the value  
     """  
     f = urllib.urlopen("http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&exhib=on&format=2")  
     s = f.read().split()  
     my_teamnames = {}  
     for i in range(0,len(s)/2):  
         my_teamnames.update({i : s[i*2 + 1]})  
     return my_teamnames

def get_games(year,exhibition):  
    """ Get game data from masseyratings.com,  
      parse it into a numpy array.  
    """  
    if year == 2013:  
        if exhibition == True:  
            f = urllib.urlopen('http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&exhib=on&format=1')  
        elif exhibition == False:  
            f = urllib.urlopen('http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&format=1')  
        else:  
            sys.exit('"exhibition" must be "True" or "False"')  
    elif year == 2012:  
        if exhibition == True:  
            f = urllib.urlopen('http://www.masseyratings.com/scores.php?s=181613&sub=181613&all=1&mode=3&exhib=on&format=1')  
        elif exhibition == False:  
            f = urllib.urlopen('http://www.masseyratings.com/scores.php?s=181613&sub=181613&all=1&mode=3&format=1')  
        else:  
            sys.exit('"exhibition" must be "True" or "False"')  
    else:  
        sys.exit('Not a valid year')  
    s = f.read()  
    if exhibition == False:  
        file_name = str('games_'+str(year)+'.txt')  
    elif exhibition == True:  
        file_name = str('games_'+str(year)+'_exhib.txt')  
    k = open(file_name,'w')  
    k.write(s)  
    k.close()  
    f.close()

    my_games = genfromtxt(file_name, dtype = None, delimiter=',')

    return my_games

Once we have our games and teams, we need to set up the Colley matrix (‘A’), and solve it against Colley’s ‘b’ vector. The colley.py takes care of this. We take data about the games, and set up a square matrix with the number of teams as the dimension. Where teams intersect contains data about how they fared against each other. The mathematical details can be found in the Who’s #1 and in the link to the Colley Method I gave. The end result of this code is a rating vector (called rating here). Anyway, here is that code:

  
def colley(num_teams,num_games,my_teams,my_games):

    A = np.zeros((num_teams,num_teams))  
    wins = np.zeros(num_teams)  
    losses = np.zeros(num_teams)  
    total = np.zeros(num_teams)
    
    for game in range(0,num_games):  
        w = my_games[game,2] - 1.0  
        l = my_games[game,5] - 1.0
    
        A[w,w] = A[w,w] + 1.0  
        A[l,l] = A[l,l] + 1.0  
        A[l,w] = A[l,w] - 1.0  
        A[w,l] = A[w,l] - 1.0
    
    for i in range(0,num_teams):  
        total[i] = total[i] + wins[i] + losses[i]
    
    b = np.zeros(num_teams)  
    for i in range(0,num_teams):  
        A[i,i] = 2 + total[i] + A[i,i]  
        b[i] = 1 + (wins[i] - losses[i])/2
    
    rating = np.linalg.solve(A,b)  
    return rating

Once we have these, we pair it up with our dictionary element containing the team names, and sort it. This is the finalsort.py code. I wrapped everything in the file ranking.py if you are checking this out in my repo. It takes care of making all the files play nicely together.

So now the big question is, how does it fare? Here is the list as of today (9/7/13):

    Seattle, 0.806187648116
    Detroit, 0.76229943304
    Washington, 0.734350377447
    Cleveland, 0.691797811483
    New_England, 0.666157176065
    New_Orleans, 0.657992419938
    NY_Jets, 0.652519130556
    San_Francisco, 0.648158042535
    Arizona, 0.647296343048
    Houston, 0.647271431013
    Denver, 0.614105025386
    Carolina, 0.600878984866
    Buffalo, 0.563322480133
    Cincinnati, 0.557969510878
    Philadelphia, 0.53328007273
    Indianapolis, 0.525416655862
    Chicago, 0.522801819437
    Kansas_City, 0.479567506516
    Dallas, 0.453378981006
    San_Diego, 0.437407308856
    Miami, 0.414388770684
    Oakland, 0.406726811416
    Green_Bay, 0.37888674304
    Tampa_Bay, 0.36195341794
    Minnesota, 0.357868414451
    Baltimore, 0.356824183447
    St_Louis, 0.340268960559
    NY_Giants, 0.339410132678
    Tennessee, 0.288458533025
    Jacksonville, 0.280125144891
    Pittsburgh, 0.192367833585
    Atlanta, 0.0805628953732

Not surprisingly, Seattle is at the top (Woot woot! Go Hawks!), having finished the preseason perfectly against tough teams. Atlanta is at the bottom, having lost every preseason game. Generally, the ratings stack up similar to how the teams did in the preseason.

Also not surprising? How poor of a representation I think it really is. Preseason games are the time where teams test out their second and third strings and avoid playing their top players. So I think, for example, Atlanta will fare a lot better than this preseason ranking suggests. While I like the Colley Method’s simplicity, I think it performs poorly as a predictive algorithm. The Massey method, on the other hand, is much better at predicting future match-ups. This was my experience with it in March Madness anyway. Google it if you are curious. If you are serious about doing any sports ranking, I also want to suggest that you take a look at weighting methods…for example, games played later in the season ‘count more’. I think that is a critical step to getting predictive algorithms. How to do that well, though, is more of an art :)

If you have any questions or comments, let me know!

Gaussian Matrix Parser

I share a lot of code here on my blog, and for most of the quantum chemistry programs to work, I need some inputs, such as one and two electron integrals, overlap matrices, and the like. To be honest, aside from working through the Szabo and Ostlund integral evaluation code, I have no idea how to compute these quantities in practice. I know there are plenty of good algorithms out there (PRISM comes to mind), but I’ve never taken the time to work them out for myself (future project maybe?). Instead, I parse the codes directly from Gaussian09 output files. It’s a little time consuming, to do that for each molecule, so I wanted to share my code with you.

Here is a link to it: gauss_parse.

The program, Gaussian Matrix Parser (gauss_parse), is a command line routine to extract integrals from Gaussian output files and write them to text files as a 2D array. gauss_parse takes the .log file as an argument, and then creates a directory with (named after the logfile), and fills it with the matrices you want, e.g:

python gauss_parse.py my_molecule.log

This program assumes you have included the extra print options for matrices, e.g for minimal basis water, our h2o.com would look like:

#P rhf/STO-3G scf(conventional) Iop(3/33=6) Extralinks=L316 Noraff Symm=Noint Iop(3/33=1) pop(full)

Minimal basis H2O

0 1  
H  
O 1 r1  
H 2 r1 1 a1

r1 1.1  
a1 104.0

Which has a suitable route section to print overlap, kinetic energy, two-electron repulsion, and potential energy integrals (and more)!

Say you have installed to a folder called gauss_parse. From the terminal, cd to that directory, and list the contents:

$ cd gauss_parse
$ ls
    LICENSE.txt README gauss_parse.py h2o.log
$ python gauss_parse.py h2o.log
$ ls
    LICENSE.txt README gauss_parse.py h2o.log h2o
$

Note how h2o has been added to the directory. cd into that directory, and you should see:

$ cd h2o
$ ls
    kinetic_energy.dat number_basis_functions.dat
    overlap.dat two_electron_ints.dat
    nuclear_repulsion.dat number_electrons.dat
    potential_energy.dat
$

Each .dat file contains the information extracted from the Gaussian .log file, which you can use for whatever you like! For example, say we want to see the overlap matrix for our water molecule:

$ cd h2o
$ vim overlap.dat

And we see in vim:

1.0000e+00 3.8405e-02 3.8613e-01 0.0000e+00 2.6843e-01 -2.0972e-01 1.8176e-01  
3.8405e-02 1.0000e+00 2.3670e-01 0.0000e+00 0.0000e+00 0.0000e+00 3.8405e-02  
3.8613e-01 2.3670e-01 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 3.8613e-01  
0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00  
2.6843e-01 0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00 0.0000e+00 -2.6843e-01  
-2.0972e-01 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00 -2.0972e-01  
1.8176e-01 3.8405e-02 3.8613e-01 0.0000e+00 -2.6843e-01 -2.0972e-01 1.0000e+00

With the ones along the diagonal — just as expected! (Actual precision in gauss_parse is much higher).

Davidson's Method

Many times in quantum chemistry you want the lowest few eigenvalues of very large matrices. For example, in my work with quantum dots, we want the first few excitation energies of our dots from a TDDFT calculation. Our smallest systems have 1000 basis functions with roughly 300 electrons, which requires creating and diagonalizing a 210,000 by 210,000 matrix (the dimension is number occupied orbitals times the number unoccupied orbitals). This is way too huge! For perspective, just storing this matrix on disk requires about 300 GB space.

Ernest Davidson ran into this problem decades ago working on diagonalizing the Configuration Interaction Hamiltonians, which are generally enormous. He came up with a method — the so-called Davidson method – which iteratively diagonalizes a subspace of the matrix instead of the whole thing, and gives you the first few lowest (or highest!) eigenvalues. It is much more cost-efficient, and actually doesn’t require you to create the whole matrix in the first place (it projects the matrix onto an appropriate subspace instead). The method turned out to work so well that quantum chemists adopted it and have used it ever since.

I wanted to try it out, so I implemented Davidson’s method on a Hermitian matrix (in Python, of course :)).

Let’s step through what I’ve done and see if it makes this method any clearer. The first bit simply creates our fake Hamiltonian.

  
sparsity = 0.000001  
A = np.zeros((n,n))  
for i in range(0,n):  
    A[i,i] = i + 1  
A = A + sparsity*np.random.randn(n,n)  
A = (A.T + A)/2

While it may look arbitrary, take a closer look at the structure. First, it is diagonally dominant. The diagonal is filled with increasing integers, while the off-diagonals are random numbers multiplied by a scaling factor to “mute” them somewhat. This adds sparsity. Davidson’s method really excels with sparse, diagonally dominant matrices. This is actually very similar to the Hamiltonians we encounter as quantum chemists. If you scale the sparsity down and approach zero, the Davidson method speeds up fast. (But don’t set it to exactly zero or it will crash — in this case the matrix is already diagonalized!)

Next we set up our subspace “trial vectors”:

  
k = 8 # number of initial guess vectors  
eig = 4 # number of eignvalues to solve  
t = np.eye(n,k) # set of k unit vectors as guess  
V = np.zeros((n,n)) # array of zeros to hold guess vec  
I = np.eye(n) # identity matrix same dimen as A

Since we are choosing to find the first four eigenvalues, we need at least four guess vectors k. In practice, we choose maybe twice to three times that, because we want to increase the span of our guess space. In other words, it helps us hone in on the appropriate eigenvectors faster. But don’t make the guess too big! If it gets too large, we basically end up diagonalizing the whole matrix — which we don’t want to do, since that is the whole point of Davidson’s method. Speaking of the span of the guess vectors, it is important to make a good initial guess. Because the matrix is diagonally dominant, I chose a set of unit vectors as my guess, which is a good since our matrix is so close to being scalar multiples of the identity matrix.

Finally we get to the meat of the main routine:

  
for m in range(k,mmax,k):
    if m <= k:
        for j in range(0,k):
            V[:,j] = t[:,j]/np.linalg.norm(t[:,j])
        theta_old = 1 
    elif m > k:
        theta_old = theta[:eig]
    V[:,:m],R = np.linalg.qr(V[:,:m])
    T = np.dot(V[:,:m].T,np.dot(A,V[:,:m]))
    THETA,S = np.linalg.eig(T)
    idx = THETA.argsort()
    theta = THETA[idx]
    s = S[:,idx]
    for j in range(0,k):
        w = np.dot((A - theta[j]*I),np.dot(V[:,:m],s[:,j])) 
        q = w/(theta[j]-A[j,j])
        V[:,(m+j)] = q
    norm = np.linalg.norm(theta[:eig] - theta_old)
    if norm < tol:
        break

(Thanks to Matthew Goldey for pointing out ways to speed up the code!)

We first check to see if this is our first iteration (e.g. m < k). If it is, we add our guess vectors to our set V, and set theta_old equal to one. theta_one equal to one is arbitrary, and is used to ensure we don’t “converge” on our first try (we can’t, since we have to compare at least two iterations to determine convergence). If this isn’t our first iteration, we set theta_old to our last iteration’s eigenvalues.

Next we ensure our vectors are orthonormal among each other with numpy’s QR decomposition routine. Vectors must be orthonormal or the routine will fail. Then we project our matrix A onto the subspace defined by our guess vectors, \(\mathbf{V}^{T}\mathbf{A}\mathbf{V}\). Then we diagonalize this matrix projection subspace, sort the eigenvalues to find the lowest few desired, and compute the residual r. The residual will be zero if the guess eigenvectors are the real eigenvectors. If the residual is lower than some criterion, then we quit. Else, we orthonormalize the eigenvectors from that iteration and add them to our guess vectors! In this way, our subspace grows each time we iterate…the hope is that each added vector will be in the span of the real eigenvectors. Once that happens, we will have solved the problem of getting the lowest few eigenvalues.

Here is some sample output:

davidson = [0.99999921 2.00000133 3.00000042 3.99999768] ; 0.6596 seconds  
numpy = [0.99999921 2.00000133 3.00000042 3.99999768] ; 1.7068 seconds

You can see it works! (And the eigenvalues are really similar to the integer values we put along the diagonal).

I’ve attached the full routine at the end. With a sparse enough matrix, I can beat numpy by about a second. Of course, comparisons aren’t really fair, since I make numpy compute all the eigenvalues. From what I know, this method has a lot of intricacies, and I am still learning many of them. If you know of any ways I can improve this code, let me know! Comments and messages are always appreciated.

#!/bin/python
from __future__ import division
from __future__ import print_function
import math
import numpy as np
import time

''' Block Davidson, Joshua Goings (2013)

    Block Davidson method for finding the first few
	lowest eigenvalues of a large, diagonally dominant,
    sparse Hermitian matrix (e.g. Hamiltonian)
'''

n = 1200					# Dimension of matrix
tol = 1e-8				# Convergence tolerance
mmax = n//2				# Maximum number of iterations	

''' Create sparse, diagonally dominant matrix A with 
	diagonal containing 1,2,3,...n. The eigenvalues
    should be very close to these values. You can 
    change the sparsity. A smaller number for sparsity
    increases the diagonal dominance. Larger values
    (e.g. sparsity = 1) create a dense matrix
'''

sparsity = 0.0001
A = np.zeros((n,n))
for i in range(0,n):
    A[i,i] = i + 1 
A = A + sparsity*np.random.randn(n,n) 
A = (A.T + A)/2 


k = 8					# number of initial guess vectors 
eig = 4					# number of eignvalues to solve 
t = np.eye(n,k)			# set of k unit vectors as guess
V = np.zeros((n,n))		# array of zeros to hold guess vec
I = np.eye(n)			# identity matrix same dimen as A

# Begin block Davidson routine

start_davidson = time.time()

for m in range(k,mmax,k):
    if m <= k:
        for j in range(0,k):
            V[:,j] = t[:,j]/np.linalg.norm(t[:,j])
        theta_old = 1 
    elif m > k:
        theta_old = theta[:eig]
    V[:,:m],R = np.linalg.qr(V[:,:m])
    T = np.dot(V[:,:m].T,np.dot(A,V[:,:m]))
    THETA,S = np.linalg.eig(T)
    idx = THETA.argsort()
    theta = THETA[idx]
    s = S[:,idx]
    for j in range(0,k):
        w = np.dot((A - theta[j]*I),np.dot(V[:,:m],s[:,j])) 
        q = w/(theta[j]-A[j,j])
        V[:,(m+j)] = q
    norm = np.linalg.norm(theta[:eig] - theta_old)
    if norm < tol:
        break

end_davidson = time.time()

# End of block Davidson. Print results.

print("davidson = ", theta[:eig],";",
    end_davidson - start_davidson, "seconds")

# Begin Numpy diagonalization of A

start_numpy = time.time()

E,Vec = np.linalg.eig(A)
E = np.sort(E)

end_numpy = time.time()

# End of Numpy diagonalization. Print results.

print("numpy = ", E[:eig],";",
     end_numpy - start_numpy, "seconds")

Monte Carlo Simulation of the "Bookshelf" Problem

I had a friend share with me this probability problem her professor had posed to the class — and, yes, it was actually posed this colorfully:

“During a massive earthquake, all twelve of your books — of which five are mathematical, four are literary, and three are reference — are cast haphazardly across your living room. A fireman graciously put them back on the shelf while your wounds are being tended; however, he simply decides to randomly arrange them. What is the probability that the books are in natural sets; i.e., all mathematical books are adjacent, and all reference books are adjacent?”

Okay. Now, probability is one of those things I love doing, but am absolutely horrid at. I had no idea how to approach the problem, so I thought it would be a good candidate for a Monte Carlo simulation instead. The idea behind a Monte Carlo simulation comes from the frequentist school of probability. The idea is that the probability of an event happening is the relative frequency of that event happening in the limit of a large number of trials. For example, if we were to say that flipping a coin has a 50% probability of landing ‘heads’ and a 50% probability of landing ‘tails’, this would be equivalent to saying that if we were to flip a coin two million times, we’d expect it to land heads 1 million of those times. Now flipping a coin two million times (or more!) is tedious and terrible work for people, but computers excel at doing the same thing over and over again. So with a suitable random number generator, computers can accurately reproduce probabilities of events, especially if the exact probability is not as obvious as flipping a coin. Like arranging a bookshelf, say.

In the bookshelf problem, we can program our computer to take the 12 books, arrange them at random, and then count up the times it formed a ‘natural set’. Simply divide the number of times the bookshelf was ordered by the number of times we arranged the bookshelf, and we get a probability for an ordered bookshelf! Here is an example of how I solved it in Python:

First, import the random and itertools modules, to help us generate random numbers and iterate among the lists. Then define two functions: one to check if two lists (or parts of lists) have all the same elements, and one to generate random numbers from 1 to 12 inclusive.

#!/bin/python

from __future__ import division  
import random  
import itertools  
from time import time

# returns True if lists are the same, False if not  
def checkEqual(lst):  
    return lst[1:] == lst[:-1]

# Get random integer from [1,12]  
def rnum():  
    return random.randint(0,12)

Then we build our bookshelf. You can see that — pre-earthquake — the books formed a so-called ‘natural set’. We initialize the total number of times we arrange our shelf, as well as the number of ordered arrangements, to zero. We haven’t done anything yet!

Then we begin the main loop. Each iteration of the loop does two big things: first, it randomly arranges the bookshelf, and two, it checks to see if it forms a natural set. The checking is where you see all the “if/elif/else” sections of code. Since there are only 6 ways we can form a natural set (3! ways to arrange 3 types of books), I just go ahead and check each case explicitly.

#!/bin/python

from __future__ import division
import random
import itertools
from time import time

# returns True if lists are the same, False if not
def checkEqual(lst):
  return lst[1:] == lst[:-1]

# Get random integer from [1,12]
def rnum():
  return random.randint(0,12)

# Build our bookshelf!
bookshelf = ['math','math','math','math','math','lit','lit','lit','lit','ref','ref','ref']
TOTAL = 0
ORDERED = 0
t0 = time()
for i in range(0,5000000):
  TOTAL += 1   # Count number of times we put the bookshelf back together
  pile = [] # Nothing in our pile yet!
  pile += bookshelf # Books falling off into our pile!
  new_shelf = [] #Now the shelf is empty!
  while len(pile) > 0:
    book = random.choice(pile) # Put the books on the shelf at random
    new_shelf.append(book)
    pile.remove(book) # remove book from pile once we put it on the shelf

  # At this point, we evaluate how the shelf is arranged. There are 3! ways
  # of arranging the shelf so that it it ordered, and we evaluate each one to see
  # if the new shelf fits the 'natural set' criteria.
  if checkEqual(new_shelf[:5]):
    if checkEqual(new_shelf[5:9]):
      if checkEqual(new_shelf[9:]):
        ORDERED +=1
    elif checkEqual(new_shelf[5:8]):
      if checkEqual(new_shelf[8:]):
        ORDERED +=1
  if checkEqual(new_shelf[:4]):
    if checkEqual(new_shelf[4:9]):
      if checkEqual(new_shelf[9:]):
        ORDERED +=1
    elif checkEqual(new_shelf[4:7]):
      if checkEqual(new_shelf[7:]):
        ORDERED +=1
  if checkEqual(new_shelf[:3]):
    if checkEqual(new_shelf[3:8]):
      if checkEqual(new_shelf[8:]):
        ORDERED +=1
    elif checkEqual(new_shelf[3:7]):
      if checkEqual(new_shelf[7:]):
        ORDERED +=1
t1 = time()
print 'ORDERED = ',ORDERED
print 'TOTAL = ',TOTAL
print 'PERCENT = ',(ORDERED/TOTAL)*100,'%'
print 'TIME = ',round(t1-t0,5),'seconds'

And we’re done! Running the code for a suitable number of iterations gives the probability that the books are ordered. More simulations gives a more accurate answer. We’d get the perfect answer if we ran this for an infinite amount of times. Running it 1 million times gave me a probability of 0.0228%. Of course, since they were random trials, this number may change a bit, and if you run the code, you may get a different number — though hopefully not too different!

So how does this compare? The exact answer is 0.02165%. We get this by treating the math books as one set, the literary books as one set, and the reference books as one set. For these sets, there are 3! ways to arrange them. Inside each set, there are 5! ways to arrange the math books, 4! ways to arrange the literary books, and 3! ways to arrange the reference books. Thus we have (3!5!4!3!) ways to arrange the books in a natural set. Ignoring order, since we have 12 books total, we have 12! ways to arrange the books in any order. This gives us our final probability of (3!5!4!3!)/(12!) or 0.02165%.

So our Monte Carlo simulation works, though it obviously does not get the exact answer. This is not surprising, considering we performed only 1 million trials — which seems like a lot, but for a computer, it isn’t. To really reproduce the data, we’d need the total number of trials to come close to the 12! we had in the exact denominator…about 479 million times. For many problems, a big challenge is determining how many trials you have to perform for confident results.

All said and done: don’t bet on that dashing firefighter putting your books back in order :)

UPDATE: I was still curious about how the simulation would run if we approached the total number of ways to arrange the books — that is, somewhere around 479 million trials. So I uploaded my program to our group’s computing clusters, and parallelized the run over 5 processors, each running 100 million trials. The results? Out of 500 million trials, 108084 of them were ordered. This gives us a probability of 0.021617%, which is accurate to +/- 0.0001%! Not bad at all. In case you were curious, each 100 million trials took about 45 minutes. Glad I ran them on separate nodes — otherwise I’d be waiting almost 4 hours! (Which admittedly is not very long in scientific computing…but considering the problem we are dealing with, even 45 minutes is excessive).

Older Newer

Joshua Goings Blog Publications

Data Munging with Pandas

Colley Method for Ranking NFL Teams

Gaussian Matrix Parser

Davidson's Method

Monte Carlo Simulation of the "Bookshelf" Problem