Thouless Theorem

This cat knows the importance of the exp(T1) operator.

Kind of a dry post today (hey, I put in a cat picture at least)…but I had to share a theorem – the Thouless theorem – that is so important in computational and quantum chemistry, and is really quite unknown. It ties together Hartree-Fock, Brillouin’s theorem, the successes of coupled cluster, and even some MCSCF ideas. It basically explains why single excitation operators are so important and useful in quantum chemistry, and what they are actually doing when we run our calculations and devise new methods. For someone working closely in electronic structure, it is hard to understate the importance of this theorem. Also the proof is kind of hard to a) find and, b) understand. So I tried to make it simpler and more accessible!

The Thouless theorem ( Thouless, 1960) says that the effect of the \({e^{\hat{T}_1}}\) operator is to transform any single determinant into any other (non-orthogonal!) single determinant. In other words, we can write
\(\vert \phi\rangle = [\hbox{exp}\left(\sum\limits_{i=1}^{N}\sum\limits_{a = N+1}^{\infty} t_i^a a^{\dagger}_{a}a_i\right)]\vert \phi_0\rangle \ \ \ \ \ (1)\)
or
\(\vert \phi\rangle= e^{\hat{T}_1} \vert \phi_0 \rangle \ \ \ \ \ (2)\)

In other words, we can generate any single determinant in terms of another determinant with virtual orbitals (or single excitations) mixed in with it. This is sometimes called orbital rotation’’. If we minimize the energy of the Slater determinant with respect to the rotation parameters (or \({t_ai}\) amplitudes), we get the Hartree-Fock wavefunction out. This is the Brillouins condition, and says that the Hartree Fock reference will not interact (directly) with singly excited determinants. And of course it shouldn’t! We just variationally optimized with respect to those parameters! The Thouless theorem also explains why there is no such thing as a coupled cluster singles equation (it gives Hartree-Fock back). Now, we do include single excitations in CCSD, but they work themselves in because of correlation generated by higher excitations (Ds or Ts, etc.) The Thouless theorem explains why CCSD and higher methods are so insensitive to reference choice; you generate the optimal reference through the \({\hat{T}_1}\) operator. I’m sure there are plenty more uses (you see them a lot in MCSCF actually) that exist in the quantum chemistry literature.

Let’s say we have a single N-electron Slater determinant, written in second quantization as

\[\vert \phi_0 \rangle = a^{\dagger}_{1}a^{\dagger}_{2}\cdots a^{\dagger}_{N}\vert 0\rangle = \prod\limits_{i=1}^{N} a^{\dagger}_{i} \vert 0\rangle \ \ \ \ \ (3)\]

where \({N}\) is the number of electrons in our aptly named N-electron wavefunction, and \({\vert 0\rangle}\) is a vacuum. Let’s say we know what this wavefunction \({\vert \phi_0\rangle}\) is, which is to say it will be our reference determinant. This might be the result of a Hartree-Fock calculation for example. Now say we want to make a new single determinant \({\vert \phi\rangle}\) in terms of our reference determinant (I’ll use determinant and wavefunction interchangably). First we write our wavefunction:
\(\vert \phi \rangle = \tilde{a}^{\dagger}_{1}\tilde{a}^{\dagger}_{2}\cdots \tilde{a}^{\dagger}_{N}\vert 0\rangle = \prod\limits_{\alpha=1}^{N} \tilde{a}^{\dagger}_{\alpha} \vert 0\rangle \ \ \ \ \ (4)\)

The difference between \({a^{\dagger}}\) and \({\tilde{a}^{\dagger}}\) is simply that they create electrons in different (though not orthonormal) orbitals in our Slater determinant. We will assume that the sets of creation (and corresponding annihilation) operators form complete sets and are non-orthogonal with respect to each other. If this is the case we can write one set in terms of the other like so:
\(\tilde{a}^{\dagger}_{\alpha} = \sum\limits_{i}^{\infty} u_{\alpha i} a^{\dagger}_{i} \ \ \ \ \ (5)\)

This is just a change of basis operation. See, for example, Szabo and Ostlund p13. This suggests that we can write
\(\vert \phi\rangle = \prod\limits_{\alpha=1}^{N}\tilde{a}^{\dagger}_{\alpha} \vert 0\rangle = \prod\limits_{\alpha = 1}^{N} \left(\sum\limits_{i}^{\infty} u_{\alpha i} a^{\dagger}_{i}\right) \vert 0\rangle \ \ \ \ \ (6)\)

Following the presentation of Thouless, we then split the sum over occupied and virtual spaces (with respect to the reference):
\(\vert \phi\rangle = \prod\limits_{\alpha = 1}^{N} \left(\sum\limits_{i}^{\infty} u_{\alpha i} a^{\dagger}_{i}\right) \vert 0\rangle = \prod\limits_{\alpha = 1}^{N} \left(\sum\limits_{i}^{N} u_{\alpha i} a^{\dagger}_{i} + \sum\limits_{m= N + 1}^{\infty} u_{\alpha m} a^{\dagger}_{m}\right) \vert 0\rangle \ \ \ \ \ (7)\)

Alright. Since we assumed \({\vert \phi\rangle}\) and \({\vert \phi_0\rangle}\) were not orthogonal, we can choose an intermediate normalization, setting
\(\langle \phi_0 \vert \phi \rangle = \mathbf{1} \ \ \ \ \ (8)\)

Which further implies (again, check out Szabo and Ostlund p13) that the transformation matrix composed of \({u_{\alpha i}}\) is unitary when \({\alpha}\) and \({i}\) run from 1 to \({N}\) (remember \({u}\) is actually a rectangular matrix according to our change of basis definition in 5. This means the square \({N \times N}\) subsection of \({u}\) is invertible, and we represent its \({N \times N}\) inverse as \({U_{i\alpha}}\). Here’s another way to look at it, as well as derive some of the properties of \({\mathbf{u}}\) and \({\mathbf{U}}\):
\(\mathbf{Uu} = \mathbf{U} \begin{bmatrix} \mathbf{u}_{(N\times N)} & \mathbf{u}_{N \times (N+1):\infty} \end{bmatrix} = \begin{bmatrix} \mathbf{I} & \mathbf{T} \end{bmatrix} \ \ \ \ \ (9)\)

where
\(\mathbf{U}\mathbf{u}_{(N\times N)} = \mathbf{I} \qquad \hbox{or} \qquad \sum\limits_{\alpha=1}^{N} U_{i \alpha} u_{\alpha j} = \delta_{i j} \ \ \ \ \ (10)\)

and conversely,
\(\mathbf{u}_{(N\times N)} \mathbf{U} = \mathbf{I} \qquad \hbox{or} \qquad \sum\limits_{i=1}^{N} u_{\alpha i} U_{i \beta} = \delta_{\alpha \beta} \ \ \ \ \ (11)\)

and that odd looking \({\mathbf{T}}\) matrix is the result of the parts of \({\mathbf{u}}\) that extend past the number of electrons \({N}\):
\(\mathbf{U}\mathbf{u}_{(N\times (N+1):\infty)} = \mathbf{T} \qquad \hbox{or} \qquad \sum\limits_{\alpha=1}^{N} U_{i \alpha} u_{\alpha m} = t_{mi} \ \ \ \ \ (12)\)

Where \({i\leq N}\) and \({m\>N}\) for the above equation. Now we put it all together. Because \({\mathbf{U}}\) is unitary, we can write \({N}\) linearly independent combinations of the creation operators, each of which we’ll call \({\bar{a}^{\dagger}_i}\). (Yes, we have now introduced three sets of creation operators, and yes this is essentially what Thouless does, but one of the sets is intermediate to our results and then we will go back to relating two sets. Trust me.)
\(\bar{a}_i^{\dagger} = \sum\limits_{\alpha =1}^{N} U_{i \alpha} \tilde{a}_{\alpha}^{\dagger} \ \ \ \ \ (13)\)

Now the fun happens and using equations 5,10, 11, and 12 we can rewrite the above in terms of just our reference creation and annihilation operators
\(\bar{a}_i^{\dagger} = \sum\limits_{\alpha =1}^{N} U_{i \alpha} \tilde{a}_{\alpha}^{\dagger} \ \ \ \ \ (13)\)

\[\bar{a}_i^{\dagger} = \sum\limits_{\alpha =1}^{N} U_{i \alpha} \sum\limits_{j=1}^{\infty} u_{\alpha j} a_j^{\dagger} \ \ \ \ \ (14)\] \[\bar{a}_i^{\dagger} = \sum\limits_{\alpha =1}^{N} U_{i \alpha} \left( \sum\limits_{j=1}^{N} u_{\alpha j} a_j^{\dagger} + \sum\limits_{m=N+1}^{\infty} u_{\alpha m} a_m^{\dagger}\right) \ \ \ \ \ (15)\] \[\bar{a}_i^{\dagger} = \sum\limits_{\alpha =1}^{N}\sum\limits_{j=1}^{N} U_{i \alpha} u_{\alpha j} a_j^{\dagger} + \sum\limits_{\alpha=1}^{N}\sum\limits_{m=N+1}^{\infty} U_{i \alpha} u_{\alpha m} a_m^{\dagger} \ \ \ \ \ (16)\] \[\bar{a}_i^{\dagger} = \sum\limits_{j = 1}^{N} \delta_{i j} a_{j}^{\dagger} + \sum\limits_{m=N+1}^{\infty} t_{mi} a_{m}^{\dagger} \ \ \ \ \ (17)\] \[\bar{a}_i^{\dagger} = a_{i}^{\dagger} + \sum\limits_{m=N+1}^{\infty} t_{mi} a_m^{\dagger} \ \ \ \ \ (18)\]

In other words, any new orbital may be generated by mixing in contributions from virtual orbitals. We apply this to generating any new single determinant below.
\(\vert \psi\rangle = \prod\limits_i^{N}\bar{a}_i^{\dagger} \vert 0\rangle \ \ \ \ \ (19)\)

\[\vert \psi\rangle= \prod\limits_i^{N}\left(a_{i}^{\dagger} + \sum\limits_{m=N+1}^{\infty}t_{mi}a_m^{\dagger}\right)\vert 0\rangle \ \ \ \ \ (20)\] \[\vert \psi\rangle= \prod\limits_i^{N}\left(1 + \sum\limits_{m=N+1}^{\infty}t_{mi}a_m^{\dagger}a_i\right)a_i^{\dagger}\vert 0\rangle \ \ \ \ \ (21)\] \[\vert \psi\rangle= \prod\limits_i^{N} \prod\limits_{m=N+1}^{\infty} \left(1 + t_{mi}a_m^{\dagger}a_i\right)\vert \psi_0\rangle \ \ \ \ \ (22)\] \[\vert \psi\rangle = e^{\hat{T}_1}\vert \psi_0 \rangle \ \ \ \ \ (23)\]

We got to Eq. (22) from Eq. (21) by realizing that \({\prod\limits_i^{N} a_i^{\dagger} \vert 0\rangle = \vert \psi_0\rangle}\), and \({a_ia_i\vert \psi_0\rangle = 0}\). The fact that all terms in which the same creation operator occurs more than once is also the reason we can write the infinite product as an exponential.

HT to Ismail Aydin and Matthias Degroote for pointing out some typos that have now been fixed!

Fragment generator for polypeptide mass-spectroscopy

Cobalt(II) Interface with ETR Important Residues

I have a friend here at the UW that is using mass spectroscopy to identify proteins. Proteins, also called polypeptides, are made up of building blocks called amino acids/peptides. Because polypeptides fragment differently, they can use the fragmentation patterns to determine what an unknown protein is. Kind of a cool idea, I think. The hope is to be able to use it for blood work analysis in medical settings, though I am sure there are plenty of other applications. In fact, C&EN just ran an article this past week on how mass spec was really getting used in biology. You can read it here (sorry, it’s behind a paywall…).

Anyway, he was trying to fit patterns of polypeptides together to fit the different peaks he found. Like if he found a peak at 500.38 M/Z, he would assume he had a GLN-TRP-TRP residue, which has a mass of 500.38 (or some combination thereof). The problem, then, is determining what different combinations of amino acids could fit the peaks! Now, there are really only 20 amino acids he could be looking at, so this turns out to be an interesting combinatorics problem: how many different ways can we combine amino acids residues to fit a given mass (or mass range?)

I realized it wouldn’t be too bad to write a program, and so I am sharing the script with you!

So for this problem, we need to have a set of amino acids, and their masses. So I just looked up the information and hard-coded it into the program. This is the set we will be drawing our combinations from, and it looks like this:

# Build list of amino acids  
amino_acids = ['ALA','ARG','ASP','ASN','CYS','GLU','GLN','GLY','HIS','ILE', \  
    'LEU','LYS','MET','PHE','PRO','SER','THR','TRP','TYR','VAL']  
amino_acid_MW = [71.09,156.19,114.11,115.09,103.15,129.12,128.14,57.05,137.14,113.16, \  
    113.16,128.17,131.19,147.18,97.12,87.08,101.11,186.12,163.18,99.14]  
# Combine them into one list, AA  
AA = zip(amino_acids,amino_acid_MW)

We zip them together to pair up the name of the residue (or amino acid) and its mass.

Once we have this, we will start building our combinations. First we take all combinations of 1 amino acid, then 2 amino acids, then 3 and so on. We don’t need to explicitly make and check each combination if we know will be below the mass specified, or if the chain will always be too high for our desired target mass. For example, say we want all combinations of residues that have a mass between 500.3 and 500.5. In this case, no chain of 2 amino acids alone can reach this, because the largest residue we are working with has a mass of 186.12! Similarly, if the lightest chain possible is too big (e.g. a chain of glycines), we better stop right there because there is no way we will find any more combinations of residues. So we have a check for if the mass is too small (increment chain length by one and continue) or if the mass is too big (break and exit). These bounds keep our computation relatively sane! Here it is in action:

for count in range(1,max_chain_length+1):  
    if count*57.05 > upper_bound_mass:  
        break  
    elif count*186.12 < lower_bound_mass:  
        continue  

Finally we need to generate our combinations. So inside the loop, we use itertools to generate all the unique combinations of length N from our set of amino acids. itertools has two tools to perform this: combinations and combinations_with_replacements. combinations won’t allow a residue to be used more than once, so we choose to use combinations_with_replacements. This allows for combinations such as a polyalanine chain, but it also allows for many more combinations. So it can get pretty slow! If we have a length 10 chain, we get 10! combinations with combinations, but 10^10 combinations with combinations_with_replacements. Once we have our unique combinations, we sum the mass of each one, see if it matches the criteria, and if it does, we add it to a list that stores all the polypeptides we want. Here it is:

  
tot_combinations = itertools.combinations_with_replacement(AA,count)  
for combination in tot_combinations:  
    MWs = [x[1] for x in combination]  
    if sum(MWs) >= lower_bound_mass:  
        if sum(MWs) <= upper_bound_mass:  
            possible.append(combination) 

This function will then return a list of all the polypeptide chains that fit our mass criteria. All we have to do is print/write to file and go! This script gets pretty slow quite quickly, as we saw from the scaling criteria. We scale as N^N, where N is the length of our chain. I’m sure there is even more going on behind the scenes in the itertools module, but simply treating the creation of one unique combination as an “operation”, we have an incredibly slow scaling method. Do you know a better way to do this? Let me know! I’d love to hear how to make this more efficient, and I am certain that some computer scientist somewhere has explored this question before… Here is the whole script:

  
import random
import itertools
import csv
import time


# this is a function that returns a list of the possible combinations of amino acids that have a mass (in daltons) 
# greater than a lower bound, and lower than an upper bound, along with a maximum polypeptide chain length
def polypeptide_chain(max_chain_length,lower_bound_mass,upper_bound_mass):
    # Build list of amino acids
    amino_acids = ['ALA','ARG','ASP','ASN','CYS','GLU','GLN','GLY','HIS','ILE', \
                   'LEU','LYS','MET','PHE','PRO','SER','THR','TRP','TYR','VAL']
    amino_acid_MW = [71.09,156.19,114.11,115.09,103.15,129.12,128.14,57.05,137.14,113.16, \
                     113.16,128.17,131.19,147.18,97.12,87.08,101.11,186.12,163.18,99.14]
    # Combine them into one list, AA
    AA = zip(amino_acids,amino_acid_MW)
    possible = []
    for count in range(1,max_chain_length+1):
        if count*57.05 > upper_bound_mass:
            break
        elif count*186.12 < lower_bound_mass:
            continue
        else:
            tot_combinations = itertools.combinations_with_replacement(AA,count)
            for combination in tot_combinations:
                MWs = [x[1] for x in combination]
                if sum(MWs) >= lower_bound_mass:
                    if sum(MWs) <= upper_bound_mass:
                        possible.append(combination)
    return possible               

# here is an example of it in use: it looks for polypeptide chains no longer than 20 amino acids, with a combined mass between 949.9 Da and 950.1 Da

### ---  make your changes here in the arguments of the function
t0 = time.time()
poss = polypeptide_chain(20,500.3,500.5)   
t1 = time.time()
### --- end changes
# we write it to file, naming the file 'polypeptides.csv'
with open('polypeptides.csv','wb') as my_file:
    file_writer = csv.writer(my_file,delimiter=' ')
    for i in poss:
        aa,mass = zip(*i)        
        my_file.write(str(sum(mass))+'\t')
        file_writer.writerow(aa)

print "Time: ", t1-t0, " seconds"

Enjoy!

Data Munging with Pandas

I’m almost embarassed I haven’t heard of pandas until now. pandas is a suite of python tools for data munging, and works more or less like a really fast Excel spreadsheet. I find it a pain to get data in a form I can use, and usually my first instinct is to just hack together something workable and then leave it. But that is sloppy in practice, and not very satisfying. So I’ll redo some parts of my last post on rating teams, so you can see it in action.

So before when I wanted to get a list of NFL teams for my ranking programs, I did something like this:

  
def get_teams():  
    """ Get team data from masseyratings.com,  
      parse it into a dictionary with team number  
      as the key, team name as the value  
    """  
    f = urllib.urlopen("http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&exhib=on&format=2")  
    s = f.read().split()  
    my_teamnames = {}  
    for i in range(0,len(s)/2):  
        my_teamnames.update({i : s[i*2 + 1]})  
    return my_teamnames  

Not really intuitive what I did there, mostly a lot of hacking to get the list of teams in a usable fashion for the rest of the program. It returns a python dictionary, and it looks like this:

    {0: 'Arizona',
     1: 'Atlanta',
     2: 'Baltimore',
     ...
     29: 'Tampa_Bay',
     30: 'Tennessee',
     31: 'Washington'}

Let’s do it a better way.
Here is how we do it with pandas (imported as pd):

  
def get_teams():  
    team_data = urllib2.urlopen('http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&format=2')  
    teams = pd.read_csv(team_data, header=None)  
    teams.columns = ['Team_ID', 'Team_name']  
    return teams  

A mere four lines of code, each with a distinct (and clear) purpose! First line, open the webpage containing data. Second line, read in the data to a pandas DataFrame object (think a spreadsheet). Third, name the columns, and finally return the object, which looks like so:

    Team_ID Team_name
    0 1 Arizona
    1 2 Atlanta
    2 3 Baltimore
    ... ... ...
    29 30 Tampa_Bay
    30 31 Tennessee
    31 32 Washington

The first column is the index, which starts at zero in a pythonic manner. The second column is team ID, which is what the data itself uses. So don’t get confused by those numbers. Having a DataFrame object allows us to easily index the columns and rows. It also easily allows us to add columns to it. So say we get a rating vector, and want to pair it up with our team names. Before I did it like so:

  
def write_sorted(teams,rating):  
    combine = np.vstack((rating,teams))  
    sort_combine = np.array(sorted(combine.T,key=tuple,reverse=True))  
    f = open(str(filename)+'.txt','w')  
    for i in range(0,len(teams)):  
        f.write(str(team_name[int(sort_combine[i,1])])+', '+str(sort_combine[i,0])+'\n')  
    f.close()  

Again, kind of a hack to stack the lists (via vstack), then sort them by rating, which involves transposing the array, using tuple keys…and if you’ve never done this before you have to look up all the options. And then you have to write to file in such a way as to link up team names with the ID in the vector…it’s messy and I don’t like it. So let’s try it with pandas. You saw the team object before, right? Now we do:

  
def write_sorted(teams,rating,filename):  
    teams['rating'] = rating  
    teams = teams.sort(columns='rating',ascending=False)  
    teams.to_csv(filename)  

Done. (Of course we can add different formatting options, too). Add a column titled rating to our teams object, sort by the column rating, in descending order, then write to file. SO EASY. And here we have it:

    Team_ID,Team_name,rating
    13, Houston,0.7603317475519333
    19, New_England,0.7582151079219482
    16, Kansas_City,0.7420115719733623
    ...
    15, Jacksonville,0.30074546774453426
    18, Minnesota,0.2838074821824212
    25, Pittsburgh,0.24660385713879754
    32, Washington,0.20859838193293445

It’s just so simple, and I’m really glad that I found this module for my python programs!

And as a bonus for reading this far, here is a picture of a laughing panda.

Oh internet, you so FUNNY

Colley Method for Ranking NFL Teams

The NFL season started this week, which means it is a perfect time to set up a program for ranking teams! My fiancee bought me a book last Christmas called Who’s #1, which is an awesome book on how ranking and rating methods work. Specifically the book is geared towards sports, but I was surprised to learn how many companies use ranking algorithms — for example, Google and Netflix. I am — by and large — ignorant of most things sports related, but this past year I had lived with five other guys who love sports and who had invited me to join their March Madness bracket group. Since I don’t really follow college basketball, but I wanted a shot at winning our groups bracket challenge (there was a cash prize!), I used the methods in the book to rank college basketball teams and fill out the brackets algorithmically. I ended up taking the second place prize out of a group of nearly 25 brackets in our group, and in the top 5% nationally. I took first in my bracket group I did with the other graduate students…but I didn’t put any money in that one! (Bummer…)

One of the first methods I learned is called the Colley Method, and you can read all about it here. You might recognize it as one of the key computer methods in college football’s Bowl Championship Series (no comment on that). The method takes the win-loss record of each team, assigns them a numerical rating based on who they beat (or lost to), and ranks them according to that rating. The big idea is that if you beat a very good team, it should count more than if you beat a poor team, and if you lose to a very good team, you should not be ‘penalized’ as much as if you lose to a terrible team. In a sense, it works by interconnecting teams based on who they won and lost against — kind of like a pseudo-social network. If you beat me, and I beat your friend, then we assume you would have beaten your friend, even if you never go to play them. Hopefully that makes sense!

Anyway, if you want to try it yourself, here is a link to the code that ranks both the 2012 and 2013 NFL teams. I included an option for preseason games, since there is (at the time of this writing) only one regular season game that has been played in the NFL. Note that you will need the numpy module for this to work.

To run it, type

python ranking.py -e -y 2013

This will call all the other python files to fetch the data from http://www.masseyratings.com and will then process it an implement the Colley method, giving you a colley.txt file containing the ratings of each team (sorted of course). The flag -y gives input for the year (2012 or 2013), and the -e flag, if present, includes exhibition matches as well (e.g. preseason games).

Now for the guts of the code (for those who are interested).

First we need a way to get the data from masseyratings.com, and put the games into an array as well as get our teams. Because the games use an integer to represent the teams, I create a python dictionary to map the number to the team name. So getdata.py does precisely this:

  
def get_teams():  
     """ Get team data from masseyratings.com,  
       parse it into a dictionary with team number  
       as the key, team name as the value  
     """  
     f = urllib.urlopen("http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&exhib=on&format=2")  
     s = f.read().split()  
     my_teamnames = {}  
     for i in range(0,len(s)/2):  
         my_teamnames.update({i : s[i*2 + 1]})  
     return my_teamnames

def get_games(year,exhibition):  
    """ Get game data from masseyratings.com,  
      parse it into a numpy array.  
    """  
    if year == 2013:  
        if exhibition == True:  
            f = urllib.urlopen('http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&exhib=on&format=1')  
        elif exhibition == False:  
            f = urllib.urlopen('http://www.masseyratings.com/scores.php?s=199229&sub=199229&all=1&mode=3&format=1')  
        else:  
            sys.exit('"exhibition" must be "True" or "False"')  
    elif year == 2012:  
        if exhibition == True:  
            f = urllib.urlopen('http://www.masseyratings.com/scores.php?s=181613&sub=181613&all=1&mode=3&exhib=on&format=1')  
        elif exhibition == False:  
            f = urllib.urlopen('http://www.masseyratings.com/scores.php?s=181613&sub=181613&all=1&mode=3&format=1')  
        else:  
            sys.exit('"exhibition" must be "True" or "False"')  
    else:  
        sys.exit('Not a valid year')  
    s = f.read()  
    if exhibition == False:  
        file_name = str('games_'+str(year)+'.txt')  
    elif exhibition == True:  
        file_name = str('games_'+str(year)+'_exhib.txt')  
    k = open(file_name,'w')  
    k.write(s)  
    k.close()  
    f.close()

    my_games = genfromtxt(file_name, dtype = None, delimiter=',')

    return my_games  

Once we have our games and teams, we need to set up the Colley matrix (‘A’), and solve it against Colley’s ‘b’ vector. The colley.py takes care of this. We take data about the games, and set up a square matrix with the number of teams as the dimension. Where teams intersect contains data about how they fared against each other. The mathematical details can be found in the Who’s #1 and in the link to the Colley Method I gave. The end result of this code is a rating vector (called rating here). Anyway, here is that code:

  
def colley(num_teams,num_games,my_teams,my_games):

    A = np.zeros((num_teams,num_teams))  
    wins = np.zeros(num_teams)  
    losses = np.zeros(num_teams)  
    total = np.zeros(num_teams)
    
    for game in range(0,num_games):  
        w = my_games[game,2] - 1.0  
        l = my_games[game,5] - 1.0
    
        A[w,w] = A[w,w] + 1.0  
        A[l,l] = A[l,l] + 1.0  
        A[l,w] = A[l,w] - 1.0  
        A[w,l] = A[w,l] - 1.0
    
    for i in range(0,num_teams):  
        total[i] = total[i] + wins[i] + losses[i]
    
    b = np.zeros(num_teams)  
    for i in range(0,num_teams):  
        A[i,i] = 2 + total[i] + A[i,i]  
        b[i] = 1 + (wins[i] - losses[i])/2
    
    rating = np.linalg.solve(A,b)  
    return rating  

Once we have these, we pair it up with our dictionary element containing the team names, and sort it. This is the finalsort.py code. I wrapped everything in the file ranking.py if you are checking this out in my repo. It takes care of making all the files play nicely together.

So now the big question is, how does it fare? Here is the list as of today (9/7/13):

    Seattle, 0.806187648116
    Detroit, 0.76229943304
    Washington, 0.734350377447
    Cleveland, 0.691797811483
    New_England, 0.666157176065
    New_Orleans, 0.657992419938
    NY_Jets, 0.652519130556
    San_Francisco, 0.648158042535
    Arizona, 0.647296343048
    Houston, 0.647271431013
    Denver, 0.614105025386
    Carolina, 0.600878984866
    Buffalo, 0.563322480133
    Cincinnati, 0.557969510878
    Philadelphia, 0.53328007273
    Indianapolis, 0.525416655862
    Chicago, 0.522801819437
    Kansas_City, 0.479567506516
    Dallas, 0.453378981006
    San_Diego, 0.437407308856
    Miami, 0.414388770684
    Oakland, 0.406726811416
    Green_Bay, 0.37888674304
    Tampa_Bay, 0.36195341794
    Minnesota, 0.357868414451
    Baltimore, 0.356824183447
    St_Louis, 0.340268960559
    NY_Giants, 0.339410132678
    Tennessee, 0.288458533025
    Jacksonville, 0.280125144891
    Pittsburgh, 0.192367833585
    Atlanta, 0.0805628953732

Not surprisingly, Seattle is at the top (Woot woot! Go Hawks!), having finished the preseason perfectly against tough teams. Atlanta is at the bottom, having lost every preseason game. Generally, the ratings stack up similar to how the teams did in the preseason.

Also not surprising? How poor of a representation I think it really is. Preseason games are the time where teams test out their second and third strings and avoid playing their top players. So I think, for example, Atlanta will fare a lot better than this preseason ranking suggests. While I like the Colley Method’s simplicity, I think it performs poorly as a predictive algorithm. The Massey method, on the other hand, is much better at predicting future match-ups. This was my experience with it in March Madness anyway. Google it if you are curious. If you are serious about doing any sports ranking, I also want to suggest that you take a look at weighting methods…for example, games played later in the season ‘count more’. I think that is a critical step to getting predictive algorithms. How to do that well, though, is more of an art :)

If you have any questions or comments, let me know!

Gaussian Matrix Parser

I share a lot of code here on my blog, and for most of the quantum chemistry programs to work, I need some inputs, such as one and two electron integrals, overlap matrices, and the like. To be honest, aside from working through the Szabo and Ostlund integral evaluation code, I have no idea how to compute these quantities in practice. I know there are plenty of good algorithms out there (PRISM comes to mind), but I’ve never taken the time to work them out for myself (future project maybe?). Instead, I parse the codes directly from Gaussian09 output files. It’s a little time consuming, to do that for each molecule, so I wanted to share my code with you.

Here is a link to it: gauss_parse.

The program, Gaussian Matrix Parser (gauss_parse), is a command line routine to extract integrals from Gaussian output files and write them to text files as a 2D array. gauss_parse takes the .log file as an argument, and then creates a directory with (named after the logfile), and fills it with the matrices you want, e.g:

python gauss_parse.py my_molecule.log

This program assumes you have included the extra print options for matrices, e.g for minimal basis water, our h2o.com would look like:

#P rhf/STO-3G scf(conventional) Iop(3/33=6) Extralinks=L316 Noraff Symm=Noint Iop(3/33=1) pop(full)

Minimal basis H2O

0 1  
H  
O 1 r1  
H 2 r1 1 a1

r1 1.1  
a1 104.0  

Which has a suitable route section to print overlap, kinetic energy, two-electron repulsion, and potential energy integrals (and more)!

Say you have installed to a folder called gauss_parse. From the terminal, cd to that directory, and list the contents:

$ cd gauss_parse
$ ls
    LICENSE.txt README gauss_parse.py h2o.log
$ python gauss_parse.py h2o.log
$ ls
    LICENSE.txt README gauss_parse.py h2o.log h2o
$

Note how h2o has been added to the directory. cd into that directory, and you should see:

$ cd h2o
$ ls
    kinetic_energy.dat number_basis_functions.dat
    overlap.dat two_electron_ints.dat
    nuclear_repulsion.dat number_electrons.dat
    potential_energy.dat
$

Each .dat file contains the information extracted from the Gaussian .log file, which you can use for whatever you like! For example, say we want to see the overlap matrix for our water molecule:

$ cd h2o
$ vim overlap.dat

And we see in vim:

1.0000e+00 3.8405e-02 3.8613e-01 0.0000e+00 2.6843e-01 -2.0972e-01 1.8176e-01  
3.8405e-02 1.0000e+00 2.3670e-01 0.0000e+00 0.0000e+00 0.0000e+00 3.8405e-02  
3.8613e-01 2.3670e-01 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 3.8613e-01  
0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00  
2.6843e-01 0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00 0.0000e+00 -2.6843e-01  
-2.0972e-01 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00 -2.0972e-01  
1.8176e-01 3.8405e-02 3.8613e-01 0.0000e+00 -2.6843e-01 -2.0972e-01 1.0000e+00  

With the ones along the diagonal — just as expected! (Actual precision in gauss_parse is much higher).