Fragment generator for polypeptide mass-spectroscopy
I have a friend here at the UW that is using mass spectroscopy to identify proteins. Proteins, also called polypeptides, are made up of building blocks called amino acids/peptides. Because polypeptides fragment differently, they can use the fragmentation patterns to determine what an unknown protein is. Kind of a cool idea, I think. The hope is to be able to use it for blood work analysis in medical settings, though I am sure there are plenty of other applications. In fact, C&EN just ran an article this past week on how mass spec was really getting used in biology. You can read it here (sorry, it’s behind a paywall…).
Anyway, he was trying to fit patterns of polypeptides together to fit the different peaks he found. Like if he found a peak at 500.38 M/Z, he would assume he had a GLN-TRP-TRP residue, which has a mass of 500.38 (or some combination thereof). The problem, then, is determining what different combinations of amino acids could fit the peaks! Now, there are really only 20 amino acids he could be looking at, so this turns out to be an interesting combinatorics problem: how many different ways can we combine amino acids residues to fit a given mass (or mass range?)
I realized it wouldn’t be too bad to write a program, and so I am sharing the script with you!
So for this problem, we need to have a set of amino acids, and their masses. So I just looked up the information and hard-coded it into the program. This is the set we will be drawing our combinations from, and it looks like this:
We zip them together to pair up the name of the residue (or amino acid) and its mass.
Once we have this, we will start building our combinations. First we take all combinations of 1 amino acid, then 2 amino acids, then 3 and so on. We don’t need to explicitly make and check each combination if we know will be below the mass specified, or if the chain will always be too high for our desired target mass. For example, say we want all combinations of residues that have a mass between 500.3 and 500.5. In this case, no chain of 2 amino acids alone can reach this, because the largest residue we are working with has a mass of 186.12! Similarly, if the lightest chain possible is too big (e.g. a chain of glycines), we better stop right there because there is no way we will find any more combinations of residues. So we have a check for if the mass is too small (increment chain length by one and continue) or if the mass is too big (break and exit). These bounds keep our computation relatively sane! Here it is in action:
Finally we need to generate our combinations. So inside the loop, we use
itertools to generate all the unique combinations of length N from our set of amino acids.
itertools has two tools to perform this:
combinations won’t allow a residue to be used more than once, so we choose to use
combinations_with_replacements. This allows for combinations such as a polyalanine chain, but it also allows for many more combinations. So it can get pretty slow! If we have a length 10 chain, we get 10! combinations with combinations, but 10^10 combinations with
combinations_with_replacements. Once we have our unique combinations, we sum the mass of each one, see if it matches the criteria, and if it does, we add it to a list that stores all the polypeptides we want. Here it is:
This function will then return a list of all the polypeptide chains that fit our mass criteria. All we have to do is print/write to file and go! This script gets pretty slow quite quickly, as we saw from the scaling criteria. We scale as N^N, where N is the length of our chain. I’m sure there is even more going on behind the scenes in the itertools module, but simply treating the creation of one unique combination as an “operation”, we have an incredibly slow scaling method. Do you know a better way to do this? Let me know! I’d love to hear how to make this more efficient, and I am certain that some computer scientist somewhere has explored this question before… Here is the whole script: