Embarassingly parallel tasks in Python
Recently, I found myself executing the same commands (or some variation thereof) at the command line.
Over and over and over and over.
Sometimes I just write a do loop (bash
) or for loop (python
). But the command I was executing was taking almost a minute to finish. Not terrible if you are doing this once, but several hundred (thousand?) times more and I get antsy.
If you are curious, I was generating .cube
files for a large quantum dot, of which I needed the molecular orbital density data to analyze. There was no way I was waiting half a day for these files to generate.
So instead, I decided to parallelize the for loop that was executing my commands. It was easier than I thought, so I am writing it here not only so I don’t forget how, but also because I’m sure there are others out there like me who (a) aren’t experts at writing parallel code, and (b) are lazy.
Most of the following came from following along here.
First, the package I used was the joblib package in python
. I’ll assume you have it installed, if not, you can use pip or something like that to get it on your system. You want to import Parallel
and delayed
.
So start off your code with
If you want to execute a system command, you’ll also need the call
function from the subprocess package. So you have
Once you have these imported, you have to structure your code (according to the joblib
people) like so:
So do you imports first (duh), then define the functions you want to do (in my case, execute a command on the command line), and then finally call that function in the main block.
I learn by example, so I’ll show you how I pieced the rest of it together.
Now, the command I was executing was the Gaussian “ cubegen” utility. So an example command looks like
Which makes a .cube
file (50.cube
) containing the volumetric data of molecular orbital 50 (MO=50) from the formatted checkpoint file (qd.fchk
). I wanted 120 points per side, and I wanted headers printed (120 h
).
Honestly, the command doesn’t matter. If you want to parallelize
over a for loop, you certainly could. That’s not my business.
What does matter is that we can execute these commands from a python script using the call
function that we imported from the subroutine
package.
So we replace our functions with the system calls
Now that we have the command(s) defined, we need to piece it together in the main block.
In the case of the makeCube
function, I want to feed it a list of molecular orbital (MO) numbers and let that define my for loop. So let’s start at MO #1 and go to, say, MO #500. This will define our inputs. I also want the cube resolution (npts
) as a variable (well, parameter really).
I’ll also use 8 processors, so I’ll define a variable num_cores
and set it to 8. Your mileage may vary. Parallel()
is smart enough to handle fairly dumb inputs.
(Also, if you do decide to use cubegen
, like I did, please make sure you have enough space on disk.)
Putting this in, our code looks like
Great. Almost done.
Now we need to call this function from within Parallel()
from joblib
.
The Parallel
function (object?) first takes the number of cores as an input. You could easily hard code this if you want, or let Python’s multiprocessing
package determine the number of CPUs available to you. Next we call the function using the delayed()
function. This is “a trick to create a tuple (function, args, kwargs) with a function-call syntax”.
It’s on the developer’s web page. I can’t make this stuff up.
Then we feed it the list defined by our start and end values.
If you wanted to list the contents of your directory 500 times and over 8 cores, it would look like (assuming you defined the function and inputs above)
Essentially we are making the equivalence that
is the same as
Does that make sense? It’s just
instead of
Okay. Enough already. Putting it all together we have:
There you have it!