Thursday, May 7, 2009

A trick for the pbs scripts on usc high performance computing cluster (HPC)

I'm trying to run LDA for different numbers of topics. I wrote a
bunch of m-files with all different number of topics hard coded but
then I realized it was being dumb. Actually, it might have saved time
this time, but in the long run I wanted to figure out the hpc pbs
scripts work. Basically its possible to pass variables from the shell
into the pbs script (you can't use the standard ENV vars b/c the job
is forked to different machines). The good idea I had was to pass the
qsub script the command I want to run (distribute) in the variable
CMD. This takes the responsibility out of the pbs script and puts it
back into a normal shell command, which I think is easier for my
purpos. All that's in the run.pbs script is:


#!/bin/bash

source source /usr/usc/matlab/default/setup.sh #puts matlab on path
cd /auto/rcf-proj3/sn/kazemzad/machineLearningTest #goes to the dir that I want


echo $CMD #assumes $CMD is passed using the -v switch
$CMD; #runs #CMD

(END)

Here's an example of how to use it for a test matlab script:


qsub -v CMD="matlab -nosplash -nodesktop -r \"2+2,exit\"" run.pbs


Here's an example of trying to put it all together for a number of topics


for x in 10 50 100 200 300 400 500 600 700 800 900 1000; do echo $x; qsub -l walltime=23:59:59,nodes=1:ppn=2 -v CMD="matlab -nosplash -nodesktop -r \"numTopicsExperiment($x),exit\"" run.pbs; done


This just forks all the different experiments with $x as a parameter.
These commands take 7+ hours to run, so I'm not actually sure it
works. At least it hasn't barfed yet. If I don't post back, assume
it worked and try it yourself if it suits your needs.

1 comment:

abe said...

So this approach worked pretty well. The results for small numbers of topics have started coming in and appear to have run correctly. However, talking to Erik Bresch, I realized that this approach will only allow 10 concurrent processes/nodes. If you put more stuff into the *.pbs scripts (using pbsdsh), it's possible to have up to 10 scripts running as many as 100 concurrent nodes. His approach makes better use of the 1000's of nodes at hpcc, but it's a little more messy in that it has to use a script to generate the pbs script. There were some other issues that he mentioned about how to use the cluster efficiently, like if you have 101 processes to run, each that run for 1 hr., you can ask for 2 hrs and 100 nodes, but then for the last hour, 99 are idle. So the process will still take 2 hours, but use less of the quota if you ask for 52 nodes for 2 hrs.

One think to try is to move the for loop into the *.pbs script and then use pbsdsh, instead of actually specifying out all the forked commands.