jeudi 6 mars 2014

GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node


Vote count:

0




I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.


When I run a PBS script calling GNU parallel without the --jobs option, like this:



#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40


it looks like it only uses one CPU per core, and also provides the following error stream:



bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.


This looks like one error for each node. I don't understand the first part (bash: parallel: command not found), but the second part tells me it's using one node.


When I add the option -j2 to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:



  1. Am I using the --jobs option correctly? Is it correct to specify -j2 because I have 2 CPUs per node? Or should I be using -jN where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)?

  2. It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?

  3. Is there any meaning to the bash: parallel: command not found message?



asked 45 secs ago






Aucun commentaire:

Enregistrer un commentaire