R programming - submitting jobs on a multiple node linux cluster using PBS - linux

I am running R on a multiple node Linux cluster. I would like to run my analysis on R using scripts or batch mode without using parallel computing software such as MPI or snow.
I know this can be done by dividing the input data such that each node runs different parts of the data.
My question is how do I go about this exactly? I am not sure how I should code my scripts. An example would be very helpful!
I have been running my scripts so far using PBS but it only seems to run on one node as R is a single thread program. Hence, I need to figure out how to adjust my code so it distributes labor to all of the nodes.
Here is what I have been doing so far:
1) command line:
> qsub myjobs.pbs
2) myjobs.pbs:
> #!/bin/sh
> #PBS -l nodes=6:ppn=2
> #PBS -l walltime=00:05:00
> #PBS -l arch=x86_64
>
> pbsdsh -v $PBS_O_WORKDIR/myscript.sh
3) myscript.sh:
#!/bin/sh
cd $PBS_O_WORKDIR
R CMD BATCH --no-save my_script.R
4) my_script.R:
> library(survival)
> ...
> write.table(test,"TESTER.csv",
> sep=",", row.names=F, quote=F)
Any suggestions will be appreciated! Thank you!
-CC

This is rather a PBS question; I usually make an R script (with Rscript path after #!) and make it gather a parameter (using commandArgs function) that controls which "part of the job" this current instance should make. Because I use multicore a lot I usually have to use only 3-4 nodes, so I just submit few jobs calling this R script with each of a possible control argument values.
On the other hand your use of pbsdsh should do its job... Then the value of PBS_TASKNUM can be used as a control parameter.

This was an answer to a related question - but it's an answer to the comment above (as well).
For most of our work we do run multiple R sessions in parallel using qsub (instead).
If it is for multiple files I normally do:
while read infile rest
do
qsub -v infile=$infile call_r.pbs
done < list_of_infiles.txt
call_r.pbs:
...
R --vanilla -f analyse_file.R $infile
...
analyse_file.R:
args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...
Then I combine all the output afterwards...

This problem seems very well suited for use of GNU parallel. GNU parallel has an excellent tutorial here. I'm not familiar with pbsdsh, and I'm new to HPC, but to me it looks like pbsdsh serves a similar purpose as GNU parallel. I'm also not familiar with launching R from the command line with arguments, but here is my guess at how your PBS file would look:
#!/bin/sh
#PBS -l nodes=6:ppn=2
#PBS -l walltime=00:05:00
#PBS -l arch=x86_64
...
parallel -j2 --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
Rscript myscript.R {} :::: infilelist.txt
where infilelist.txt lists the data files you want to process, e.g.:
inputdata01.dat
inputdata02.dat
...
inputdata12.dat
Your myscript.R would access the command line argument to load and process the specified input file.
My main purpose with this answer is to point out the availability of GNU parallel, which came about after the original question was posted. Hopefully someone else can provide a more tangible example. Also, I am still wobbly with my usage of parallel, for example, I'm unsure of the -j2 option. (See my related question.)

Related

Best way to automatically create different process names for qsub

I am running my program on a high-performance computer, usually with different parameters as input. Those parameters are given to the program via a parameter file, i.e. the qsub-file looks like
#!/bin/bash
#PBS -N <job-name>
#PBS -A <name>
#PBS -l select=1:ncpus=20:mpiprocs=20
#PBS -l walltime=80:00:00
#PBS -M <mail-address>
#PBS -m bea
module load foss
cd $PBS_O_WORKDIR
mpirun main parameters.prm
# Append the job statistics to the std out file
qstat -f $PBS_JOBID
Now usually I run the same program multiple times more or less at the same time, with different parameter.prm-files. Nevertheless they all show up in the job-list with the same name, making the correlation between the job in the list and the used parameters difficult (not impossible).
Is there a way to change the name of the program in the job list dynamically, depending on the used input parameters (ideally from within main)? Or is there another way to change the job name without having to edit the job-file every time I run
qsub job_script.pbs
?
Would be a solution to create a shell script which reads data from the parameter file, and then in turn creates the job-script and runs it? Or are there easier ways?
Simply use the -N option on the command line:
qsub -N job1 job_script.pbs
You can then use a for loop to iterate over the *.prm files:
for prm in *.prm
do
prmbase=$(basename $prm .prm)
qsub -N $prmbase main $prm
done
This will name each job by the parameter file name, sans the .prm suffix.

Calling Matlab on a linux based Cluster: matlab sessions stops before m file is completly executed

i'm running a bash script that submits some pbs Jobs on a Linux based Cluster multiple times. each Submission calls Matlab, reads some data, performs calculations, and writes the results back to my Directory.
This process works fine without one exception. For some calculations the m-file starts, loads everything, than performs the calculation, but while printing the results to the stdout the Job terminates.
the log file of pbs Shows no error Messages, matlab Shows no error Messages.
the code runs perfectly on my Computer. I am out of ideas.
if anyone would have an idea what i could do, i would appreciate it.
thanks in advance
jbug
edit:
is there a possibility to force matlab to reach the end of file? may that help?
edit #18:00:
as requested in the comment below by HBHB here is the comment that Shows how matlab is called by an external *.sh file
#PBS -l nodes=1:ppn=2
#PBS -l pmem=1gb
#PBS -l mem=1gb
#PBS -l walltime=00:05:00
module load MATLAB/R2015b
cd $PBS_O_WORKDIR
matlab -nosplash -nodisplay -nojvm -r "addpath('./data/calc');myFunc("$a","$b"),quit()"
Where $a and $b Comes from a Loop within the caller bash file and ./data/calc Points to the Directory where myFunction is located
edit #18:34: if i perform the calculation manually than everything runs fine. so the given data is fine and seems to narrow down to pbs?
edit #21:27 i put an until Loop around the matlab call that checks if matlab Returns the desired data. if not, it should restart matlab again after some delay. but still. matlab stops after finished calulation while printing the result(some matrices) and even the Job finishes. the checking part of the restart will never be reached.
what i don't understand. the Job stays in the Queue, like i planned it with the small delay. so the sleep$w will be executed? but if I check the error files, it just shows me the frozen matlab in its first round, recognizable by i. here is that part of code. maybe you can help me
#w=w wait
i=1
until [[ -e ./temp/$b/As$a && -e ./temp/$b/Bs$a && -e ./temp/$b/Cs$a && -e ./temp/$b/lamb$a ]]
do
echo $i
matlab -nosplash -nodisplay -nojvm -r "addpath('./data/calc');myFunc("$a","$b"),quit()"
sleep $w
((i=i+1))
done
You are most likely choking your matlab process with limited memory. Your PBS file:
#PBS -l nodes=1:ppn=2
#PBS -l pmem=1gb
#PBS -l mem=1gb
#PBS -l walltime=00:05:00
You are setting your physical memory to 1gb. Matlab without any files runs around 900MB of virtual memory. Try:
#PBS -l nodes=1:ppn=1
#PBS -l pvmem=5gb
#PBS -l walltime=00:05:00
Additionally, this is something you should contact your local system administrator for. Without system logs, I can't tell you for sure why your job is cutting short (but my guess is resource limits). As an SA of an HPC center, I can tell you that they would be able to tell you in about 5 minutes why your job is not working correctly. Additionally, different HPC centers utilize different PBS configurations. So mem might not even be recognized; this is something your local adminstrators can help you with much better then StackOverflow.

Redirect output of my java program under qsub

I am currently running multiple Java executable program using qsub.
I wrote two scripts: 1) qsub.sh, 2) run.sh
qsub.sh
#! /bin/bash
echo cd `pwd` \; "$#" | qsub
run.sh
#! /bin/bash
for param in 1 2 3
do
./qsub.sh java -jar myProgram.jar -param ${param}
done
Given the two scripts above, I submit jobs by
sh run.sh
I want to redirect the messages generated by myProgram.jar -param ${param}
So in run.sh, I replaced the 4th line with the following
./qsub.sh java -jar myProgram.jar -param ${param} > output-${param}.txt
but the messages stored in output.txt is "Your job 730 ("STDIN") has been submitted", which is not what I intended.
I know that qsub has an option -o for specifying the location of output, but I cannot figure out how to use this option for my case.
Can anyone help me?
Thanks in advance.
The issue is that qsub doesn't return the output of your job, it returns the output of the qsub command itself, which is simply informing your resource manager / scheduler that you want that job to run.
You want to use the qsub -o option, but you need to remember that the output won't appear there until the job has run to completion. For Torque, you'd use qstat to check the status of your job, and all other resource managers / schedulers have similar commands.

Torque+MAUI PBS submitted job strange startup

I am using a Torque+MAUI cluster.
The cluster's utilization now is ~10 node/40 nodes available, a lot of job being queued but cannot be started.
I submitted the following PBS script using qsub:
#!/bin/bash
#
#PBS -S /bin/bash
#PBS -o STDOUT
#PBS -e STDERR
#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=32
#PBS -q zone0
cd /somedir/workdir/
java -Xmx1024m -Xms256m -jar client_1_05.jar
The job gets R(un) status immediately, but I had this abnormal information from qstat -n
8655.cluster.local user zone0 run.sh -- 1 32 -- 500:00:00 R 00:00:31
z0-1/0+z0-1/1+z0-1/2+z0-1/3+z0-1/4+z0-1/5+z0-1/6+z0-1/7+z0-1/8+z0-1/9
+z0-1/10+z0-1/11+z0-1/12+z0-1/13+z0-1/14+z0-1/15+z0-1/16+z0-1/17+z0-1/18
+z0-1/19+z0-1/20+z0-1/21+z0-1/22+z0-1/23+z0-1/24+z0-1/25+z0-1/26+z0-1/27
+z0-1/28+z0-1/29+z0-1/30+z0-1/31
The abnormal part is -- in run.sh -- 1 32, as the sessionId is missing, and evidently the script does not run at all, i.e. the java program does not ever had traces of being started.
After this kind of strange running for ~5 minutes, the job will be set back to Q(ueue) status and seemingly will not being run again (I had monitored this for ~1 week and it does not run even being queued to the top most job).
I tried submit the same job 14 times, and monitored its node in qstat -n, 7 copies ran successfully, having varied node numbers, but all jobs being allocated z0-1/* get stuck with this strange startup behavior.
Anyone know a solution to this issue?
For a temporary workaround, how can I specify NOT to use those strange nodes in PBS script?
It sounds like something is wrong with those nodes. One solution would be to offline the nodes that aren't working: pbsnodes -o <node name> and allow the cluster to continue to work. You may need to release the holds on any jobs. I believe you can run releasehold ALL to accomplish this in Maui.
Once you take care of that I'd investigate the logs on those nodes (start with the pbs_mom logs and the syslogs) and figure out what is wrong with them. Once you figure out and correct what is wrong with them, you can put the nodes back online: pbsnodes -c <node_name>. You may also want to look into setting up some node health scripts to proactively detect and handle these situations.
For users, contact your administrator and in the mean time, run the job using this workaround.
Use pbsnodes to check for free and healthy nodes
Modify PBS directive #PBS -l nodes=<freenode1>:ppn=<ppn1>+<freenode2>:ppn=<ppn2>+...
submit the job using qsub

Linux bash multithread/process small jobs

I have a script that runs some data processing command 10K times.
foreach f (folderName/input*.txt)
mycmd $f
end
I have timed the runtime for each "mycmd $f" to be 0.25 secs.
With 10K runs, it adds up to be more than 1 hr.
I'm running it on a 16 cores nehalem.
It's a huge waste to not run on the remaining 15 cores.
I have tried & with sleep, somehow the script just dies with a warning or error around 3900 iterations, see below. The shorter the sleep, that faster it dies.
foreach f (folderName/input*.txt)
mycmd $f & ; sleep 0.1
end
There has got to be a better way.
Note: I would prefer shell script solutions, let's not wander into C/C++ land.
Thanks
Regards
Pipe the list of files to
xargs -n 1 -P 16 mycmd
For example:
echo folderName/input*.txt | xargs -n 1 -P 16 mycmd
There are a few other solutions possible using one of the following applications:
xjobs
Parallel
PPSS - Parallel Processing Shell Script
runpar.sh
Submit the jobs with batch; that should fix load balancing and resource starvation issues.
for f in folderName/input.*; do
batch <<____HERE
mycmd "$f"
____HERE
done
(Not 100% sure whether the quotes are correct and/or useful.)
With GNU Parallel you can do:
parallel mycmd ::: folderName/input*.txt
From: http://git.savannah.gnu.org/cgit/parallel.git/tree/README
= Full installation =
Full installation of GNU Parallel is as simple as:
./configure && make && make install
If you are not root you can add ~/bin to your path and install in
~/bin and ~/share:
./configure --prefix=$HOME && make && make install
Or if your system lacks 'make' you can simply copy src/parallel
src/sem src/niceload src/sql to a dir in your path.
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Resources