Linux bash multithread/process small jobs - linux

I have a script that runs some data processing command 10K times.
foreach f (folderName/input*.txt)
mycmd $f
end
I have timed the runtime for each "mycmd $f" to be 0.25 secs.
With 10K runs, it adds up to be more than 1 hr.
I'm running it on a 16 cores nehalem.
It's a huge waste to not run on the remaining 15 cores.
I have tried & with sleep, somehow the script just dies with a warning or error around 3900 iterations, see below. The shorter the sleep, that faster it dies.
foreach f (folderName/input*.txt)
mycmd $f & ; sleep 0.1
end
There has got to be a better way.
Note: I would prefer shell script solutions, let's not wander into C/C++ land.
Thanks
Regards

Pipe the list of files to
xargs -n 1 -P 16 mycmd
For example:
echo folderName/input*.txt | xargs -n 1 -P 16 mycmd

There are a few other solutions possible using one of the following applications:
xjobs
Parallel
PPSS - Parallel Processing Shell Script
runpar.sh

Submit the jobs with batch; that should fix load balancing and resource starvation issues.
for f in folderName/input.*; do
batch <<____HERE
mycmd "$f"
____HERE
done
(Not 100% sure whether the quotes are correct and/or useful.)

With GNU Parallel you can do:
parallel mycmd ::: folderName/input*.txt
From: http://git.savannah.gnu.org/cgit/parallel.git/tree/README
= Full installation =
Full installation of GNU Parallel is as simple as:
./configure && make && make install
If you are not root you can add ~/bin to your path and install in
~/bin and ~/share:
./configure --prefix=$HOME && make && make install
Or if your system lacks 'make' you can simply copy src/parallel
src/sem src/niceload src/sql to a dir in your path.
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Related

How to run few nohup in same time?

I have tool trimmings content from file. I need to use it on 33 files. One file is processing 2 hours.
I want to try run it 33 Times on same tame coz one instance used one core, on my machine I have 128 cores.
So I wrote script:
#!/bin/bash
FILES=/home/ab/raw/*
for f in $FILES
do
base = ${f##*/}
nohup /home/ab/trimmer -a /home/ab/trimmer/adapters.fa -o "OUT$base" $f
done
So main line:
I run trimmer, -a it's file with patterns to delete, -o it's new file as outpu (OUT+basename) and last $f is processing file.
My intention was that the script would run separate tasks for each file.
But unfortunately, after run it, only one nohup will be launched. In htop it's still only one core working at 100%.
How can I fix it?

Parallelism in bash script

I got a script that is starting up some virtual machines. After the deployment I want to install a few things on the VMs. Because these installation can take up to 6 minutes per VM. It would be much more efficient to execute these installations in parallel. In Java I would probably use Threads but in a bash script I do not know. My first approach was sth like this:
function install {
plink -ssh -i /var/lib/one/Downloads/id_rsa_ubuntu_putty..ppk root#$1 wget https://www.dropbox.com/s/xdhnx/install.sh
plink -ssh -i /var/lib/one/Downloads/id_rsa_ubuntu_putty..ppk root#$1 chmod 4500 install.sh
plink -ssh -i /var/lib/one/Downloads/id_rsa_ubuntu_putty..ppk root#$1 ./install.sh
echo $1 angestoßen
}
echo -------------------------------------------------
echo Alle VMs erfolgreich deployed
for i in "${IParray[#]}"
do
install $i &
done
wait
I created a funtion and tried to connect the function calls in the for-loop by using "&" which should create subprocesses. But for some how this is not working properly. Can anybody help me out
Maybe use GNU Parallel like this:
#!/bin/bash
IParray=(192.168.0.1 192.168.0.2)
function install {
echo $1
# plink...
}
# Make install() visible to GNU Parallel
export -f install
# Run a bunch of installs in parallel
parallel install ::: ${IParray[#]}

Bash: Running the same program over multiple cores

I have access to a machine where I have access to 10 of the cores -- and I would like to actually use them. What I am used to doing on my own machine would be something like this:
for f in *.fa; do
myProgram (options) "./$f" "./$f.tmp"
done
I have 10 files I'd like to do this on -- let's call them blah00.fa, blah01.fa, ... blah09.fa.
The problem with this approach is that myProgram only uses 1 core at a time, and doing it like this on the multi-core machine I'd be using 1 core at a time 10 times, so I wouldn't be using my mahcine to its max capability.
How could I change my script so that it runs all 10 of my .fa files at the same time? I looked at Run a looped process in bash across multiple cores but I couldn't get the command from that to do what I wanted exactly.
You could use
for f in *.fa; do
myProgram (options) "./$f" "./$f.tmp" &
done
wait
which would start all of you jobs in parallel, then wait until they all complete before moving on. In the case where you have more jobs than cores, you would start all of them and let your OS scheduler worry about swapping processes in an out.
One modification is to start 10 jobs at a time
count=0
for f in *.fa; do
myProgram (options) "./$f" "./$f.tmp" &
(( count ++ ))
if (( count = 10 )); then
wait
count=0
fi
done
but this is inferior to using parallel because you can't start new jobs as old ones finish, and you also can't detect if an older job finished before you manage to start 10 jobs. wait allows you to wait on a single particular process or all background processes, but doesn't let you know when any one of an arbitrary set of background processes complete.
With GNU Parallel you can do:
parallel myProgram (options) {} {.}.tmp ::: *.fa
From: http://git.savannah.gnu.org/cgit/parallel.git/tree/README
= Full installation =
Full installation of GNU Parallel is as simple as:
./configure && make && make install
If you are not root you can add ~/bin to your path and install in
~/bin and ~/share:
./configure --prefix=$HOME && make && make install
Or if your system lacks 'make' you can simply copy src/parallel
src/sem src/niceload src/sql to a dir in your path.
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
# Wait while instance count less than $3, run additional instance and exit
function runParallel () {
cmd=$1
args=$2
number=$3
currNumber="1024"
while true ; do
currNumber=`ps -e | grep -v "grep" | grep " $1$" | wc -l`
if [ $currNumber -lt $number ] ; then
break
fi
sleep 1
done
echo "run: $cmd $args"
$cmd $args &
}
loop=0
# We will run 12 sleep commands for 10 seconds each
# and only five of them will work simultaneously
while [ $loop -ne 12 ] ; do
runParallel "sleep" 10 5
loop=`expr $loop + 1`
done

Perl or Bash threadpool script?

I have a script - a linear list of commands - that takes a long time to run sequentially. I would like to create a utility script (Perl, Bash or other available on Cygwin) that can read commands from any linear script and farm them out to a configurable number of parallel workers.
So if myscript is
command1
command2
command3
I can run:
threadpool -n 2 myscript
Two threads would be created, one commencing with command1 and the other command2. Whichever thread finishes its first job first would then run command3.
Before diving into Perl (it's been a long time) I thought I should ask the experts if something like this already exists. I'm sure there should be something like this because it would be incredibly useful both for exploiting multi-CPU machines and for parallel network transfers (wget or scp). I guess I don't know the right search terms. Thanks!
If you need the output not to be mixed up (which xargs -P risks doing), then you can use GNU Parallel:
parallel -j2 ::: command1 command2 command3
Or if the commands are in a file:
cat file | parallel -j2
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
In Perl you can do this with Parallel::ForkManager:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $pm = Parallel::ForkManager->new( 8 ); # number of jobs to run in parallel
open FILE, "<commands.txt" or die $!;
while ( my $cmd = <FILE> ) {
$pm->start and next;
system( $cmd );
$pm->finish;
}
close FILE or die $!;
$pm->wait_all_children;
There is xjobs which is better at separating individual job output then xargs -P.
http://www.maier-komor.de/xjobs.html
You could also use make. Here is a very interesting article on how to use it creatively
Source: http://coldattic.info/shvedsky/pro/blogs/a-foo-walks-into-a-bar/posts/7
# That's commands.txt file
echo Hello world
echo Goodbye world
echo Goodbye cruel world
cat commands.txt | xargs -I CMD --max-procs=3 bash -c CMD

R programming - submitting jobs on a multiple node linux cluster using PBS

I am running R on a multiple node Linux cluster. I would like to run my analysis on R using scripts or batch mode without using parallel computing software such as MPI or snow.
I know this can be done by dividing the input data such that each node runs different parts of the data.
My question is how do I go about this exactly? I am not sure how I should code my scripts. An example would be very helpful!
I have been running my scripts so far using PBS but it only seems to run on one node as R is a single thread program. Hence, I need to figure out how to adjust my code so it distributes labor to all of the nodes.
Here is what I have been doing so far:
1) command line:
> qsub myjobs.pbs
2) myjobs.pbs:
> #!/bin/sh
> #PBS -l nodes=6:ppn=2
> #PBS -l walltime=00:05:00
> #PBS -l arch=x86_64
>
> pbsdsh -v $PBS_O_WORKDIR/myscript.sh
3) myscript.sh:
#!/bin/sh
cd $PBS_O_WORKDIR
R CMD BATCH --no-save my_script.R
4) my_script.R:
> library(survival)
> ...
> write.table(test,"TESTER.csv",
> sep=",", row.names=F, quote=F)
Any suggestions will be appreciated! Thank you!
-CC
This is rather a PBS question; I usually make an R script (with Rscript path after #!) and make it gather a parameter (using commandArgs function) that controls which "part of the job" this current instance should make. Because I use multicore a lot I usually have to use only 3-4 nodes, so I just submit few jobs calling this R script with each of a possible control argument values.
On the other hand your use of pbsdsh should do its job... Then the value of PBS_TASKNUM can be used as a control parameter.
This was an answer to a related question - but it's an answer to the comment above (as well).
For most of our work we do run multiple R sessions in parallel using qsub (instead).
If it is for multiple files I normally do:
while read infile rest
do
qsub -v infile=$infile call_r.pbs
done < list_of_infiles.txt
call_r.pbs:
...
R --vanilla -f analyse_file.R $infile
...
analyse_file.R:
args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...
Then I combine all the output afterwards...
This problem seems very well suited for use of GNU parallel. GNU parallel has an excellent tutorial here. I'm not familiar with pbsdsh, and I'm new to HPC, but to me it looks like pbsdsh serves a similar purpose as GNU parallel. I'm also not familiar with launching R from the command line with arguments, but here is my guess at how your PBS file would look:
#!/bin/sh
#PBS -l nodes=6:ppn=2
#PBS -l walltime=00:05:00
#PBS -l arch=x86_64
...
parallel -j2 --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
Rscript myscript.R {} :::: infilelist.txt
where infilelist.txt lists the data files you want to process, e.g.:
inputdata01.dat
inputdata02.dat
...
inputdata12.dat
Your myscript.R would access the command line argument to load and process the specified input file.
My main purpose with this answer is to point out the availability of GNU parallel, which came about after the original question was posted. Hopefully someone else can provide a more tangible example. Also, I am still wobbly with my usage of parallel, for example, I'm unsure of the -j2 option. (See my related question.)

Resources