Running bash scripts parallel in Linux

Running bash scripts parallel in Linux - linux

I am trying to run a script (1.sh)
spin -a /home/files/1/1.pml;
gcc -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c >log1.txt;
./pan -m100000 >log2.txt;
spin -p -s -r -X -v -n123 -l -g -k /home/files/1/1.pml.trail \
-u10000 /home/files/1/1.pml >log3.txt;
The command spin -a ...; generates temporary files (pan.c, pan.h) which is used by the next gcc -O2.. command. If I run the script in terminal it creates the temporary files in the same location.
I want to run multiple scripts parallelly. I tried two things, first to write a script to run then in a loop in background (parallel.sh)
for((i=1;i<1800;i++))
do
/home/files/$i/$i.sh &
done
and secondly use parallel gnu parallel -j0 sh /home/files/{}/{}.sh ::: {1..1800}.
Both method created temp file in the location from where they were called from instead of the script location.
For example if I run the script 'parallel.sh' from home/files the temp file are created in "home/files" instead of the location "home/files/1","home/files/2", etc.
Please suggest a method so that the temporary file generated by the script 1.sh,2.sh,.. are created in the directory /home/file/1/, /home/files/2/,.. respectively while I run the parallel script parallel.sh or parallel GNU in terminal from location /home.

The trick is to change the working directory for each command.
When your computer can really run up to 1800 such processes at the same time without heating up the climate:
for i in {1..1800}; do (cd $i && ./$i.sh) & done
When running in parallel, and your processes are cpu-bound, it usually does not gain throughput when running more than the number of processors:
seq 1 1800 | xargs -n1 -P8 -I% sh -c 'cd % && ./%.sh'

Try:
parallel 'cd /home/files/{}; sh {}.sh' ::: {1..1800}
It will run one process per core, and may be faster than '-j0' (only testing can tell with certainty).
If your scripts only vary by the number, consider rewriting it as a general script or bash function that takes the number as an argument:
spinit() {
num=$1
spin -a /home/files/$num/$num.pml;
gcc -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c >log1.txt;
./pan -m100000 >log2.txt;
spin -p -s -r -X -v -n123 -l -g -k /home/files/$num/$num.pml.trail \
-u10000 /home/files/$num/$num.pml >log3.txt;
}
export -f spinit
parallel 'cd /home/files/{}; spinit {}' ::: {1..1800}

Related

qsub Job using GNU parallel not running

I am trying execute qsub job in a multinode(2) and PPN of 20 using GNU parallel, However it shows some error.
#!/bin/bash
#PBS -l nodes=2:ppn=20
#PBS -l walltime=02:00:00
#PBS -N down
cd $PBS_O_WORKDIR
module load gnu-parallel
for cdr in /scratch/data/v/mt/Downscale/*;do
(cp /scratch/data/v/mt/DWN_FILE_NEW/* $cdr/)
(cd $cdr && parallel -j20 --sshloginfile $PBS_NODEFILE 'echo {} | ./vari_1st_imge' ::: *.DS0 )
done
When I run the above code I got the following error(Please note all the path are properly checked, and the same code without qsub is running properly in a normal computer)
$ ./down
parallel: Error: Cannot open echo {} | ./vari_1st_imge.
& for $qsub down -- no output is creating
I am using parallel --version
GNU parallel 20140622
Please help to solve the problem

First try adding --dryrun to parallel.
But my feeling is that $PBS_NODEFILE is not set for some reason, and that GNU Parallel tries to read the command as the --sshloginfile.
To test this:
echo $PBS_NODEFILE
(cd $cdr && parallel --sshloginfile $PBS_NODEFILE -j20 'echo {} | ./vari_1st_imge' ::: *.DS0 )
If GNU Parallel now tries to open -j20 then it is clear that it is empty.

qsub array job delay

#!/bin/bash
#PBS -S /bin/bash
#PBS -N garunsmodel
#PBS -l mem=2g
#PBS -l walltime=1:00:00
#PBS -t 1-2
#PBS -e error/error.txt
#PBS -o error/output.txt
#PBS -A improveherds_my
#PBS -m ae
set -x
c=$PBS_ARRAYID
nodeDir=`mktemp -d /tmp/phuong.XXXXX`
cp -r /group/dairy/phuongho/garuns $nodeDir
cp /group/dairy/phuongho/jo/parity1/my/simplex.bin $nodeDir/garuns/simplex.bin
cp /group/dairy/phuongho/jo/parity1/nttp.txt $nodeDir/garuns/my.txt
cp /group/dairy/phuongho/jo/parity1/delay_input.txt $nodeDir/garuns/delay_input.txt
cd $nodeDir/garuns
module load gcc vle
XXX=`pwd`
sed -i "s|/group/dairy/phuongho/garuns/out|$XXX/out/|" exp/garuns.vpz
awk -v i="$c" 'NR == 1 || $8==i' my.txt > simplex-observed.txt
awk -v i="$c" 'NR == 1 || $7==i {print $6}' delay_input.txt > afm_param.txt
cp "/group/dairy/phuongho/garuns_param.txt" "$nodeDir/garuns/garuns_param.txt"
while true
do
./simplex.bin &
sleep 5m
done
awk 'NR >1' < simplex-optimum-output.csv>> /group/dairy/phuongho/jo/parity1/my/finalresuls${c}.csv
cp simplex-all-output.csv "/group/dairy/phuongho/jo/parity1/my/simplex-all-output${c}.csv"
#awk '$28==1{print $1, $12,$26,$28,c}' c=$c out/exp_tempfile.csv > /group/dairy/phuongho/jo/parity1/my/simulated_my${c}.csv
cp /out/exp_tempfile.csv /group/dairy/phuongho/jo/parity1/my/exp_tempfile${c}.csv
rm simplex-observed.txt
rm garuns_param.txt
I have above bash script that allows submitting multiple jobs at the same time via PBS_ARRAYID. My issue is that my model (simplex.bin) when it executes it writes something to my home directory. Thus, if one jobs runs at a time or wait until next jobs finished writing stuff to home then it is fine. However, as I want to have >1000 jobs running at a time, 1000 of them try to write the same stuff to home, then leading to crash.
Is there any a smart way to just submit the second job after the first one has already started for a certain amount of time (let's say 5 minutes)?
I already checked and found two options: starts 2nd job when 1st finished, or start at a specific date/time.
Thanks

You can try something like the following:
while [ yes ]
do
./simplex.bin &
sleep 2
done
It endlessly starts ./simplex.bin process in the background, waits for 2 seconds, starts a new ./simplex.bin, etc.
Please note that you may also need nohup and add standard input/output redirection for your ./simplex.bin. Depending on your exact requirements

If you are using Torque, you can set a limit on the number of jobs that can run concurrently:
# Only allow 100 jobs to concurrently execute from this job array
qsub myscript.sh -t 0-10000%100
I know this isn't exactly what you're looking for, but I'm guessing you can find a slot limit that'll make it run without crashing.

Does awk run parallelly?

TASK - SSH to 650 Servers and fetch few details from them and then write the completed server name in different file. How can do it in faster way? If I do normal ssh it takes 7 Minutes. So, I read about awk and wrote following 2 codes.
Could you please explain me the difference in the following codes?
Code 1 -
awk 'BEGIN{done_file="/home/sarafa/AWK_FASTER/done_status.txt"}
{
print "blah"|"ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=1 -o ConnectionAttempts=1 "$0" uname >/dev/null 2>&1";
print "$0" >> done_file
}' /tmp/linux
Code 2 -
awk 'BEGIN{done_file="/home/sarafa/AWK_FASTER/done_status.txt"}
{
"ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=1 -o ConnectionAttempts=1 "$0" uname 2>/dev/null"|getline output;
print output >> done_file
}' /tmp/linux
When I run these codes for 650 Servers, Code 1 takes - 30 seconds and Code 2 takes 7 Minutes ?
Why is there so much time difference ?
File - /tmp/linux is a list of 650 servers

Updated Answer - with thanks to #OleTange
This form is preferable to my suggestion:
parallel -j 0 --tag --slf /tmp/linux --nonall 'hostname;ls'
--tag Tag lines with arguments. Each output line will be prepended
with the arguments and TAB (\t). When combined with --onall or
--nonall the lines will be prepended with the sshlogin
instead.
--nonall --onall with no arguments. Run the command on all computers
given with --sshlogin but take no arguments. GNU parallel will
log into --jobs number of computers in parallel and run the
job on the computer. -j adjusts how many computers to log into
in parallel.
This is useful for running the same command (e.g. uptime) on a
list of servers.
Original Answer
I would recommend using GNU Parallel for this task, like this:
parallel -j 64 -k -a /tmp/linux 'echo ssh user#{} "hostname; ls"'
which will ssh into 64 hosts in parallel (you can change the number), run hostname and ls on each and then give you all the results in order (-k switch).
Obviously remove the echo when you see how it works.

Processing data with inotify-tools as a daemon

I have a bash script that processes some data using inotify-tools to know when certain events took place on the filesystem. It works fine if run in the bash console, but when I try to run it as a daemon it fails. I think the reason is the fact that all the output from the inotifywait command call goes to a file, thus, the part after | while doesn't get called anymore. How can I fix that? Here is my script.
#!/bin/bash
inotifywait -d -r \
-o /dev/null \
-e close_write \
--exclude "^[\.+]|cgi-bin|recycle_bin" \
--format "%w:%&e:%f" \
$1|
while IFS=':' read directory event file
do
#doing my thing
done
So, -d tells inotifywait to run as daemon, -r to do it recursively and -o is the file in which to save the output. In my case the file is /dev/null because I don't really need the output except for processing the part after the command (| while...)

You don't want to run inotify-wait as a daemon in this case, because you want to continue process output from the command. You want to replace the -d command line option with -m, which tells inotifywait to keep monitoring the files and continue printing to stdout:
-m, --monitor
Instead of exiting after receiving a single event, execute
indefinitely. The default behaviour is to exit after the
first event occurs.
If you want things running in the background, you'll need to background the entire script.

Here's a solution using nohup: (Note in my testing, if I specified the -o the while loop didn't seem to be evaluated)
nohup inotifywait -m -r \
-e close_write \
--exclude "^[\.+]|cgi-bin|recycle_bin" \
--format "%w:%&e:%f" \
$1 |
while IFS=':' read directory event file
do
#doing my thing
done >> /some/path/to/log 2>&1 &

Parallel download using Curl command line utility

I want to download some pages from a website and I did it successfully using curl but I was wondering if somehow curl downloads multiple pages at a time just like most of the download managers do, it will speed up things a little bit. Is it possible to do it in curl command line utility?
The current command I am using is
curl 'http://www...../?page=[1-10]' 2>&1 > 1.html
Here I am downloading pages from 1 to 10 and storing them in a file named 1.html.
Also, is it possible for curl to write output of each URL to separate file say URL.html, where URL is the actual URL of the page under process.

My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs, which is capable of running a specified number of commands in subprocesses.
The one-liner I would use is, simply:
$ seq 1 10 | xargs -n1 -P2 bash -c 'i=$0; url="http://example.com/?page${i}.html"; curl -O -s $url'
This warrants some explanation. The use of -n 1 instructs xargs to process a single input argument at a time. In this example, the numbers 1 ... 10 are each processed separately. And -P 2 tells xargs to keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.
You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargs is.
Update: I suspect that my example with xargs could be improved (at least on Mac OS X and BSD with the -J flag). With GNU Parallel, the command is a bit less unwieldy as well:
parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}

Well, curl is just a simple UNIX process. You can have as many of these curl processes running in parallel and sending their outputs to different files.
curl can use the filename part of the URL to generate the local file. Just use the -O option (man curl for details).
You could use something like the following
urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here
for url in $urls; do
# run the curl job in the background so we can start another job
# and disable the progress bar (-s)
echo "fetching $url"
curl $url -O -s &
done
wait #wait for all background jobs to terminate

As of 7.66.0, the curl utility finally has built-in support for parallel downloads of multiple URLs within a single non-blocking process, which should be much faster and more resource-efficient compared to xargs and background spawning, in most cases:
curl -Z 'http://httpbin.org/anything/[1-9].{txt,html}' -o '#1.#2'
This will download 18 links in parallel and write them out to 18 different files, also in parallel. The official announcement of this feature from Daniel Stenberg is here: https://daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/

For launching of parallel commands, why not use the venerable make command line utility.. It supports parallell execution and dependency tracking and whatnot.
How? In the directory where you are downloading the files, create a new file called Makefile with the following contents:
# which page numbers to fetch
numbers := $(shell seq 1 10)
# default target which depends on files 1.html .. 10.html
# (patsubst replaces % with %.html for each number)
all: $(patsubst %,%.html,$(numbers))
# the rule which tells how to generate a %.html dependency
# $# is the target filename e.g. 1.html
%.html:
curl -C - 'http://www...../?page='$(patsubst %.html,%,$#) -o $#.tmp
mv $#.tmp $#
NOTE The last two lines should start with a TAB character (instead of 8 spaces) or make will not accept the file.
Now you just run:
make -k -j 5
The curl command I used will store the output in 1.html.tmp and only if the curl command succeeds then it will be renamed to 1.html (by the mv command on the next line). Thus if some download should fail, you can just re-run the same make command and it will resume/retry downloading the files that failed to download during the first time. Once all files have been successfully downloaded, make will report that there is nothing more to be done, so there is no harm in running it one extra time to be "safe".
(The -k switch tells make to keep downloading the rest of the files even if one single download should fail.)

Curl can also accelerate a download of a file by splitting it into parts:
$ man curl |grep -A2 '\--range'
-r/--range <range>
(HTTP/FTP/SFTP/FILE) Retrieve a byte range (i.e a partial docu-
ment) from a HTTP/1.1, FTP or SFTP server or a local FILE.
Here is a script that will automatically launch curl with the desired number of concurrent processes: https://github.com/axelabs/splitcurl

Starting from 7.68.0 curl can fetch several urls in parallel. This example will fetch urls from urls.txt file with 3 parallel connections:
curl --parallel --parallel-immediate --parallel-max 3 --config urls.txt
urls.txt:
url = "example1.com"
output = "example1.html"
url = "example2.com"
output = "example2.html"
url = "example3.com"
output = "example3.html"
url = "example4.com"
output = "example4.html"
url = "example5.com"
output = "example5.html"

curl and wget cannot download a single file in parallel chunks, but there are alternatives:
aria2 (written in C++, available in Deb and Cygwin repo's)
aria2c -x 5 <url>
axel (written in C, available in Deb repo)
axel -n 5 <url>
wget2 (written in C, available in Deb repo)
wget2 --max-threads=5 <url>
lftp (written in C++, available in Deb repo)
lftp -n 5 <url>
hget (written in Go)
hget -n 5 <url>
pget (written in Go)
pget -p 5 <url>

Run a limited number of process is easy if your system have commands like pidof or pgrep which, given a process name, return the pids (the count of the pids tell how many are running).
Something like this:
#!/bin/sh
max=4
running_curl() {
set -- $(pidof curl)
echo $#
}
while [ $# -gt 0 ]; do
while [ $(running_curl) -ge $max ] ; do
sleep 1
done
curl "$1" --create-dirs -o "${1##*://}" &
shift
done
to call like this:
script.sh $(for i in `seq 1 10`; do printf "http://example/%s.html " "$i"; done)
The curl line of the script is untested.

I came up with a solution based on fmt and xargs. The idea is to specify multiple URLs inside braces http://example.com/page{1,2,3}.html and run them in parallel with xargs. Following would start downloading in 3 process:
seq 1 50 | fmt -w40 | tr ' ' ',' \
| awk -v url="http://example.com/" '{print url "page{" $1 "}.html"}' \
| xargs -P3 -n1 curl -o
so 4 downloadable lines of URLs are generated and sent to xargs
curl -o http://example.com/page{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}.html
curl -o http://example.com/page{17,18,19,20,21,22,23,24,25,26,27,28,29}.html
curl -o http://example.com/page{30,31,32,33,34,35,36,37,38,39,40,41,42}.html
curl -o http://example.com/page{43,44,45,46,47,48,49,50}.html

Bash 3 or above lets you populate an array with multiple values as it expands sequence expressions:
$ urls=( "" http://example.com?page={1..4} )
$ unset urls[0]
Note the [0] value, which was provided as shorthand to make the indices line up with page numbers, since bash arrays autonumber starting at zero. This strategy obviously might not always work. Anyway, you can unset it in this example.
Now you have a an array, and you can verify the contents with declare -p:
$ declare -p urls
declare -a urls=([1]="http://example.com?Page=1" [2]="http://example.com?Page=2" [3]="http://example.com?Page=3" [4]="http://example.com?Page=4")
Now that you have a list of URLs in an array, expand the array into a curl command line:
$ curl $(for i in ${!urls[#]}; do echo "-o $i.html ${urls[$i]}"; done)
The curl command can take multiple URLs and fetch all of them, recycling the existing connection (HTTP/1.1) to a common server, but it needs the -o option before each one in order to download and save each target. Note that characters within some URLs may need to be escaped to avoid interacting with your shell.

I am not sure about curl, but you can do that using wget.
wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains website.org \
--no-parent \
www.website.org/tutorials/html/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Running bash scripts parallel in Linux - linux

Related

qsub Job using GNU parallel not running

qsub array job delay

Does awk run parallelly?

Processing data with inotify-tools as a daemon

Parallel download using Curl command line utility

Categories

Resources