Multithreading 70 similar commands and receiving output from each - multithreading

I need to parse 70 identically formatted files (different data), repeatedly, to process some information on demand from each file. I.e. (as a simplified example)...
find /dir -name "MYDATA.bam" | while read filename; do
dir=$(echo ${filename} | awk -F"/" '{ print $(NF-1)}')
ARRAY[$dir]=$(samtools view ${filename} | head -1)
done
Since it's 70 files, I wanted each samtools view command to run as an independent thread...so I didn't have to wait for each command to finish (each command takes around 1 second.) Something like...
# $filename will = "/dir/DATA<id#>/MYDATA.bam"
# $dir then = "DATA<id#>" as the ARRAY key.
find /dir -name "MYDATA.bam" | while read filename; do
dir=$(echo ${filename} | awk -F"/" '{ print $(NF-1)}')
command="$(samtools view ${filename} | head -1)
ARRAY[$dir]=$command &
done
wait # To get the array loaded
(... do stuff with $ARRAY...)
But I can't seem to find the syntax to get all the commands called in the background, but still have "result" receive the (correct) output.
I'd be running this on a slurm cluster, so I WOULD actually have 70 cores available to run each command independently (theoretically making that step take 1-2 seconds concurrently, instead of 70 seconds consecutively).

You can do this simply with GNU Parallel like this:
#!/bin/bash
doit() {
dir=$(echo "$1" | awk -F"/" '{print $(NF-1)}')
result=$(samtools view "$1" | head -1)
echo "$dir:$result"
}
# export doit() function for subshells of "parallel" to use
export -f doit
# find the files and pass, null-terminated, to GNU Parallel
find somewhere -name "MYDATA.bam" -print0 | parallel -0 doit {}
It will run one copy of samtools per CPU core you have available, but you can easily change that, with parallel -j 8 if you just want 8 at a time, for example.
If you want the outputs in order, use parallel -k ...
I am not familiar with slurm clusters, so you may have to read up on how to tell GNU Parallel about your nodes, or let it just run 8 at a time or however many cores your main node has.

Capturing the output of a process even when spawned in the background blocks the shell. Here a small example:
echo "starting to sleep in the background"
sleep 2 &
echo "some printing in the foreground"
wait
echo "done sleeping"
This will produce the following output:
starting to sleep in the background
some printing in the foreground
<2 second wait>
done sleeping
If however you capture like this:
echo "starting to sleep in the background"
output=$(sleep 2 &)
echo "some printing in the foreground"
wait
echo "done sleeping"
The following happens:
starting to sleep in the background
<2 second wait>
some printing in the foreground
done sleeping
The actual waiting happened on the assignment of the output. By the time the wait statement is reached there is no more background process and thus no waiting.
So one way would be to pipe the output into files and stitch them back together
after the wait. This is a bit awkward.
A simpler solution might be to use GNU Parallel, a tool that deals with
collecting the output of parallel processes. It works particularly well when the output is line based.

You should be able to do this with just Bash. This snippet show how you can run each command in the background and write the results to stdout. The inner loop reads in these results and adds them to your array. You'll probably have to tweak this to make it work.
while read -r dir && read -r data; do
ARRAY[$dir]="$data"
done < <(
# sub shell level one
find /dir -name "MYDATA.bam" | while read filename; do
(
# sub shell level two
# run each task in parallel, output will be in the following format
# "directory"
# "result"
# ...
dir=$(awk -F"/" '{ print $(NF-1)}' <<< "$filename")
printf "%s\n%s\n" \
"$dir" "$(samtools view "$filename" | head -1)"
) &
done
)
The key is that ( command; command ) & runs each command in a new sub shell in the background, so the top level shell can continue to the next task.
The < <(command) allows us to redirect the stdout of a subprocess to the stdin of another shell command. This is how we can read the results into our variable and have the variable be available later.

Related

Read stdin to run a script on certain keystroke without interfering in programs/games

My goal is to run a script every time I push the R key on my keyboard:
#!/bin/bash
tail -50 .local/share/binding\ of\ isaac\ afterbirth+/log.txt | grep "on seed" | awk '{print $(NF-1) $NF}' > /home/fawk/seed
I already tried to do this with a program called xbindkeys but it would interfere and pause the game.
Then I tried it with another bash script using the read command to execute the first script but my knowledge isn't that great so far. Reading and experimenting for hours didn't lead anywhere.
The purpose is to feed a game-seed to OBS. The easiest way seems to create a (text-)file containing the seed for OBS to pick up. All I want is that seed (1st script) to be written into a file.
I tried
#!/bin/bash
while :
do
read -r 1 k <&1
if [[ $k = r ]] ; then
exec /home/fawk/seedobs.sh
fi
done
and many other variations but didn't get closer to a solution.

How to monitor CPU usage automatically and return results when it reaches a threshold

I am new to shell script , i want to write a script to monitor CPU usage and if the CPU usage reaches a threshold it should print the CPU usage by top command ,here is my script , which is giving me error bad number and also not storing any value in the log files
while sleep 1;do if [ "$(top -n1 | grep -i ^cpu | awk '{print $2}')">>sy.log - ge "$Threshold" ]; then echo "$(top -n1)">>sys.log;fi;done
Your script HAS to be indented and stored to a file, especially if you are new to shell !
#!/bin/sh
while sleep 1
do
if [ "$(top -n1 | grep -i ^cpu | awk '{print $2}')">>sy.log - ge "$Threshold" ]
then
echo "$(top -n1)" >> sys.log
fi
done
Your condition looks a bit odd. It may work, but it looks really complex. Store intermediate results in variables, and evaluate them.
Then, you will immediately see the syntax error on the “-ge”.
You HAVE to store logfiles within an absolute path for security reasons. Use variables to simplify the reading.
#!/bin/sh
LOGFILE=/absolute_path/sy.log
WHOLEFILE=/absolute_path/sys.log
Thresold=80
while sleep 1
do
TOP="$(top -n1)"
CPU="$(echo $TOP | grep -i ^cpu | awk '{print $2}')"
echo $CPU >> $LOGFILE
if [ "$CPU" -ge "$Threshold" ] ; then
echo "$TOP" >> $WHOLEFILE
fi
done
You have a couple of errors.
If you write output to sy.log with a redirection then that output is no longer available to the shell. You can work around this with tee.
The dash before -ge must not be followed by a space.
Also, a few stylistic remarks:
grep x | awk '{y}' is a useless use of grep; this can usefully and more economically (as well as more elegantly) be rewritten as awk '/x/{y}'
echo "$(command)" is a useless use of echo -- not a deal-breaker, but you simply want command; there is no need to capture what it prints to standard output just so you can print that text to standard output.
If you are going to capture the output of top -n 1 anyway, there is no need really to run it twice.
Further notes:
If you know the capitalization of the field you want to extract, maybe you don't need to search case-insensitively. (I could not find a version of top which prints a CPU prefix with the load in the second field -- it the expression really correct?)
The shell only supports integer arithmetic. Is this a bug? Maybe you want to use Awk (which has floating-point support) to perform the comparison? This also allows for a moderately tricky refactoring. We make Awk output an exit code of 1 if the comparison fails, and use that as the condition for the if.
#!/bin/sh
while sleep 1
do
if top=$(top -n 1 |
awk -v thres="$Threshold" '1; # print every line
tolower($1) ~ /^cpu/ { print $2 >>"sy.log";
exitcode = ($2 >= thres ? 0 : 1) }
END { exit exitcode }')
then
echo "$top" >>sys.log
fi
done
Do you really mean to have two log files with nearly the same name, or is that a typo? Including a time stamp in the log might be useful both for troubleshooting and for actually using the log files.

bash script to Ping 1000+ ip at the same time via loop

LocationDevice array includes all the devices in the location (from a text file)
what I want to do is to to save the statistics results in the PingValue[$k] array and to print them out in the next loop (it will be something other than printing of course) I want to find away to do send the declaration command to the background so the 1000 devices pings happen for less than 10 seconds (-c 5 for each device) then i get to call my arrays in process them in the second loop.
I tried to echo PingValue[$k]=`ping -c 5 -q ${LocationDevice[$k]}` but didn't work and the array didn't have the value and ALL the results printed after couple or seconds
for ((k=0;k<${#LocationDevice[#]};k++)) do
PingValue[$k]=`ping -c 5 -q ${LocationDevice[$k]}`
done
sleep 10
for ((i=0;i<${#LocationDevice[#]};i++)) do
echo "${PingValue[$i]}"
done
Combining the output from multiple parallel background processes into a single array is tricky because those processes can't easily access any shared memory. The best way, I think, is to build a large pipeline to synchronize the data and to use a null value to delineate the output from each process. The best way to launch such a pipeline is recursively. However there won't be any way to quickly read 1000 results from the pipeline into the final array, that takes a while. Instead you can do your processing as you read each item from the pipeline (while building the array if you still need it after). Try this:
PingHosts() {
[ "$1" ] || return
host="$1"; shift
rest=("$#")
(
r=$(ping -c 5 -q "$host")
echo "$r"
printf "\0"
cat
) < <(PingHosts "${rest[#]}" &)
}
for ((i=0; i<${#LocationDevice[#]}; ++i)) do
read -r -d $'\0' pingresult
PingValue[$i]="$pingresult"
echo "Processing result for $i:"
echo "${PingValue[$i]}"
echo "-----------------------"
done < <(PingHosts "${LocationDevice[#]}")

Can I avoid using a FIFO file to join the end of a Bash pipeline to be stored in a variable in the current shell?

I have the following functions:
execIn ()
{
local STORE_INvar="${1}" ; shift
printf -v "${STORE_INvar}" '%s' "$( eval "$#" ; printf %s x ; )"
printf -v "${STORE_INvar}" '%s' "${!STORE_INvar%x}"
}
and
getFifo ()
{
local FIFOfile
FIFOfile="/tmp/diamondLang-FIFO-$$-${RANDOM}"
while [ -e "${FIFOfile}" ]
do
FIFOfile="/tmp/diamondLang-FIFO-$$-${RANDOM}"
done
mkfifo "${FIFOfile}"
echo "${FIFOfile}"
}
I want to store the output of the end of a pipeline into a variable as given to a function at the end of the pipeline, however, the only way I have found to do this that will work in early versions of Bash is to use mkfifo to make a temp fifo file. I was hoping to use file descriptors to avoid having to create temporary files. So, This works, but is not ideal:
Set Up: (before I can do this I need to have assigned a FIFO file to a var that can be used by the rest of the process)
$ FIFOfile="$( getFifo )"
The Pipeline I want to persist:
$ printf '\n\n123\n456\n524\n789\n\n\n' | grep 2 # for e.g.
The action: (I can now add) >${FIFOfile} &
$ printf '\n\n123\n456\n524\n789\n\n\n' | grep 2 >${FIFOfile} &
N.B. The need to background it with & - Problem 1: I get [1] <PID_NO> output to the screen.
The actual persist:
$ execIn SOME_VAR cat - <${FIFOfile}
Problem 2: I get more noise to the screen
[1]+ Done printf '\n\n123\n456\n524\n789\n\n\n' | grep 2 > ${FIFOfile}
Problem 3: I loose the blanks at the start of the stream rather than at the end as I have experienced before.
So, am I doing this the right way? I am sure that there must be a way to avoid the need of a FIFO file that needs cleanup afterwards using file descriptors, but I cannot seem to do this as I cannot assign either side of the problem to a file descriptor that is not attached to a file or a FIFO file.
I can try and resolve the problems with what I have, although to make this work properly I guess I need to pre-establish a pool of FIFO files that can be pulled in to use or else I have a pre-req of establishing this file before the command. So, for many reasons this is far from ideal. If anyone can advise me of a better way you would make my day/week/month/life :)
Thanks in advance...
Process substitution was available in bash from the ancient days. You absolutely do not have a version so ancient as to be unable to use it. Thus, there's no need to use a FIFO at all:
readToVar() { IFS= read -r -d '' "$1"; }
readToVar targetVar < <(printf '\n\n123\n456\n524\n789\n\n\n')
You'll observe that:
printf '%q\n' "$targetVar"
...correctly preserves the leading newlines as well as the trailing ones.
By contrast, in a use case where you can't afford to lose stdin:
readToVar() { IFS= read -r -d '' "$1" <"$2"; }
readToVar targetVar <(printf '\n\n123\n456\n524\n789\n\n\n')
If you really want to pipe to this command, are willing to require a very modern bash, and don't mind being incompatible with job control:
set +m # disable job control
shopt -s lastpipe # in a pipeline, parent shell becomes right-hand side
readToVar() { IFS= read -r -d '' "$1"; }
printf '\n\n123\n456\n524\n789\n\n\n' | grep 2 | readToVar targetVar
The issues you claim to run into with using a FIFO do not actually exist. Put this in a script, and run it:
#!/bin/bash
trap 'rm -rf "$tempdir"' 0 # cleanup on exit
tempdir=$(mktemp -d -t fifodir.XXXXXX)
mkfifo "$tempdir/fifo"
printf '\n\n123\n456\n524\n789\n\n\n' >"$tempdir/fifo" &
IFS= read -r -d '' content <"$tempdir/fifo"
printf '%q\n' "$content" # print content to console
You'll notice that, when run in a script, there is no "noise" printed to the screen, because all that status is explicitly tied to job control, which is disabled by default in scripts.
You'll also notice that both leading and tailing newlines are correctly represented.
One idea, tell me I am crazy, might be to use the !! notation to grab the line just executed, e.g. if there is a command that can terminate a pipeline and stop it actually executing, whilst still as far as the shell is concerned, consider it as a successful execution, I am thinking something like the true command, I could then use !! to grab that line and call my existing function to execute it with process substitution or something. I could then wrap this into an alias, something like: alias streamTo=' | true ; LAST_EXEC="!!" ; myNewCommandVariation <<<' which I think could be used something like: $ cmd1 | cmd2 | myNewCommandVariation THE_VAR_NAME_TO_SET and the <<< from the alias would pass the var name to the command as an arg or stdin, either way, the command would be not at the end of a pipeline. How mad is this idea?
Not a full answer but rather a first point: is there some good reason not using mktemp for creating a new file with a random name? As far as I can see, your function called getFifo() doesn't perform much more.
mktemp -u
will give to you a free new name without creating anything; then you can use mkfifo with this name.

Use I/O redirection between two scripts without waiting for the first to finish

I have two scripts, let's say long.sh and simple.sh: one is very time consuming, the other is very simple. The output of the first script should be used as input of the second one.
As an example, the "long.sh" could be like this:
#!/bin/sh
for line in `cat LONGIFLE.dat` do;
# read line;
# do some complicated processing (time consuming);
echo $line
done;
And the simple one is:
#!/bin/sh
while read a; do
# simple processing;
echo $a + "other stuff"
done;
I want to pipeline the two scripts this:
sh long.sh | sh simple.sh
Using pipelines, the simple.sh has to wait the end of the long script before it could start.
I would like to know if in the bash shell it is possible to see the output of simple.sh per current line, so that I can see at runtime what line is being processed at this moment.
I would prefer not to merge the two scripts together, nor to call the simple.sh inside long.sh.
Thank you very much.
stdout is normally buffered. You want line-buffered. Try
stdbuf -oL sh long.sh | sh simple.sh
Note that this loop
for line in `cat LONGIFLE.dat`; do # see where I put the semi-colon?
reads words from the file. If you only have one word per line, you're OK. Otherwise, to read by lines, use while IFS= read -r line; do ...; done < LONGFILE.dat
Always quote your variables (echo "$line") unless you know specifically when not to.

Resources