bash append file from multiple thread - multithreading

I'm working on big data, I'm trying to parallelize my process functions.
I can use several threads and process every user is a different thread (I have 200k users).
Every thread should append the first n lines of a file that produce, in an output file, shared between all the threads.
I wrote a Java program that execute head -n 256 thread_processed.txt >> output (every thread will do this)
I need the output file to be wrote in an atomic way.
If the thread A wrote lines from 0 to 9 and threads B wrote lines from 10 to 19 the output should be: [0...9 10... 19]. Lines can't overlaps, it can't be something like [0 1 2 17 18 3 4 ...]
How I can manage concurrent write access to the output file in a bash script?

sem from GNU Parallel should be able to do it:
sem --id mylock "head -n 256 thread_processed.txt >> output"
It will start a mutex named mylock.
If you are concerned that someone might read output while the head is running:
sem --id mylock "cp output o2; head -n 256 thread_processed.txt >> o2; mv o2 output"

Related

Is pipe in linux asynchronous or not?

I thought that when i running below commands
sleep 10 | sleep 2 | sleep 5
Linux process will be
86014 ttys002 0:00.03 bash
86146 ttys002 0:00.00 sleep 10
when sleep 10 is end -> sleep 2 (when sleep 2 is end)-> sleep 5
that's i thought
But in Linux bash sleep 10, sleep 2, sleep 5 are in ps same time
Standard output of sleep 10 process will be redirected to sleep 5's process
But, in that case, sleep 5's process will be finished before sleep 10
I'm confused, any google keywords or concepts of that phenomenon?
(I'm not good at English, maybe it's hard to understand this text🥲. Thank you)
I think you expect the commands to be run in sequence. But that is not what a pipe does.
To run two commands in sequence you use the ;, what it is called a command list, I think:
$ time ( sleep 1 ; sleep 2 )
real 0m3.004s
You can also do command lists with && (or ||) so that the sequence is interrupted if one command returns failure (or success).
But when you run two commands with | both are run in parallel, and the stdout of the first is connected to the stdinof the second. That way, the pipe acts as a synchronization object:
If the second command is faster and empties the pipe, when it reads it will wait for more data
If the first command is faster and writes too much data to the pipe, its buffer will fill up and it will block until some data is read.
Additionally, if the second command dies, as soon as the first one writes to stdout it will get a SIGPIPE, and possibly die.
(Note what would happen if your programs were not run concurrently: your first program could write megabytes of text to stdout, and with nobody to read it, the pipe would overflow!)
But since sleep does not read or write to the console, when you do sleep 1 | sleep 2 nothing special happens and both are run concurrently.
The same happens with 3 or any other number of commands in your pipe.
The net effect is that the full sleep is the longest:
$ time ( sleep 1 | sleep 2 | sleep 3 | sleep 4 )
real 0m4.004s

Linux Commands Loop

I want to make a loop in CentOS SSH Terminal where it loops over certain commands. For example:
zmap -p22 -o mfu.txt -B100M -N 250000
Waits until that's finished
chmod 777 *
./update 1500
Stops task after 25 mins
perl wget.pl vuln.txt
repeat the process
Do you want to parallelize step 1? Use parallel.
Step 3../update is very broad. Do you want to update dataset/underlying programs? Dump output?
For step 3.repeating and step 5.repeat you could use something like
How to have a bash script loop until a specific time
For step 5, do you want to repeat the process during a specific time (like step 3), or for a number of iterations ? For the second option you can do something like (with N iterations):
for i in $(seq 1 N); do execute steps 1 to 4; done

Linux: check if file descriptor is available for reading

Considering the following example, emulating a command which gives output after 10 seconds: exec 5< <(sleep 10; pwd)
In Solaris, if I check the file descriptor earlier than 10 seconds, I can see that it has a size of 0 and this tells me that it hasn't been populated with data yet. I can simply check every second until the file test condition is met (different from 0) and then pull the data:
while true; do
if [[ -s /proc/$$/fd/5 ]]; then
variable=$(cat <&5)
break
fi
sleep 1
done
But in Linux I can't do this (RedHat, Debian etc). All file descriptors appear with a size of 64 bytes no matter if they hold data or not. For various commands that will take a variable amount of time to dump their output, I will not know when I should read the file descriptor. No, I don't want to just wait for cat <&5 to finish, I need to know when I should perform the cat in the first place. Because I am using this mechanism to issue simultaneous commands and assign their output to corresponding file descriptors. As mentioned already, this works great in Solaris.
Here is the skeleton of an idea :
#!/bin/bash
exec 5< <(sleep 4; pwd)
while true
do
if
read -t 0 -u 5 dummy
then
echo Data available
cat <&5
break
else
echo No data
fi
sleep 1
done
From the Bash reference manual :
If timeout is 0, read returns immediately, without trying to read and
data. The exit status is 0 if input is available on the specified file
descriptor, non-zero otherwise.
The idea is to use read with -t 0 (to have zero timeout) and -u 5 (read from file descriptor 5) to instantly check for data availability.
Of course this is just a toy loop to demonstrate the concept.
The solution given by User Fred using only bash builtins works fine, but is a tiny bit non-optimal due to polling for the state of a file descriptor. If calling another interpreter (for example Python) is not a no-go, a non-polling version is possible:
#! /bin/bash
(
sleep 4
echo "This is the data coming now"
echo "More data"
) | (
python3 -c 'import select;select.select([0],[],[])'
echo "Data is now available and can be processed"
# Replace with more sophisticated real-world processing, of course:
cat
)
The single line python3 -c 'import select;select.select([0],[],[])' waits until STDIN has data ready. It uses the standard select(2) system call, for which I have not found a direct shell equivalent or wrapper.

Launch the same program with different arguments in parallel via bash

I have a program that has very big computation times. I need to call it with different arguments. I want to run them on a server with a lot of processors, so I'd like to launch them in parallel in order to save time. (One program instance only uses one processor)
I have tried my best to write a bash script which looks like this:
#!/bin/bash
# set maximal number of parallel jobs
MAXPAR=5
# fill the PID array with nonsense pid numbers
for (( PAR=1; PAR<=MAXPAR; PAR++ ))
do
PID[$PAR]=-18
done
# loop over the arguments
for ARG in 50 60 70 90
do
# endless loop that checks, if one of the parallel jobs has finished
while true
do
# check if PID[PAR] is still running, suppress error output of kill
if ! kill -0 ${PID[PAR]} 2> /dev/null
then
# if PID[PAR] is not running, the next job
# can run as parellel job number PAR
break
fi
# if it is still running, check the next parallel job
if [ $PAR -eq $MAXPAR ]
then
PAR=1
else
PAR=$[$PAR+1]
fi
# but sleep 10 seconds before going on
sleep 10
done
# call to the actual program (here sleep for example)
#./complicated_program $ARG &
sleep $ARG &
# get the pid of the process we just started and save it as PID[PAR]
PID[$PAR]=$!
# give some output, so we know where we are
echo ARG=$ARG, par=$PAR, pid=${PID[PAR]}
done
Now, this script works, but I don't quite like it.
Is there any better way to deal with the beginning? (Setting PID[*]=-18 looks wrong to me)
How do I wait for the first job to finish without the ugly infinite loop and sleeping some seconds? I know there is wait, but I'm not sure how to use it here.
I'd be grateful for any comments on how to improve style and conciseness.
I have a much more complicated code that, more or less, does the same thing.
The things you need to consider:
Does the user need to approve the spawning of a new thread
Does the user need to approve the killing of an old thread
Does the thread terminate on it's own or it needs to be killed
Does the user want the script to run endlessly, as long as it has MAXPAR threads
If so, does the user need an escape sequence to stop further spawning
Here is some code for you:
spawn() #function that spawns a thread
{ #usage: spawn 1 ls -l
i=$1 #save the thread index
shift 1 #shift arguments to the left
[ ${thread[$i]} -ne 0 ] && #if the thread is not already running
[ ${#thread[#]} -lt $threads] && #and if we didn't reach maximum number of threads,
$# & #run the thread in the background, with all the arguments
thread[$1]=$! #associate thread id with thread index
}
terminate() #function that terminates threads
{ #usage: terminate 1
[ your condition ] && #if your condition is met,
kill {thread[$1]} && #kill the thread and if so,
thread[$1]=0 #mark the thread as terminated
}
Now, the rest of the code depends on your needs (things to consider), so you will either loop through input arguments and call spawn, and then after some time loop through threads indexes and call terminate. Or, if the threads end on their own, loop through input arguments and call both spawn and terminate,but the condition for the terminate is then:
[ ps -aux 2>/dev/null | grep " ${thread[$i]} " &>/dev/null ]
#look for thread id in process list (note spaces around id)
Or, something along the lines of that, you get the point.
Using the tips #theotherguy gave in the comments, I rewrote the script in a better way using the sem command that comes with GNU Parallel:
#!/bin/bash
# set maximal number of parallel jobs
MAXPAR=5
# loop over the arguments
for ARG in 50 60 70 90
do
# call to the actual program (here sleep for example)
# prefixed by sem -j $MAXPAR
#sem -j $MAXPAR ./complicated_program $ARG
sem -j $MAXPAR sleep $ARG
# give some output, so we know where we are
echo ARG=$ARG
done

linux batch jobs in parallel

I have seven licenses of a particular software. Therefore, I want to start 7 jobs simultaneously. I can do that using '&'. Now, 'wait' command waits till the end of all of those 7 processes to be finished to spawn the next 7. Now, I would like to write the shell script where after I start the first seven, as and when a job gets completed I would like to start another. This is because some of those 7 jobs might take very long while some others get over really quickly. I don't want to waste time waiting for all of them to finish. Is there a way to do this in linux? Could you please help me?
Thanks.
GNU parallel is the way to go. It is designed for launching multiples instances of a same command, each with a different argument retrieved either from stdin or an external file.
Let's say your licensed script is called myScript, each instance having the same options --arg1 --arg2 and taking a variable parameter --argVariable for each instance spawned, those parameters being stored in file myParameters :
cat myParameters | parallel -halt 1 --jobs 7 ./myScript --arg1 --argVariable {} --arg2
Explanations :
-halt 1 tells parallel to halt all jobs if one fails
--jobs 7 will launch 7 instances of myScript
On a debian-based linux system, you can install parallel using :
sudo apt-get install parallel
As a bonus, if your licenses allow it, you can even tell parallel to launch these 7 instances amongst multiple computers.
You could check how many are currently running and start more if you have less than 7:
while true; do
if [ "`ps ax -o comm | grep process-name | wc -l`" -lt 7 ]; then
process-name &
fi
sleep 1
done
Write two scripts. One which restarts a job everytime it is finished and one that starts 7 times the first script.
Like:
script1:
./script2 job1
...
./script2 job7
and
script2:
while(...)
./jobX
I found a fairly good solution using make, which is a part of the standard distributions. See here

Resources