Use more than one core in bash - linux

I have a linux tool that (greatly simplifying) cuts me the sequences specified in illumnaSeq file. I have 32 files to grind. One file is processed in about 5 hours. I have a server on the centos, it has 128 cores.
I've found a few solutions, but each one works in a way that only uses one core. The last one seems to fire 32 nohups, but it'll still pressurize the whole thing with one core.
My question is, does anyone have any idea how to use the server's potential? Because basically every file can be processed independently, there are no relations between them.
This is the current version of the script and I don't know why it only uses one core. I wrote it with the help of advice here on stack and found on the Internet:
#!/bin/bash
FILES=/home/daw/raw/*
count=0
for f in $FILES
to
base=${f##*/}
echo "process $f file..."
nohup /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o "OUT$base" $f &
(( count ++ ))
if (( count = 31 )); then
wait
count=0
fi
done
I'm explaining: FILES is a list of files from the raw folder.
The "core" line to execute nohup: the first path is the path to the tool, -a path is the path to the file with paternas to cut, out saves the same file name as the processed + OUT at the beginning. The last parameter is the input file to be processed.
Here readme tools:
https://github.com/vsbuffalo/scythe
Does anybody know how you can handle it?
P.S. I also tried move nohup before count, but it's still use one core. I have no limitation on server.

IMHO, the most likely solution is GNU Parallel, so you can run up to say, 64 jobs in parallel something like this:
parallel -j 64 /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o OUT{.} {} ::: /home/daw/raw/*
This has the benefit that jobs are not batched, it keeps 64 running at all times, starting a new one as each job finishes, which is better than waiting potentially 4.9 hours for all 32 of your jobs to finish before starting the last one which takes a further 5 hours after that. Note that I arbitrarily chose 64 jobs here, if you don't specify otherwise, GNU Parallel will run 1 job per CPU core you have.
Useful additional parameters are:
parallel --bar ... gives a progress bar
parallel --dry-run ... does a dry run so you can see what it would do without actually doing anything
If you have multiple servers available, you can add them in a list and GNU Parallel will distribute the jobs amongst them too:
parallel -S server1,server2,server3 ...

Related

run multiple commands in parallel in unix

I have few files and I have to cut few columns from that files to generate new files unix.
I tried to do it in loop as selecting files in directory and generating new files but as directory having 100 such files it takes lot of time to generate new files.
Can anyone please help if I can select 10 files in parallel and generate 10 new files and again next set of 10 files as it will reduce the time.
i need sample unix code block for this
cut -b 1-10,25-50,65-79 file1.txt > file_cut1.txt
cut -b 1-10,25-50,65-79 file2.txt > file_cut2.txt
You can do that quite simply with GNU Parallel like this:
parallel 'cut -b 1-10,25-50,65-79 {} > {.}_cut.txt' ::: file*txt
where:
{} represents the current filename, and
{.} represents the current filename without its extension.
Make a backup of the files in your directory before trying this, or any unfamiliar commands.
It will process your files in parallel, doing N at a time, where N is the number of cores in your CPU. If you want it to do, say 8, jobs at a time, use:
parallel -j 8 ...
If you want to see what it would do, without actually doing anything, use:
parallel --dry-run ...

Why there are written randomly null characters in some of my output files?

I have some scripts in my RedHat server which contains Microfocus COBOL programs which generates a huge file of aprox 3GB in a sort of time of 3 hours on average. The programs write their output files directly in the directory /my_test/files/.
The problem is that sometimes (randomly) some files generated contains null character sections in the middle of the file. And when I check them up, if I reexecute the script again (with the same input parameters), the output file is perfectly generated (it doesn't contain any nullchars). I've checked it a lot of times and I'm pretty sure is not the fault of the COBOL programs (they use quite simple operations). The space in use of that folder is 40%.
Some programs updates the database, and if they finish with return code 0, then the changes are commited, and I don't have any backup, so this is the point of what I'm doing.
This is an example of a file declaration of one of the problematic COBOL programs:
FILE-CONTROL.
SELECT MYFILE
ASSIGN TO MYFILE
ORGANIZATION IS SEQUENTIAL
ACCESS MODE IS SEQUENTIAL
FILE STATUS IS FILE-STATUS.
DATA DIVISION.
FILE SECTION.
FD MYFILE
LABEL RECORD STANDARD
RECORDING MODE F.
01 REG-OUTPUT PIC X(400).
I've also checked for the nulls in the COBOL programs before the NULL files, but unfortunately there are no nulls spotted.
Then I thought about creating a crontab which executes the following script each 5 seconds:
if [[ -f /tmp/sorry_im_working ]]; then
exit
fi
trap 'rm -rf /tmp/sorry_im_working' EXIT
touch /tmp/sorry_im_working
lsof | awk 'BEGIN{
sfiles="";
} {
if($1=="PROGRAM" && $9~/my_test\/files/){
sfiles=sfiles" "$9
}
}END{
comm="find "sfiles" -newermt \x27-2 seconds\x27 -exec env LC_ALL=C bash -c \x27grep -Pq \x22\x5Cx00{200}\x22 <(tail -c 1000 {}) && echo {}\x27 \x5C\x3B";
while(comm | getline sout){
print sout;
};
close(comm);
}' >> /home/ouhma/nullfiles.txt
Therefore, I would like to ask you the following questions:
Any idea of what's going on here?
Do you have any other way to trigger the lastest modified files?
What other information of interest could I add to my log?
If you construct a file d with only \x00 :
hexdump -C d
00000000 5c 78 30 30 0a |\x00.|
00000005
and you :
grep -Faq '\x00' d;echo $?
0
But they're no null caracter inside d.
Maybe, is better to use grep -Paq '\x00'
Depending on the configuration and record structure that is used for the file MF will pad different characters with hex null.
Please copy the 'ASSIGN' clause and the 'FD' clause of the COBOL program.
BTW: if your COBOL programs run three ours to do some calculations and write three GB of data back you should investigate the storage and / or get a COBOL programmer to check the programs, sounds much to slow.
I suspect you are have non-printable characters in your file, the null inserts can be controlled, take a look # INSERTNULL file configuration.

Need help to make my program automated

I just wanted to get some idea how I should approach to this. I am trying to automate to get report back to a database with these bunch of scripts (i.e., java -jar snet_client.jar -mode report -id 13528 -props /int2/contact/client0.properties & ). Lets say I have hundreds of this command with unique numbers as it in the script(13528). I need to put that in a loop so that I do not need to write/copy&paste that hundred scripts over and over to execute. Any suggestion would be helpful. It has to be in unix.
This first bash script would iterate over each line in the file textfile, assuming that each of the id values are on a single line, start the java process and wait for it to complete before starting the next one.
# Queueing
# This one will only start the next process when the previous one completes.
OLD_IFS=$IFS
while IFS=$'\n' read -r line_data; do
java -jar snet_client.jar -mode report -id ${line_data} -props /int2/contact/client0.properties &
wait;
done < /path/to/textfile
IFS=$OLD_IFS
Alternatively, this script does the same, as far as getting id values from a text file, but doesn't wait for the first to complete before the next is started. This will likely cause problems is the snet_client.jar program is very resource intensive:
# Non-queueing
# This starts and runs all the processes
OLD_IFS=$IFS
while IFS=$'\n' read -r line_data; do
java -jar snet_client.jar -mode report -id ${line_data} -props /int2/contact/client0.properties &
done < /path/to/textfile
IFS=$OLD_IFS
On both, we store the current IFS value before we begin so we can reset it after the process runs, just in case we need it set back for something later in the script file.
I have not tested these (since I don't have the dependencies available), so you might have to make adjustments for your own environment.

Update Bash commands every 2 seconds (without re-running code everytime)

for my first bash project I am developing a simple bash script that shows basic information about my system:
#!/bash/sh
UPTIME=$(w)
MHZ=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq)
TEMP=$(cat /sys/class/thermal/thermal_zone0/temp)
#UPTIME shows the uptime of the device
#MHZ shows the overclocked specs
#TEMP shows the current CPU Temperature
echo "$UPTIME" #displays uptime
echo "$MHZ" #displays overclocked specs
echo "$TEMP" #displays CPU Temperature
MY QUESTION: How can I code this so that the uptime and CPU temperature refresh every 2seconds without re-generating the code new every time (I just want these two variables to update without having to enter the file path again and re-running the whole script).
This code is already working fine on my system but after it executes in the command line, the information isn't updating because it executed the command and is standing by for the next command instead of updating the variables such as UPTIME in real time.
I hope someone understands what I am trying to achieve, sorry about my bad wordings of this idea.
Thank you in advance...
I think it will help you. You can use the watch command for updating that for every two seconds without the loop.
watch ./filename.sh
It will give you the update of that command for every two second.
watch - execute a program periodically, showing output fullscreen
Not sure to really understand the main goal, but here's an answer to the basic question "How can I code this so that the uptime and CPU temperature refresh every two seconds ?" :
#!/bash/sh
while :; do
UPTIME=$(w)
MHZ=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq)
TEMP=$(cat /sys/class/thermal/thermal_zone0/temp)
#UPTIME shows the uptime of the device
#MHZ shows the overclocked specs
#TEMP shows the current CPU Temperature
echo "$UPTIME" #displays uptime
echo "$MHZ" #displays overclocked specs
echo "$TEMP" #displays CPU Temperature
sleep 2
done
I may suggest some modifications.
For such simple job I may recommend no to use external utilities. So instead of $(cat file) you could use $(<file). This is a cheaper method as bash does not have to launch cat.
On the other hand if reading those devices returns only one line, you can use the bash built-in read like: read ENV_VAR <single_line_file. It is even cheaper. If there are more lines and for example you want to read the 2nd line, you could use sg like this: { read line_1; read line2;} <file.
As I see w provides much more information and as I assume you need only the header line. This is exactly what uptime prints. The external utility uptime reads the /proc/uptime pseudo file. So to avoid to call externals, you can read this pseudo file directly.
The looping part also uses the external sleep(1) utility. For this the timeout feature of the read internal could be used.
So in short the script would look like this:
while :; do
# /proc/uptime has two fields, uptime and idle time
read UPTIME IDLE </proc/uptime
# Not having these pseudo files on my system, the whole line is read
# Maybe some formatting is needed. For MHZ /proc/cpuinfo may be used
read MHZ </sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
read TEMP </sys/class/thermal/thermal_zone0/temp
# Bash supports only integer arithmetic, so chomp off float
UPTIME_SEC=${UPTIME%.*}
UPTIME_HOURS=$((UPTIME_SEC/3600))
echo "Uptime: $UPTIME_HOURS hours"
echo $MHZ
echo $TEMP
# It reads stdin, so pressing an ENTER it returns immediately
read -t 2
done
This does not call any external utility and does not make any fork. So instead of executing 3 external utilities (using the expensive fork and execve system calls) in every 2 seconds this executes none. Much less system resources are used.
you could use while [ : ] and sleep 2
You need the awesome power of loops! Something like this should be a good starting point:
while true ; do
echo 'Uptime:'
w 2>&1 | sed 's/^/ /'
echo 'Clocking:'
sed 's/^/ /' /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
echo 'Temperature:'
sed 's/^/ /' /sys/class/thermal/thermal_zone0/temp
echo '=========='
sleep 2
done
That should give you your three sections, with the data of each nicely indented.

Multi-threaded BASH programming - generalized method?

Ok, I was running POV-Ray on all the demos, but POV's still single-threaded and wouldn't utilize more than one core. So, I started thinking about a solution in BASH.
I wrote a general function that takes a list of commands and runs them in the designated number of sub-shells. This actually works but I don't like the way it handles accessing the next command in a thread-safe multi-process way:
It takes, as an argument, a file with commands (1 per line),
To get the "next" command, each process ("thread") will:
Waits until it can create a lock file, with: ln $CMDFILE $LOCKFILE
Read the command from the file,
Modifies $CMDFILE by removing the first line,
Removes the $LOCKFILE.
Is there a cleaner way to do this? I couldn't get the sub-shells to read a single line from a FIFO correctly.
Incidentally, the point of this is to enhance what I can do on a BASH command line, and not to find non-bash solutions. I tend to perform a lot of complicated tasks from the command line and want another tool in the toolbox.
Meanwhile, here's the function that handles getting the next line from the file. As you can see, it modifies an on-disk file each time it reads/removes a line. That's what seems hackish, but I'm not coming up with anything better, since FIFO's didn't work w/o setvbuf() in bash.
#
# Get/remove the first line from FILE, using LOCK as a semaphore (with
# short sleep for collisions). Returns the text on standard output,
# returns zero on success, non-zero when file is empty.
#
parallel__nextLine()
{
local line rest file=$1 lock=$2
# Wait for lock...
until ln "${file}" "${lock}" 2>/dev/null
do sleep 1
[ -s "${file}" ] || return $?
done
# Open, read one "line" save "rest" back to the file:
exec 3<"$file"
read line <&3 ; rest=$(cat<&3)
exec 3<&-
# After last line, make sure file is empty:
( [ -z "$rest" ] || echo "$rest" ) > "${file}"
# Remove lock and 'return' the line read:
rm -f "${lock}"
[ -n "$line" ] && echo "$line"
}
#adjust these as required
args_per_proc=1 #1 is fine for long running tasks
procs_in_parallel=4
xargs -n$args_per_proc -P$procs_in_parallel povray < list
Note the nproc command coming soon to coreutils will auto determine
the number of available processing units which can then be passed to -P
If you need real thread safety, I would recommend to migrate to a better scripting system.
With python, for example, you can create real threads with safe synchronization using semaphores/queues.
sorry to bump this after so long, but I pieced together a fairly good solution for this IMO
It doesnt work perfectly, but it will limit the script to a certain number of child tasks running, and then wait for all the rest at the end.
#!/bin/bash
pids=()
thread() {
local this
while [ ${#} -gt 6 ]; do
this=${1}
wait "$this"
shift
done
pids=($1 $2 $3 $4 $5 $6)
}
for i in 1 2 3 4 5 6 7 8 9 10
do
sleep 5 &
pids=( ${pids[#]-} $(echo $!) )
thread ${pids[#]}
done
for pid in ${pids[#]}
do
wait "$pid"
done
it seems to work great for what I'm doing (handling parallel uploading of a bunch of files at once) and keeps it from breaking my server, while still making sure all the files get uploaded before it finishes the script
I believe you're actually forking processes here, and not threading. I would recommend looking for threading support in a different scripting language like perl, python, or ruby.

Resources