Speed up dig -x in bash script - linux

I have to run as an exercise at my university a bash script to reverse lookup all their DNS entries for a B class network block they own.
This is the fastest I have got but takes forever. Any help optimising this code?
#!/bin/bash
network="a.b"
CMD=/usr/bin/dig
for i in $(seq 1 254); do
for y in $(seq 1 254); do
answer=`$CMD -x $network.$i.$y +short`;
echo $network.$i.$y ' resolves to ' $answer >> hosts_a_b.txt;
done
done

Using GNU xargs to run 64 processes at a time might look like:
#!/usr/bin/env bash
lookupArgs() {
for arg; do
# echo entire line together to ensure atomicity
echo "$arg resolves to $(dig -x "$arg" +short)"
done
}
export -f lookupArgs
network="a.b"
for (( x=1; x<=254; x++ )); do
for (( y=1; y<=254; y++ )); do
printf '%s.%s.%s\0' "$network" "$x" "$y"
done
done | xargs -0 -P64 bash -c 'lookupArgs "$#"' _ >hosts_a_b.txt
Note that this doesn't guarantee order of output (and relies on the lookupArgs function doing one write() syscall per result) -- but output is sortable so you should be able to reorder. Otherwise, one could get ordered output (and ensure atomicity of results) by switching to GNU parallel -- a large perl script, vs GNU xargs' small, simple, relatively low-feature implementation.

Related

Bash script - execute commands parallely [duplicate]

I have a bash script similar to:
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
done
What's the most straightforward way to limit the number of parallel processes to NUM_PROCS? I'm looking for a solution that doesn't require packages/installations/modules (like GNU Parallel) if possible.
When I tried Charles Duffy's latest approach, I got the following error from bash -x:
+ python run.py args 1
+ python run.py ... 3
+ python run.py ... 4
+ python run.py ... 2
+ read -r line
+ python run.py ... 1
+ read -r line
+ python run.py ... 4
+ read -r line
+ python run.py ... 2
+ read -r line
+ python run.py ... 3
+ read -r line
+ python run.py ... 0
+ read -r line
... continuing with other numbers between 0 and 5, until too many processes were started for the system to handle and the bash script was shut down.
bash 4.4 will have an interesting new type of parameter expansion that simplifies Charles Duffy's answer.
#!/bin/bash
num_procs=$1
num_iters=$2
num_jobs="\j" # The prompt escape for number of jobs currently running
for ((i=0; i<num_iters; i++)); do
while (( ${num_jobs#P} >= num_procs )); do
wait -n
done
python foo.py "$i" arg2 &
done
GNU, macOS/OSX, FreeBSD and NetBSD can all do this with xargs -P, no bash versions or package installs required. Here's 4 processes at a time:
printf "%s\0" {1..10} | xargs -0 -I # -P 4 python foo.py # arg2
As a very simple implementation, depending on a version of bash new enough to have wait -n (to wait until only the next job exits, as opposed to waiting for all jobs):
#!/bin/bash
# ^^^^ - NOT /bin/sh!
num_procs=$1
num_iters=$2
declare -A pids=( )
for ((i=0; i<num_iters; i++)); do
while (( ${#pids[#]} >= num_procs )); do
wait -n
for pid in "${!pids[#]}"; do
kill -0 "$pid" &>/dev/null || unset "pids[$pid]"
done
done
python foo.py "$i" arg2 & pids["$!"]=1
done
If running on a shell without wait -n, one can (very inefficiently) replace it with a command such as sleep 0.2, to poll every 1/5th of a second.
Since you're actually reading input from a file, another approach is to start N subprocesses, each of processes only lines where (linenum % N == threadnum):
num_procs=$1
infile=$2
for ((i=0; i<num_procs; i++)); do
(
while read -r line; do
echo "Thread $i: processing $line"
done < <(awk -v num_procs="$num_procs" -v i="$i" \
'NR % num_procs == i { print }' <"$infile")
) &
done
wait # wait for all the $num_procs subprocesses to finish
A relatively simple way to accomplish this with only two additional lines of code. Explanation is inline.
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
let 'i>=NUM_PROCS' && wait -n # wait for one process at a time once we've spawned $NUM_PROC workers
done
wait # wait for all remaining workers
Are you aware that if you are allowed to write and run your own scripts, then you can also use GNU Parallel? In essence it is a Perl script in one single file.
From the README:
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the
system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
mv parallel sem dir-in-your-$PATH/bin/
seq $2 | parallel -j$1 python foo.py {} arg2
parallel --embed (available since 20180322) even makes it possible to distribute GNU Parallel as part of a shell script (i.e. no extra files needed):
parallel --embed >newscript
Then edit the end of newscript.
This isn't the simplest solution, but if your version of bash doesn't have "wait -n" and you don't want to use other programs like parallel, awk etc, here is a solution using while and for loops.
num_iters=10
total_threads=4
iter=1
while [[ "$iter" -lt "$num_iters" ]]; do
iters_remainder=$(echo "(${num_iters}-${iter})+1" | bc)
if [[ "$iters_remainder" -lt "$total_threads" ]]; then
threads=$iters_remainder
else
threads=$total_threads
fi
for ((t=1; t<="$threads"; t++)); do
(
# do stuff
) &
((++iter))
done
wait
done

Read stdin in chunks in Bash pipe

I have some shell scripts that works with pipes like such:
foo.sh | bar.sh
My bar.sh calls some command line program that can only take a certain number of lines of stdin. Thus, I want foo.sh's large stdout to be chunked up in N number of lines to make multiple bar.sh calls. Essentially, paginate foo.sh's stdout and do multiple bar.sh.
Is it possible? I am hoping for some magic in between the pipes like foo.sh | ??? | bar.sh. xargs -n doesn't quite get me what I want.
I am nowhere near a machine to test this, but you need GNU Parallel to make this easy - along the lines of:
foo.sh | parallel --pipe -N 10000 -k bar.sh
As an added bonus, that will run as many bar.sh in parallel as you have CPU cores.
Add -j 1 if you only want one bar.sh at a time.
Add --dry-run if you want to see what it would do but without doinng anything.
Use a while read loop.
foo.sh | while read line1 && read line2 && read line3; do
printf "%s\n%s\n%s\n" "$line1" "$line2" "$line3" | bar.sh
done
For large N, write a function that loops.
read_n_lines() {
read -r line || return 1
echo "$line"
n=$(($1 - 1))
while [[ $n -gt 0 ]] && read -r line; do
echo "$line"
n=$((n-1))
done
}
Then you can do:
n=20
foo.sh | while lines=$(read_n_lines $n); do
printf "%s\n" "$lines" | bar.sh
done

Replacing awk with bash built-ins

I have been told to write a bash script for adding all the GroupID's in "/etc/passwd" file, this is my scrip
#!/bin/sh
# script input should be (sh groupsum.sh /etc/passwd)
if [ -f $1 ] ; then
awk -F ':' '{print $4}' $1 > /tmp/numb
A=`awk '{s+=$1} END {print s}' /tmp/numb`
echo $A
else
echo "its not a file"
fi
The script is working fine but to make it fast I should use bash built-in commands instead of using "awk". So I need information to achieve this using built-in commands it would be great if someone gives the explanation on this.
You said "bash built-ins", but your script starts with #!/bin/sh -- which requests POSIX sh, not bash. I'll assume, though, that you really do want bash.
#!/bin/bash
[[ -f "$1" ]] || { echo "Not a file" >&2; exit 1; }
exec <"$1"
total=0
while IFS=':' read -r _ _ _ groupid _; do
(( total += groupid ))
done
echo "$total"
To explain the specific operations being used to replace components of your awk script: The read command iterates through lines (by default), splitting them by characters in IFS; so IFS=: read -r _ _ _ groupid _ discards the first three columns, puts the fourth in in a variable named groupid, and discards the rest. (( )) is a math context in bash; inside it, C-style syntax is usable for integer arithmetic operations, hence the addition.
By the way, reading /etc/passwd directly is a bad idea -- it won't work on systems using LDAP, or NIS, or any other alternate directory service. If you're on a Linux host, you can use the getent program to do a lookup that works with whatever your current directory service is:
$ yourscript <(getent passwd)
All that said, the premise for this question is a poor one -- though there's overhead for spawning any external program, awk included, once it's running awk is much, much faster than bash. If speed were your only priority, you'd do better to not use a shell at all, and have your script start with a shebang that runs the awk interpreter directly.

Redirecting output of bash for loop

I have a simple BASH command that looks like
for i in `seq 2`; do echo $i; done; > out.dat
When this runs the output of seq 2 is output to the terminal and nothing is output to the data file (out.dat)
I am expecting standard out to be redirected to out.dat like it does simply running the command seq 2 > out.dat
Remove your semicolon.
for i in `seq 2`; do echo "$i"; done > out.dat
SUGGESTIONS
Also as suggested by Fredrik Pihl, try not to use external binaries when they are not needed, or at least when practically not:
for i in {1..2}; do echo "$i"; done > out.dat
for (( i = 1; i <= 2; ++i )); do echo "$i"; done > out.dat
for i in 1 2; do echo "$i"; done > out.dat
Also, be careful of outputs in words that may cause pathname expansion.
for a in $(echo '*'); do echo "$a"; done
Would show your files instead of just a literal *.
$() is also recommended as a clearer syntax for command substitution in Bash and POSIX shells than backticks (`), and it supports nesting.
The cleaner solutions as well for reading output to variables are
while read var; do
...
done < <(do something)
And
read ... < <(do something) ## Could be done on a loop or with readarray.
for a in "${array[#]}"; do
:
done
Using printf can also be an easier alternative with respect to the intended function:
printf '%s\n' {1..2} > out.dat
Another possibility, for the sake of completeness: You can move the output inside the loop, using >> to append to the file, if it exists.
for i in `seq 2`; do echo $i >> out.dat; done;
Which one is better certainly depends on the use case. Writing the file in one go is certainly better than appending to it a thousand times. Also, if the loop contains multiple echo statements, all of which shall go to the file, doing done > out.dat is probably more readable and easier to maintain. The advantage of this solution, of course, is that it gives more flexibility.
Try:
(for i in `seq 2`; do echo $i; done;) > out.dat

Bash Variable Maths Not Working

I have a simple bash script, which forms part of an in house web app that I've developed.
It's purpose is to automate deletion of thumbnails of images when the original image has been deleted by the user.
The script logs some basic status info to a file /var/log/images.log
#!/bin/bash
cd $thumbpath
filecount=0
# Purge extraneous thumbs
find . -type f | while read file
do
if [ ! -f "$imagepath/$file" ]
then
filecount=$[$filecount+1]
rm -f "$file"
fi
done
echo `date`: $filecount extraneous thumbs removed>>/var/log/images.log
Whilst the script correctly deletes thumbs, it doesn't correctly output the number of thumbs that are being purged, it always shows 0.
For example, having just manually created some orphaned thumbnails, and then running my script, the manually generated orphaned thumbs are deleted, but the log shows:
Thu Jun 9 23:30:12 BST 2011: 0 extraneous thumbs removed
What am I doing wrong that is stopping $filecounter from showing a number other than zero, when files are being deleted.
I've created the following bash script to test this, and this works perfectly, outputting 0 then 1:
#!/bin/bash
count=0
echo $count
count=$[$count+1]
echo $count
Edit:
Thanks for the answers, but why does the following work
$ x=3
$ x=$[$x+1]
$ echo $x
4
...and also the second example works, yet it doesn't work in the first script?
Second Edit:
This works
count=0
echo Initial Value $count
for i in `seq 1 5`
do
count=$[$count+1]
echo $count
done
echo Final Value $count
Initial Value 0
1
2
3
4
5
Final Value 5
as does replacing count=$[$count+1] with count=$((count+1)), but not in my initial script.
You're using the wrong operator. Try using $(( ... )) instead, e.g.:
$ x=4
$ y=$((x + 1))
$ echo $y
5
$
EDIT
The other problem you're bumping into is down to the pipe. Bumped into this one before (with ksh, but wouldn't suprise me to find that other shells have the same problem). The pipe is forking another bash process, so when you do the increment, filcount is getting incremented in the subshell that's been forked after the pipe. This value isn't passed back to the calling shell as the subshell has it's own independent environment (environment variables are inherited in called processes, but called process cannot modify the environment of the calling process).
As an example, this demonstrates that filecount gets incremented okay:
#!/bin/bash
filecount=0
ls /bin | while read x
do
filecount=$((filecount + 1))
echo $filecount
done
echo $filecount
...so you should see filecount increase in the loop, but the final filecount will be zero because this echo belongs to the main shell, but the forked subshell (which consists purely of the while loop).
One way you can get the value back is like this...
#!/bin/bash
filecount=0
filecount=`ls /bin | while read x
do
filecount=$((filecount + 1))
echo $filecount
done | tail -1`
echo $filecount
This will only work if you don't care about any other stdout output in the loop as this throws it all away apart from the last line we output (the final value of filecount). This works because we're using stdout and stdin to feed the data back to the parent shell.
Depending on your viewpoint this is either a nasty hack or a nifty bit of shell jiggery-pokery. I'll leave you to decide what you think it is :-)
If you remove the pipeline into the while construct, you remove bash's need to create a subshell.
Change this:
filecount=0
find . -type f | while read file; do
if [ ! -f "$imagepath/$file" ]; then
filecount=$[$filecount+1]
rm -f "$file"
fi
done
echo $filecount
to this:
filecount=0
while read file; do
if [ ! -f "$imagepath/$file" ]; then
rm -f "$file" && (( filecount++ ))
fi
done < <(find . -type f)
echo $filecount
That is harder to read because the find command is hidden at the end. Another possibility is:
files=$( find . -type f )
while ...; do
:
done <<< "$files"
Chris J is quite right that you are using the wrong operator and POSIX subshell variable scoping means you can't get a final count that way.
As a side note, when doing math operations you could also consider using the let shell bultin like this:
$ filecount=4
$ let filecount=$filecount+1
$ echo $filecount
5
Also if you want scoping to just work like you expected it to in spite of that pipeline, you could use zsh instead of bash. In this case it should be a drop in replacement and work as expected.

Resources