Read stdin in chunks in Bash pipe - linux

I have some shell scripts that works with pipes like such:
foo.sh | bar.sh
My bar.sh calls some command line program that can only take a certain number of lines of stdin. Thus, I want foo.sh's large stdout to be chunked up in N number of lines to make multiple bar.sh calls. Essentially, paginate foo.sh's stdout and do multiple bar.sh.
Is it possible? I am hoping for some magic in between the pipes like foo.sh | ??? | bar.sh. xargs -n doesn't quite get me what I want.

I am nowhere near a machine to test this, but you need GNU Parallel to make this easy - along the lines of:
foo.sh | parallel --pipe -N 10000 -k bar.sh
As an added bonus, that will run as many bar.sh in parallel as you have CPU cores.
Add -j 1 if you only want one bar.sh at a time.
Add --dry-run if you want to see what it would do but without doinng anything.

Use a while read loop.
foo.sh | while read line1 && read line2 && read line3; do
printf "%s\n%s\n%s\n" "$line1" "$line2" "$line3" | bar.sh
done
For large N, write a function that loops.
read_n_lines() {
read -r line || return 1
echo "$line"
n=$(($1 - 1))
while [[ $n -gt 0 ]] && read -r line; do
echo "$line"
n=$((n-1))
done
}
Then you can do:
n=20
foo.sh | while lines=$(read_n_lines $n); do
printf "%s\n" "$lines" | bar.sh
done

Related

Shell - iterate over content of file but do something only the first x lines

So guys,
I need your help trying to identify the fastest and the most "fault" tolerant solution to my problem.
I have a shell script which executes some functions, based on a txt file, in which I have a list of files.
The list can contain from 1 file to X files.
What I would like to do is iterate over the content of the file and execute my scripts for only 4 items out of the file.
Once the functions have been executed for these 4 files, go over to the next 4 .... and keep on doing so until all the files from the list have been "processed".
My code so far is as follows.
#!/bin/bash
number_of_files_in_folder=$(cat list.txt | wc -l)
max_number_of_files_to_process=4
Translated_files=/home/german_translated_files/
while IFS= read -r files
do
while [[ $number_of_files_in_folder -gt 0 ]]; do
i=1
while [[ $i -le $max_number_of_files_to_process ]]; do
my_first_function "$files" & # I execute my translation function for each file, as it can only perform 1 file per execution
find /home/german_translator/ -name '*.logs' -exec mv {} $Translated_files \; # As there will be several files generated, I have them copied to another folder
sed -i "/$files/d" list.txt # We remove the processed file from within our list.txt file.
my_second_function # Without parameters as it will process all the files copied at step 2.
done
# here, I want to have all the files processed and don't stop after the first iteration
done
done < list.txt
Unfortunately, as I am not quite good at shell scripting, I do not know how to structure it so that it won't waste any resources and mostly, to make sure that it "processes" everything from that file.
Do you have any advice on how to achieve what I am trying to achieve?
only 4 items out of the file. Once the functions have been executed for these 4 files, go over to the next 4
Seems to be quite easy with xargs.
your_function() {
echo "Do something with $1 $2 $3 $4"
}
export -f your_function
xargs -d '\n' -n 4 bash -c 'your_function "$#"' _ < list.txt
xargs -d '\n' for each line
-n 4 take for arguments
bash .... - run this command with 4 arguments
_ - the syntax is bash -c <script> $0 $1 $2 etc..., see man bash.
"$#" - forward arguments
export -f your_function - export your function to environment so child bash can pick it up.
I execute my translation function for each file
So you execute your translation function for each file, not for each 4 files. If the "translation function" is really for each file with no inter-file state, consider rather executing 4 processes in parallel with same code and just xargs -P 4.
If you have GNU Parallel it looks something like this:
doit() {
my_first_function "$1"
my_first_function "$2"
my_first_function "$3"
my_first_function "$4"
my_second_function "$1" "$2" "$3" "$4"
}
export -f doit
cat list.txt | parallel -n4 doit

Is it possible to do watch logfile with tail -f and pipe updates/changes over netcat to another local system? [duplicate]

This question already has answers here:
Piping tail output though grep twice
(2 answers)
Closed 4 years ago.
There is a file located at $filepath, which grows gradually. I want to print every line that starts with an exclamation mark:
while read -r line; do
if [ -n "$(grep ^! <<< "$line")" ]; then
echo "$line"
fi
done < <(tail -F -n +1 "$filepath")
Then, I rearranged the code by moving the comparison expression into the process substitution to make the code more concise:
while read -r line; do
echo "$line"
done < <(tail -F -n +1 "$filepath" | grep '^!')
Sadly, it doesn't work as expected; nothing is printed to the terminal (stdout).
I prefer to write grep ^\! after tail. Why doesn't the second code snippet work? Why putting the command pipe into the process substitution make things different?
PS1. This is how I manually produce the gradually growing file by randomly executing one of the following commands:
echo ' something' >> "$filepath"
echo '!something' >> "$filepath"
PS2. Test under GNU bash, version 4.3.48(1)-release and tail (GNU coreutils) 8.25.
grep is not line-buffered when its stdout isn't connected to a tty. So it's trying to process a block (usually 4 KiB or 8 KiB or so) before generating some output.
You need to tell grep to buffer its output by line. If you're using GNU grep, this works:
done < <(tail -F -n +1 "$filepath" | grep '^!' --line-buffered)
^^^^^^^^^^^^^^^

Speed up dig -x in bash script

I have to run as an exercise at my university a bash script to reverse lookup all their DNS entries for a B class network block they own.
This is the fastest I have got but takes forever. Any help optimising this code?
#!/bin/bash
network="a.b"
CMD=/usr/bin/dig
for i in $(seq 1 254); do
for y in $(seq 1 254); do
answer=`$CMD -x $network.$i.$y +short`;
echo $network.$i.$y ' resolves to ' $answer >> hosts_a_b.txt;
done
done
Using GNU xargs to run 64 processes at a time might look like:
#!/usr/bin/env bash
lookupArgs() {
for arg; do
# echo entire line together to ensure atomicity
echo "$arg resolves to $(dig -x "$arg" +short)"
done
}
export -f lookupArgs
network="a.b"
for (( x=1; x<=254; x++ )); do
for (( y=1; y<=254; y++ )); do
printf '%s.%s.%s\0' "$network" "$x" "$y"
done
done | xargs -0 -P64 bash -c 'lookupArgs "$#"' _ >hosts_a_b.txt
Note that this doesn't guarantee order of output (and relies on the lookupArgs function doing one write() syscall per result) -- but output is sortable so you should be able to reorder. Otherwise, one could get ordered output (and ensure atomicity of results) by switching to GNU parallel -- a large perl script, vs GNU xargs' small, simple, relatively low-feature implementation.

Removing forks in a shell script so that it runs well in Cygwin

I'm trying to run a shell script on windows in Cygwin. The problem I'm having is that it runs extremely slowly in the following section of code. From a bit of googling, I believe its due to there being a large amount of fork() calls within the script and as windows has to use Cygwins emulation of this, it just slows to a crawl.
A typical scenario would be in Linux, the script would complete in < 10 seconds (depending on file size) but in Windows on Cygin for the same file it would take nearly 10 minutes.....
So the question is, how can i remove some of these forks and still have the script return the same output. I'm not expecting miracles but I'd like to cut that 10 minute wait time down a fair bit.
Thanks.
check_for_customization(){
filename="$1"
extended_class_file="$2"
grep "extends" "$filename" | grep "class" | grep -v -e '^\s*<!--' | while read line; do
classname="$(echo $line | perl -pe 's{^.*class\s*([^\s]+).*}{$1}')"
extended_classname="$(echo $line | perl -pe 's{^.*extends\s*([^\s]+).*}{$1}')"
case "$classname" in
*"$extended_classname"*) echo "$filename"; echo "$extended_classname |$classname | $filename" >> "$extended_class_file";;
esac
done
}
Update: Changed the regex a bit and used a bit more perl:
check_for_customization(){
filename="$1"
extended_class_file="$2"
grep "^\(class\|\(.*\s\)*class\)\s.*\sextends\s\S*\(.*$\)" "$filename" | grep -v -e '^\s*<!--' | perl -pe 's{^.*class\s*([^\s]+).*extends\s*([^\s]+).*}{$1 $2}' | while read classname extended_classname; do
case "$classname" in
*"$extended_classname"*) echo "$filename"; echo "$extended_classname | $classname | $filename" >> "$extended_class_file";;
esac
done
}
So, using the above code, the run time was reduced from about 8 minutes to 2.5 minutes. Quite an improvement.
If anybody can suggest any other changes I would appreciate it.
Put more commands into one perl script, e. g.
check_for_customization(){
filename="$1" extended_class_file="$2" perl -n - "$1" <<\EOF
next if /^\s*<!--/;
next unless /^.*class\s*([^\s]+).*/; $classname = $1;
next unless /^.*extends\s*([^\s]+).*/; $extended_classname = $1;
if (index($extended_classname, $classname) != -1)
{
print "$ENV{filename}\n";
open FILEOUT, ">>$ENV{extended_class_file}";
print FILEOUT "$extended_classname |$classname | $ENV{filename}\n"
}
EOF
}

bash - errors trying to pipe commands to run to separate function

I'm trying to get this function for making it easy to parallelize my bash scripts working. The idea is simple; instead of running each command sequentially, I pipe the command I want to run to this function and it does while read line; run the jobs in the bg for me and take care of logistics.... it doesn't work though. I added set -x by where stuff's executed and it looks like I'm getting weird quotes around the stuff I want executed... what should I do?
runParallel () {
while read line
do
while [ "`jobs | wc -l`" -eq 8 ]
do
sleep 2
done
{
set -x
${line}
set +x
} &
done
while [ "`jobs | wc -l`" -gt 0 ]
do
sleep 1
jobs >/dev/null 2>/dev/null
echo sleeping
done
}
for H in `ypcat hosts | grep fmez | grep -v mgmt | cut -d\ -f2 | sort -u`
do
echo 'ping -q -c3 $H 2>/dev/null 1>/dev/null && echo $H - UP || echo $H - DOWN'
done | runParallel
When I run it, I get output like the following:
> ./myscript.sh
+ ping -q -c3 '$H' '2>/dev/null' '1>/dev/null' '&&' echo '$H' - UP '||' echo '$H' - DOWN
Usage: ping [-LRUbdfnqrvVaA] [-c count] [-i interval] [-w deadline]
[-p pattern] [-s packetsize] [-t ttl] [-I interface or address]
[-M mtu discovery hint] [-S sndbuf]
[ -T timestamp option ] [ -Q tos ] [hop1 ...] destination
+ set +x
sleeping
>
The quotes in the set -x output are not the problem, at most they are another result of the problem. The main problem is that ${line} is not the same as eval ${line}.
When a variable is expanded, the resulting words are not treated as shell reserved constructs. And this is expected, it means that eg.
A="some text containing > ; && and other weird stuff"
echo $A
does not shout about invalid syntax but prints the variable value.
But in your function it means that all the words in ${line}, including 2>/dev/null and the like, are passed as arguments to ping, which set -x output nicely shows, and so ping complains.
If you want to execute from variables complicated commandlines with redirections and conditionals, you will have to use eval.
If I'm understanding this correctly, you probably don't want single quotes in your echo command. Single quotes are literal strings, and don't interpret your bash variable $H.
Like many users of GNU Parallel you seem to have written your own parallelizer.
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:
cat hosts | parallel -j8 'ping -q -c3 {} 2>/dev/null 1>/dev/null && echo {} - UP || echo {} - DOWN'
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos for GNU Parallel to learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Put your command in an array.

Resources