How to efficiently loop through the lines of a file in Bash? - linux

I have a file example.txt with about 3000 lines with a string in each line. A small file example would be:
>cat example.txt
saudifh
sometestPOIFJEJ
sometextASLKJND
saudifh
sometextASLKJND
IHFEW
foo
bar
I want to check all repeated lines in this file and output them. The desired output would be:
>checkRepetitions.sh
found two equal lines: index1=1 , index2=4 , value=saudifh
found two equal lines: index1=3 , index2=5 , value=sometextASLKJND
I made a script checkRepetions.sh:
#!bin/bash
size=$(cat example.txt | wc -l)
for i in $(seq 1 $size); do
i_next=$((i+1))
line1=$(cat example.txt | head -n$i | tail -n1)
for j in $(seq $i_next $size); do
line2=$(cat example.txt | head -n$j | tail -n1)
if [ "$line1" = "$line2" ]; then
echo "found two equal lines: index1=$i , index2=$j , value=$line1"
fi
done
done
However this script is very slow, it takes more than 10 minutes to run. In python it takes less than 5 seconds... I tried to store the file in memory by doing lines=$(cat example.txt) and doing line1=$(cat $lines | cut -d',' -f$i) but this is still very slow...

When you do not want to use awk (a good tool for the job, parsing the input only once),
you can run through the lines several times. Sorting is expensive, but this solution avoids the loops you tried.
grep -Fnxf <(uniq -d <(sort example.txt)) example.txt
With uniq -d <(sort example.txt) you find all lines that occur more than once. Next grep will search for these (option -f) complete (-x) lines without regular expressions (-F) and show the line it occurs (-n).

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why your script is so slow.
$ cat tst.awk
{ val2hits[$0] = val2hits[$0] FS NR }
END {
for (val in val2hits) {
numHits = split(val2hits[val],hits)
if ( numHits > 1 ) {
printf "found %d equal lines:", numHits
for ( hitNr=1; hitNr<=numHits; hitNr++ ) {
printf " index%d=%d ,", hitNr, hits[hitNr]
}
print " value=" val
}
}
}
$ awk -f tst.awk file
found 2 equal lines: index1=1 , index2=4 , value=saudifh
found 2 equal lines: index1=3 , index2=5 , value=sometextASLKJND
To give you an idea of the performance difference using a bash script that's written to be as efficient as possible and an equivalent awk script:
bash:
$ cat tst.sh
#!/bin/bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac
# initialize an associative array, mapping each string to the last line it was seen on
declare -A lines=( )
lineNum=0
while IFS= read -r line; do
(( ++lineNum ))
if [[ ${lines[$line]} ]]; then
printf 'Content previously seen on line %s also seen on line %s: %s\n' \
"${lines[$line]}" "$lineNum" "$line"
fi
lines[$line]=$lineNum
done < "$1"
$ time ./tst.sh file100k > ou.sh
real 0m15.631s
user 0m13.806s
sys 0m1.029s
awk:
$ cat tst.awk
lines[$0] {
printf "Content previously seen on line %s also seen on line %s: %s\n", \
lines[$0], NR, $0
}
{ lines[$0]=NR }
$ time awk -f tst.awk file100k > ou.awk
real 0m0.234s
user 0m0.218s
sys 0m0.016s
There are no differences in the output of both scripts:
$ diff ou.sh ou.awk
$
The above is using 3rd-run timing to avoid caching issues and being tested against a file generated by the following awk script:
awk 'BEGIN{for (i=1; i<=10000; i++) for (j=1; j<=10; j++) print j}' > file100k
When the input file had zero duplicate lines (generated by seq 100000 > nodups100k) the bash script executed in about the same amount of time as it did above while the awk script executed much faster than it did above:
$ time ./tst.sh nodups100k > ou.sh
real 0m15.179s
user 0m13.322s
sys 0m1.278s
$ time awk -f tst.awk nodups100k > ou.awk
real 0m0.078s
user 0m0.046s
sys 0m0.015s

To demonstrate a relatively efficient (within the limits of the language and runtime) native-bash approach, which you can see running in an online interpreter at https://ideone.com/iFpJr7:
#!/bin/bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac
# initialize an associative array, mapping each string to the last line it was seen on
declare -A lines=( )
lineNum=0
while IFS= read -r line; do
lineNum=$(( lineNum + 1 ))
if [[ ${lines[$line]} ]]; then
printf 'found two equal lines: index1=%s, index2=%s, value=%s\n' \
"${lines[$line]}" "$lineNum" "$line"
fi
lines[$line]=$lineNum
done <example.txt
Note the use of while read to iterate line-by-line, as described in BashFAQ #1: How can I read a file line-by-line (or field-by-field)?; this permits us to open the file only once and read through it without needing any command substitutions (which fork off subshells) or external commands (which need to be individually started up by the operating system every time they're invoked, and are likewise expensive).
The other part of the improvement here is that we're reading the whole file only once -- implementing an O(n) algorithm -- as opposed to running O(n^2) comparisons as the original code did.

Related

How to search the full string in file which is passed as argument in shell script?

i am passing a argument and that argument i have to match in file and extract the information. Could you please how I can get it?
Example:
I have below details in file-
iMedical_Refined_load_Procs_task_id=970113
HV_Rawlayer_Execution_Process=988835
iMedical_HV_Refined_Load=988836
DHS_RawLayer_Execution_Process=988833
iMedical_DHS_Refined_Load=988834
If I am passing 'hv' as argument so it should to pick 'iMedical_HV_Refined_Load' and give the result - '988836'
If I am passing 'dhs' so it should pick - 'iMedical_DHS_Refined_Load' and give the result = '988834'
I tried below logic but its not giving the result correctly. What Changes I need to do-
echo $1 | tr a-z A-Z
g=${1^^}
echo $g
echo $1
val=$(awk -F= -v s="$g" '$g ~ s{print $2}' /medaff/Scripts/Aggrify/sltconfig.cfg)
echo "TASK ID is $val"
Assuming your matching criteria is the first string after delimiter _ and the output needed is the numbers after the = char, then you can try this sed
$ sed -n "/_$1/I{s/[^=]*=\(.*\)/\1/p}" input_file
$ read -r input
hv
$ sed -n "/_$input/I{s/[^=]*=\(.*\)/\1/p}" input_file
988836
$ read -r input
dhs
$ sed -n "/_$input/I{s/[^=]*=\(.*\)/\1/p}" input_file
988834
If I'm reading it right, 2 quick versions -
$: cat 1
awk -F= -v s="_${1^^}_" '$1~s{print $2}' file
$: cat 2
sed -En "/_${1^^}_/{s/^.*=//;p;}" file
Both basically the same logic.
In pure bash -
$: cat 3
while IFS='=' read key val; do [[ "$key" =~ "_${1^^}_" ]] && echo "$val"; done < file
That's a lot less efficient, though.
If you know for sure there will be only one hit, all these could be improved a bit by short-circuit exits, but on such a small sample it won't matter at all. If you have a larger dataset to read, then I strongly suggest you formalize your specs better than "in this set I should get...".

How to find list of words (in thousands) in list of tsv files (hundreds), with output as number of match for each string in each file, in linux?

I have hundreds of tsv file with following structure (example):
GH1 123 family1
GH2 23 family2
.
.
.
GH4 45 family4
GH6 34 family6
And i have a text file with list of words (thousands):
GH1
GH2
GH3
.
.
.
GH1000
I want to get output which contain number of each words occurred in each file like this
GH1 GH2 GH3 ... GH1000
filename1 1 1 0... 4
.
.
.
filename2 2 3 1... 0
I try this code but it gives me zero only
for file in *.tsv; do
echo $file >> output.tsv
cat fore.txt | while read line; do
awk -F "\\t" '{print $1}' $file | grep -wc $line >>output.tsv
echo "\\t">>output.tsv;
done ;
done
Use the following script.
Just put sdtout to output.txt file.
#!/bin/bash
while read p; do
echo -n "$p "
done <words.txt
echo ""
for file in *.tsv; do
echo -n "$file = "
while read p; do
COUNT=$(sed 's/$p/$p\n/g' $file | grep -c "$p")
echo -n "$COUNT "
done <words.txt
echo ""
done
Here is a simple Awk script which collects a list like the one you describe.
awk 'BEGIN { printf "\t" }
NR==FNR { a[$1] = n = FNR;
printf "\t%s", $1; next }
FNR==1 {
if(f) { printf "%s", f;
for (i=1; i<=n; i++)
printf "\t%s", 0+b[i] }
printf "\n"
delete b
f = FILENAME }
$1 in a { b[$1]++ }' fore.txt *.tsv /etc/motd
To avoid repeating the big block in END, we add a short sentinel file at the end whose only purpose is to supply a file after the last whose counts will not be reported.
The shell's while read loop is slow and inefficient and somewhat error-prone (you basically always want read -r and handling incomplete text files is hairy); in addition, the brute-force method will require reading the word file once per iteration, which incurs a heavy I/O penalty.

How to use the pipe in a file with head-tail operation?

size=$(wc -l < "$1")
if [ "$size" -gt 0 ]
then
tr "[:lower:]" "[:upper:]" < $1 > output
for (( i=1; i <= "$size"; ++i ))
do
echo "Line " $i $(head -"$i" > output | tail -1 > output)
done
Hi, guys!
I have a problem with this little code. Everything works fine except the head-tail thing. What I wanna do is just displaying the line number "i" from a file.
The results that I receive are just the last line ($size).
I think maybe it is something wrong with the input of tail. The head -"$i" doesn't go at the specified line. :(
Any thoughts?
Ohhhh... I just realised: As input for my tail i give the same input for head.
The solution is to give to tail the result from head. How do I do that? :-/
You don't need to redirect to file output from head. Otherwise, the pipe does not get any input at all. Also, use >> to append results otherwise you will just keep overwriting the file with the next iteration of the loop. But make sure to delete the output file before each new call to the script, else you will just keep appending to the output file infinitely.
echo "Line " $i $(head -"$i" $infile | tail -1 >> output)
Use read to fetch a line of input from the file.
# Since `1` is always true, essentially count up forever
for ((i=1; 1; ++i)); do
# break when a read fails to read a line
IFS= read -r line || break
echo "Line $i: $(tr [:lower:] [:upper:])"
done < "$1" > output
A more standard approach is to iterate over the file and maintain i explicitly.
i=1
while IFS= read -r line; do
echo "Line $i: $(tr [:lower:] [:upper:])"
((i++))
done < "$1" > output
I think you're re-implementing cat -n with prefix "Line ". If so, awk to the rescue!
awk '{print "Line "NR, tolower($0)}'
I made it. :D
The trick is to put the output of the head to another file that will be the input for tail, like that:
echo "Line " $i $(head -"$i" < output >outputF | tail -1 < outputF)
Your questions made me think differently. Thank you!

Retrieve string between characters and assign on new variable using awk in bash

I'm new to bash scripting, I'm learning how commands work, I stumble in this problem,
I have a file /home/fedora/file.txt
Inside of the file is like this:
[apple] This is a fruit.
[ball] This is a sport's equipment.
[cat] This is an animal.
What I wanted is to retrieve words between "[" and "]".
What I tried so far is :
while IFS='' read -r line || [[ -n "$line" ]];
do
echo $line | awk -F"[" '{print$2}' | awk -F"]" '{print$1}'
done < /home/fedora/file.txt
I can print the words between "[" and "]".
Then I wanted to put the echoed word into a variable but i don't know how to.
Any help I will appreciate.
Try this:
variable="$(echo $line | awk -F"[" '{print$2}' | awk -F"]" '{print$1}')"
or
variable="$(awk -F'[\[\]]' '{print $2}' <<< "$line")"
or complete
while IFS='[]' read -r foo fruit rest; do echo $fruit; done < file
or with an array:
while IFS='[]' read -ra var; do echo "${var[1]}"; done < file
In addition to using awk, you can use the native parameter expansion/substring extraction provided by bash. Below # indicates a trim from the left, while % is used to trim from the right. (note: a single # or % indicates removal up to the first occurrence, while ## or %% indicates removal of all occurrences):
#!/bin/bash
[ -r "$1" ] || { ## validate input is readable
printf "error: insufficient input. usage: %s filename\n" "${0##*/}"
exit 1
}
## read each line and separate label and value
while read -r line || [ -n "$line" ]; do
label=${line#[} # trim initial [ from left
label=${label%%]*} # trim through ] from right
value=${line##*] } # trim from left through '[ '
printf " %-8s -> '%s'\n" "$label" "$value"
done <"$1"
exit 0
Input
$ cat dat/labels.txt
[apple] This is a fruit.
[ball] This is a sport's equipment.
[cat] This is an animal.
Output
$ bash readlabel.sh dat/labels.txt
apple -> 'This is a fruit.'
ball -> 'This is a sport's equipment.'
cat -> 'This is an animal.'

How can I retain numbers for sorting them later?

I have a problem that sounds like this: Write a shell script that for each file from the command line will output the
number of words that are longer than the number k read from keyboard.
The output must be ordered by the number of words.
How can i retain the number of characters of each file,for sorting them?
I tried something like that :
#!/bin/bash
if [ ## -ne 1 ]
then exit 1
fi
array[$#]=''
echo -n "Give the number>"
read k
for f in $#;
do
n=`$f | wc -c`
if [ $n -gt $k ];
then
i++
array[i]=$n
fi
done
echo {array[#]} | sort -n
The challenge is:
Write a shell script that for each file from the command line will output the number of words that are longer than the number k read from keyboard. The output must be ordered by the number of words.
I decline to answer prompts — commands take arguments. I'll go with William Pursell's suggestion that the number is the first argument — it is a reasonable solution. An alternative uses an option like -l 23 for the length (and other options to tweak other actions).
The solutions I see so far are counting the number of words, but not the number of words longer than the given length. This is a problem. For that, I think awk is appropriate:
awk -v min=$k '{ for (i = 1; i <= NF; i++) if (length($i) >= min) print $i; }'
This generates the words at least min characters one per line on the standard output. We'll do this one file at a time, at least in the first pass.
We can then count the number of such words with wc -l. Finally, we can sort the data numerically.
Putting it all together yields:
#!/bin/bash
case "$#" in
0|1) echo "Usage: $0 length file ..." >&2; exit 1;;
esac
k=${1:?"Cannot provide an empty length"}
shift
for file in "$#"
do
echo "$(awk -v min=$k '{ for (i = 1; i <= NF; i++)
if (length($i) >= min) print $i
}' "$file" |
wc -l) $file"
done | sort -n
This lists the files with the most long words last; that's convenient because the most interesting files are at the end of the list. If you want the high numbers first, add -r to the sort.
Of course, if we're using awk, we can improve things. It can count the number of long words in each file, and print the file name and the number, so there'd be just a single invocation of awk for all the files. It takes a little bit more programming, though:
#!/bin/sh
case "$#" in
0|1) echo "Usage: $0 length file ..." >&2; exit 1;;
esac
k=${1:?"Cannot provide an empty length"}
shift
awk -v min=$k '
FILENAME != oldfile { if (oldfile != "") { print longwords, oldfile }
oldfile = FILENAME; longwords = 0
}
{ for (i = 1; i <= NF; i++) if (length($i) >= min) longwords++ }
END { if (oldfile != "") { print longwords, oldfile } }
' "$#" |
sort -n
If you have GNU awk, there are even ways to sort the results built into awk.
You can simplify the script a bit:
#!/bin/bash
(( $# > 0 )) || exit
read -r -p 'Enter number > ' k
wc -w "$#" | sed '$d' | gawk -v k="$k" '$1>k{print $0}' | sort -nr
where
read -r -p ... prompts and read the input
wc -w - counts the words of all files what you entered as arguments
sed ... - skips the last line (total...)
awk skips lines where count is less than $k
sort - for sorting the output
With the great help of #Tom Fench here it can be simplified to:
wc -w "$#" | awk -v k="$k" 'NR>1&&p>k{print p}{p=$1}' | sort -nr
or with filenames (based on #Wintermute's comment here)
wc -w "$#" | awk -v k="$k" 'p { print p; p="" } $1 > k { p = $0 }' | sort -nr
EDIT
Based on #Jonathan Leffler's comment adding a variant for for counting words what are longer as number k in each file.
#!/bin/bash
(( $# > 0 )) || exit
read -r -p 'Enter number > ' k
let k++
grep -HoP "\b\w{${k:-3},}\b" "$#" |\
awk -F: '{f[$1]++}END{for(n in f)print f[n],n}' |\
sort -nr
Where:
the grep... searches for the words what are longer as the entered number (omit the let line if want equal and longer). prints out lines like:
file1:word1
file1:word2
...
file2:wordx
file2:wordy
and the awk count the frequency based on the 1st field, e.g. count by filename.

Resources