We are trying to execute below script for finding out the occurrence of a particular word in a log file
Need suggestions to optimize the script.
Test.log size - Approx to 500 to 600 MB
$wc -l Test.log
16609852 Test.log
po_numbers - 11 to 12k po's to search
$more po_numbers
xxx1335
AB1085
SSS6205
UY3347
OP9111
....and so on
Current Execution Time - 2.45 hrs
while IFS= read -r po
do
check=$(grep -c "PO_NUMBER=$po" Test.log)
echo $po "-->" $check >>list3
if [ "$check" = "0" ]
then
echo $po >>po_to_server
#else break
fi
done < po_numbers
You are reading your big file too many times when you execute
grep -c "PO_NUMBER=$po" Test.log
You can try to split your big file into smaller ones or write your patterns to a file and make grep use it
echo -e "PO_NUMBER=$po\n" >> patterns.txt
then
grep -f patterns.txt Test.log
$ grep -Fwf <(sed 's/.*/PO_NUMBER=&/' po_numbers) Test.log
create the lookup file from po_numbers (process substitution) check for literal word matches from the log file. This assumes the searched PO_NUMBER=xxx is a separate word, if not remove -w, also assumes there is no regex but just literal matches, if not remove -F, however both will slow down searches.
Using Grep :
sed -e 's|^|PO_NUMBER=|' po_numbers | grep -o -F -f - Test.log | sed -e 's|^PO_NUMBER=||' | sort | uniq -c > list3
grep -o -F -f po_numbers list3 | grep -v -o -F -f - po_numbers > po_to_server
Using awk :
This awk program might work faster
awk '(NR==FNR){ po[$0]=0; next }
{ for(key in po) {
str=$0
po[key]+=gsub("PO_NUMBER="key,"",str)
}
}
END {
for(key in po) {
if (po[key]==0) {print key >> "po_to_server" }
else {print key"-->"po[key] >> "list3" }
}
}' po_numbers Test.log
This does the following :
The first line loads the po keys from the file po_numbers
The second awk parser, will pars the file for occurences of PO_NUMBER=key per line. (gsub is a function which performs a substitutation and returns the substitution count)
In the end we print out the requested output to the requested files.
The assumption here is that is might be possible that multiple patterns could occure multiple times on a single line of Test.log
Comment: the original order of po_numbers will not be satisfied.
"finding out the occurrence"
Not sure if you mean to count the number of occurrences for each searched word or to output the lines in the log that contain at least one of the searched words. This is how you could solve it in the latter case:
(cat po_numbers; echo GO; cat Test.log) | \
perl -nle'$r?/$r/&&print:/GO/?($r=qr/#{[join"|",#s]}/):push#s,$_'
Related
I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}
You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.
You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv
I have multiple fasta files, where the first line always contains a > with multiple words, for example:
File_1.fasta:
>KY620313.1 Hepatitis C virus isolate sP171215 polyprotein gene, complete cds
File_2.fasta:
>KY620314.1 Hepatitis C virus isolate sP131957 polyprotein gene, complete cds
File_3.fasta:
>KY620315.1 Hepatitis C virus isolate sP127952 polyprotein gene, complete cds
I would like to take the word starting with sP* from each file and rename each file to this string (for example: File_1.fasta to sP171215.fasta).
So far I have this:
$ for match in "$(grep -ro '>')";do
fname=$("echo $match|awk '{print $6}'")
echo mv "$match" "$fname"
done
But it doesn't work, I always get the error:
grep: warning: recursive search of stdin
I hope you can help me!
you can use something like this:
grep '>' *.fasta | while read -r line ; do
new_name="$(echo $line | cut -d' ' -f 6)"
old_name="$(echo $line | cut -d':' -f 1)"
mv $old_name "$new_name.fasta"
done
It searches for *.fasta files and handles every "hitted" line
it splits each result of grep by spaces and gets the 6th element as new name
it splits each result of grep by : and gets the first element as old name
it
moves/renames from old filename to new filename
There are several things going on with this code.
For a start, .. I actually don't get this particular error, and this might be due to different versions.
It might resolve to the fact that grep interprets '>' the same as > due to bash expansion being done badly. I would suggest maybe going for "\>".
Secondly:
fname=$("echo $match|awk '{print $6}'")
The quotes inside serve unintended purpose. Your code should like like this, if anything:
fname="$(echo $match|awk '{print $6}')"
Lastly, to properly retrieve your data, this should be your final code:
for match in "$(grep -Hr "\>")"; do
fname="$(echo "$match" | cut -d: -f1)"
new_fname="$(echo "$match" | grep -o "sP[^ ]*")".fasta
echo mv "$fname" "$new_fname"
done
Explanations:
grep -H -> you want your grep to explicitly use "Include Filename", just in case other shell environments decide to alias grep to grep -h (no filenames)
you don't want to be doing grep -o on your file search, as you want to have both the filename and the "new filename" in one data entry.
Although, i don't see why you would search for '>' and not directory for 'sP' as such:
for match in "$(grep -Hro "sP[0-9]*")"
This is not the exact same behaviour, and has different edge cases, but it just might work for you.
Quite straightforward in (g)awk :
create a file "script.awk":
FNR == 1 {
for (i=1; i<=NF; i++) {
if (index($i, "sP")==1) {
print "mv", FILENAME, $i ".fasta"
nextfile
}
}
}
use it :
awk -f script.awk *.fasta > cmmd.txt
check the content of the output.
mv File_1.fasta sP171215.fasta
mv File_2.fasta sP131957.fasta
if ok, launch rename with . cmmd.txt
For all fasta files in directory, search their first line for the first word starting with sP and rename them using that word as the basename.
Using a bash array:
for f in *.fasta; do
arr=( $(head -1 "$f") )
for word in "${arr[#]}"; do
[[ "$word" =~ ^sP* ]] && echo mv "$f" "${word}.fasta" && break
done
done
or using grep:
for f in *.fasta; do
word=$(head -1 "$f" | grep -o "\bsP\w*")
[ -z "$word" ] || echo mv "$f" "${word}.fasta"
done
Note: remove echo after you are ok with testing.
Using a bash script, I'm trying to iterate through a text file that only has around 700 words, line-by-line, and run a case-insensitive grep search in the current directory using that word on particular files. To break it down, I'm trying to output the following to a file:
Append a newline to a file, then the searched word, then another newline
Append the results of the grep command using that search
Repeat steps 1 and 2 until all words in the list are exhausted
So for example, if I had this list.txt:
search1
search2
I'd want the results.txt to be:
search1:
grep result here
search2:
grep result here
I've found some answers throughout the stack exchanges on how to do this and have come up with the following implementation:
#!/usr/bin/bash
while IFS = read -r line;
do
"\n$line:\n" >> "results.txt";
grep -i "$line" *.in >> "results.txt";
done < "list.txt"
For some reason, however, this (and the numerous variants I've tried) isn't working. Seems trivial, but I'd it's been frustrating me beyond belief. Any help is appreciated.
Your script would work if you changed it to:
while IFS= read -r line; do
printf '\n%s:\n' "$line"
grep -i "$line" *.in
done < list.txt > results.txt
but it'd be extremely slow. See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice for why you should think long and hard before writing a shell loop just to manipulate text. The standard UNIX tool for manipulating text is awk:
awk '
NR==FNR { words2matches[$0]; next }
{
for (word in words2matches) {
if ( index(tolower($0),tolower(word)) ) {
words2matches[word] = words2matches[word] $0 ORS
}
}
}
END {
for (word in words2matches) {
print word ":" ORS words2matches[word]
}
}
' list.txt *.in > results.txt
The above is untested of course since you didn't provide sample input/output we could test against.
Possible problems:
bash path - use /bin/bash path instead of /usr/bin/bash
blank spaces - remove ' ' after IFS
echo - use -e option for handling escape characters (here: '\n')
semicolons - not required at end of line
Try following script:
#!/bin/bash
while IFS= read -r line; do
echo -e "$line:\n" >> "results.txt"
grep -i "$line" *.in >> "results.txt"
done < "list.txt"
You do not even need to write a bash script for this purpose:
INPUT FILES:
$ more file?.in
::::::::::::::
file1.in
::::::::::::::
abc
search1
def
search3
::::::::::::::
file2.in
::::::::::::::
search2
search1
abc
def
::::::::::::::
file3.in
::::::::::::::
abc
search1
search2
def
search3
PATTERN FILE:
$ more patterns
search1
search2
search3
CMD:
$ grep -inf patterns file*.in | sort -t':' -k3 | awk -F':' 'BEGIN{OFS=FS}{if($3==buffer){print $1,$2}else{print $3; print $1,$2}buffer=$3}'
OUTPUT:
search1
file1.in:2
file2.in:2
file3.in:2
search2
file2.in:1
file3.in:3
search3
file1.in:4
file3.in:5
EXPLANATIONS:
grep -inf patterns file*.in will grep all the file*.in with all the patterns located in patterns file thanks to -f option, using -i forces insensitive case, -n will add the line numbers
sort -t':' -k3 you sort the output with the 3rd column to regroup patterns together
awk -F':' 'BEGIN{OFS=FS}{if($3==buffer){print $1,$2}else{print $3; print $1,$2}buffer=$3}' then awk will print the display that you want by using : as Field Separator and Output Field Separator, you use a buffer variable to save the pattern (3rd field) and you print the pattern whenever it changes ($3!=buffer)
I have tried this :
dirs=$1
for dir in $dirs
do
ls -R $dir
done
Like this?:
$ cat > foo
this
nope
$ cat > bar
neither
this
$ sort *|uniq -c
1 neither
1 nope
2 this
and weed out the ones with just 1s:
... | awk '$1>1'
2 this
Use sort with uniq to find the duplicate lines.
#!/bin/bash
dirs=("$#")
for dir in "${dirs[#]}" ; do
cat "$dir"/*
done | sort | uniq -c | sort -n | tail -n1
uniq -c will prepend the number of occurrences to each line
sort -n will sort the lines by the number of occurrences
tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail:
perl -ane 'if ($F[0] == $n) { push #buff, $_ }
else { #buff = $_ }
$n = $F[0];
END { print for #buff }'
You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts:
#!/bin/sh
for file in "$#"; do
if [ -s "$file" ]; then
awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file"
fi
done
The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total.
Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array.
Example:
$ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt
$ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt
inp.txt: 2
The word 'bar' exist three times in the file, thus there are two duplicates.
To aggregate multiple files, you can just feed multiple files to awk:
$ printf 'foo\nbar\nthis\nbar\n' > inp1.txt
$ printf 'red\nblue\ngreen\nbar\n' > inp2.txt
$ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt
2
For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.
I am able to find the number of times a word occurs in a text file, like in Linux we can use:
cat filename|grep -c tom
My question is, how do I find the count of multiple words like "tom" and "joe" in a text file.
Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
test.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! Hope this helps
Ok, so first split the file into words, then sort and uniq:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use uniq:
sort filename | uniq -c
Use awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input.
Pipe tho output to grep to get the desired fields
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-)
The sample you gave does not search for words "tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
\<\(tom\|joe\)\>
You could do regexp,
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
Here is one:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE
A shell script solution:
#!/bin/bash
file_name="$2"
string="$1"
if [ $# -ne 2 ]
then
echo "Usage: $0 <pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo
I completely forgot about grep -f:
cat filename | grep -fc names
AWK solution:
Assuming the names are in a file called names:
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
Note that your original grep doesn't search for words. e.g.
$ echo tomorrow | grep -c tom
1
You need grep -w
gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.
We use gawk because the POSIX awk doesn't allow regex record separator.
For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print}.")
To find all hits in all lines
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
This will count "tomtom" as 2 hits.