How should I count the duplicate lines in each file?

How should I count the duplicate lines in each file? - linux

I have tried this :
dirs=$1
for dir in $dirs
do
ls -R $dir
done

Like this?:
$ cat > foo
this
nope
$ cat > bar
neither
this
$ sort *|uniq -c
1 neither
1 nope
2 this
and weed out the ones with just 1s:
... | awk '$1>1'
2 this

Use sort with uniq to find the duplicate lines.
#!/bin/bash
dirs=("$#")
for dir in "${dirs[#]}" ; do
cat "$dir"/*
done | sort | uniq -c | sort -n | tail -n1
uniq -c will prepend the number of occurrences to each line
sort -n will sort the lines by the number of occurrences
tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail:
perl -ane 'if ($F[0] == $n) { push #buff, $_ }
else { #buff = $_ }
$n = $F[0];
END { print for #buff }'

You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts:
#!/bin/sh
for file in "$#"; do
if [ -s "$file" ]; then
awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file"
fi
done
The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total.
Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array.
Example:
$ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt
$ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt
inp.txt: 2
The word 'bar' exist three times in the file, thus there are two duplicates.
To aggregate multiple files, you can just feed multiple files to awk:
$ printf 'foo\nbar\nthis\nbar\n' > inp1.txt
$ printf 'red\nblue\ngreen\nbar\n' > inp2.txt
$ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt
2
For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.

Related

Bash function with input fails awk command

I am writing a function in a BASH shell script, that should return lines from csv-files with headers, having more commas than the header. This can happen, as there are values inside these files, that could contain commas. For quality control, I must identify these lines to later clean them up. What I have currently:
#!/bin/bash
get_bad_lines () {
local correct_no_of_commas=$(head -n 1 $1/$1_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $1 | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$1/$1_0_${i}_0.csv" ]; then
echo "File: $1_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "$1_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$1/$1_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk '$1 > $correct_no_of_commas {print}'
done
}
get_bad_lines products
get_bad_lines users
The output of this program is now all the comma-counts with all of the line numbers in all the files,
and I suspect this is due to the input $1 (foldername, i.e. products & users) conflicting with the call to awk with reference to $1 as well (where I wish to grab the first column being the count of commas for that line in the current file in the loop).
Is this the issue? and if so, would it be solvable by either referencing the 1.st column or the folder name by different variable names instead of both of them using $1 ?
Example, current output:
5 6667
5 6668
5 6669
5 6670
(should only show lines for that file having more than 5 commas).
Tried variable declaration in call to awk as well, with same effect
(as in the accepted answer to Awk field variable clash with function argument)
:
get_bad_lines () {
local table_name=$1
local correct_no_of_commas=$(head -n 1 $table_name/${table_name}_0_0_0.csv | tr -cd , | wc -c)
local no_of_files=$(ls $table_name | wc -l)
for i in $(seq 0 $(( ${no_of_files}-1 )))
do
# Check that the file exist
if [ ! -f "$table_name/${table_name}_0_${i}_0.csv" ]; then
echo "File: ${table_name}_0_${i}_0.csv not found!"
continue
fi
# Search for error-lines inside the file and print them out
echo "${table_name}_0_${i}_0.csv has over $correct_no_of_commas commas in the following lines:"
grep -o -n '[,]' "$table_name/${table_name}_0_${i}_0.csv" | cut -d : -f 1 | uniq -c | awk -v table_name="$table_name" '$1 > $correct_no_of_commas {print}'
done
}

You can use awk the full way to achieve that :
get_bad_lines () {
find "$1" -maxdepth 1 -name "$1_0_*_0.csv" | while read -r my_file ; do
awk -v table_name="$1" '
NR==1 { num_comma=gsub(/,/, ""); }
/,/ { if (gsub(/,/, ",", $0) > num_comma) wrong_array[wrong++]=NR":"$0;}
END { if (wrong > 0) {
print(FILENAME" has over "num_comma" commas in the following lines:");
for (i=0;i<wrong;i++) { print(wrong_array[i]); }
}
}' "${my_file}"
done
}
For why your original awk command failed to give only lines with too many commas, that is because you are using a shell variable correct_no_of_commas inside a single quoted awk statement ('$1 > $correct_no_of_commas {print}'). Thus there no substitution by the shell, and awk read "$correct_no_of_commas" as is, and perceives it as an undefined variable. More precisely, awk look for the variable correct_no_of_commas which is undefined in the awk script so it is an empty string . awk will then execute $1 > $"" as matching condition, and as $"" is a $0 equivalent, awk will compare the count in $1 with the full input line. From a numerical point of view, the full input line has the form <tab><count><tab><num_line>, so it is 0 for awk. Thus, $1 > $correct_no_of_commas will be always true.

You can identify all the bad lines with a single awk command
awk -F, 'FNR==1{print FILENAME; headerCount=NF;} NF>headerCount{print} ENDFILE{print "#######\n"}' /path/here/*.csv
If you want the line number also to be printed, use this
awk -F, 'FNR==1{print FILENAME"\nLine#\tLine"; headerCount=NF;} NF>headerCount{print FNR"\t"$0} ENDFILE{print "#######\n"}' /path/here/*.csv

grep reverse with exact matching

I have a list file, which has id and number and am trying to get those lines from a master file which do not have those ids.
List file
nw_66 17296
nw_67 21414
nw_68 21372
nw_69 27387
nw_70 15830
nw_71 32348
nw_72 21925
nw_73 20363
master file
nw_1 5896
nw_2 52814
nw_3 14537
nw_4 87323
nw_5 56466
......
......
nw_n xxxxx
so far am trying this but not working as expected.
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
Kindly help

Give this awk one-liner a try:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master

Maybe this helps:
awk 'NR == FNR {id[$1]=1;next}
{
if (id[$1] == "") {
print $0
}
}' listfile masterfile
We accept 2 files as input above, first one is listfile, second is masterfile.
NR == FNR would be true while awk is going through listfile. In the associative array id[], all ids in listfile are made a key with value as 1.
When awk goes through masterfile, it only prints a line if $1 i.e. the id is not a key in array ids.

The OP attempted the following line:
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
This line will not work as for every entry $i, you print all entries in master.txt tat are not equivalent to "$i". As a consequence, you will end up with multiple copies of master.txt, each missing a single line.
Example:
$ for i in 1 2; do grep -v -w "$i" <(seq 1 3); done
2 \ copy of seq 1 3 without entry 1
3 /
1 \ copy of seq 1 3 without entry 2
3 /
Furthermore, the attempt reads the file master.txt multiple times. This is very inefficient.
The unix tool grep allows one the check multiple expressions stored in a file in a single go. This is done using the -f flag. Normally this looks like:
$ grep -f list.txt master.txt
The OP can use this now in the following way:
$ grep -vwf <(awk '{print $1}' list.txt) master.txt
But this would do matches over the full line.
The awk solution presented by Kent is more flexible and allows the OP to define a more tuned match:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Here the OP clearly states, I want to match column 1 of list with column 1 of master and I don't care about spaces or whatever is in column 2. The grep solution could still match entries in column 2.

How to find list of words (in thousands) in list of tsv files (hundreds), with output as number of match for each string in each file, in linux?

I have hundreds of tsv file with following structure (example):
GH1 123 family1
GH2 23 family2
.
.
.
GH4 45 family4
GH6 34 family6
And i have a text file with list of words (thousands):
GH1
GH2
GH3
.
.
.
GH1000
I want to get output which contain number of each words occurred in each file like this
GH1 GH2 GH3 ... GH1000
filename1 1 1 0... 4
.
.
.
filename2 2 3 1... 0
I try this code but it gives me zero only
for file in *.tsv; do
echo $file >> output.tsv
cat fore.txt | while read line; do
awk -F "\\t" '{print $1}' $file | grep -wc $line >>output.tsv
echo "\\t">>output.tsv;
done ;
done

Use the following script.
Just put sdtout to output.txt file.
#!/bin/bash
while read p; do
echo -n "$p "
done <words.txt
echo ""
for file in *.tsv; do
echo -n "$file = "
while read p; do
COUNT=$(sed 's/$p/$p\n/g' $file | grep -c "$p")
echo -n "$COUNT "
done <words.txt
echo ""
done

Here is a simple Awk script which collects a list like the one you describe.
awk 'BEGIN { printf "\t" }
NR==FNR { a[$1] = n = FNR;
printf "\t%s", $1; next }
FNR==1 {
if(f) { printf "%s", f;
for (i=1; i<=n; i++)
printf "\t%s", 0+b[i] }
printf "\n"
delete b
f = FILENAME }
$1 in a { b[$1]++ }' fore.txt *.tsv /etc/motd
To avoid repeating the big block in END, we add a short sentinel file at the end whose only purpose is to supply a file after the last whose counts will not be reported.
The shell's while read loop is slow and inefficient and somewhat error-prone (you basically always want read -r and handling incomplete text files is hairy); in addition, the brute-force method will require reading the word file once per iteration, which incurs a heavy I/O penalty.

bash count sequential files

I'm pretty new to bash scripting so some of the syntaxes may not be optimal. Please do point them out if you see one.
I have files in a directory named sequentially.
Example: prob01_01 prob01_03 prob01_07 prob02_01 prob02_03 ....
I am trying to have the script iterate through the current directory and count how many extensions each problem has. Then print the pre-extension name then count
Sample output for above would be:
prob01 3
prob02 2
This is my code:
#!/bin/bash
temp=$(mktemp)
element=''
count=0
for i in *
do
current=${i%_*}
if [[ $current == $element ]]
then
let "count+=1"
else
echo $element $count >> temp
element=$current
count=1
fi
done
echo 'heres the temp:'
cat temp
rm 'temp'
The Problem:
Current output:
prob1 3
Desired output:
prob1 3
prob2 2
The last count isn't appended because it's not seeing a different element after it
My Guess on possible solutions:
Have the last append occur at the end of the for loop?

Your code has 2 problems.
The first problem doesn't answer your question. You make a temporary file, the filename is stored in $temp. You should use that one, and not the file with the fixed name temp.
The problem is that you only write results when you see a new problem/filename. The last one will not be printed.
Fixing only these problems will result in
results() {
if (( count == 0 )); then
return
fi
echo $element $count >> "${temp}"
}
temp=$(mktemp)
element=''
count=0
for i in prob*
do
current=${i%_*}
if [[ $current == $element ]]
then
let "count+=1" # Better is using ((count++))
else
results
element=$current
count=1
fi
done
results
echo 'heres the temp:'
cat "${temp}"
rm "${temp}"
You can do without the script with
ls prob* | cut -d"_" -f1 | sort | uniq -c
When you want the have the output displayed as given, you need one more step.
ls prob* | cut -d"_" -f1 | sort | uniq -c | awk '{print $2 " " $1}'

You may use printf + awk solution:
printf '%s\n' *_* | awk -F_ '{a[$1]++} END{for (i in a) print i, a[i]}'
prob01 3
prob02 2
We use printf to print each file that has at least one _
We use awk to get a count of each file's first element delimited by _ by using an associative array.

I would do it like this:
$ ls | awk -F_ '{print $1}' | sort | uniq -c | awk '{print $2 " " $1}'
prob01 3
prob02 2

Need to reduce the execution time

We are trying to execute below script for finding out the occurrence of a particular word in a log file
Need suggestions to optimize the script.
Test.log size - Approx to 500 to 600 MB
$wc -l Test.log
16609852 Test.log
po_numbers - 11 to 12k po's to search
$more po_numbers
xxx1335
AB1085
SSS6205
UY3347
OP9111
....and so on
Current Execution Time - 2.45 hrs
while IFS= read -r po
do
check=$(grep -c "PO_NUMBER=$po" Test.log)
echo $po "-->" $check >>list3
if [ "$check" = "0" ]
then
echo $po >>po_to_server
#else break
fi
done < po_numbers

You are reading your big file too many times when you execute
grep -c "PO_NUMBER=$po" Test.log
You can try to split your big file into smaller ones or write your patterns to a file and make grep use it
echo -e "PO_NUMBER=$po\n" >> patterns.txt
then
grep -f patterns.txt Test.log

$ grep -Fwf <(sed 's/.*/PO_NUMBER=&/' po_numbers) Test.log
create the lookup file from po_numbers (process substitution) check for literal word matches from the log file. This assumes the searched PO_NUMBER=xxx is a separate word, if not remove -w, also assumes there is no regex but just literal matches, if not remove -F, however both will slow down searches.

Using Grep :
sed -e 's|^|PO_NUMBER=|' po_numbers | grep -o -F -f - Test.log | sed -e 's|^PO_NUMBER=||' | sort | uniq -c > list3
grep -o -F -f po_numbers list3 | grep -v -o -F -f - po_numbers > po_to_server
Using awk :
This awk program might work faster
awk '(NR==FNR){ po[$0]=0; next }
{ for(key in po) {
str=$0
po[key]+=gsub("PO_NUMBER="key,"",str)
}
}
END {
for(key in po) {
if (po[key]==0) {print key >> "po_to_server" }
else {print key"-->"po[key] >> "list3" }
}
}' po_numbers Test.log
This does the following :
The first line loads the po keys from the file po_numbers
The second awk parser, will pars the file for occurences of PO_NUMBER=key per line. (gsub is a function which performs a substitutation and returns the substitution count)
In the end we print out the requested output to the requested files.
The assumption here is that is might be possible that multiple patterns could occure multiple times on a single line of Test.log
Comment: the original order of po_numbers will not be satisfied.

"finding out the occurrence"
Not sure if you mean to count the number of occurrences for each searched word or to output the lines in the log that contain at least one of the searched words. This is how you could solve it in the latter case:
(cat po_numbers; echo GO; cat Test.log) | \
perl -nle'$r?/$r/&&print:/GO/?($r=qr/#{[join"|",#s]}/):push#s,$_'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How should I count the duplicate lines in each file? - linux

I have tried this : dirs=$1 for dir in $dirs do ls -R $dir done

Like this?: $ cat > foo this nope $ cat > bar neither this $ sort *|uniq -c 1 neither 1 nope 2 this and weed out the ones with just 1s: ... | awk '$1>1' 2 this

Related

Bash function with input fails awk command

grep reverse with exact matching

How to find list of words (in thousands) in list of tsv files (hundreds), with output as number of match for each string in each file, in linux?

bash count sequential files

Need to reduce the execution time

Categories

Resources