Find duplicate lines based on column and print both lines and their numbers with awk - linux

I have a following file:
userID PWD_HASH
test 1234
admin 1234
user 6789
abcd 5555
efgh 6666
root 1234
Using AWK,
I need to find both original lines and their duplicates with row numbers,
so that get the output like:
NR $0
1 test 1234
2 admin 1234
6 root 1234
I have tried the following, but it does not print the correct row number with NR :
awk 'n=x[$2]{print NR" "n;print NR" "$0;} {x[$2]=$0;}' file.txt
Any help would be appreciated!

$ awk '
($2 in a) { # look for duplicates in $2
if(a[$2]) { # if found
print a[$2] # output the first, stored one
a[$2]="" # mark it outputed
}
print NR,$0 # print the duplicated one
next # skip the storing part that follows
}
{
a[$2]=NR OFS $0 # store the first of each with NR and full record
}' file
Output (with the header in file):
2 test 1234
3 admin 1234
7 root 1234

Using GAWK, you can do this by below construct : -
awk '
{
NR>1
{
a[$2][NR-1 " " $0];
}
}
END {
for (i in a)
if(length(a[i]) > 1)
for (j in a[i])
print j;
}
' Input_File.txt
Create a 2-dimensional array.
In first dimension, store PWD_HASH and in second dimension, store line number(NR-1) concatenated with whole line($0).
To display only duplicate ones, you can use length(a[i] > 1) condition.

Could you please try following.
awk '
FNR==NR{
a[$2]++
b[$2,FNR]=FNR==1?FNR:(FNR-1) OFS $0
next
}
a[$2]>1{
print b[$2,FNR]
}
' Input_file Input_file
Output will be as follows.
1 test 1234
2 admin 1234
6 root 1234
Explanation: Following is the explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition here FNR==NR which will be TRUE when first time Input_file is being read.
a[$2]++ ##Creating an array named a whose index is $1 and incrementing its value to 1 each time it sees same index.
b[$2,FNR]=FNR==1?FNR:(FNR-1) OFS $0 ##Creating array b whose index is $2,FNR and concatenating its value to its own.
next ##Using next for skipping all further statements from here.
}
a[$2]>1{ ##Checking condition where value of a[$2] is greater than 1, this will be executed when 2nd time Input_file read.
print b[$2,FNR] ##Printing value of array b whose index is $2,FNR here.
}
' Input_file Input_file ##Mentioning Input_file(s) names here 2 times.

Without using awk, but GNU coretutils tools:
tail -n+2 file | nl | sort -k3n | uniq -D -f2
tail remove the first line.
nl add line number.
sort based on the 3rd field.
uniq only prints duplicate based on the 3rd field.

Related

Merging two txt files based on a common column with different row numbers

I would like to merge two whitespace-delimited files without sorting them first based on the "phenotype" column. File 1 contains the same phenotype several times, while file 2 has each phenotype only once. I need to match "phenotype" from file 1 to "category" in file 2.
File 1:
chr pos pval_EAS phenotype FDR
1 1902906 0.234 biomarkers-30600-both_sexes-irnt.tsv.gz 1
2 1475898 0.221 biomarkers-30600-both_sexes-irnt.tsv.gz 1
2 568899 0.433 continuous-4566-both_sexes-irnt.tsv.gz 1
2 2435478 0.113 continuous-4566-both_sexes-irnt.tsv.gz 1
4 1223446 0.112 phecode-554-both_sexes-irnt.tsv.gz 0.345
4 3456573 0.0003 phecode-554-both_sexes-irnt.tsv.gz 0.989
File 2:
phenotype Category
biomarkers-30600-both_sexes-irnt.tsv.bgz Metabolic
continuous-4566-both_sexes-irnt.tsv.gz Neoplasms
phecode-554-both_sexes-irnt.tsv.gz Immunological
I tried the following, but I don't get the desired output:
awk -F' ' 'FNR==NR{a[$1]=$4; next} {print $0 a[$6]}' file2 file1 > file3
With your shown samples, please try following.
awk 'FNR==NR{arr[$1]=$2;next} ($4 in arr){print $0,arr[$4]}' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file2 is being read.
arr[$1]=$2 ##Creating array arr with index of $1 and value is $2.
next ##next will skip all further statements from here.
}
($4 in arr){ ##Checking condition if 4th field is in arr then do following.
print $0,arr[$4] ##Printing current line along with value of arr with 4th field as index number.
}
' file2 file1 ##Mentioning Input_file names here.
Bonus solution: In case you want to print those lines which are not matching values and want to print with N/A then do following.
awk 'FNR==NR{arr[$1]=$2;next} {print $0,(($4 in arr)?arr[$4]:"N/A")}' file2 file1

AWK: Comparing substrings from two files and write to third file

I'm trying to compare two different files, let's say "file1" and "file2", in this way. If the substring of characters i.e 5 characters at position (8 to 12) matches in both files - file1 and file2, then remove that matching row from file 1. Finally, write the output to file3.(output contains the remaining rows which are not matching with file 2)
My output is the non matching rows of file1.
Output (file3) = File1 - File2
File1
-----
aqcdfdf**45555**78782121
axcdfdf**45555**75782321
aecdfdf**75555**78782221
aqcdfdf**95555**78782121
File2
-----
aqcdfdf**45555**78782121
axcdfdf**25555**75782321
File3
-----
aecdfdf**75555**78782221
aqcdfdf**95555**78782121
I tried awk but i need some thing which looks at substring of the two files, since there are no delimiters in my files.
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2 > file3
Could you please try following, written and tested with shown samples in GNU awk. Once happy with results on terminal then redirect output of following command to > file3(append > file3 to following command).
awk '{str=substr($0,8,5)} FNR==NR{a[str];next} !(str in a)' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
str=substr($0,8,5) ##Creating str which has sub-string of current line from 8th to 12th character.
}
FNR==NR{ ##Checking condition FNR==NR which will run when Input_file2 is being read.
a[str] ##Creating array a with index of str here.
next ##next will skip all further statements from here.
}
!(str in a) ##Checking condition if str is NOT present in a then print that line from Input_file1.
' file2 file1 ##Mentioning Input_file names here.

Sum of 2nd and 3rd column for same value in 1st column

I want to sum the value in column 2nd and 3rd column for same value in 1st column
1555971000 6 1
1555971000 0 2
1555971300 2 0
1555971300 3 0
Output would be like
1555971000 6 3
1555971300 5 0
I have tried below command
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
but this seems to be for only one column.
Here is another way with reading Input_file 2 times and it will provide output in same sequence as Input_file's sequence.
awk 'FNR==NR{a[$1]+=$2;b[$1]+=$3;next} ($1 in a){print $1,a[$1],b[$1];delete a[$1]}' Input_file Input_file
if data in 'd' without sort, tried on gnu awk,
awk 'BEGIN{f=1} {if($1==a||f){b+=$2;c+=$3;f=0} else{print a,b,c;b=$2;c=$3} a=$1} END{print a,b,c}' d
with sort gnu awk
awk '{w[NR]=$0} END{asort(w);f=1;for(;i++<NR;){split(w[i],v);if(v[1]==a||f){f=0;b+=v[2];c+=v[3]} else{print a,b,c;b=v[2];c=v[3];} a=v[1]} print a,b,c;}' d
You can do it with awk by first saving the fields in the first record, and then for all subsequent records, comparing if the first field matches, if so, add the contents of fields two and three and continue. If the first field fails to match, then output your first field and the running-sums, e.g.
awk '{
if ($1 == a) {
b+=$2; c+=$3;
}
else {
print a, b, c; a=$1; b=$2; c=$3;
}
} END { print a, b, c; }' file
With your input in file, you can copy and paste the foregoing into your terminal and obtain, the following:
Example Use/Output
$ awk '{
> if ($1 == a) {
> b+=$2; c+=$3;
> }
> else {
> print a, b, c; a=$1; b=$2; c=$3;
> }
> } END { print a, b, c; }' file
1555971000 6 3
1555971300 5 0
Using awk Arrays
A shorter more succinct alternative using arrays that does not require your input to be in sorted order would be:
awk '{a[$1]+=$2; b[$1]+=$3} END{ for (i in a) print i, a[i], b[i] }' file
(same output)
Using arrays allows the summing of columns for like field1 to work equally well if your data file contained the following lines in random order, e.g.
1555971300 2 0
1555971000 0 2
1555971000 6 1
1555971300 3 0
Another awk that would work regardless of any order of records whether or not they are not sorted :
awk '{r[$1]++}
r[$1]==1{o[++c]=$1}
{f[$1]+=$2;s[$1]+=$3}
END{for(i=1;i<=c;i++){print o[i],f[o[i]],s[o[i]]}}' file
Assuming when you wrote:
awk -F" " '{b[$2]+=$1} END { for (i in b) { print b[i],i } } '
you meant to write:
awk '{ b[$1]+=$2 } END{ for (i in b) print i,b[i] }'
It shouldn't be a huge leap to figure out:
$ awk '{ b[$1]+=$2; c[$1]+=$3 } END{ for (i in b) print i,b[i],c[i] }' file
1555971000 6 3
1555971300 5 0
Please get the book "Effective Awk Programming", 4th Edition, by Arnold Robbins and just read a paragraph or 2 about fields and arrays.

Using awk to extract data and count

How do I use awk on a file that looks like this:
abcd Z
efdg Z
aqbs F
edf F
aasd A
I want to extract the number of times each letter of the alphabet occurs in the second column, so output should be:
Z 2
F 2
A 1
try: If you want the order of output same as Input_file then following may help you.
awk 'FNR==NR{A[$2]++;next} A[$2]{print $2,A[$2];delete A[$2]}' Input_file Input_file
if you don't bother of order of $2 then following may help you.
awk '{A[$2]++} END{for(i in A){print i,A[i]}}' Input_file
In first solution reading the Input_file twice and creating an array A whose index is $2 with it's incrementing value. then when second Input_file is being read then printing the $2 and it's count.
In Second solution creating an array A whose index $2 and incrementing value of it. Then in end section go through the array A and print it's index and array A's value.
I would use sort | uniq for this purpose as these two utils are designed specifically for this kind of task:
cat <<END |
abcd Z
efdg Z
aqbs F
edf F
aasd A
END
awk '{print $2}' | sort -r | uniq -c | awk '{printf "%s %d\n", $2, $1}'
Would produce exactly the desired output
Z 2
F 2
A 1
Here awk '{print $2}' is used to get the second column from a document with fields separated by one or more whitespace characters. If we knew the width of the columns is fixed, we could use a faster cut utility instead.
sort -r | uniq -c is doing the main algorithmic part of the task - sort the letters in reverse order and count the number of occurrences of each letter.
awk '{printf "%s %d\n", $2, $1}' does some reformatting of the uniq -c output to match the required format exactly.
Update: AWK has powerful array support so this can be done with awk alone:
cat <<END |
abcd Z
efdg Z
aqbs F
edf F
aasd A
END
awk '{a[$2]++}
END {n=asorti(a,b,"#ind_str_desc");
for (k=1;k<=n;k++) {printf b[k], a[b[k]]} }'
We use the array a that is indexed with letters found in the input stream, and on each line the element indexed by the corresponding letter gets incremented.
In the END clause we reverse the order of indices and output the array.

Subtract a constant number from a column

I have two large files (~10GB) as follows:
file1.csv
name,id,dob,year,age,score
Mike,1,2014-01-01,2016,2,20
Ellen,2, 2012-01-01,2016,4,35
.
.
file2.csv
id,course_name,course_id
1,math,101
1,physics,102
1,chemistry,103
2,math,101
2,physics,102
2,chemistry,103
.
.
I want to subtract 1 from the "id" columns of these files:
file1_updated.csv
name,id,dob,year,age,score
Mike,0,2014-01-01,2016,2,20
Ellen,0, 2012-01-01,2016,4,35
file2_updated.csv
id,course_name,course_id
0,math,101
0,physics,102
0,chemistry,103
1,math,101
1,physics,102
1,chemistry,103
I have tried awk '{print ($1 - 1) "," $0}' file2.csv, but did not get the correct result:
-1,id,course_name,course_id
0,1,math,101
0,1,physics,102
0,1,chemistry,103
1,2,math,101
1,2,physics,102
1,2,chemistry,103
You've added an extra column in your attempt. Instead set your first field $1 to $1-1:
awk -F"," 'BEGIN{OFS=","} {$1=$1-1;print $0}' file2.csv
That semicolon separates the commands. We set the delimiter to comma (-F",") and the Output Field Seperator to comma BEGIN{OFS=","}. The first command to subtract 1 from the first field executes first, then the print command executes second, so the entire record, $0, will now contain the new $1 value when it's printed.
It might be helpful to only subtract 1 from records that are not your header. So you can add a condition to the first command:
awk -F"," 'BEGIN{OFS=","} NR>1{$1=$1-1} {print $0}' file2.csv
Now we only subtract when the record number (NR) is greater than 1. Then we just print the entire record.

Resources