How to count the unique values in 2 fields without concatenating them? - linux

I am trying to using the basic shell in unix to count the unique values in 2 fields. I have data with 5 columns but just what to count the unique values in the first 2 WITHOUT concatenating them.
So far I am successful in using cut -f 1 | sort | uniq | wc -l in counting the unique values in column one and I can do the same for column two but because some of the values are the same in column one and two I need to be able to do this command treating column 1 and 2 as one field. Can anyone help me please?

Your question can be interpreted in two ways so I answer both of them.
Given the input file:
2 1
2 1
1 2
1 1
2 2
If you want the result to output 4 because the unique pairs are 1 1, 1 2, 2 1 and 2 2, then you need:
cat test|cut -f1,2|sort|uniq|wc -l
What we do here: we pick only first two columns as well as the delimiter and pass it to sort|uniq which does the job.
If you, on the other hand, want the result to output 2 because there are only two unique elements: 1 and 2, then you can tweak the above like this:
cat test|cut -f1,2|tr "\t" "\n"|sort|uniq|wc -l
This time after we select first two columns, we split each of them into two lines so that sort|uniq picks them up.
These work as long as the columns are separated with TAB character, not spaces. Since you didn't pass -d option to cut in your question and cut uses tabs by default, I assumed your input uses tabs too.

Related

How to count unique values from predefined range of rows from a 3rd column in a file

I have tsv file containing 8 column and 10000 rows. I need to take unique count of values at every 1000 rows interval from column 7.
Please suggest the linux commands to get desired output. I tried following command but it did not work:
awk >myfile 'NR>=1&&NR<=1000' | sort | uniq -c

Find if the first 10 digits of two columns on csv file are matched in bash

I have a file which contains two columns (names.csv), values are separated by comma
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
e123456777-anything,e123456999-anything
These columns have values with 10 digits, which are unique identifiers, and some extra junk in the values (-anything).
I want to see if the columns have the prefix matched!
To verify the values on first and second column I use:
cat /home/names.csv | parallel --colsep ',' echo column 1 = {1} column 2 = {2}
Which print the values. Because the values are HEX digits, it is cumbersome to verify one by one by only reading. Is there any way to see if the 10 digits of each column pair are exact matches? They might contain special characters!
Expected output (example, but anything that says the columns are matched or not can work):
Matches (including first line):
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
Non-matches
e123456777-anything,e123456999-anything
Here's one way using awk. It prints every line where the first 10 characters of the first two fields match.
% cat /tmp/names.csv
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
e123456777-anything,e123456999-anything
% awk -F, 'substr($1,1,10)==substr($2,1,10)' /tmp/names.csv
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything

Compare 2 files using awk-If 2nd field is same,sum the 1st field & print it-if not print it(true for non-matching entries in both the files)

I have two files -
File 1:
2 923000026531
1 923000031178
2 923000050000
1 923000050278
1 923000051178
1 923000060000
File 2:
2 923000050000
3 923000050278
1 923000051178
1 923000060000
4 923000026531
1 923335980059
I want to achieve the following using awk:
1- If 2nd field is same, sum the 1st field and print it.
2- If 2nd field is not same, print the line as it is. This will have two cases.
2(a) If 2nd field is not same & record belongs to first file
2(b) If 2nd field is not same & record belongs to second file
I have achieved the following using this command:
Command: gawk 'FNR==NR{f1[$2]=$1;next}$2 in f1{print f1[$2]+$1,$2}!($2 in f1){print $0}' f1 f2
Result:
4 923000050000
4 923000050278
2 923000051178
2 923000060000
6 923000026531
1 923335980059
However, this doesn't contains the records which were in first file & whose second field didn't match that of the second file i.e. case 2(a), to be more specific, the following record is not present in the final file:
1 923000031178
I know there are multiple work around using extra commands but I am interested if this can be somehow done in the same command.
give this one-liner a try:
$ awk '{a[$2]+=$1}END{for(x in a)print a[x], x}' f1 f2
2 923000060000
2 923000051178
1 923000031178
6 923000026531
4 923000050278
4 923000050000
1 923335980059

Delete repeated rows based on column 5 and keep the one with highest value in column 13

I have a spreadsheet (.csv) with 20 columns and 9000 rows. I want to delete rows that have the same ID in column 5, so I will end up with only one entry (or row) per ID number (unique ID). If there are 2 or more rows with the same ID in column 5, I want to keep the one that has the highest score in column 13. At the same time I want to keep all 20 columns for each row (all the information). Rows with repeated ID and lower score are not important, so I want to just remove those.
I was trying with awk and pearl, but somehow I only manage to do it half way. Let me know if I need to provide more information. Thanks!
INPUT (delimeter=','):
geneID, Score, annotation, etc.
ENSG0123, 532.0, intergenic, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0123, 234.0, 5-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.
OUTPUT:
geneID, Score, annotation, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.
since you didn't give the complete input/output example, I guess it was a generic problem. So here is the answer:
sort -t',' -k5,5n -k13,13nr file.csv|awk -F, '!a[$5]++'
although awk can do it alone, but with help of sort the code could be much easier. What the above one-liner does:
sort the file by col5 and col13(numerically, descending)
pass the sorted result to awk to remove duplicates, base on col5.
here is a little test on it, in the example, col1 is your col5, and col3 is your col13:
kent$ cat f
1,2,3
2,8,7
1,2,4
1,4,5
2,2,8
1,3,6
2,2,9
1,2,10
LsyHP 12:38:04 /tmp/test
kent$ sort -t',' -k1,1n -k3,3nr f|awk -F, '!a[$1]++'
1,2,10
2,2,9

How to sort by column and break ties randomly

I have a tab-delimited file with three columns like this:
joe W 4
bob A 1
ana F 1
roy J 3
sam S 0
don R 2
tim L 0
cyb M 0
I want to sort this file by decreasing values in the third column, but to break ties I do not want to use some other column to do so (i.e. not use the first column to sort rows with the same entry in the third column).
Instead, I want rows with the same third column entries to either preserve the original order, or be sorted randomly.
Is there a way to do this using the sort command in unix?
sort -k3 -r -s file
This should give you the required output.
-k3 denotes the 3rd column and -r will sort in decreasing order and -s will disable the breaking of ties using other options.

Resources