Delete repeated rows based on column 5 and keep the one with highest value in column 13 - excel

I have a spreadsheet (.csv) with 20 columns and 9000 rows. I want to delete rows that have the same ID in column 5, so I will end up with only one entry (or row) per ID number (unique ID). If there are 2 or more rows with the same ID in column 5, I want to keep the one that has the highest score in column 13. At the same time I want to keep all 20 columns for each row (all the information). Rows with repeated ID and lower score are not important, so I want to just remove those.
I was trying with awk and pearl, but somehow I only manage to do it half way. Let me know if I need to provide more information. Thanks!
INPUT (delimeter=','):
geneID, Score, annotation, etc.
ENSG0123, 532.0, intergenic, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0123, 234.0, 5-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.
OUTPUT:
geneID, Score, annotation, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.

since you didn't give the complete input/output example, I guess it was a generic problem. So here is the answer:
sort -t',' -k5,5n -k13,13nr file.csv|awk -F, '!a[$5]++'
although awk can do it alone, but with help of sort the code could be much easier. What the above one-liner does:
sort the file by col5 and col13(numerically, descending)
pass the sorted result to awk to remove duplicates, base on col5.
here is a little test on it, in the example, col1 is your col5, and col3 is your col13:
kent$ cat f
1,2,3
2,8,7
1,2,4
1,4,5
2,2,8
1,3,6
2,2,9
1,2,10
LsyHP 12:38:04 /tmp/test
kent$ sort -t',' -k1,1n -k3,3nr f|awk -F, '!a[$1]++'
1,2,10
2,2,9

Related

Print whole line with highest value from one column

I have a little issue right now.
I have a file with 4 columns
test0000002,10030010330,c_,218
test0000002,10030010330,d_,202
test0000002,10030010330,b_,193
test0000002,10030010020,c_,178
test0000002,10030010020,b_,170
test0000002,10030010330,a_,166
test0000002,10030010020,a_,151
test0000002,10030010020,d_,150
test0000002,10030070050,c_,119
test0000002,10030070050,b_,99
test0000002,10030070050,d_,79
test0000002,10030070050,a_,56
test0000002,10030010390,c_,55
test0000002,10030010390,b_,44
test0000002,10030010380,d_,41
test0000002,10030010380,a_,37
test0000002,10030010390,d_,35
test0000002,10030010380,c_,33
test0000002,10030010390,a_,31
test0000002,10030010320,c_,30
test0000002,10030010320,b_,27
test0000002,10030010380,b_,26
test0000002,10030010320,a_,23
test0000002,10030010320,d_,22
test0000002,10030010010,a_,6
and I want the highest value from 4th column sorted from 2nd column.
test0000002,10030010330,c_,218
test0000002,10030010020,c_,178
test0000002,10030010330,a_,166
test0000002,10030010020,a_,151
test0000002,10030070050,c_,119
test0000002,10030010390,c_,55
test0000002,10030010380,d_,41
test0000002,10030010320,c_,30
test0000002,10030010390,a_,31
test0000002,10030010380,c_,33
test0000002,10030010390,d_,35
test0000002,10030010320,a_,23
test0000002,10030010380,b_,26
test0000002,10030010010,a_,6
It appears that your file is already sorted in descending order on the 4th column, so you just need to print lines where the 2nd column appears for the first time:
awk -F, '!seen[$2]++' file
test0000002,10030010330,c_,218
test0000002,10030010020,c_,178
test0000002,10030070050,c_,119
test0000002,10030010390,c_,55
test0000002,10030010380,d_,41
test0000002,10030010320,c_,30
test0000002,10030010010,a_,6
If your input file is not sorted on column 4, then
sort -t, -k4nr file | awk -F, '!seen[$2]++'
You can use two sorts:
sort -u -t, -k2,2 file | sort -t, -rnk4
The first one removes duplicates in the second column, the second one sorts the first one on the 4th column.

Linux group by, sum and count

From a directory listing, I have created an output that presents file size in column 1 and a section of the filename (it's a date) in column 2.
178694671 2017-10-14
175332227 2017-10-14
175021608 2017-10-14
174851281 2017-10-14
175316643 2017-10-14
What I now need to do is group by, sum and count on this list. Group by and count files by column 2, and sum the file size associated with each grouping.
The result of the above output would look like this:
879216430 2017-10-14 5
I tried this
awk '{sum[$1]+= $2;}END{for (date in sum){print sum[date], date;}}'
But it provides strange results and I don't really understand what it's doing.
Can anyone help?
Use another associated array to store frequency of date as in:
awk '{++freq[$2]; sum[$2]+=$1}
END{for (date in sum) print sum[date], date, freq[date]}' file
879216430 2017-10-14 5
Also note key of your array would be $2 i.e. date not $1

How to count the unique values in 2 fields without concatenating them?

I am trying to using the basic shell in unix to count the unique values in 2 fields. I have data with 5 columns but just what to count the unique values in the first 2 WITHOUT concatenating them.
So far I am successful in using cut -f 1 | sort | uniq | wc -l in counting the unique values in column one and I can do the same for column two but because some of the values are the same in column one and two I need to be able to do this command treating column 1 and 2 as one field. Can anyone help me please?
Your question can be interpreted in two ways so I answer both of them.
Given the input file:
2 1
2 1
1 2
1 1
2 2
If you want the result to output 4 because the unique pairs are 1 1, 1 2, 2 1 and 2 2, then you need:
cat test|cut -f1,2|sort|uniq|wc -l
What we do here: we pick only first two columns as well as the delimiter and pass it to sort|uniq which does the job.
If you, on the other hand, want the result to output 2 because there are only two unique elements: 1 and 2, then you can tweak the above like this:
cat test|cut -f1,2|tr "\t" "\n"|sort|uniq|wc -l
This time after we select first two columns, we split each of them into two lines so that sort|uniq picks them up.
These work as long as the columns are separated with TAB character, not spaces. Since you didn't pass -d option to cut in your question and cut uses tabs by default, I assumed your input uses tabs too.

How to sort by column and break ties randomly

I have a tab-delimited file with three columns like this:
joe W 4
bob A 1
ana F 1
roy J 3
sam S 0
don R 2
tim L 0
cyb M 0
I want to sort this file by decreasing values in the third column, but to break ties I do not want to use some other column to do so (i.e. not use the first column to sort rows with the same entry in the third column).
Instead, I want rows with the same third column entries to either preserve the original order, or be sorted randomly.
Is there a way to do this using the sort command in unix?
sort -k3 -r -s file
This should give you the required output.
-k3 denotes the 3rd column and -r will sort in decreasing order and -s will disable the breaking of ties using other options.

I have a file having some columns. I would like to do sort for column 2 by grouping column 1 values

I have a file having some columns. I would like to do sort for column 2 by grouping column 1 values.
See below example.
Input File like:
NEW,RED,1
OLD,BLUE,2
NEW,BLUE,3
OLD,GREEN,4
Expected output file:
NEW,BLUE,3
NEW,RED,1
OLD,BLUE,2
OLD,GREEN,4
How can i achieve this,please help. Thanks in advance!
$ sort -t, -k1,2 inputfile
NEW,BLUE,3
NEW,RED,1
OLD,BLUE,2
OLD,GREEN,4
-t is used to specify the field separator, and -k1 to specify the starting/ending key positions.

Resources