How to sort by column and break ties randomly - linux

I have a tab-delimited file with three columns like this:
joe W 4
bob A 1
ana F 1
roy J 3
sam S 0
don R 2
tim L 0
cyb M 0
I want to sort this file by decreasing values in the third column, but to break ties I do not want to use some other column to do so (i.e. not use the first column to sort rows with the same entry in the third column).
Instead, I want rows with the same third column entries to either preserve the original order, or be sorted randomly.
Is there a way to do this using the sort command in unix?

sort -k3 -r -s file
This should give you the required output.
-k3 denotes the 3rd column and -r will sort in decreasing order and -s will disable the breaking of ties using other options.

Related

Delete repeated rows based on column 5 and keep the one with highest value in column 13

I have a spreadsheet (.csv) with 20 columns and 9000 rows. I want to delete rows that have the same ID in column 5, so I will end up with only one entry (or row) per ID number (unique ID). If there are 2 or more rows with the same ID in column 5, I want to keep the one that has the highest score in column 13. At the same time I want to keep all 20 columns for each row (all the information). Rows with repeated ID and lower score are not important, so I want to just remove those.
I was trying with awk and pearl, but somehow I only manage to do it half way. Let me know if I need to provide more information. Thanks!
INPUT (delimeter=','):
geneID, Score, annotation, etc.
ENSG0123, 532.0, intergenic, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0123, 234.0, 5-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.
OUTPUT:
geneID, Score, annotation, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.
since you didn't give the complete input/output example, I guess it was a generic problem. So here is the answer:
sort -t',' -k5,5n -k13,13nr file.csv|awk -F, '!a[$5]++'
although awk can do it alone, but with help of sort the code could be much easier. What the above one-liner does:
sort the file by col5 and col13(numerically, descending)
pass the sorted result to awk to remove duplicates, base on col5.
here is a little test on it, in the example, col1 is your col5, and col3 is your col13:
kent$ cat f
1,2,3
2,8,7
1,2,4
1,4,5
2,2,8
1,3,6
2,2,9
1,2,10
LsyHP 12:38:04 /tmp/test
kent$ sort -t',' -k1,1n -k3,3nr f|awk -F, '!a[$1]++'
1,2,10
2,2,9

How to count the unique values in 2 fields without concatenating them?

I am trying to using the basic shell in unix to count the unique values in 2 fields. I have data with 5 columns but just what to count the unique values in the first 2 WITHOUT concatenating them.
So far I am successful in using cut -f 1 | sort | uniq | wc -l in counting the unique values in column one and I can do the same for column two but because some of the values are the same in column one and two I need to be able to do this command treating column 1 and 2 as one field. Can anyone help me please?
Your question can be interpreted in two ways so I answer both of them.
Given the input file:
2 1
2 1
1 2
1 1
2 2
If you want the result to output 4 because the unique pairs are 1 1, 1 2, 2 1 and 2 2, then you need:
cat test|cut -f1,2|sort|uniq|wc -l
What we do here: we pick only first two columns as well as the delimiter and pass it to sort|uniq which does the job.
If you, on the other hand, want the result to output 2 because there are only two unique elements: 1 and 2, then you can tweak the above like this:
cat test|cut -f1,2|tr "\t" "\n"|sort|uniq|wc -l
This time after we select first two columns, we split each of them into two lines so that sort|uniq picks them up.
These work as long as the columns are separated with TAB character, not spaces. Since you didn't pass -d option to cut in your question and cut uses tabs by default, I assumed your input uses tabs too.

I have a file having some columns. I would like to do sort for column 2 by grouping column 1 values

I have a file having some columns. I would like to do sort for column 2 by grouping column 1 values.
See below example.
Input File like:
NEW,RED,1
OLD,BLUE,2
NEW,BLUE,3
OLD,GREEN,4
Expected output file:
NEW,BLUE,3
NEW,RED,1
OLD,BLUE,2
OLD,GREEN,4
How can i achieve this,please help. Thanks in advance!
$ sort -t, -k1,2 inputfile
NEW,BLUE,3
NEW,RED,1
OLD,BLUE,2
OLD,GREEN,4
-t is used to specify the field separator, and -k1 to specify the starting/ending key positions.

Fast removing duplicate rows between multiple files

I have 10k files with 80k rows each and need to compare, and - either delete the duplicate lines or replace them by "0". ultrafast since I have to do it +1000 times.
the following script is fast enough for files with less than 100 rows. now tcsh
import csv
foreach file ( `ls -1 *` )
split -l 1 ${file} ${file}.
end
find *.* -type f -print0 | xargs -0 sha512sum | awk '($1 in aa){print $2}(!($1 in aa)){aa[$1]=$2}' | xargs -I {} cp rowzero {}
cat ${file}.* > ${file}.filtered
where "rowzero" is just a file with a... zero. I have tried python but haven't found a fast way. I have tried pasting them and doing all nice fast things (awk, sed, above commands, etc.) but the i/o slows to incredible levels when the file has over more than e.g. 1000 columns. I need help, thanks a million hours!.
ok this is so far the fastest code that I could make, which works on a transposed and "cat" input. As explained before, "cat"-ed input ">>" works fine however "paste" or "pr" code gives nightmares pasting another column in, say, +1GB files, and that is why we need to transpose. e.g.
each original file looks like
1
2
3
4
...
if we transpose and cat the first file with others the input for the code will look like:
1 2 3 4 ..
1 1 2 4 ..
1 1 1 4 ..
The code will return the original "aka retransposed pasted" format with the minor detail of shuffled rows
1
1 2
1 2 3
2 3 4
..
The repeated rows were effectively removed. below the code,
HOWEVER THE CODE IS NOT GENERAL! it only works with 1-digit integers since the awk array indexes are not sorted. Could someone help to generalize it? thanks!
{for(ii=1;ii<=NF;ii++){aa[ii,$ii]=$ii}}END{mm=1; for (n in aa) {split(n, bb, SUBSEP);
if (bb[1]==mm){cc=bb[2]; printf ( "%2s", cc)}else{if (mm!=bb[1]){printf "\n%2s", bb[2] }; mm=bb[1]}}}

Why linux sort is not giving me desired results?

I have a file a.csv with contents similar to below
a,b,c
a ,aa, a
a b, c, f
a , b, c
a b a b a,a,a
a,a,a
a aa ,a , t
I am trying to sort it by using sort -k1 -t, a.csv
But it is giving following results
a,a,a
a ,aa, a
a aa ,a , t
a b a b a,a,a
a , b, c
a,b,c
a b, c, f
Which is not the actual sort on 1st column. What am I doing wrong?
You have to specify the end position to be 1, too:
sort -k1,1 -t, a.csv
Give this a try: sort -t, -k1,1 a.csv
The man suggests that omitting the end field, it will sort on all characters starting at field n until the end of the line:
-k POS1[,POS2]'
The recommended, POSIX, option for specifying a sort field. The
field consists of the part of the line between POS1 and POS2 (or
the end of the line, if POS2 is omitted), _inclusive_. Fields and
character positions are numbered starting with 1. So to sort on
the second field, you'd use `-k 2,2' See below for more examples.
Try this instead:
sort -k 1,1 -t , a.csv
sort reads -k 1 as "sort from first field onwards" -- thus effectively defying the point of passing the argument in the first place.
This is documented in the sort man page and warned about in the Examples section:
Sort numerically on the second field
and resolve ties by sorting
alphabetically on the third and fourth
characters of field five. Use `:' as
the field delimiter:
$ sort -t : -k 2,2n -k 5.3,5.4
Note that if you had written -k 2 instead
of -k 2,2, sort would have used all
characters beginning in the second
field and extending to the end of the
line as the primary numeric key. For
the large majority of applications,
treating keys spanning more than one
field as numeric will not do what you
expect.

Resources