I have a file having some columns. I would like to do sort for column 2 by grouping column 1 values - linux

I have a file having some columns. I would like to do sort for column 2 by grouping column 1 values.
See below example.
Input File like:
NEW,RED,1
OLD,BLUE,2
NEW,BLUE,3
OLD,GREEN,4
Expected output file:
NEW,BLUE,3
NEW,RED,1
OLD,BLUE,2
OLD,GREEN,4
How can i achieve this,please help. Thanks in advance!

$ sort -t, -k1,2 inputfile
NEW,BLUE,3
NEW,RED,1
OLD,BLUE,2
OLD,GREEN,4
-t is used to specify the field separator, and -k1 to specify the starting/ending key positions.

Related

Print whole line with highest value from one column

I have a little issue right now.
I have a file with 4 columns
test0000002,10030010330,c_,218
test0000002,10030010330,d_,202
test0000002,10030010330,b_,193
test0000002,10030010020,c_,178
test0000002,10030010020,b_,170
test0000002,10030010330,a_,166
test0000002,10030010020,a_,151
test0000002,10030010020,d_,150
test0000002,10030070050,c_,119
test0000002,10030070050,b_,99
test0000002,10030070050,d_,79
test0000002,10030070050,a_,56
test0000002,10030010390,c_,55
test0000002,10030010390,b_,44
test0000002,10030010380,d_,41
test0000002,10030010380,a_,37
test0000002,10030010390,d_,35
test0000002,10030010380,c_,33
test0000002,10030010390,a_,31
test0000002,10030010320,c_,30
test0000002,10030010320,b_,27
test0000002,10030010380,b_,26
test0000002,10030010320,a_,23
test0000002,10030010320,d_,22
test0000002,10030010010,a_,6
and I want the highest value from 4th column sorted from 2nd column.
test0000002,10030010330,c_,218
test0000002,10030010020,c_,178
test0000002,10030010330,a_,166
test0000002,10030010020,a_,151
test0000002,10030070050,c_,119
test0000002,10030010390,c_,55
test0000002,10030010380,d_,41
test0000002,10030010320,c_,30
test0000002,10030010390,a_,31
test0000002,10030010380,c_,33
test0000002,10030010390,d_,35
test0000002,10030010320,a_,23
test0000002,10030010380,b_,26
test0000002,10030010010,a_,6
It appears that your file is already sorted in descending order on the 4th column, so you just need to print lines where the 2nd column appears for the first time:
awk -F, '!seen[$2]++' file
test0000002,10030010330,c_,218
test0000002,10030010020,c_,178
test0000002,10030070050,c_,119
test0000002,10030010390,c_,55
test0000002,10030010380,d_,41
test0000002,10030010320,c_,30
test0000002,10030010010,a_,6
If your input file is not sorted on column 4, then
sort -t, -k4nr file | awk -F, '!seen[$2]++'
You can use two sorts:
sort -u -t, -k2,2 file | sort -t, -rnk4
The first one removes duplicates in the second column, the second one sorts the first one on the 4th column.

Join the original sorted files, include 2 fields in one file and 1 field in 2nd file

I need help with linux command.
I have 2 files StockSort and SalesSort. They are sorted and they have 3 fields each. I know how to sort 1 field in 1st file and 1 field in 2nd file. But I can't get a right syntax for joining two fields in 1st file and only 1 field in second file. I also need to save it i na new file.
So far I have this command, but it doesn't work.I think the mistake is in "2,3" part, where I need to combine two fields from the 1st file.
join -1 2,3 -2 2 StockSort SalesSort >FinalReport
StockSort file
3976:diode:350
4105:resistor:750
4250:resistor:500
SalesSort file
3976:120:net
4105:250:chg
5500:100:pde
Output should be like this:
3976:350:120
4105:750:250
4250:500:100
You can try
join -t: -o 1.1,1.3,2.2 stocksort salesort
where
-t set the column separator
-o is the output format (a comma sep. list of filenumber.fieldnumber)
Here is an awk:
$ awk 'BEGIN{ FS=OFS=":"}
FNR==NR {Stock[$1]=$3; next}
$1 in Stock{ print $1,Stock[$1],$2}' StockSort SalesSort

Awk matching values of first two columns and printing in blank field

I have a csv file which looks like below:
2212,A1,
2212,A1,128
2307,B1,
2307,B1,107
how can i copy value of 3rd column in place of missing values in 3rd column of if value of first 2 column is same. e.g. first two columns of first two rows are same so automatically it should print value of 3rd column of second row in missing place of third column of first row.
expected output:
2212,A1,128
2212,A1,128
2307,B1,107
2307,B1,107
Please help as i couldn't even think of a solution and there are millions of values such like this in my file..
If you first sort the file in reverse order, the rows with data preceed the empty rows:
$ sort -r file
2307,B1,107
2307,B1,
2212,A1,128
2212,A1,
Then use following awk to process the output of sort:
$ sort -r file | awk 'NR>1 && match(prev,$0) {$0=prev} {prev=$0} 1'
2307,B1,107
2307,B1,107
2212,A1,128
2212,A1,128
awk -F, '{a[$1FS$2]++;b[$1FS$2]=$NF}END{for (i in b) {for(j=1;j<=a[i];j++) print i FS b[i]}}' file

Delete repeated rows based on column 5 and keep the one with highest value in column 13

I have a spreadsheet (.csv) with 20 columns and 9000 rows. I want to delete rows that have the same ID in column 5, so I will end up with only one entry (or row) per ID number (unique ID). If there are 2 or more rows with the same ID in column 5, I want to keep the one that has the highest score in column 13. At the same time I want to keep all 20 columns for each row (all the information). Rows with repeated ID and lower score are not important, so I want to just remove those.
I was trying with awk and pearl, but somehow I only manage to do it half way. Let me know if I need to provide more information. Thanks!
INPUT (delimeter=','):
geneID, Score, annotation, etc.
ENSG0123, 532.0, intergenic, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0123, 234.0, 5-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.
OUTPUT:
geneID, Score, annotation, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.
since you didn't give the complete input/output example, I guess it was a generic problem. So here is the answer:
sort -t',' -k5,5n -k13,13nr file.csv|awk -F, '!a[$5]++'
although awk can do it alone, but with help of sort the code could be much easier. What the above one-liner does:
sort the file by col5 and col13(numerically, descending)
pass the sorted result to awk to remove duplicates, base on col5.
here is a little test on it, in the example, col1 is your col5, and col3 is your col13:
kent$ cat f
1,2,3
2,8,7
1,2,4
1,4,5
2,2,8
1,3,6
2,2,9
1,2,10
LsyHP 12:38:04 /tmp/test
kent$ sort -t',' -k1,1n -k3,3nr f|awk -F, '!a[$1]++'
1,2,10
2,2,9

Why linux sort is not giving me desired results?

I have a file a.csv with contents similar to below
a,b,c
a ,aa, a
a b, c, f
a , b, c
a b a b a,a,a
a,a,a
a aa ,a , t
I am trying to sort it by using sort -k1 -t, a.csv
But it is giving following results
a,a,a
a ,aa, a
a aa ,a , t
a b a b a,a,a
a , b, c
a,b,c
a b, c, f
Which is not the actual sort on 1st column. What am I doing wrong?
You have to specify the end position to be 1, too:
sort -k1,1 -t, a.csv
Give this a try: sort -t, -k1,1 a.csv
The man suggests that omitting the end field, it will sort on all characters starting at field n until the end of the line:
-k POS1[,POS2]'
The recommended, POSIX, option for specifying a sort field. The
field consists of the part of the line between POS1 and POS2 (or
the end of the line, if POS2 is omitted), _inclusive_. Fields and
character positions are numbered starting with 1. So to sort on
the second field, you'd use `-k 2,2' See below for more examples.
Try this instead:
sort -k 1,1 -t , a.csv
sort reads -k 1 as "sort from first field onwards" -- thus effectively defying the point of passing the argument in the first place.
This is documented in the sort man page and warned about in the Examples section:
Sort numerically on the second field
and resolve ties by sorting
alphabetically on the third and fourth
characters of field five. Use `:' as
the field delimiter:
$ sort -t : -k 2,2n -k 5.3,5.4
Note that if you had written -k 2 instead
of -k 2,2, sort would have used all
characters beginning in the second
field and extending to the end of the
line as the primary numeric key. For
the large majority of applications,
treating keys spanning more than one
field as numeric will not do what you
expect.

Resources