Using awk and sort last column in descending order in Linux - linux

I have a file contains both name and numbers like :
data.csv
2016,Bmw,M2,2 Total score:24
1998,Subaru,Legacy,23 Total score:62
2012,Volkswagen,Golf,59 Total score:28
2001,Dodge,Viper,42 Total score:8
2014,Honda,Accord,83 Total score:112
2015,Chevy,Camaro,0 Total score:0
2008,Honda,Accord,88 Total score:48
Total score is last column I did :
awk -F"," 'NR>1{{for(i=4;i<=6;++i)printf $i""FS }
{sum=0; for(g=8;g<=NF;g++)
sum+=$g
print $i,"Total score:"sum ; print ""}}' data.csv
when i try
awk -F"," 'NR>1{{for(i=4;i<=6;++i)printf $i""FS }
{sum=0; for(g=8;g<=NF;g++)
sum+=$g
print $i,"Total score:"sum ; print "" | "sort -k1,2n"}}' data.csv
It gave me error, I only want to sort total score column, is there anything I did it wrong? Any helps are appreciated

First, assuming there are really not blank lined in between each line of data in data.csv, all you need is sort, you don't need awk at all. For example, since there is only ':' before the total score you want to sort descending by, you can use:
sort -t: -k2,2rn data.csv
Where -t: tells sort to use ':' as the field separator and then the KEYDEF -k2,2rn tell sort to use the 2nd field (what's after the ':' to sort by) and the rn says use a reverse numeric sort on that field.
Example Use/Output
With your data (without blank lines) in data.csv, you would have:
$ sort -t: -k2,2rn data.csv
2014,Honda,Accord,83 Total score:112
1998,Subaru,Legacy,23 Total score:62
2008,Honda,Accord,88 Total score:48
2012,Volkswagen,Golf,59 Total score:28
2016,Bmw,M2,2 Total score:24
2001,Dodge,Viper,42 Total score:8
2015,Chevy,Camaro,0 Total score:0
Which is your sort by Total score in descending order. If you want ascending order, just remove the r from -k2,2rn.
If you do have blank lines, you can remove them before the sort with sed -i '/^$/d' data.csv.
Sorting by number Before "Total score"
If you want to sort by the number that begins the XX Total score: yy field (e.g. XX), you can use sort with the field separator being a ',' and then your KEYDEF would be -k4.1,4.3rn which just says sort using the 4th field character 1 through character 3 by the same reverse numeric, e.g.
sort -t, -k4.1,4.3rn data.csv
Example Use/Output
In this case, sorting by the number before Total score in descending order would result in:
$ sort -t, -k4.1,4.3rn data.csv
2008,Honda,Accord,88 Total score:48
2014,Honda,Accord,83 Total score:112
2012,Volkswagen,Golf,59 Total score:28
2001,Dodge,Viper,42 Total score:8
1998,Subaru,Legacy,23 Total score:62
2016,Bmw,M2,2 Total score:24
2015,Chevy,Camaro,0 Total score:0
After posting the original solution I noticed it was ambiguous as which of the numbers on the 4th field you intended to sort on. In either case, here are both solutions. Let me know if you have further questions.

Related

Shell | Sort Date and Month in Ascending order

I wanted to display/sort the file records in Ascending order of Date and Month or if there are any equal data values they should list in the very next column in ascending order.
Date & Month to sort: (current scenario)
ver.....03.02../ver>
ver.....19.01../ver>
ver.....02.02..ver>
File content:
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
How would I can achieve below following results?
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
I tried using sort: (not working)
sort -n sortfile.txt
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
You can use sort, but you will need to specify the field-seperator -t '-' so that fields are separated by '-' and then specify the keydef to sort on the 5th field beginning with the 4th character and then again with the 1st character and finally a version sort on field 6 if all else is equal. That would be:
sort -t '-' -k5.4n -k5.1n -k6V contents
Providing full start and stop characters within each keydef can be done as:
sort -t '-' -k5.4n,5.5 -k5.1n,5.2 -k6V contents
(though for this data the output isn't changed)
Example Use/Output
$ sort -t '-' -k5.4n -k5.1n -k6V contents
ver>0.1.1-XYZ-LOK-BR-19.01-v1.0-5-8a8d7dd/ver>
ver>0.1.1-DXD-UIJ-BR-02.02-v1.0-4-9o2k4wk/ver>
ver>0.1.1-ABC-XYA-BR-03.02-v1.0-1-4d4f3dd/ver>

Print whole line with highest value from one column

I have a little issue right now.
I have a file with 4 columns
test0000002,10030010330,c_,218
test0000002,10030010330,d_,202
test0000002,10030010330,b_,193
test0000002,10030010020,c_,178
test0000002,10030010020,b_,170
test0000002,10030010330,a_,166
test0000002,10030010020,a_,151
test0000002,10030010020,d_,150
test0000002,10030070050,c_,119
test0000002,10030070050,b_,99
test0000002,10030070050,d_,79
test0000002,10030070050,a_,56
test0000002,10030010390,c_,55
test0000002,10030010390,b_,44
test0000002,10030010380,d_,41
test0000002,10030010380,a_,37
test0000002,10030010390,d_,35
test0000002,10030010380,c_,33
test0000002,10030010390,a_,31
test0000002,10030010320,c_,30
test0000002,10030010320,b_,27
test0000002,10030010380,b_,26
test0000002,10030010320,a_,23
test0000002,10030010320,d_,22
test0000002,10030010010,a_,6
and I want the highest value from 4th column sorted from 2nd column.
test0000002,10030010330,c_,218
test0000002,10030010020,c_,178
test0000002,10030010330,a_,166
test0000002,10030010020,a_,151
test0000002,10030070050,c_,119
test0000002,10030010390,c_,55
test0000002,10030010380,d_,41
test0000002,10030010320,c_,30
test0000002,10030010390,a_,31
test0000002,10030010380,c_,33
test0000002,10030010390,d_,35
test0000002,10030010320,a_,23
test0000002,10030010380,b_,26
test0000002,10030010010,a_,6
It appears that your file is already sorted in descending order on the 4th column, so you just need to print lines where the 2nd column appears for the first time:
awk -F, '!seen[$2]++' file
test0000002,10030010330,c_,218
test0000002,10030010020,c_,178
test0000002,10030070050,c_,119
test0000002,10030010390,c_,55
test0000002,10030010380,d_,41
test0000002,10030010320,c_,30
test0000002,10030010010,a_,6
If your input file is not sorted on column 4, then
sort -t, -k4nr file | awk -F, '!seen[$2]++'
You can use two sorts:
sort -u -t, -k2,2 file | sort -t, -rnk4
The first one removes duplicates in the second column, the second one sorts the first one on the 4th column.

Linux filtering a file by two columns and printing the output

I have a table that has 9 columns as shown below.
How would I first sort by the strand column so only those with a "+" are selected, and then of those I select the ones that have 3 exons (In the exon count column).
I have been trying to use grep for this as I understand I can pick out a word from a column, but I only get the particular column or just the total number.
using awk
awk -F "," ' $4=="+" && $9=="3" ' file.csv
If it's not CSV then remove -F "," from this command

sort and remove duplicate based on different columns in a file

I have a file in which there are three columns as (yyyy-mm-dd hh:mm:ss.000 12-digit number) :
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.578 5001234567891
I want to first sort the file based on the date-time(first two columns) and then have to remove the rows having duplicate numbers (third column). So after this the above file will look like:
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.568 5001234567890
I have used sort with key and awk command(as below) but the results aren't correct..(I am not very sure which entries are being removed as the file that I am processing are too big.)
Commands:
sort -k1 inputFile > sortedInputFile<br/>
awk '!seen[$3]++' sortedInputFile > outputFile<br/>
I am not sure how to do this.
If you want to keep the earliest instance of each 3rd column entry, you can sort twice; the first time to group duplicates and the second time to restore the sort by time, after duplicates are removed. (The following assumes a default sort works with both dates and values and that all lines have three columns with consistent whitespace.)
sort -k3 -k1,2 inputFile | uniq -f2 | sort > sortedFile
The -f2 option to uniq tells it to start the comparison at the end of the second field, so that the date fields are not considered.
If milliseconds doesn't matter, following is another approach which removes the milliseconds and performs the sort and uniq:
awk '{print $1" "substr($2,1,index($2,".")-1)" "$3 }' file1.txt | sort | uniq
Here is one in awk. It groups on the $3 and stores the earliest timestamp but the output order is random, so the output should be piped to sort.
$ awk '
(a[$3] == "" || a[$3] > ($1 OFS $2)) && a[$3]=($1 OFS $2) { next }
END{ for(i in a) print a[i], i }
' file # | sort goes here
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.478 5001234567891

Bash Script - Divide Colum 2 by Colum in the middle but keep 1 and 4 on either side

I have a list that has an ID, population, area and province, that looks like this:
1:517000:405212:Newfoundland and Labrador
2:137900:5660:Prince Edward Island
3:751400:72908:New Brunswick
4:938134:55284:Nova Scotia
5:7560592:1542056:Quebec
6:12439755:1076359:Ontario
7:1170300:647797:Manitoba
8:996194:651036:Saskatchewan
9:3183312:661848:Alberta
10:4168123:944735:British Comumbia
11:42800:1346106:Northwest Territories
12:31200:482443:Yukon Territories
13:29300:2093190:Nunavut
I need display the names of the provinces with the lowest and highest population density (population/area). How can I divide column 1 by column 2 (2 decimal places) but keep the file information in tact on either side (eg. 1:1.28:Newfoundland and Labrador). After that I figure I can just pump it into sort -t: -nk2 | head -n 1 and sort -t: -nrk2 | head -n 1 to pull them.
The recommended command given was grep.
Since you seem to have the sorting and extraction under control, here's an example awk script that should work for you:
#!/usr/bin/env awk -f
BEGIN {
FS=":"
OFS=":"
OFMT="%.2f"
}
{
print $1,$2/$3,$4
}

Resources