Print whole line with highest value from one column - linux

I have a little issue right now.
I have a file with 4 columns
test0000002,10030010330,c_,218
test0000002,10030010330,d_,202
test0000002,10030010330,b_,193
test0000002,10030010020,c_,178
test0000002,10030010020,b_,170
test0000002,10030010330,a_,166
test0000002,10030010020,a_,151
test0000002,10030010020,d_,150
test0000002,10030070050,c_,119
test0000002,10030070050,b_,99
test0000002,10030070050,d_,79
test0000002,10030070050,a_,56
test0000002,10030010390,c_,55
test0000002,10030010390,b_,44
test0000002,10030010380,d_,41
test0000002,10030010380,a_,37
test0000002,10030010390,d_,35
test0000002,10030010380,c_,33
test0000002,10030010390,a_,31
test0000002,10030010320,c_,30
test0000002,10030010320,b_,27
test0000002,10030010380,b_,26
test0000002,10030010320,a_,23
test0000002,10030010320,d_,22
test0000002,10030010010,a_,6
and I want the highest value from 4th column sorted from 2nd column.
test0000002,10030010330,c_,218
test0000002,10030010020,c_,178
test0000002,10030010330,a_,166
test0000002,10030010020,a_,151
test0000002,10030070050,c_,119
test0000002,10030010390,c_,55
test0000002,10030010380,d_,41
test0000002,10030010320,c_,30
test0000002,10030010390,a_,31
test0000002,10030010380,c_,33
test0000002,10030010390,d_,35
test0000002,10030010320,a_,23
test0000002,10030010380,b_,26
test0000002,10030010010,a_,6

It appears that your file is already sorted in descending order on the 4th column, so you just need to print lines where the 2nd column appears for the first time:
awk -F, '!seen[$2]++' file
test0000002,10030010330,c_,218
test0000002,10030010020,c_,178
test0000002,10030070050,c_,119
test0000002,10030010390,c_,55
test0000002,10030010380,d_,41
test0000002,10030010320,c_,30
test0000002,10030010010,a_,6
If your input file is not sorted on column 4, then
sort -t, -k4nr file | awk -F, '!seen[$2]++'

You can use two sorts:
sort -u -t, -k2,2 file | sort -t, -rnk4
The first one removes duplicates in the second column, the second one sorts the first one on the 4th column.

Related

Using awk and sort last column in descending order in Linux

I have a file contains both name and numbers like :
data.csv
2016,Bmw,M2,2 Total score:24
1998,Subaru,Legacy,23 Total score:62
2012,Volkswagen,Golf,59 Total score:28
2001,Dodge,Viper,42 Total score:8
2014,Honda,Accord,83 Total score:112
2015,Chevy,Camaro,0 Total score:0
2008,Honda,Accord,88 Total score:48
Total score is last column I did :
awk -F"," 'NR>1{{for(i=4;i<=6;++i)printf $i""FS }
{sum=0; for(g=8;g<=NF;g++)
sum+=$g
print $i,"Total score:"sum ; print ""}}' data.csv
when i try
awk -F"," 'NR>1{{for(i=4;i<=6;++i)printf $i""FS }
{sum=0; for(g=8;g<=NF;g++)
sum+=$g
print $i,"Total score:"sum ; print "" | "sort -k1,2n"}}' data.csv
It gave me error, I only want to sort total score column, is there anything I did it wrong? Any helps are appreciated
First, assuming there are really not blank lined in between each line of data in data.csv, all you need is sort, you don't need awk at all. For example, since there is only ':' before the total score you want to sort descending by, you can use:
sort -t: -k2,2rn data.csv
Where -t: tells sort to use ':' as the field separator and then the KEYDEF -k2,2rn tell sort to use the 2nd field (what's after the ':' to sort by) and the rn says use a reverse numeric sort on that field.
Example Use/Output
With your data (without blank lines) in data.csv, you would have:
$ sort -t: -k2,2rn data.csv
2014,Honda,Accord,83 Total score:112
1998,Subaru,Legacy,23 Total score:62
2008,Honda,Accord,88 Total score:48
2012,Volkswagen,Golf,59 Total score:28
2016,Bmw,M2,2 Total score:24
2001,Dodge,Viper,42 Total score:8
2015,Chevy,Camaro,0 Total score:0
Which is your sort by Total score in descending order. If you want ascending order, just remove the r from -k2,2rn.
If you do have blank lines, you can remove them before the sort with sed -i '/^$/d' data.csv.
Sorting by number Before "Total score"
If you want to sort by the number that begins the XX Total score: yy field (e.g. XX), you can use sort with the field separator being a ',' and then your KEYDEF would be -k4.1,4.3rn which just says sort using the 4th field character 1 through character 3 by the same reverse numeric, e.g.
sort -t, -k4.1,4.3rn data.csv
Example Use/Output
In this case, sorting by the number before Total score in descending order would result in:
$ sort -t, -k4.1,4.3rn data.csv
2008,Honda,Accord,88 Total score:48
2014,Honda,Accord,83 Total score:112
2012,Volkswagen,Golf,59 Total score:28
2001,Dodge,Viper,42 Total score:8
1998,Subaru,Legacy,23 Total score:62
2016,Bmw,M2,2 Total score:24
2015,Chevy,Camaro,0 Total score:0
After posting the original solution I noticed it was ambiguous as which of the numbers on the 4th field you intended to sort on. In either case, here are both solutions. Let me know if you have further questions.

Uniqing a delimited file based on a subset of fields

I have data such as below:
1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
Due to the nature of the last two columns, their values change throughout the day and their values are repeated regularly. By grouping the way outlined in my desired output (below), I am able to view each time there was a change in their values (with the enoch time in the first column). Is there a way to achieve the desired output shown below:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
So I consolidate the data by the second two columns. However, the consolidation is not completely unique (as can be seen by 207.55, 207.5 being repeated)
I have tried:
uniq -f 1
However the output gives only the first line and does not go on through the list
The awk solution below does not allow the occurrence which happened previously to be outputted again and so gives the output (below the awk code):
awk '!x[$2 $3]++'
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
I do not wish to sort the data by the second two columns. However, since the first is epoch time, it may be sorted by the first column.
You can't set delimiters with uniq, it has to be white space. With the help of tr you can
tr ',' ' ' <file | uniq -f1 | tr ' ' ','
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
You can use an Awk statement as below,
awk 'BEGIN{FS=OFS=","} s != $2 && t != $3 {print} {s=$2;t=$3}' file
which produces the output as you need.
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
The idea is to store the second and third column values in variables s and t respectively and print the line contents only if the current line is unique.
I found an answer which is not as elegant as Inian but satisfies my purpose.
Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command:
uniq -s 17
You can try to manually (with a loop) compare current line with previous line.
previous_line=""
# start at first line
i=1
# suppress first column, that don't need to compare
sed 's#^[0-9][0-9]*,##' ./data_file > ./transform_data_file
# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do
# if previous record line are same than current line
if [ "x$prev_line" == "x$current_line" ]
then
# record line number to supress after
echo $i >> ./line_to_be_suppress
fi
# record current line as previous line
prev_line=$current_line
# increment current number line
i=$(( i + 1 ))
done
# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done
rm line_to_be_suppress
rm transform_data_file
Since your first field seems to have a fixed length of 18 characters (including the , delimiter), you could use the -s option of uniq, which would be more optimal for larger files:
uniq -s 18 file
Gives this output:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
From man uniq:
-f num
Ignore the first num fields in each input line when doing comparisons.
A field is a string of non-blank characters separated from adjacent fields by blanks.
Field numbers are one based, i.e., the first field is field one.
-s chars
Ignore the first chars characters in each input line when doing comparisons.
If specified in conjunction with the -f option, the first chars characters after
the first num fields will be ignored. Character numbers are one based,
i.e., the first character is character one.

Join the original sorted files, include 2 fields in one file and 1 field in 2nd file

I need help with linux command.
I have 2 files StockSort and SalesSort. They are sorted and they have 3 fields each. I know how to sort 1 field in 1st file and 1 field in 2nd file. But I can't get a right syntax for joining two fields in 1st file and only 1 field in second file. I also need to save it i na new file.
So far I have this command, but it doesn't work.I think the mistake is in "2,3" part, where I need to combine two fields from the 1st file.
join -1 2,3 -2 2 StockSort SalesSort >FinalReport
StockSort file
3976:diode:350
4105:resistor:750
4250:resistor:500
SalesSort file
3976:120:net
4105:250:chg
5500:100:pde
Output should be like this:
3976:350:120
4105:750:250
4250:500:100
You can try
join -t: -o 1.1,1.3,2.2 stocksort salesort
where
-t set the column separator
-o is the output format (a comma sep. list of filenumber.fieldnumber)
Here is an awk:
$ awk 'BEGIN{ FS=OFS=":"}
FNR==NR {Stock[$1]=$3; next}
$1 in Stock{ print $1,Stock[$1],$2}' StockSort SalesSort

Awk matching values of first two columns and printing in blank field

I have a csv file which looks like below:
2212,A1,
2212,A1,128
2307,B1,
2307,B1,107
how can i copy value of 3rd column in place of missing values in 3rd column of if value of first 2 column is same. e.g. first two columns of first two rows are same so automatically it should print value of 3rd column of second row in missing place of third column of first row.
expected output:
2212,A1,128
2212,A1,128
2307,B1,107
2307,B1,107
Please help as i couldn't even think of a solution and there are millions of values such like this in my file..
If you first sort the file in reverse order, the rows with data preceed the empty rows:
$ sort -r file
2307,B1,107
2307,B1,
2212,A1,128
2212,A1,
Then use following awk to process the output of sort:
$ sort -r file | awk 'NR>1 && match(prev,$0) {$0=prev} {prev=$0} 1'
2307,B1,107
2307,B1,107
2212,A1,128
2212,A1,128
awk -F, '{a[$1FS$2]++;b[$1FS$2]=$NF}END{for (i in b) {for(j=1;j<=a[i];j++) print i FS b[i]}}' file

I have a file having some columns. I would like to do sort for column 2 by grouping column 1 values

I have a file having some columns. I would like to do sort for column 2 by grouping column 1 values.
See below example.
Input File like:
NEW,RED,1
OLD,BLUE,2
NEW,BLUE,3
OLD,GREEN,4
Expected output file:
NEW,BLUE,3
NEW,RED,1
OLD,BLUE,2
OLD,GREEN,4
How can i achieve this,please help. Thanks in advance!
$ sort -t, -k1,2 inputfile
NEW,BLUE,3
NEW,RED,1
OLD,BLUE,2
OLD,GREEN,4
-t is used to specify the field separator, and -k1 to specify the starting/ending key positions.

Resources