Select rows with min value based on fourth column and group by first column in linux - linux

Can you please tell me how to Select rows with min value based on fourth column and group by first column in linux?
Original file
x,y,z,w
1,a,b,0.22
1,a,b,0.35
1,a,b,0.45
2,c,d,0.06
2,c,d,0.20
2,c,d,0.46
3,e,f,0.002
3,e,f,0.98
3,e,f,1.0
The file I want is as below.
x,y,z,w
1,a,b,0.22
2,c,d,0.06
3,e,f,0.002
I tried as below but this does not work.
sort -k1,4 -u original_file.txt | awk '!a[$1] {a[$1] = $4} $4 == a[$1]' >> out.txt

You should just sort by column 4. You need to store the entire line in the array, not just $4. And then print the entire array at the end.
To keep the heading from getting mixed in, I print that separately and then process the rest of the file.
head -n 1 original_file
tail -n +2 original_file | sort -t, -k 4n -u | awk -F, '
!a[$1] { a[$1] = $0 }
END { for (k in a) print a[k] }' | sort -t, -k 1,1n >> out

Related

Using awk to extract data and count

How do I use awk on a file that looks like this:
abcd Z
efdg Z
aqbs F
edf F
aasd A
I want to extract the number of times each letter of the alphabet occurs in the second column, so output should be:
Z 2
F 2
A 1
try: If you want the order of output same as Input_file then following may help you.
awk 'FNR==NR{A[$2]++;next} A[$2]{print $2,A[$2];delete A[$2]}' Input_file Input_file
if you don't bother of order of $2 then following may help you.
awk '{A[$2]++} END{for(i in A){print i,A[i]}}' Input_file
In first solution reading the Input_file twice and creating an array A whose index is $2 with it's incrementing value. then when second Input_file is being read then printing the $2 and it's count.
In Second solution creating an array A whose index $2 and incrementing value of it. Then in end section go through the array A and print it's index and array A's value.
I would use sort | uniq for this purpose as these two utils are designed specifically for this kind of task:
cat <<END |
abcd Z
efdg Z
aqbs F
edf F
aasd A
END
awk '{print $2}' | sort -r | uniq -c | awk '{printf "%s %d\n", $2, $1}'
Would produce exactly the desired output
Z 2
F 2
A 1
Here awk '{print $2}' is used to get the second column from a document with fields separated by one or more whitespace characters. If we knew the width of the columns is fixed, we could use a faster cut utility instead.
sort -r | uniq -c is doing the main algorithmic part of the task - sort the letters in reverse order and count the number of occurrences of each letter.
awk '{printf "%s %d\n", $2, $1}' does some reformatting of the uniq -c output to match the required format exactly.
Update: AWK has powerful array support so this can be done with awk alone:
cat <<END |
abcd Z
efdg Z
aqbs F
edf F
aasd A
END
awk '{a[$2]++}
END {n=asorti(a,b,"#ind_str_desc");
for (k=1;k<=n;k++) {printf b[k], a[b[k]]} }'
We use the array a that is indexed with letters found in the input stream, and on each line the element indexed by the corresponding letter gets incremented.
In the END clause we reverse the order of indices and output the array.

Linux sort numerically based on first column

I'm trying to numerically sort a long list of csv file based on the number in the first column, using below command:
-> head -1 file.csv ; tail -n +2 file.csv | sort -t , -k1n
(I'm piping head/tail command to skip the first line of the file, as it's a header and contains string)
However, it doesn't return a fully sorted list. Half of it is sorted, the other half is like this:
9838,2361,8,947,2284
9842,2135,2,261,2511
9846,2710,1,176,2171
986,2689,32,123,2177
9888,2183,15,30,2790
989,2470,33,887,2345
Can somebody tell me what I'm doing wrong? I've also tried below with same result:
-> sort -k1n -t"," file.csv
tail -n +2 file.csv | sort -k1,2 -n -t"," should do the trick.
To perform a numeric sort by the first column use the following approach:
tail -n +2 /file.csv | sort -n -t, -k1,1
The output:
986,2689,32,123,2177
989,2470,33,887,2345
9838,2361,8,947,2284
9842,2135,2,261,2511
9846,2710,1,176,2171
9888,2183,15,30,2790
-k pos1[,pos2]
Specify a sort field that consists of the part of the line between pos1 and pos2
(or the end of the line, if pos2 is omitted), inclusive.
In its simplest form pos specifies a field number (starting with 1) ...

sort and remove duplicate based on different columns in a file

I have a file in which there are three columns as (yyyy-mm-dd hh:mm:ss.000 12-digit number) :
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.578 5001234567890
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.578 5001234567891
I want to first sort the file based on the date-time(first two columns) and then have to remove the rows having duplicate numbers (third column). So after this the above file will look like:
2016-11-30 23:40:45.478 5001234567891
2016-11-30 23:40:45.568 5001234567890
I have used sort with key and awk command(as below) but the results aren't correct..(I am not very sure which entries are being removed as the file that I am processing are too big.)
Commands:
sort -k1 inputFile > sortedInputFile<br/>
awk '!seen[$3]++' sortedInputFile > outputFile<br/>
I am not sure how to do this.
If you want to keep the earliest instance of each 3rd column entry, you can sort twice; the first time to group duplicates and the second time to restore the sort by time, after duplicates are removed. (The following assumes a default sort works with both dates and values and that all lines have three columns with consistent whitespace.)
sort -k3 -k1,2 inputFile | uniq -f2 | sort > sortedFile
The -f2 option to uniq tells it to start the comparison at the end of the second field, so that the date fields are not considered.
If milliseconds doesn't matter, following is another approach which removes the milliseconds and performs the sort and uniq:
awk '{print $1" "substr($2,1,index($2,".")-1)" "$3 }' file1.txt | sort | uniq
Here is one in awk. It groups on the $3 and stores the earliest timestamp but the output order is random, so the output should be piped to sort.
$ awk '
(a[$3] == "" || a[$3] > ($1 OFS $2)) && a[$3]=($1 OFS $2) { next }
END{ for(i in a) print a[i], i }
' file # | sort goes here
2016-11-30 23:40:45.568 5001234567890
2016-11-30 23:40:45.478 5001234567891

Linux sort: how to sort numerically but leave empty cells to the end

I have this data to sort. The 1st column is the item ID. The 2nd column is the numerical value. Some items do not have a numerical value.
03875334 -4.27
03860156 -7.27
03830332
19594535 7.87
01542392 -5.74
01481815 11.45
04213946 -10.06
03812865 -8.67
03831625
01552174 -9.28
13540266 -8.27
03927870 -7.25
00968327 -8.09
I want to use the Linux sort command to sort the items numerically in the ascending order of their value, but leave those empty items to the end. So, this is the expected output I want to obtain:
04213946 -10.06
01552174 -9.28
03812865 -8.67
13540266 -8.27
00968327 -8.09
03860156 -7.27
03927870 -7.25
01542392 -5.74
03875334 -4.27
19594535 7.87
01481815 11.45
03830332
03831625
I tried "sort -k2n" and "sort -k2g", but neither yielded the output I want. Any idea?
Here is a simple Schwartzian transform based on the assumption that all actual values are smaller than 123456789.
awk '{ printf "%s\t%s", ($2 || 123456789), $0 }' file |
sort -n | cut -f2- >output
Assuming data is in d.txt and blanks have 4 spaces at the end
egrep " $" d.txt > blanks.txt ; egrep -v " $" d.txt | sort -n -k2 | cat - blanks.txt
This should work:
awk '$2 ~ /[0-9]$/' d.txt | sort -k2g && awk '$2 !~ /[0-9]$/' d.txt

How to count number of unique values of a field in a tab-delimited text file?

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Resources