Linux sort numerically based on first column - linux

I'm trying to numerically sort a long list of csv file based on the number in the first column, using below command:
-> head -1 file.csv ; tail -n +2 file.csv | sort -t , -k1n
(I'm piping head/tail command to skip the first line of the file, as it's a header and contains string)
However, it doesn't return a fully sorted list. Half of it is sorted, the other half is like this:
9838,2361,8,947,2284
9842,2135,2,261,2511
9846,2710,1,176,2171
986,2689,32,123,2177
9888,2183,15,30,2790
989,2470,33,887,2345
Can somebody tell me what I'm doing wrong? I've also tried below with same result:
-> sort -k1n -t"," file.csv

tail -n +2 file.csv | sort -k1,2 -n -t"," should do the trick.

To perform a numeric sort by the first column use the following approach:
tail -n +2 /file.csv | sort -n -t, -k1,1
The output:
986,2689,32,123,2177
989,2470,33,887,2345
9838,2361,8,947,2284
9842,2135,2,261,2511
9846,2710,1,176,2171
9888,2183,15,30,2790
-k pos1[,pos2]
Specify a sort field that consists of the part of the line between pos1 and pos2
(or the end of the line, if pos2 is omitted), inclusive.
In its simplest form pos specifies a field number (starting with 1) ...

Related

can't make pipe operator function properly - linux

I'm trying to get the second column of a file, get the first 10 results and sort it in alphanumerical order but it doesn't seem to work.
cut -f2 file.txt | head -10 | sort -d
I get this output:
NM_000242
NM_000525
NM_001005850
NM_001136557
NM_001204426
NM_001204836
NM_001271762
NM_001287216
NM_006952
NM_007253
If I sort the file first and get the first 10 lines of the sorted file it works
cut -f2 refGene.txt | sort -d | head -10
I get this output:
NM_000014
NM_000015
NM_000016
NM_000017
NM_000018
NM_000019
NM_000020
NM_000021
NM_000022
NM_000023
I don't want to sort the file and get the sorted result, I'd like to get the first 10 lines first and then sort them in alphanumerical order. What did I miss here?
Thanks
Well, it works correctly NM_000525 is before NM_001005850, and the later is before NM_00695.
But if you need to sort the second part (after the _) numerically, then you can do:
cut -f2 file.txt | head -10 | sort -t_ -k1,1 | sort -s -t_ -k2 -n
-s is a stable sort
Assuming the format is the same in the whole file (two letters _ numbers)
EDIT: Even shorter version would be:
cut -f2 file.txt | head -10 | sort -t_ -k1,1 -k2n
Explanation:
-t_ use _ as separator of fields (for selection on which field to sort)
-k1,1 sort alphabetically from first field (without ,1 it would sort also the second field)
-k2n sort numerically on the second field
So first it will sort by first field (using alphanumeric sorting) and then using the second field (using numeric, so it will convert string to a number and sort that)

How do I sort a CSV file by a specific column?

I want to sort csv as follow, what I want is
sort by column 2
if column is the same, sort by column 3(numerically)
here is what I do:
$ sort -t"," -k2 -nk3 /tmp/test.csv
55b64670abb9c0663e77de84,525e3bfad07b4377dc142a24:9999,0.081032
5510b33ec720d80086865312,525e3bfad07b4377dc142a24:9999,0.081033
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
5514548ec720d80086bfec46,525e3bfad07b4377dc142a24:9999,0.081035
551d4e21c720d80086084f45,525e3bfad07b4377dc142a24:9999,0.081036
557bff5276bd54a8df83268a,525e3bfad07b4377dc142a24:9999,0.081036
this result is strange, it sorts by the column three first, then by column 2
This command seems to yield correct output:
sort -t"," -k2,2 -k3,3n /tmp/test.csv
I use comma to constrain order to that column only, and use the numeric (-n) switch to last character in the third column.
It yields:
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
55b64670abb9c0663e77de84,525e3bfad07b4377dc142a24:9999,0.081032
5510b33ec720d80086865312,525e3bfad07b4377dc142a24:9999,0.081033
5514548ec720d80086bfec46,525e3bfad07b4377dc142a24:9999,0.081035
551d4e21c720d80086084f45,525e3bfad07b4377dc142a24:9999,0.081036
557bff5276bd54a8df83268a,525e3bfad07b4377dc142a24:9999,0.081036
Sort will work sorting data on csv & txt file , it will print the output on console
-t says columns are delimited by '|' , -k1 -k2 says that-- it will sort te data by column 1 & then by 2
$ sort -t '|' -k1 -k2 <INPUT_FILE>
For storing the result in output file use following command
$ sort -t '|' -k1 -k2 <INPUT_FILE> -o <OUTPUTFILE>
If you wann do it with ignoring header line then use following command
(head -n1 INPUT_FILE && sort <(tail -n+2 INPUT_FILE)) > OUTPUT_FILE
head -n1 INPUT_FILE which will print only the first line of your file i.e. header
&
This special tail syntax gets your file from second line up to EOF.
While the sort command has some hacks to partially handle CSV files, it won't handle all CSV format features. csvsort is a great option:
csvsort -c 2,3 /tmp/test.csv

Uniq skipping middle part of the line when comparing lines

Sample file
aa\bb\cc\dd\ee\ff\gg\hh\ii\jj
aa\bb\cc\dd\ee\ll\gg\hh\ii\jj
aa\bb\cc\dd\ee\ff\gg\hh\ii\jj
I want to skip 6th field 'ff' when comparing for an unique line, also I want the count of # of duplicate lines in front.
I tried this, without any luck:
sort -t'\' -k1,5 -k7 --unique xslin1 > xslout
Expected output
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj
$ awk -F'\' -v OFS='\' '{$6="*"} 1' xslin1 | sort | uniq -c
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj
Discussion
With --unique, sort outputs only unique lines but it does not count them. One needs uniq -c for that. Further, sort outputs all unique lines, not just those that sort to the same value.
The above solution does the simple approach of assigning the sixth field to *, as you wanted in the output, and then uses the standard pipeline, sort | uniq -c, to produce the count of unique lines.
You can do this in one awk:
awk 'BEGIN{FS=OFS="\\"} {$6="*"} uniq[$0]++{}
END {for (i in uniq) print uniq[i] "\t" i}' file
3 aa\bb\cc\dd\ee\*\gg\hh\ii\jj

Bash- sum values from an array in one line

I have this array:
array=(1 2 3 4 4 3 4 3)
I can get the largest number with:
echo "num: $(printf "%d\n" ${array[#]} | sort -nr | head -n 1)"
#outputs 4
But i want to get all 4's add sum them up, meaning I want it to output 12 (there are 3 occurrences of 4) instead. any ideas?
dc <<<"$(printf '%d\n' "${array[#]}" | sort -n | uniq -c | tail -n 1) * p"
sort to get max value at end
uniq -c to get only unique values, with a count of how many times they appear
tail to get only the last line (with the max value and its count)
dc to multiply the value by the count
I picked dc for the multiplication step because it's RPN, so you don't have to split up the uniq -c output and insert anything in the middle of it - just add stuff to the end.
Using awk:
$ printf "%d\n" "${array[#]}" | sort -nr | awk 'NR>1 && p!=$0{print x;exit;}{x+=$0;p=$0;}'
12
Using sort, the numbers are sorted(-n) in reverse(-r) order, and the awk keeps summing the numbers till it finds a number which is different from the previous one.
You can do this with awk:
awk -v RS=" " '{sum[$0]+=$0; if($0>max) max=$0} END{print sum[max]}' <<<"${array[#]}"
Setting RS (record separator) to space allows you to read your array entries as separate records.
sum[$0]+=$0; means sum is a map of cumulative sums for each input value; if($0>max) max=$0 calculates the max number seen so far; END{print sum[max]} prints the sum for the larges number seen at the end.
<<<"${array[#]}" is a here-document that allows you to feed a string (in this case all elements of the array) as stdin into awk.
This way there is no piping or looping involved - a single command does all the work.
Using only bash:
echo $((${array// /+}))
Replace all spaces with plus, and evaluate using double-parentheses expression.

How to count number of unique values of a field in a tab-delimited text file?

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Resources