Trying to understand the sort utilty in linux - linux

I have a file named a.csv. which contains
100008,3
10000,3
100010,5
100010,4
10001,6
100021,7
After running this command sort -k1 -d -t "," a.csv
The result is
10000,3
100008,3
100010,4
100010,5
10001,6
100021,7
Which is unexpected because 10001 should come first than 100010
Trying to understand why this happened from long time. but couldn't get any answers.
$ sort --version
sort (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and Paul Eggert.

Some of the other responses have assumed this is a numeric sort vs dictionary sort problem. It isn't, as even sorting alphabetically the output given in the question is incorrect.
The answer
To get the correct sorting, you need to change -k1 to -k1,1:
$ sort -k1,1 -d -t "," a.csv
10000,3
100008,3
10001,6
100010,4
100010,5
100021,7
The reason
The -k option takes two numbers, the start and end fields to sort (i.e. -ks,e where s is the start and e is the end). By default, the end field is the end of the line. Hence, -k1 is the same as not giving the -k option at all. To show this, compare:
$ printf "1,a,1\n2,aa,2\n" | sort -k2 -t,
1,a,1
2,aa,2
with:
$ printf "1~a~1\n2~aa~2\n" | sort -k2 -t~
2~aa~2
1~a~1
The first sorts a,1 before aa,2, while the second sorts aa~2 before a~1 since, in ASCII, , < a < ~.
To get the desired behaviour, therefore, we need to sort only one field. In your case, that means using 1 as both the start and end field, so you specify -k1,1. If you try the two examples above with -k2,2 instead of -k2, you'll find you get the same (correct) ordering in both cases.
Many thanks to Eric and Assaf from the coreutils mailing list for pointing this out.

You have not found a bug in sort. Your usage bug is that you used '-k1' ("set the key to the first field through the end of the line") instead of '-k1,1' ("set the key to use only the first field"). If you use GNU sort, the --debug option will show you the difference. The delimiter is included in the key as long as the key extends beyond a single field.

It sorts alphabetically, not numerically, so "," is before "0", i.e. more like a dictionary

The -d option is for --dictionary-order:
-d, --dictionary-order
consider only blanks and alphanumeric characters
But I think you want to use -n (--numeric-sort) instead:
-n, --numeric-sort
compare according to string numerical value
So, change your command to look like this:
sort -k1 -n -t "," a.csv
http://man7.org/linux/man-pages/man1/sort.1.html

The sort is alphabetical, not numerical. Replace -d by -n in your option list to sort numerically.

Related

Linux bash sorting in reverse order based on a column isn't working as expected

I'm trying to sort a text file in the descending order based on the last column. It just doesn't seem to work.
cat 1.txt | sort -r -n -k 4,4
ACHG,89.46,0.08,34200
UUKJL,0.85,-15.00,200
NIMJKY,34.35,0.09,17700
TBBNHW,10.24,0.00,4600
JJkLEYE,73.67,0.48,25400
I've tried removing spaces just in case but, hasn't helped. Also, tried sorting by the other fields just to see but, ahve the same problem.
I just can't work out what is wrong with the command I've issued. Please could I request help with this one?
Your command is almost right but it is missing field separator option -t that should set comma as field separator.
This should work for you:
sort -t, -rnk 4,4 1.txt
ACHG,89.46,0.08,34200
JJkLEYE,73.67,0.48,25400
NIMJKY,34.35,0.09,17700
TBBNHW,10.24,0.00,4600
UUKJL,0.85,-15.00,200
Note that there is no need to use cat | sort here.

linux bash 'sort' in dictionary order

I am trying to sort a file, in ascending order. The file has both alphabets and numerical values.
aae-miR-1
aae-miR-10
aae-miR-100
aae-miR-1000
aae-miR-11-3p
aae-miR-11-5p
aae-miR-1174
aae-miR-1175-3p
aae-miR-1175-5p
aae-miR-12-3p
aae-miR-124
I want the output as
aae-miR-1
aae-miR-10
aae-miR-11-3p
aae-miR-11-5p
aae-miR-12-3p
aae-miR-100
aae-miR-124
aae-miR-1000
aae-miR-1174
aae-miR-1175-3p
aae-miR-1175-5p`
I used,
sort -k1,1 -n <file>
For sorting, with numeric and alphabetical order, but it is not coming as expected. Please suggest, the use of sort
You should use sort -t"-" -k3n file.txt for this case.
Output received :-
aae-miR-1
aae-miR-10
aae-miR-11-3p
aae-miR-11-5p
aae-miR-12-3p
aae-miR-100
aae-miR-124
aae-miR-1000
aae-miR-1174
aae-miR-1175-3p
aae-miR-1175-5p
This is more explicit. '-t' option is used to provide the
delimiter in case of files with delimiter. '-k' is used to specify the
keys on the basis of which the sorting has to be done. The format of
'-k' is : -km[,n] where m is the starting key and n is the ending key. n is an optional key,used only when required.
Try:
sort -n -t- -k3 <file>
-n will numerically sort.
-t- will use - as field separator.
-k3 will use third field to sort by.
Try this, with separator:
sort -t - -k3n file

Linux sort -Help Wanted

I'm stuck in a problem for few days. Here it is maybe u got bigger brains than me!
I got a bunch of CSV files and i want them concatenated into a single .csv file, numeric sorted. Ok, first encountered problem is with the ID (i want to sort unly by ID) name.
eg
sort -f *.csv > output.csv This would work if i had standard ids like id001, id002, id010, id100
but my ids are like id1, id2, id10, id100 and this make my sort job inaccurate.
Ok
sort -t, -V *.csv > output.csv - This works perfectly on my test machine (sort --version GNU coreutils 8.5.0) but my live machine from work got 5.3.0 sort version (and they didn't had implemented -V syntax on it) and i cannot update it!
I'm feel so noob and unlucky
If you have a better idea please bring it on.
my csv file looks like
cn41 AQ34070YTW CDEAQ34070YTW 9C:B6:54:08:A3:C6 9C:B6:54:08:A3:C4
cn42 AQ34070YTY CDEAQ34070YTY 9C:B6:54:08:A4:22 9C:B6:54:08:A4:20
cn43 AQ34070YV1 CDEAQ34070YV1 9C:B6:54:08:9F:0E 9C:B6:54:08:9F:0C
cn44 AQ34070YV3 CDEAQ34070YV3 9C:B6:54:08:A3:7A 9C:B6:54:08:A3:78
cn45 AQ34070YW7 CDEAQ34070YW7 9C:B6:54:08:25:22 9C:B6:54:08:25:20
This is actually copy / paste from a csv. So let's say, this is my first CSV. and the other one looks like
cn201 AQ34070YTW CDEAQ34070YTW 9C:B6:54:08:A3:C6 9C:B6:54:08:A3:C4
cn202 AQ34070YTY CDEAQ34070YTY 9C:B6:54:08:A4:22 9C:B6:54:08:A4:20
cn203 AQ34070YV1 CDEAQ34070YV1 9C:B6:54:08:9F:0E 9C:B6:54:08:9F:0C
cn204 AQ34070YV3 CDEAQ34070YV3 9C:B6:54:08:A3:7A 9C:B6:54:08:A3:78
cn205 AQ34070YW7 CDEAQ34070YW7 9C:B6:54:08:25:22 9C:B6:54:08:25:20
Looking forward reading you!
Regards
You can use the -kX.Y for column X starting on Y character, together with -n for numeric:
sort -t, -k2.3 -n *csv
Given your sample file, it produces:
$ sort -t, -k2.3 -n file
,id1,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id2,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id10,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id40,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id101,aaaaaa,bbbbbbbbbb,cccccccccccc,ddddddd
,id201,aaaaaaaaa,bbbbbbbbbb,ccccccccccc,ddddddd
Update
For your given input, I would do:
$ cat *csv | sort -k1.3 -n
cn41 AQ34070YTW CDEAQ34070YTW 9C:B6:54:08:A3:C6 9C:B6:54:08:A3:C4
cn42 AQ34070YTY CDEAQ34070YTY 9C:B6:54:08:A4:22 9C:B6:54:08:A4:20
cn43 AQ34070YV1 CDEAQ34070YV1 9C:B6:54:08:9F:0E 9C:B6:54:08:9F:0C
cn44 AQ34070YV3 CDEAQ34070YV3 9C:B6:54:08:A3:7A 9C:B6:54:08:A3:78
cn45 AQ34070YW7 CDEAQ34070YW7 9C:B6:54:08:25:22 9C:B6:54:08:25:20
cn201 AQ34070YTW CDEAQ34070YTW 9C:B6:54:08:A3:C6 9C:B6:54:08:A3:C4
cn202 AQ34070YTY CDEAQ34070YTY 9C:B6:54:08:A4:22 9C:B6:54:08:A4:20
cn203 AQ34070YV1 CDEAQ34070YV1 9C:B6:54:08:9F:0E 9C:B6:54:08:9F:0C
cn204 AQ34070YV3 CDEAQ34070YV3 9C:B6:54:08:A3:7A 9C:B6:54:08:A3:78
cn205 AQ34070YW7 CDEAQ34070YW7 9C:B6:54:08:25:22 9C:B6:54:08:25:20
If your CSV format is fixed, you can use the shell equivalent of the decorate-sort-undecorate pattern:
cat *.csv | sed 's/^,id//' | sort -n | sed 's/^/,id/' >output.csv
The -n option is present even in ancient version of sort.
UPDATE: the updated input contains a number with a different prefix, and at a different position in the line. Here is a version that handles both kinds of input, as well as other inputs that have a number somewhere in the line, sorting by the first number:
cat *.csv | sed 's/^\([^0-9]*\)\([0-9][0-9]*\)/\2 \1\2/' \
| sort -n \
| sed 's/^[^ ]* //' > output.csv
You could try the -g option:
sort -t, -k 2.3 -g fileName
-t seperator
-k key/column
-g general numeric sort

Bash/Linux Sort by 3rd column using custom field seperator

I can't seem to sort the following data as I would like;
find output/ -type f -name *.raw | sort
output/rtp.0.0.raw
output/rtp.0.10.raw
output/rtp.0.11.raw
output/rtp.0.12.raw
output/rtp.0.13.raw
output/rtp.0.14.raw
output/rtp.0.15.raw
output/rtp.0.16.raw
output/rtp.0.17.raw
output/rtp.0.18.raw
output/rtp.0.19.raw
output/rtp.0.1.raw
output/rtp.0.20.raw
output/rtp.0.2.raw
output/rtp.0.3.raw
output/rtp.0.4.raw
output/rtp.0.5.raw
output/rtp.0.6.raw
output/rtp.0.7.raw
output/rtp.0.8.raw
output/rtp.0.9.raw
In the above example I haven't passed any arguments to the sort command. No matter what options I used I can't get closer to my desired results. I would like the following output;
find output/ -type f -name *.raw | sort
output/rtp.0.0.raw
output/rtp.0.1.raw
output/rtp.0.2.raw
output/rtp.0.3.raw
output/rtp.0.4.raw
output/rtp.0.5.raw
output/rtp.0.6.raw
output/rtp.0.7.raw
output/rtp.0.8.raw
output/rtp.0.9.raw
output/rtp.0.10.raw
output/rtp.0.11.raw
output/rtp.0.12.raw
output/rtp.0.13.raw
output/rtp.0.14.raw
output/rtp.0.15.raw
output/rtp.0.16.raw
output/rtp.0.17.raw
output/rtp.0.18.raw
output/rtp.0.19.raw
output/rtp.0.20.raw
I have tried with -t . option to set a field separator to the full stop. Also I have experimented with the -k option to specify the field, and -g, -h, -n, but none of the options are helping. I can't see anything else in the man pages that would do as I require, unless I haven't understood the man pages correctly and overlooked my answer.
Can I produce the results I require with sort, and if so, how?
Additionally, it's vary rare but sometimes the 2nd column which shows as '0' all the way down may increment. Can that be factored into the sort?
This makes it:
$ sort -t'.' -n -k3 a
output/rtp.0.0.raw
output/rtp.0.1.raw
output/rtp.0.2.raw
output/rtp.0.3.raw
output/rtp.0.4.raw
output/rtp.0.5.raw
output/rtp.0.6.raw
output/rtp.0.7.raw
output/rtp.0.8.raw
output/rtp.0.9.raw
output/rtp.0.10.raw
output/rtp.0.11.raw
output/rtp.0.12.raw
output/rtp.0.13.raw
output/rtp.0.14.raw
output/rtp.0.15.raw
output/rtp.0.16.raw
output/rtp.0.17.raw
output/rtp.0.18.raw
output/rtp.0.19.raw
output/rtp.0.20.raw
As you see we need different options:
-t'.' to set the dot . as the field separator.
-n to make it numeric sort.
-k3 to check the 3rd column.
Update
This also makes it:
$ sort -t'.' -V -k2 a
output/rtp.0.0.raw
output/rtp.0.1.raw
output/rtp.0.2.raw
output/rtp.0.3.raw
output/rtp.0.4.raw
output/rtp.0.5.raw
output/rtp.0.6.raw
output/rtp.0.7.raw
output/rtp.0.8.raw
output/rtp.0.9.raw
output/rtp.0.10.raw
output/rtp.0.11.raw
output/rtp.0.12.raw
output/rtp.0.13.raw
output/rtp.0.14.raw
output/rtp.0.15.raw
output/rtp.0.16.raw
output/rtp.0.17.raw
output/rtp.0.18.raw
output/rtp.0.19.raw
output/rtp.0.20.raw
As you see we need different options:
-t'.' to set the dot . as the field separator.
-V to make it sort based on version.
-k2 to check the 2nd column.
Fedorqui's solution is good (+1), but not all versions of sort support -V. For those versions that do not, you need to do a little more work than fedorqui's original solution. This should suffice:
sort -t. -k2,2 -k3,3 -n
You get a slightly different sort (eg, '05' sorts before '1' instead of after) if you use:
sort -t. -k2g
(Note that -g is also non-standard, and not available in all versions of sort).
sort -t'.' -n -k3
to sort the 3th column from smallest to biggest
And if you want to sort from biggest to smallest,
you can use '-r' option :
sort -t'.' -n -r -k3

Bash/unix sort "YYYY-[N]NN"

I have a list like this -
2009-96 2010-100 2010-101 2010-97 2010-98 2010-99 2009-99a 2011-102
How do I sort the numbers in the right order, so that it's sorted by first 4 digits (year) if the year is different, otherwise it is sorted by the digit after -?
The right output which I want is -
2009-96 2009-99a 2010-97 2010-98 2010-99 2010-100 2010-101 2011-102
It depends on your version of sort, because the command line options may be different, but on my system, sort -t - -k 1,1n -k 2,2n <filename> works.
With GNU sort (std on Linux):
sort -t'-' -n
sort sorts lines, so convert your space delimiters to \n and back using tr as shown in #dimba's answer.

Resources