How to sort data by the numbers in third column? - linux

If I have a file consisting of data that looks as follows, how would I sort the data based on the numbers in the third column?
The spaces between the first two columns are not tab delimited, but some number of spaces. The space between the second and third column varies based on the size of the number.
Also note that there are spaces within some data of the second column (like lp25( plasmid between ( and p) while others do not have any spaces (like chromosome).
HELIX lp25(plasmid 24437 bp RNA linear 29-AUG-2011
HELIX cp9(plasmid 9586 bp DNA helix 29-AUG-2011
HELIX lp28-1(plasmid 25455 bp DNA linear 29-AUG-2011
HELIX chromosome 911724 bp DNA plasmid 29-AUG-2011

Here you go:
sort -n -k 3 test.txt
From man sort:
-n, --numeric-sort compare according to string numerical value
-k, --key=KEYDEF sort via a key; KEYDEF gives location and type
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end. If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace. OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key. If no key is given, use
the entire line as the key.
and also interesting:
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
which tells us that the F fields are separated by whitespace.

Related

Find if the first 10 digits of two columns on csv file are matched in bash

I have a file which contains two columns (names.csv), values are separated by comma
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
e123456777-anything,e123456999-anything
These columns have values with 10 digits, which are unique identifiers, and some extra junk in the values (-anything).
I want to see if the columns have the prefix matched!
To verify the values on first and second column I use:
cat /home/names.csv | parallel --colsep ',' echo column 1 = {1} column 2 = {2}
Which print the values. Because the values are HEX digits, it is cumbersome to verify one by one by only reading. Is there any way to see if the 10 digits of each column pair are exact matches? They might contain special characters!
Expected output (example, but anything that says the columns are matched or not can work):
Matches (including first line):
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
Non-matches
e123456777-anything,e123456999-anything
Here's one way using awk. It prints every line where the first 10 characters of the first two fields match.
% cat /tmp/names.csv
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything
e123456777-anything,e123456999-anything
% awk -F, 'substr($1,1,10)==substr($2,1,10)' /tmp/names.csv
,
a123456789-anything,a123456789-anything
b123456789-anything,b123456789-anything
c123456789-anything,c123456789-anything
d123456789-anything,d123456789-anything
e123456789-anything,e123456789-anything

How to sort by column and break ties randomly

I have a tab-delimited file with three columns like this:
joe W 4
bob A 1
ana F 1
roy J 3
sam S 0
don R 2
tim L 0
cyb M 0
I want to sort this file by decreasing values in the third column, but to break ties I do not want to use some other column to do so (i.e. not use the first column to sort rows with the same entry in the third column).
Instead, I want rows with the same third column entries to either preserve the original order, or be sorted randomly.
Is there a way to do this using the sort command in unix?
sort -k3 -r -s file
This should give you the required output.
-k3 denotes the 3rd column and -r will sort in decreasing order and -s will disable the breaking of ties using other options.

Alphanumeric sorting of a string with variable width size

I am stuck in a small sorting step. I have a huge file with >300K entries and the file has to be sorted on a specific column containing alphanumeric identifiers as
Rpl12-8
Lrsam1-1
Rpl12-9
Lrsam1-2
Rpl12-10
Lrsam1-5
Rpl12-11
Lrsam1-101
Lrsam2-1
Act-1
Act-100
Act-101
Act-11
The problem is the variable width size, so I am unable to specify the second key identifier (sort -k 1.8n).The first sort is on first alphabet, then on number next to it and then the third number after "-". Can I specifically enable sorting after "-" using delimiter field so then I don't care about width of string.
Desired output would be :
Act-1
Act-11
Act-100
Act-101
Lrsam1-1
Lrsam1-2
Lrsam1-5
Lrsam1-101
Lrsam2-1
Rpl12-8
Rpl12-9
Rpl12-10
Rpl12-11
With the above data in input.txt:
sort -t- -k1,1 -k2n input.txt
You can change the field delimiter to - with -t, then sort on the first field only (as a string) with -k1,1, and finally the 2nd field (as a number) with -k2n.

Linux shell sort file according to the second column?

I have a file like this:
FirstName, FamilyName, Address, PhoneNumber
How can I sort it by FamilyName?
If this is UNIX:
sort -k 2 file.txt
You can use multiple -k flags to sort on more than one column. For example, to sort by family name then first name as a tie breaker:
sort -k 2,2 -k 1,1 file.txt
Relevant options from "man sort":
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
sort -nk2 file.txt
Accordingly you can change column number.
To sort by second field only (thus where second fields match, those lines with matches remain in the order they are in the original without sorting on other fields) :
sort -k 2,2 -s orig_file > sorted_file
FWIW, here is a sort method for showing which processes are using the most virt memory.
memstat | sort -k 1 -t':' -g -r | less
Sort options are set to first column, using : as column seperator, numeric sort and sort in reverse.

Why linux sort is not giving me desired results?

I have a file a.csv with contents similar to below
a,b,c
a ,aa, a
a b, c, f
a , b, c
a b a b a,a,a
a,a,a
a aa ,a , t
I am trying to sort it by using sort -k1 -t, a.csv
But it is giving following results
a,a,a
a ,aa, a
a aa ,a , t
a b a b a,a,a
a , b, c
a,b,c
a b, c, f
Which is not the actual sort on 1st column. What am I doing wrong?
You have to specify the end position to be 1, too:
sort -k1,1 -t, a.csv
Give this a try: sort -t, -k1,1 a.csv
The man suggests that omitting the end field, it will sort on all characters starting at field n until the end of the line:
-k POS1[,POS2]'
The recommended, POSIX, option for specifying a sort field. The
field consists of the part of the line between POS1 and POS2 (or
the end of the line, if POS2 is omitted), _inclusive_. Fields and
character positions are numbered starting with 1. So to sort on
the second field, you'd use `-k 2,2' See below for more examples.
Try this instead:
sort -k 1,1 -t , a.csv
sort reads -k 1 as "sort from first field onwards" -- thus effectively defying the point of passing the argument in the first place.
This is documented in the sort man page and warned about in the Examples section:
Sort numerically on the second field
and resolve ties by sorting
alphabetically on the third and fourth
characters of field five. Use `:' as
the field delimiter:
$ sort -t : -k 2,2n -k 5.3,5.4
Note that if you had written -k 2 instead
of -k 2,2, sort would have used all
characters beginning in the second
field and extending to the end of the
line as the primary numeric key. For
the large majority of applications,
treating keys spanning more than one
field as numeric will not do what you
expect.

Resources