Sorting of data in descending order - linux

Allow me to clarify my query:
I have a database with thousand of character strings, followed by some values (based on scoring matrix)
GKCHGYEGRGFQGRHYEGRSDGPNGQL 25
WGCGGYESRGFQGRHYEGGGDCPNGQG 56
GLCCGYEGRGFQCRHYEGGGDGPNDQL 43
GKGCGYEGRGFQGRHYEHGIDKDHFFR 24
PYGSGGNRARRSGCSWMLYEQVNYSGD 4
DFTEDLRCLQDVFAFNEIVSLNVLERL 3
REDYRRQSIYELSNYRCRQYLTDPSDY 18
There are equal values also present. I am trying to sort the data in descending order using:
sort -n -r file.txt
But the data is still disarranged. Also tried by adding -k argument.
Is it possible that i could get the following result:
GKCHGYEGRGFQGRHYEGRSDGPNGQL 56
WGCGGYESRGFQGRHYEGGGDCPNGQG 56
GLCCGYEGRGFQCRHYEGGGDGPNDQL 56
GKGCGYEGRGFQGRHYEHGIDKDHFFR 43
PYGSGGNRARRSGCSWMLYEQVNYSGD 25
DFTEDLRCLQDVFAFNEIVSLNVLERL 25
REDYRRQSIYELSNYRCRQYLTDPSDY 24
and so on.
I am new to Linux. Any help will be appreciated.

sort -k 2 -nr
This will number sort 2nd field in reverse order and print

Related

Illogical number priority in file names in BASH [duplicate]

This question already has answers here:
How to loop over files in natural order in Bash?
(7 answers)
Closed 1 year ago.
It so happens that I wrote a script in BASH, part of which is supposed to take files from a specified directory in numerical order. Obviously, files in that directory are named as follows: 1, 2, 3, 4, 5, etc. The thing is, I discovered that while running this script with 10 files in the directory, something that appears quite illogical to me, occurs, as the script takes files in strange order: 10, 1, 2, 3, etc.
How do I make it run from minimum value of name of a file to maximum in decimals?
Also, I am using the following line of code to define loop and path:
for file in /dir/*
Don't know if it matters, but I'm using Fedora 33 as OS.
Directories are sorted by alphabetical order. So "10" is before "2".
If I list 20 files whose names correspond to the 20 first integers, I get:
1 10 11 12 13 14 15 16 17 18 19 2 20 3 4 5 6 7 8 9
I can call the function 'sort -n' so I'll sort them numerically rather than alphabetically. The following command:
for i in $(ls | sort -n) ; do echo $i ; done
produces the following output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
i.e. your command:
for file in /dir/*
should be rewritten:
for file in "dir/"$(ls /dir/* | sort -n)
If you have GNU sort then use the -V flag.
for file in /dir/* ; do echo "$file" ; done | sort -V
Or store the data in an array.
files=(/dir/*); printf '%s\n' "${files[#]}" | sort -V
As an aside, if you have the option and work once ahead of time is preferable to sorting every time, you could also format the names of your directories with leading zeroes. This is frequently a better design when possible.
I made both for some comparisons.
$: echo [0-9][0-9]/ # perfect list based on default string sort
00/ 01/ 02/ 03/ 04/ 05/ 06/ 07/ 08/ 09/ 10/ 11/ 12/ 13/ 14/ 15/ 16/ 17/ 18/ 19/ 20/
That also filters out any non-numeric names, and any non-directories.
$: for d in [0-9][0-9]/; do echo "${d%/}"; done
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
If I show both single- and double-digit versions (I made both)
$: shopt -s extglob
$: echo #(?|??)
0 00 01 02 03 04 05 06 07 08 09 1 10 11 12 13 14 15 16 17 18 19 2 20 3 4 5 6 7 8 9
Only the single-digit versions without leading zeroes get out of order.
The shell sorts the names by the locale order (not necessarily the byte value) of each individual character. Anything that starts with 1 will go before anything that starts with 2, and so on.
There's two main ways to tackle your problem:
sort -n (numeric sort) the file list, and iterate that.
Rename or recreate the target files (if you can), so all numbers are the same length (in bytes/characters). Left pad shorter numbers with 0 (eg. 01). Then they'll expand like you want.
Using sort (properly):
mapfile -td '' myfiles <(printf '%s\0' * | sort -zn)
for file in "${myfiles[#]}"; do
# what you were going to do
sort -z for zero/null terminated lines is common but not posix. It makes processing paths/data that contains new lines safe. Without -z:
mapfile -t myfiles <(printf '%s\n' * | sort -n)
# Rest is the same.
Rename the target files:
#!/bin/bash
cd /path/to/the/number/files || exit 1
# Gets length of the highest number. Or you can just hardcode it.
length=$(printf '%s\n' * | sort -n | tail -n 1)
length=${#length}
for i in *; do
mv -n "$i" "$(printf "%.${length}d" "$i")"
done
Examples for making new files with zero padded numbers for names:
touch {000..100} # Or
for i in {000..100}; do
> "$i"
done
If it's your script that made the target files, something like $(printf %.Nd [file]) can be used to left pad the names before you write to them. But you need to know the length in characters of the highest number first (N).

sort alphanumerically with priority for numbers in linux [duplicate]

This question already has answers here:
How to combine ascending and descending sorting?
(3 answers)
How to sort strings that contain a common prefix and suffix numerically from Bash?
(5 answers)
Closed 4 years ago.
I want to sort a file alphanumerically but with priority for the numbers in each file entry. Example: File is:
22 FAN
14 FTR
16 HHK
19 KOT
25 LMC
22 LOW
22 MOK
22 RAC
22 SHS
18 SHT
20 TAP
19 TAW
23 TWO
15 UNI
I want to sort it as:
25 LMC
23 TWO
22 FAN
22 LOW
22 MOK
22 RAC
22 SHS
20 TAP
19 KOT
19 TAW
18 SHT
16 HHK
15 UNI
14 FTR
So, basically, you're asking to sort the first field numerically in descending order, but if the numeric keys are the same, you want the second field to be ordered in natural, or ascending, order.
I tried a few things, but here's the way I managed to make it work:
sort -nk2 file.txt | sort -snrk1
Explanation:
The first command sorts the whole file using the second, alphanumeric field in natural order, while the second command sorts the output using the first numeric field, shows it in reverse order, and requests that it be a "stable" sort.
-n is for numeric sort, versus alphanumeric, in which 6 would come before 60.
-r is for reversed order, so from highest to lowest. If unspecified, it will assume natural, or ascending, order.
-k which key, or field, to use for sorting order.
-s for stable ordering. This option maintains the original record order of records that have an equal key.
There is no need for a pipe, or the additional subshell it spawns. Simply use of keydef for both fields 1 and 2 will do:
$ sort -k1nr,2 file
Example/Output
$ sort -k1nr,2 file
25 LMC
23 TWO
22 FAN
22 LOW
22 MOK
22 RAC
22 SHS
20 TAP
19 KOT
19 TAW
18 SHT
16 HHK
15 UNI
14 FTR

how does linux store negative number in text files?

I made a file using ed and named it numeric. Its content is as follow:
-100
-10
0
99
11
-56
12
Then I executed this command on terminal:
sort numeric
And the result was:
0
-10
-100
11
12
-56
99
And of course this output was not at all expected!
Sort want to be asked to sort numerically (otherwise it will default to lexigraphic sorting)
$ sort -n numbers.dat
-100
-56
-10
0
11
12
99
Watch out for the "-n" parameter (see manual)
Text files are text files, they contain text. Your numbers are sorted alphabetically. If you want sort to sort based on numerical value, use sort -n.
Also, your sort result is strange, when I run the same test I get:
$ sort numeric
-10
-100
-56
0
11
12
99
Sorted alphabetically, as expected.
See https://glot.io/snippets/e555jjumx6
Use sort -n to make sort do numerical sorting instead of alphabetical
That's because by default, sort is alphabetical, not numeric. sort -n does numbers.
Otherwise you'll get
1
10
2
3
etc.

How to keep only those rows which are unique in a tab-delimited file in unix

Here, two rows are considered redundant if second value is same.
Is there any unix/linux command that can achieve the following.
1 aa
2 aa
1 ss
3 dd
4 dd
Result
1 aa
1 ss
3 dd
I generally use the following command but it does not achieve what I want here.
sort -k2 /Users/fahim/Desktop/delnow2.csv | uniq
Edit:
My file had roughly 25 million lines:
Time when using the solution suggested by #Steve : 33 seconds.
$date; awk -F '\t' '!a[$2]++' myfile.txt > outfile.txt; date
Wed Nov 27 18:00:16 EST 2013
Wed Nov 27 18:00:49 EST 2013
The sort and unique is taking too much time. I quit after waiting for 5 minutes.
Perhaps this is what you're looking for:
awk -F "\t" '!a[$2]++' file
Results:
1 aa
1 ss
3 dd
I understand that you want a unique sorted file by the second field.
You need to add -u to sort to achieve this.
sort -u -k2 /Users/fahim/Desktop/delnow2.csv

merging two files based on two columns

I have a question very similar to a previous post:
Merging two files by a single column in unix
but i want to merge my data based on two columns (The orders are the same, so no need to sort).
Example,
subjectid subID2 name age
12 121 Jane 16
24 241 Kristen 90
15 151 Clarke 78
23 231 Joann 31
subjectid subID2 prob_disease
12 121 0.009
24 241 0.738
15 151 0.392
23 231 1.2E-5
And the output to look like
subjectid SubID2 prob_disease name age
12 121 0.009 Jane 16
24 241 0.738 Kristen 90
15 151 0.392 Clarke 78
23 231 1.2E-5 Joanna 31
when i use join it only considers the first column(subjectid) and repeats the SubID2 column.
Is there a way of doing this with join or some other way please? Thank you
join command doesn't have an option to scan more than one field as a joining criteria. Hence, you will have to add some intelligence into the mix. Assuming your files has a FIXED number of fields on each line, you can use something like this:
join f1 f2 | awk '{print $1" "$2" "$3" "$4" "$6}'
provided the the field counts are as given in your examples. Otherwise, you need to adjust the scope of print in the awk command, by adding or taking away some fields.
If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like:
join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b
as described in join(1).

Resources