Sorting multiple keys with Unix sort -- Bug? - linux

I'm trying to sort my data by multiple keys with unix sort. I think that I get a wrong result. My command is
sort -t "_" -k4,4 -k2 -k1,1g < stdev.txt
And the result:
0.322_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000110687417806 0.0346076270248
0.3_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000111161259827 0.0358869210331
0.321_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000134981044857 0.0457899948612
0.332_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 2.79712100925e-05 0.0049473335673
0.313_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 3.11625097814e-05 0.00588538959351
0.312_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 3.69066495111e-05 0.00819208397496
0.331_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 3.69774104969e-05 0.00824956236819
0.311_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 6.15395637079e-05 0.0173808578728
0.321_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 0.000138353320007 1.05986015585
0.322_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 0.00017460061705 0.521775402243
0.311_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 0.000206502239096 0.149912367819
0.3_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 0.000237775594814 0.633350656766
0.332_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 3.1779126554e-05 0.0128586399133
0.313_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 4.33297503265e-05 0.0166438194725
0.312_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 7.21521358641e-05 0.0342760190842
0.331_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 7.52883193115e-05 0.0416052108611
...
0.3_rsrc:8_phi:0.5_abr:2_prof:plaw_diff:point.dat 0.000124446390455 0.00132402479772
0.3_rsrc:8_phi:0.5_abr:2_prof:unif_diff:lap2.dat 1.2638050496e-05 0.0289450596111
0.3_rsrc:8_phi:0.5_abr:2_prof:unif_diff:lap4.dat 0.000100909900236 0.170116521056
0.3_rsrc:8_phi:0.5_abr:2_prof:unif_diff:point.dat 0.000237686616486 0.00142895807647
First key is read correctly (all abr:2s are at the end).
Second key is also read correctly (diff:lap2s are before diff:lap4s).
The last key -k1,1g is not read properly. According to the another SO question it should use only the first column (0.322, 0.3, etc.) with general numeric sort. Which is not performed (0.322>0.3 in lap2 sector) and unfortunately in lap4 sector the ordering is completely different. Command
echo -e '0.3\n0.32\n0.28' | sort -g
give correct result.
Is it possible to change field separator -t for each sorting key -k?

-k2 uses all the characters from the beginning of the 2nd field to the end of the line, because you did not specify where the key ends. So the lines
0.322_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000110687417806 0.0346076270248
0.3_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000111161259827 0.0358869210331
are correctly sorted because in both keys begin with _rsrc:15 and 0.000110 sorts before 0.000111. The key phrase in the manual page is
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.

Related

Simple sorting in linux

I'm quite new to linux, and I can't quite get to understand sorting. I need to sort a long file by column 4 and then column 5, ignoring the first line. The catch is, there are two separators - '.' and ',' - I don't know how to make sort command to include both of them. I guess it has to be sorted by the column that has "3" in the first line, and then in the second sort by the column that has "5" in the second line. And the second thing is I don't know how to keep the first line intact. Worth noting I can't change all ',' into '.', it has to stay intact. And I can't just remove the first line with tail or head, it has to stay.
This is the text:
d,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
6,5.4,3.9,1.7,0.4,Iris-setosa
7,4.6,3.4,1.4,0.3,Iris-setosa
8,5.0,3.4,1.5,0.2,Iris-setosa
9,4.4,2.9,1.4,0.2,Iris-setosa
10,4.9,3.1,1.5,0.1,Iris-setosa
11,5.4,3.7,1.5,0.2,Iris-setosa
12,4.8,3.4,1.6,0.2,Iris-setosa
13,4.8,3.0,1.4,0.1,Iris-setosa
14,4.3,3.0,1.1,0.1,Iris-setosa
15,5.8,4.0,1.2,0.2,Iris-setosa
16,5.7,4.4,1.5,0.4,Iris-setosa
17,5.4,3.9,1.3,0.4,Iris-setosa
18,5.1,3.5,1.4,0.3,Iris-setosa
19,5.7,3.8,1.7,0.3,Iris-setosa
20,5.1,3.8,1.5,0.3,Iris-setosa
21,5.4,3.4,1.7,0.2,Iris-setosa
22,5.1,3.7,1.5,0.4,Iris-setosa
23,4.6,3.6,1.0,0.2,Iris-setosa
24,5.1,3.3,1.7,0.5,Iris-setosa
25,4.8,3.4,1.9,0.2,Iris-setosa
26,5.0,3.0,1.6,0.2,Iris-setosa
27,5.0,3.4,1.6,0.4,Iris-setosa
28,5.2,3.5,1.5,0.2,Iris-setosa
29,5.2,3.4,1.4,0.2,Iris-setosa
30,4.7,3.2,1.6,0.2,Iris-setosa
31,4.8,3.1,1.6,0.2,Iris-setosa
32,5.4,3.4,1.5,0.4,Iris-setosa
33,5.2,4.1,1.5,0.1,Iris-setosa
34,5.5,4.2,1.4,0.2,Iris-setosa
35,4.9,3.1,1.5,0.1,Iris-setosa
36,5.0,3.2,1.2,0.2,Iris-setosa
37,5.5,3.5,1.3,0.2,Iris-setosa
38,4.9,3.1,1.5,0.1,Iris-setosa
39,4.4,3.0,1.3,0.2,Iris-setosa
40,5.1,3.4,1.5,0.2,Iris-setosa
41,5.0,3.5,1.3,0.3,Iris-setosa
42,4.5,2.3,1.3,0.3,Iris-setosa
43,4.4,3.2,1.3,0.2,Iris-setosa
44,5.0,3.5,1.6,0.6,Iris-setosa
45,5.1,3.8,1.9,0.4,Iris-setosa
46,4.8,3.0,1.4,0.3,Iris-setosa
47,5.1,3.8,1.6,0.2,Iris-setosa
48,4.6,3.2,1.4,0.2,Iris-setosa
49,5.3,3.7,1.5,0.2,Iris-setosa
50,5.0,3.3,1.4,0.2,Iris-setosa
51,7.0,3.2,4.7,1.4,Iris-versicolor
52,6.4,3.2,4.5,1.5,Iris-versicolor
53,6.9,3.1,4.9,1.5,Iris-versicolor
54,5.5,2.3,4.0,1.3,Iris-versicolor
55,6.5,2.8,4.6,1.5,Iris-versicolor
56,5.7,2.8,4.5,1.3,Iris-versicolor
57,6.3,3.3,4.7,1.6,Iris-versicolor
58,4.9,2.4,3.3,1.0,Iris-versicolor
59,6.6,2.9,4.6,1.3,Iris-versicolor
60,5.2,2.7,3.9,1.4,Iris-versicolor
61,5.0,2.0,3.5,1.0,Iris-versicolor
62,5.9,3.0,4.2,1.5,Iris-versicolor
63,6.0,2.2,4.0,1.0,Iris-versicolor
64,6.1,2.9,4.7,1.4,Iris-versicolor
65,5.6,2.9,3.6,1.3,Iris-versicolor
66,6.7,3.1,4.4,1.4,Iris-versicolor
67,5.6,3.0,4.5,1.5,Iris-versicolor
68,5.8,2.7,4.1,1.0,Iris-versicolor
69,6.2,2.2,4.5,1.5,Iris-versicolor
70,5.6,2.5,3.9,1.1,Iris-versicolor
71,5.9,3.2,4.8,1.8,Iris-versicolor
72,6.1,2.8,4.0,1.3,Iris-versicolor
73,6.3,2.5,4.9,1.5,Iris-versicolor
74,6.1,2.8,4.7,1.2,Iris-versicolor
75,6.4,2.9,4.3,1.3,Iris-versicolor
76,6.6,3.0,4.4,1.4,Iris-versicolor
77,6.8,2.8,4.8,1.4,Iris-versicolor
78,6.7,3.0,5.0,1.7,Iris-versicolor
79,6.0,2.9,4.5,1.5,Iris-versicolor
80,5.7,2.6,3.5,1.0,Iris-versicolor
81,5.5,2.4,3.8,1.1,Iris-versicolor
82,5.5,2.4,3.7,1.0,Iris-versicolor
83,5.8,2.7,3.9,1.2,Iris-versicolor
84,6.0,2.7,5.1,1.6,Iris-versicolor
85,5.4,3.0,4.5,1.5,Iris-versicolor
86,6.0,3.4,4.5,1.6,Iris-versicolor
87,6.7,3.1,4.7,1.5,Iris-versicolor
88,6.3,2.3,4.4,1.3,Iris-versicolor
89,5.6,3.0,4.1,1.3,Iris-versicolor
90,5.5,2.5,4.0,1.3,Iris-versicolor
91,5.5,2.6,4.4,1.2,Iris-versicolor
92,6.1,3.0,4.6,1.4,Iris-versicolor
93,5.8,2.6,4.0,1.2,Iris-versicolor
94,5.0,2.3,3.3,1.0,Iris-versicolor
95,5.6,2.7,4.2,1.3,Iris-versicolor
96,5.7,3.0,4.2,1.2,Iris-versicolor
97,5.7,2.9,4.2,1.3,Iris-versicolor
98,6.2,2.9,4.3,1.3,Iris-versicolor
99,5.1,2.5,3.0,1.1,Iris-versicolor
100,5.7,2.8,4.1,1.3,Iris-versicolor
101,6.3,3.3,6.0,2.5,Iris-virginica
102,5.8,2.7,5.1,1.9,Iris-virginica
103,7.1,3.0,5.9,2.1,Iris-virginica
104,6.3,2.9,5.6,1.8,Iris-virginica
105,6.5,3.0,5.8,2.2,Iris-virginica
106,7.6,3.0,6.6,2.1,Iris-virginica
107,4.9,2.5,4.5,1.7,Iris-virginica
108,7.3,2.9,6.3,1.8,Iris-virginica
109,6.7,2.5,5.8,1.8,Iris-virginica
110,7.2,3.6,6.1,2.5,Iris-virginica
111,6.5,3.2,5.1,2.0,Iris-virginica
112,6.4,2.7,5.3,1.9,Iris-virginica
113,6.8,3.0,5.5,2.1,Iris-virginica
114,5.7,2.5,5.0,2.0,Iris-virginica
115,5.8,2.8,5.1,2.4,Iris-virginica
116,6.4,3.2,5.3,2.3,Iris-virginica
117,6.5,3.0,5.5,1.8,Iris-virginica
I'm not sure if I understand the problem: in order to sort on columns 4 and 5 numerically, you can simply say:
sort -t, -k4,5 -n
-t, says to use a comma as a column separator
-k4,5 says to sort, based on columns 4 and 5
-n says to sort numerically
Remarks:
As far as the hyphen concerns: as it's no part of the sorting columns, why bother about it?
As far as the column headers are concerned: due to the type of sorting, a letter comes in front of any number, so the column headers stay headers, so again why bother about it?

How to use grep to find a specific string of numbers and move that to a new test file

I am new to using linux and grep and I am looking for some direction in how to use grep. I am trying to get two specific numbers from a text file. I will need to do this for thousands of files so I believe using grep or some equivalent to be best for my mental health.
The text file I am working with looks as follows:
*Average spectrum energy: 0.00100 MeV
Average sampled energy : 0.00100 MeV [ -0.0000%]
K/phi = <E*mu_tr/rho> = 6.529719E+02 10^-12 Gy cm^2 [ 0.0008%]
Kcol/phi = <E*mu_tr/rho>*(1-<g>) = 6.529719E+02 10^-12 Gy cm^2 [ 0.0008%]
<g> = 1.0000E-15 [ 0.4264%]
1-<g> = 1.000000 [ 0.0000%]
<mu_tr/rho> = <E*mu_tr/rho>/Eave = 4.075530E+03 cm^2/g [ 0.0008%]
<mu_en/rho> = <E*mu_tr/rho>*(1-<g>)/Eave = 4.075530E+03 cm^2/g [ 0.0008%]
<E*mu_en/rho> = 4.075530E+00 MeV cm^2/g
The values I am looking to extract from this are "0.00100" and "4.075530E+00".
At the moment I am using grep -iE "Average spectrum energy|<E*mu_en/rho>" * which is allowing me to see the full lines, but I am not quite sure how to refine the search to only show me the numbers instead of just the whole line. Is this possible using grep?
As for moving the numbers into a new file, I believe the command is > newdata.txt. My question is when using this with grep can you change how it writes the data to the new text file? I am looking for the format of the numbers to be like this:
0.00100001 3.4877754595352117
0.00100367 3.4665273232204363
0.00100735 3.4453747056004884
0.00101104 3.4243696230289187
0.00101474 3.4035147003587718
Again is that possble using the grep > newdata.txt?
I really appreciate any help or direction people can give me. Thank you.
I'm not quite sure why it was giving the 4.075530E+03 value.
That's because * has the special meaning of a repetition of the previous item any number of times (including zero), so the pattern <E*mu_en/rho> does not match the text <E*mu_en/rho>, but rather < any number of E mu_en/rho>, i. e. especially <mu_en/rho>. To escape this special meaning and match a literal *, prepend a backslash, i. e. <E\*mu_en/rho>.
I am not quite sure how to refine the search to only show me the numbers instead of just the whole line. Is this possible using grep?
It is if PCRE (grep -P) is available in the system. To only (-o) show the numbers, we can use the feature of Resetting the match start with \K. Your modified grep command is then:
grep -hioP "(Average spectrum energy: *|<E\*mu_en/rho> *= )\K\S*" *
(option -h drops the file names, pattern item \S means not a white space).
when using this with grep can you change how it writes the data to the new text file?
grep by itself cannot change the format of numbers (except maybe cutting digits off). If you want this, we need another tool. Now, since we need another tool, I'd consider using a tool which is capable of doing the whole job, e. g. awk:
awk '
/Average spectrum energy/ { printf "%.8f ", $4 }
/<E\*mu_en\/rho>/ { printf "%.16f\n", $3 }
' * >newdata.txt

Sorting based on certain value in a string.

I have a file with contents like this :
666500872101_002.log
738500861101_003.log
738500861101_002.log
666500872101_001.log
741500881101_001.log
738500861101_001.log
741500881101_002.log
666500872101_003.log
741500881101_003.log
666500872101_004.log
I need to Sort the rows based on the values in fields 5 to 8, i.e. 741500881101_003.log at first and then based on the part number of log i.e.
741500881101_003.log to get something like this :
738500861101_001.log
738500861101_002.log
738500861101_003.log
666500872101_001.log
666500872101_002.log
666500872101_003.log
666500872101_004.log
741500881101_001.log
741500881101_002.log
741500881101_003.log
Can't get any good results using sort please help.
You can use the sort command wit the following options:
sort -n -k1.5,1.8 -n -k1.14,1.16 fileToSort.log
Options:
-n for numerical sorting
-k1.5,1.8 and -k1.14,1.16 to define your sorting keys
Example:
$ sort -n -k1.5,1.8 -n -k1.14,1.16 fileToSort
738500861101_001.log
738500861101_002.log
738500861101_003.log
666500872101_001.log
666500872101_002.log
666500872101_003.log
666500872101_004.log
741500881101_001.log
741500881101_002.log
741500881101_003.log
I solved this problem as part of learning SPARK. I am not UNIX shell programmer. Hence thought of solving the problem using spark
val logList = Array("666500872101_002.log","738500861101_003.log","738500861101_002.log","666500872101_001.log","741500881101_001.log","738500861101_001.log","741500881101_002.log","666500872101_003.log","741500881101_003.log","666500872101_004.log")
val logListRDD = sc.parallelize(logList)
logListRDD.map(x=>((x.substring(4,8), x.slice(x.indexOfSlice("_") +1, x.indexOfSlice("."))),x)).sortByKey().values.collect.take(20)
Output:
Array[String] = Array(738500861101_001.log, 738500861101_002.log, 738500861101_003.log, 666500872101_001.log, 666500872101_002.log, 666500872101_003.log, 666500872101_004.log, 741500881101_001.log, 741500881101_002.log, 741500881101_003.log)
Explaining what I did
sc.parallelize(logList) - is the step to create an RDD which is the core component of spark.
map(x=>((x.substring(4,8), x.slice(x.indexOfSlice("_") +1, x.indexOfSlice("."))),x)) - This extracts the contents from Array and generates a key value pair. In our case, value is the ***.log value and key is an Array of Substrings based on which we wanted to sort (0086, 001). KeyValue pair will look like [(0086, 001),738500861101_001.log]
sortByKey() - Sorts the data based on the Key generated above
values - gets the value corresponding to the key
collect.take(20) -> Displays the o/p on screen

How to get ordered, defined or all columns except or after or before a given column

In BASH
I run the following one liner to get an individual column/field after splitting on a given character (one can use AWK as well if they want to split on more than one char i.e. on a word in any order, ok).
#This will give me first column i.e. 'lori' i.e. first column/field/value after splitting the line / string on a character '-' here
echo "lori-chuck-shenzi" | cut -d'-' -f1
# This will give me 'chuck'
echo "lori-chuck-shenzi" | cut -d'-' -f2
# This will give me 'shenzi'
echo "lori-chuck-shenzi" | cut -d'-' -f3
# This will give me 'chuck-shenzi' i.e. all columns after 2nd and onwards.
echo "lori-chuck-shenzi" | cut -d'-' -f2-
Notice the last command above, How can I do the same last cut command shit in Groovy?
For ex: if the contents are in a file and they look like:
1 - a
2 - b
3 - c
4 - d
5 - e
6 - lori-chuck shenzi
7 - columnValue1-columnValue2-columnValue3-ColumnValue4
I tried the following Groovy code, but it's not giving me lori-chuck shenzi (i.e. after ignoring the 6th bullet and first occurence of the -, I want my output to be lori-chuck shenzi and the following script is returning me just lori (which is givning me the correct output as my index is [1] in the following code, so I know that).
def file = "/path/to/my/file.txt"
File textfile= new File(file)
//now read each line from the file (using the file handle we created above)
textfile.eachLine { line ->
//list.add(line.split('-')[1])
println "Bullet entry full value is: " + line.split('-')[1]
}
// return list
Also, is there an easy way for the last line in the file above, if I can use Groovy code to change the order of the columns after they are split i.e. reverse the order like we do in Python [1:], [:1], [:-1] etc.. or in some fashion
I don't like this solution but I did this to get it working. After getting index values from [1..-1 (i.e. from 1st index, excluding the 0th index which is the left hand side of first occurrence of - character), I had to remove the [ and ] (LIST) using join(',') and then replacing any , with a - to get the final result what I was looking for.
list.add(line.split('-')[1..-1].join(',').replaceAll(',','-'))
I would still like to know what's a better solution and how can this work when we talk about cherry picking individual columns + in a given order (instead of me writing various Groovy statements to pick individual elements from the string/list per statement).
If I'm understanding your question correctly, what you want is:
line.split('-')[1..-1]
This will give you from position 1 to the last. You can do -2 (next to last) and so on, but just be aware that you can get an ArrayIndexOutOfBoundsException moving backwards too, if you go past the beginning of your array!
-- Original answer is above this line --
Adding to my answer, since comments don't allow code formatting. If all you want is to pick specific columns, and you want a string in the end, you could do something like:
def resultList = line.split('-')
def resultString = "${resultList[1]}-${resultList[2]} ${resultList[3]}"
and pick whatever columns you want that way. I thought you were looking for a more generic solution, but if not, specific columns are easy!
If you want the first value, a dash, then the rest joined by spaces, just use:
"${resultList[1]}-${resultList[2..-1].join(" ")}"
I don't know how to give you specific answers for every combination you might want, but basically once you have your values in a list, you can manipulate that however you want, and turn the results back into a string with GStrings or with .join(...).

how do I replace the string from the top of the file with the bottom string based on pattern

I have the following file:
petro,36262
artur,946034
alex,12345
alex,99999
artur,111111111
What I need is to replace the upper strings with bottom ones based on first column as primary key, so in the end it would look like:
petro,36262
alex,99999
artur,111111111
The first column can contain a space or a number.
Thanks
Here is one method:
$ awk -F, '{a[$1]=$2} END{for (name in a)print name","a[name]}' file
alex,99999
artur,111111111
petro,36262
The lines come out in arbitrary order. If you have a preferred order, some additional code would be needed.

Resources