Sorting based on certain value in a string. - linux

I have a file with contents like this :
666500872101_002.log
738500861101_003.log
738500861101_002.log
666500872101_001.log
741500881101_001.log
738500861101_001.log
741500881101_002.log
666500872101_003.log
741500881101_003.log
666500872101_004.log
I need to Sort the rows based on the values in fields 5 to 8, i.e. 741500881101_003.log at first and then based on the part number of log i.e.
741500881101_003.log to get something like this :
738500861101_001.log
738500861101_002.log
738500861101_003.log
666500872101_001.log
666500872101_002.log
666500872101_003.log
666500872101_004.log
741500881101_001.log
741500881101_002.log
741500881101_003.log
Can't get any good results using sort please help.

You can use the sort command wit the following options:
sort -n -k1.5,1.8 -n -k1.14,1.16 fileToSort.log
Options:
-n for numerical sorting
-k1.5,1.8 and -k1.14,1.16 to define your sorting keys
Example:
$ sort -n -k1.5,1.8 -n -k1.14,1.16 fileToSort
738500861101_001.log
738500861101_002.log
738500861101_003.log
666500872101_001.log
666500872101_002.log
666500872101_003.log
666500872101_004.log
741500881101_001.log
741500881101_002.log
741500881101_003.log

I solved this problem as part of learning SPARK. I am not UNIX shell programmer. Hence thought of solving the problem using spark
val logList = Array("666500872101_002.log","738500861101_003.log","738500861101_002.log","666500872101_001.log","741500881101_001.log","738500861101_001.log","741500881101_002.log","666500872101_003.log","741500881101_003.log","666500872101_004.log")
val logListRDD = sc.parallelize(logList)
logListRDD.map(x=>((x.substring(4,8), x.slice(x.indexOfSlice("_") +1, x.indexOfSlice("."))),x)).sortByKey().values.collect.take(20)
Output:
Array[String] = Array(738500861101_001.log, 738500861101_002.log, 738500861101_003.log, 666500872101_001.log, 666500872101_002.log, 666500872101_003.log, 666500872101_004.log, 741500881101_001.log, 741500881101_002.log, 741500881101_003.log)
Explaining what I did
sc.parallelize(logList) - is the step to create an RDD which is the core component of spark.
map(x=>((x.substring(4,8), x.slice(x.indexOfSlice("_") +1, x.indexOfSlice("."))),x)) - This extracts the contents from Array and generates a key value pair. In our case, value is the ***.log value and key is an Array of Substrings based on which we wanted to sort (0086, 001). KeyValue pair will look like [(0086, 001),738500861101_001.log]
sortByKey() - Sorts the data based on the Key generated above
values - gets the value corresponding to the key
collect.take(20) -> Displays the o/p on screen

Related

Map repeated values in presto

I'm extracting data from JSON and mapping two arrays in presto.It works fine when there are no repeated values in the array but fails with error - Duplicate map keys are not allowed if any of the values are repeated.I need those values and cannot remove any of the values from the array.Is there a work around for this scenario?
Sample values:
array1 -- [Rewards,NEW,Rewards,NEW]
array2 -- [losg1,losg2,losg3,losg4]
Map key/value has to be generated like this [Rewards=>losg1,NEW=>losg2,Rewards=>losg3,NEW=>losg4]
Pairs of associations can be returned like this:
SELECT ARRAY[ROW('Rewards', 'losg1'), ROW('NEW', 'losg2'), ROW('Rewards', 'losg3')]

Sorting multiple keys with Unix sort -- Bug?

I'm trying to sort my data by multiple keys with unix sort. I think that I get a wrong result. My command is
sort -t "_" -k4,4 -k2 -k1,1g < stdev.txt
And the result:
0.322_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000110687417806 0.0346076270248
0.3_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000111161259827 0.0358869210331
0.321_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000134981044857 0.0457899948612
0.332_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 2.79712100925e-05 0.0049473335673
0.313_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 3.11625097814e-05 0.00588538959351
0.312_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 3.69066495111e-05 0.00819208397496
0.331_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 3.69774104969e-05 0.00824956236819
0.311_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 6.15395637079e-05 0.0173808578728
0.321_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 0.000138353320007 1.05986015585
0.322_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 0.00017460061705 0.521775402243
0.311_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 0.000206502239096 0.149912367819
0.3_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 0.000237775594814 0.633350656766
0.332_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 3.1779126554e-05 0.0128586399133
0.313_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 4.33297503265e-05 0.0166438194725
0.312_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 7.21521358641e-05 0.0342760190842
0.331_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap4.dat 7.52883193115e-05 0.0416052108611
...
0.3_rsrc:8_phi:0.5_abr:2_prof:plaw_diff:point.dat 0.000124446390455 0.00132402479772
0.3_rsrc:8_phi:0.5_abr:2_prof:unif_diff:lap2.dat 1.2638050496e-05 0.0289450596111
0.3_rsrc:8_phi:0.5_abr:2_prof:unif_diff:lap4.dat 0.000100909900236 0.170116521056
0.3_rsrc:8_phi:0.5_abr:2_prof:unif_diff:point.dat 0.000237686616486 0.00142895807647
First key is read correctly (all abr:2s are at the end).
Second key is also read correctly (diff:lap2s are before diff:lap4s).
The last key -k1,1g is not read properly. According to the another SO question it should use only the first column (0.322, 0.3, etc.) with general numeric sort. Which is not performed (0.322>0.3 in lap2 sector) and unfortunately in lap4 sector the ordering is completely different. Command
echo -e '0.3\n0.32\n0.28' | sort -g
give correct result.
Is it possible to change field separator -t for each sorting key -k?
-k2 uses all the characters from the beginning of the 2nd field to the end of the line, because you did not specify where the key ends. So the lines
0.322_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000110687417806 0.0346076270248
0.3_rsrc:15_phi:0.5_abr:1_prof:gauss_diff:lap2.dat 0.000111161259827 0.0358869210331
are correctly sorted because in both keys begin with _rsrc:15 and 0.000110 sorts before 0.000111. The key phrase in the manual page is
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.

How to get ordered, defined or all columns except or after or before a given column

In BASH
I run the following one liner to get an individual column/field after splitting on a given character (one can use AWK as well if they want to split on more than one char i.e. on a word in any order, ok).
#This will give me first column i.e. 'lori' i.e. first column/field/value after splitting the line / string on a character '-' here
echo "lori-chuck-shenzi" | cut -d'-' -f1
# This will give me 'chuck'
echo "lori-chuck-shenzi" | cut -d'-' -f2
# This will give me 'shenzi'
echo "lori-chuck-shenzi" | cut -d'-' -f3
# This will give me 'chuck-shenzi' i.e. all columns after 2nd and onwards.
echo "lori-chuck-shenzi" | cut -d'-' -f2-
Notice the last command above, How can I do the same last cut command shit in Groovy?
For ex: if the contents are in a file and they look like:
1 - a
2 - b
3 - c
4 - d
5 - e
6 - lori-chuck shenzi
7 - columnValue1-columnValue2-columnValue3-ColumnValue4
I tried the following Groovy code, but it's not giving me lori-chuck shenzi (i.e. after ignoring the 6th bullet and first occurence of the -, I want my output to be lori-chuck shenzi and the following script is returning me just lori (which is givning me the correct output as my index is [1] in the following code, so I know that).
def file = "/path/to/my/file.txt"
File textfile= new File(file)
//now read each line from the file (using the file handle we created above)
textfile.eachLine { line ->
//list.add(line.split('-')[1])
println "Bullet entry full value is: " + line.split('-')[1]
}
// return list
Also, is there an easy way for the last line in the file above, if I can use Groovy code to change the order of the columns after they are split i.e. reverse the order like we do in Python [1:], [:1], [:-1] etc.. or in some fashion
I don't like this solution but I did this to get it working. After getting index values from [1..-1 (i.e. from 1st index, excluding the 0th index which is the left hand side of first occurrence of - character), I had to remove the [ and ] (LIST) using join(',') and then replacing any , with a - to get the final result what I was looking for.
list.add(line.split('-')[1..-1].join(',').replaceAll(',','-'))
I would still like to know what's a better solution and how can this work when we talk about cherry picking individual columns + in a given order (instead of me writing various Groovy statements to pick individual elements from the string/list per statement).
If I'm understanding your question correctly, what you want is:
line.split('-')[1..-1]
This will give you from position 1 to the last. You can do -2 (next to last) and so on, but just be aware that you can get an ArrayIndexOutOfBoundsException moving backwards too, if you go past the beginning of your array!
-- Original answer is above this line --
Adding to my answer, since comments don't allow code formatting. If all you want is to pick specific columns, and you want a string in the end, you could do something like:
def resultList = line.split('-')
def resultString = "${resultList[1]}-${resultList[2]} ${resultList[3]}"
and pick whatever columns you want that way. I thought you were looking for a more generic solution, but if not, specific columns are easy!
If you want the first value, a dash, then the rest joined by spaces, just use:
"${resultList[1]}-${resultList[2..-1].join(" ")}"
I don't know how to give you specific answers for every combination you might want, but basically once you have your values in a list, you can manipulate that however you want, and turn the results back into a string with GStrings or with .join(...).

How to sort data of a file in python(windows)

We are trying to implement a file based student record program.We want to sort the file that contain the details of the student according to the roll number which is at the first position of every line.
the file contains the following data:
1/rahul/cs
10/manish sharma/mba
5/jhon/ms
2/ram/bba
We want to sort the file's data according to the first field i.e. roll number.
Any help shall be great
You use sorted:
dataSorted = list(sorted(f.readlines()))
If you want to sort only the first element use:
dataSorted = list(sorted(f.readlines(), lambda line: line[:line.find('/')]))
f is the file-object.
Further information: help(sorted)

Split a string containing fixed length columns

I got data like this:
3LLO24MACT01 24MOB_6012010051700000020100510105010 123456
It contains different values for different columns when I import it.
Every column is fixed width:
Col#1 is the ID and just 1 long. Meaning it is "3" here.
Col#2 is 3 in length and here "LLO".
Col#3 is 9 in length and "24MACT01 " (notice that the missing ones gets filled up by blanks).
This goes on for 15 columns or so...
Is there a method to quickly cut it into different elements based on sequence length? I couldn't find any.
This can be done with RegEx matching, and creating an array of custom objects. Something like this:
$AllRecords = Get-Content C:\Path\To\File.txt | Where{$_ -match "^(.)(.{3})(.{9})"} | ForEach{
[PSCustomObject]#{
'Col1' = $Matches[1]
'Col2' = $Matches[2]
'Col3' = $Matches[3]
}
}
That will take each line, match by how many characters are specified, and then create an object based off those matches. It collects all objects in an array and could be exported to CSV or whatever. The 'Col1', 'Col2' etc are just generic column headers I suggested due to a lack of better information, and could be anything you wanted.
Edit: Thank you iCodez for showing me, perhaps inadvertantly, that you can specify a language for your code samples!
[Regex]::Matches will do this rather easily. All you need to do is specify a Regex pattern that has . followed by the number of characters you want in curly braces. For example, to match a column of three characters, you would write .{3}. You then do this for all 15 columns.
To demonstrate, I will use a string that contains the first three columns of your example data (since I know their sizes):
PS > $data = '3LLO24MACT01 '
PS > $pattern = '(.{1})(.{3})(.{9})'
PS > ([Regex]::Matches($data, $pattern).Groups).Value
3LLO24MACT01
3
LLO
24MACT01
PS >
Note that the first value outputted will be the text matched be all of the capture groups. If you do not need this, you can remove it with slicing:
$columns = ([Regex]::Matches($data, $pattern).Groups).Value
$columns = $columns[1..$columns.Length]
New-PSObjectFromMatches is a helper function for creating PS Objects from regex matches.
The -Debug option can help with the process of writing the regex.

Resources