I have this code which compares two lists and find differences, so far so good, it works fine for small lists. Now Im testing with huge lists.
which contains both more than 300000 maps. It takes more than 5 hours to process it. is that normal? How can I reduce the procssing time?
def list1 = [
[cuInfo:"T12",service:"3",startDate:"14-01-16 13:22",appId:"G12355"],
[cuInfo:"T13",service:"3",startDate:"12-02-16 13:00",appId:"G12356"],
[cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12300"],
[cuInfo:"T15",service:"10",startDate:"26-02-16 10:20",appId:"G12999"]
]
def list2 = [
[name:"testname1",cuInfo:"T12",service:"3",startDate:"14-02-16 10:00",appId:"G12351"],
[name:"testname1",cuInfo:"T13",service:"3",startDate:"14-01-16 13:00",appId:"G12352"],
[name:"testname1",cuInfo:"T16",service:"3",startDate:"14-01-16 13:00",appId:"G12353"],
[name:"testname2",cuInfo:"T14",service:"9",startDate:"10-01-16 11:20",appId:"G12301"],
[name:"testname3",cuInfo:"T15",service:"10",startDate:"26-02-16 10:20",appId:"G12999"],
[name:"testname3",cuInfo:"T18",service:"10",startDate:"26-02-16 10:20",appId:"G12999"]
]
def m1 = [:]
def m2 = [:]
def rows = list1.collect { me ->
[me, list2.find { it.cuInfo == me.cuInfo && it.service == me.service }]
}.findAll {
it[1]
}.findAll {
/*
* This is where the differences are identified.
* The 'name' attribute is excluded from the comparison,
* by including only the desired attributes.
*/
it[0] != it[1].subMap(['cuInfo', 'service', 'startDate', 'appId'])
}.collect {
/*
* At this point the list only contains the row pairs
* which are different. This step identifies which columns
* are different using asterisks.
*/
(m1, m2) = it
m1.keySet().each { key ->
if(m1[key] != m2[key]) {
m1[key] = "*${m1[key]}*"
m2[key] = "*${m2[key]}*"
}
}
[m1, m2]
}.collect {
[it[0].values(), it[1].values()].flatten() as String[]
}
Maybe this will help a little. I didn't have time to test but your code has a lot of collects and find alls that can cause performance issues
def results = []
list1.each{ lst1 ->
def list1WithDifferences = []
def list2WithDifferences = []
def add = false
def match = list2.find{ lst2 -> lst2.cuInfo == lst1.cuInfo && lst2.service == lst1.service }
match.each{k, v ->
if(k != 'name'){
if(v != lst1[k]){
add = true
list1WithDifferences << "*${lst1[k]}*"
list2WithDifferences << "*${v}*"
}else{
list1WithDifferences << v
list2WithDifferences << v
}
}else{
list2WithDifferences << v
}
}
if(add){
results << list1WithDifferences + list2WithDifferences
}
}
println(results)
Related
I am new to groovy and trying to find the indexes of all sublists in a list.
I am trying to use something like Collections.indexOfSubList like in java but it gives exception saying it applies on Lists and not ArrayLists.
So I am trying to define my own function. I am finding all the indices of all the elements in the smaller list existing in the longer list and then subtracting the indices of the result array. If it comes to 1 then I am considering that index to a sublist.
I know that I have the logic a little twisted. Can somebody guide with a better and efficient way of doing this.
Below is my code:
List list1 = [1,2,3,4,5,6,1,2,3]
List list2 = [1,2]
index1 = list1.findIndexValues {
it == list2[0];
}
index2 = list1.findIndexValues {
it == list2[1];
}
println index1
println index2
result = []
for (int i = 0; i < index1.size(); i++) {
result.add(index2[i]-index1[i]);
}
println result
Edit: no longer uses Collections due to new issue re: Elastic Search.
The following code traverses along the source list, creating a sublist. It checks the sublist to see if it starts with the target list. See the asserts below (e.g. the indexes are 0-based):
def listStartsWithSubList = { source, target ->
def result = false
if (source.size() >= target.size()) {
result = true
target.eachWithIndex { item, index ->
result = result && (item == source[index])
}
}
result
}
def indexOfSubLists = { source, target ->
def results = []
source.eachWithIndex { item, index ->
def tmpList = source[index..source.size()-1]
if (listStartsWithSubList(tmpList, target)) {
results << index
}
}
results
}
assert [1] == indexOfSubLists([1,2,3], [2,3])
assert [2] == indexOfSubLists([1,2,3], [3])
assert [] == indexOfSubLists([1,2,3], [4])
assert [0,6] == indexOfSubLists([1,2,3,4,5,6,1,2,3], [1,2])
I have been trying to implement parallel merge sort in Scala. But with 8 cores, using .sorted is still about twice as fast.
edit:
I rewrote most of the code to minimize object creation. Now it runs about as fast as the .sorted
Input file with 1.2M integers:
1.333580 seconds (my implementation)
1.439293 seconds (.sorted)
How should I parallelize this?
New implementation
object Mergesort extends App
{
//=====================================================================================================================
// UTILITY
implicit object comp extends Ordering[Any] {
def compare(a: Any, b: Any) = {
(a, b) match {
case (a: Int, b: Int) => a compare b
case (a: String, b: String) => a compare b
case _ => 0
}
}
}
//=====================================================================================================================
// MERGESORT
val THRESHOLD = 30
def inssort[A](a: Array[A], left: Int, right: Int): Array[A] = {
for (i <- (left+1) until right) {
var j = i
val item = a(j)
while (j > left && comp.lt(item,a(j-1))) {
a(j) = a(j-1)
j -= 1
}
a(j) = item
}
a
}
def mergesort_merge[A](a: Array[A], temp: Array[A], left: Int, right: Int, mid: Int) : Array[A] = {
var i = left
var j = right
while (i < mid) { temp(i) = a(i); i+=1; }
while (j > mid) { temp(i) = a(j-1); i+=1; j-=1; }
i = left
j = right-1
var k = left
while (k < right) {
if (comp.lt(temp(i), temp(j))) { a(k) = temp(i); i+=1; k+=1; }
else { a(k) = temp(j); j-=1; k+=1; }
}
a
}
def mergesort_split[A](a: Array[A], temp: Array[A], left: Int, right: Int): Array[A] = {
if (right-left == 1) a
if ((right-left) > THRESHOLD) {
val mid = (left+right)/2
mergesort_split(a, temp, left, mid)
mergesort_split(a, temp, mid, right)
mergesort_merge(a, temp, left, right, mid)
}
else
inssort(a, left, right)
}
def mergesort[A: ClassTag](a: Array[A]): Array[A] = {
val temp = new Array[A](a.size)
mergesort_split(a, temp, 0, a.size)
}
Previous implementation
Input file with 1.2M integers:
4.269937 seconds (my implementation)
1.831767 seconds (.sorted)
What sort of tricks there are to make it faster and cleaner?
object Mergesort extends App
{
//=====================================================================================================================
// UTILITY
val StartNano = System.nanoTime
def dbg(msg: String) = println("%05d DBG ".format(((System.nanoTime - StartNano)/1e6).toInt) + msg)
def time[T](work: =>T) = {
val start = System.nanoTime
val res = work
println("%f seconds".format((System.nanoTime - start)/1e9))
res
}
implicit object comp extends Ordering[Any] {
def compare(a: Any, b: Any) = {
(a, b) match {
case (a: Int, b: Int) => a compare b
case (a: String, b: String) => a compare b
case _ => 0
}
}
}
//=====================================================================================================================
// MERGESORT
def merge[A](left: List[A], right: List[A]): Stream[A] = (left, right) match {
case (x :: xs, y :: ys) if comp.lteq(x, y) => x #:: merge(xs, right)
case (x :: xs, y :: ys) => y #:: merge(left, ys)
case _ => if (left.isEmpty) right.toStream else left.toStream
}
def sort[A](input: List[A], length: Int): List[A] = {
if (length < 100) return input.sortWith(comp.lt)
input match {
case Nil | List(_) => input
case _ =>
val middle = length / 2
val (left, right) = input splitAt middle
merge(sort(left, middle), sort(right, middle + length%2)).toList
}
}
def msort[A](input: List[A]): List[A] = sort(input, input.length)
//=====================================================================================================================
// PARALLELIZATION
//val cores = Runtime.getRuntime.availableProcessors
//dbg("Detected %d cores.".format(cores))
//lazy implicit val ec = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(cores))
def futuremerge[A](fa: Future[List[A]], fb: Future[List[A]])(implicit order: Ordering[A], ec: ExecutionContext) =
{
for {
a <- fa
b <- fb
} yield merge(a, b).toList
}
def parallel_msort[A](input: List[A], length: Int)(implicit order: Ordering[A]): Future[List[A]] = {
val middle = length / 2
val (left, right) = input splitAt middle
if(length > 500) {
val fl = parallel_msort(left, middle)
val fr = parallel_msort(right, middle + length%2)
futuremerge(fl, fr)
}
else {
Future(msort(input))
}
}
//=====================================================================================================================
// MAIN
val results = time({
val src = Source.fromFile("in.txt").getLines
val header = src.next.split(" ").toVector
val lines = if (header(0) == "i") src.map(_.toInt).toList else src.toList
val f = parallel_msort(lines, lines.length)
Await.result(f, concurrent.duration.Duration.Inf)
})
println("Sorted as comparison...")
val sorted_src = Source.fromFile(input_folder+"in.txt").getLines
sorted_src.next
time(sorted_src.toList.sorted)
val writer = new PrintWriter("out.txt", "UTF-8")
try writer.print(results.mkString("\n"))
finally writer.close
}
My answer is probably going to be a bit long, but i hope that it will be useful for both you and me.
So, first question is: "how scala is doing sorting for a List?" Let's have a look at the code from scala repo!
def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
val len = this.length
val b = newBuilder
if (len == 1) b ++= this
else if (len > 1) {
b.sizeHint(len)
val arr = new Array[AnyRef](len) // Previously used ArraySeq for more compact but slower code
var i = 0
for (x <- this) {
arr(i) = x.asInstanceOf[AnyRef]
i += 1
}
java.util.Arrays.sort(arr, ord.asInstanceOf[Ordering[Object]])
i = 0
while (i < arr.length) {
b += arr(i).asInstanceOf[A]
i += 1
}
}
b.result()
}
So what the hell is going on here? Long story short: with java. Everything else is just size justification and casting. Basically this is the line which defines it:
java.util.Arrays.sort(arr, ord.asInstanceOf[Ordering[Object]])
Let's go one level deeper into JDK sources:
public static <T> void sort(T[] a, Comparator<? super T> c) {
if (c == null) {
sort(a);
} else {
if (LegacyMergeSort.userRequested)
legacyMergeSort(a, c);
else
TimSort.sort(a, 0, a.length, c, null, 0, 0);
}
}
legacyMergeSort is nothing but single threaded implementation of merge sort algorithm.
The next question is: "what is TimSort.sort and when do we use it?"
To my best knowledge default value for this property is false, which leads us to TimSort.sort algorithm. Description can be found here. Why is it better? Less comparisons that in merge sort according to comments in JDK sources.
Moreover you should be aware that it is all single threaded, so no parallelization here.
Third question, "your code":
You create too many objects. When it comes to performance, mutation (sadly) is your friend.
Premature optimization is the root of all evil -- Donald Knuth. Before making any optimizations (like parallelism), try to implement single threaded version and compare the results.
Use something like JMH to test performance of your code.
You should not probably use Stream class if you want to have the best performance as it does additional caching.
I intentionally did not give you answer like "super-fast merge sort in scala can be found here", but just some tips for you to apply to your code and coding practices.
Hope it will help you.
NOTE: This can be done as a method call or an operator override pretty easily, I am looking for an intrinsic one-line solution that I don't have to carry around in a library.
When you combine(add) Maps, you get a result like this:
println [a:1,c:3] + [a:2]
// prints {a=2, c=3}
I seem to keep needing results more like:
{a=[1, 2], c=[3]}
In other words, something that combines all the values from identical keys in the Maps.
Is there an operator or simple function call that does this, because doing it myself always seems to break my stride a little. It seems like the * operator might do this nicely, but it doesn't.
Is there an easy way to do this?
Another alternative (adding it to the * operator on Maps)
def a = [ a:1, c:10 ]
def b = [ b:1, a:3 ]
Map.metaClass.multiply = { Map other ->
(delegate.keySet() + other.keySet()).inject( [:].withDefault { [] } ) { m, v ->
if (delegate[v] != null) { m[v] << delegate[v] }
if (other[v] != null) { m[v] << other[v] }
m
}
}
assert a * b == [a:[1, 3], c:[10], b:[1]]
Came up with this as well, but it's late and there are probably better, shorter ways
def a = [ a:1, c:10 ]
def b = [ b:1, a:3 ]
[a,b]*.collect {k,v -> [(k):v]}
.flatten()
.groupBy { it.keySet()[0]}
.inject([:].withDefault{[]}) {m,v->
m << [(v.key):v.value[v.key]]
}
Nothing came to my mind, so I start the bidding with this:
m1 = [a:1, c:666]; m2 = [a:2, b:42]
result = [:].withDefault{[]}
[m1,m2].each{ it.each{ result[it.key] << it.value } }
assert result == [a:[1,2], b:[42], c:[666]]
I have collection
def list = [4,1,1,1,3,5,1,1]
and I need to remove numbers which are repeated three times in a row. As a result I have to get an [4,3,5,1,1]. How to do this in groovy ?
This can be done by copying the list while ensuring the two previous elements are not the same as the one to be copied. If they are, drop the two previous elements, otherwise copy as normal.
This can be implemented with inject like this:
def list = [4,1,1,1,3,5,1,1]
def result = list.drop(2).inject(list.take(2)) { result, element ->
def prefixSize = result.size() - 2
if ([element] * 2 == result.drop(prefixSize)) {
result.take(prefixSize)
} else {
result + element
}
}
assert result == [4,3,5,1,1]
You can calculate the uniques size in the next three elements and drop them when they are 1:
def list = [4,1,1,1,3,5,1,1]
assert removeTriplets(list) == [4,3,5,1,1]
def removeTriplets(list) {
listCopy = [] + list
(list.size()-3).times { index ->
uniques = list[index..(index+2)].unique false
if (uniques.size() == 1)
listCopy = listCopy[0..(index-1)] + listCopy[(index+3)..-1]
}
listCopy
}
Another option is to use Run Length Encoding
First lets define a class which will hold our object and the number of times it occurs in a row:
class RleUnit {
def object
int runLength
RleUnit( object ) {
this( object, 1 )
}
RleUnit( object, int runLength ) {
this.object = object
this.runLength = runLength
}
RleUnit inc() {
new RleUnit( object, runLength + 1 )
}
String toString() { "$object($runLength)" }
}
We can then define a method which will encode a List into a List of RleUnit objects:
List<RleUnit> rleEncode( List list ) {
list.inject( [] ) { r, v ->
if( r && r[ -1 ].object == v ) {
r.take( r.size() - 1 ) << r[ -1 ].inc()
}
else {
r << new RleUnit( v )
}
}
}
And a method that takes a List of RleUnit objects, and unpacks it back to the original list:
List rleDecode( List<RleUnit> rle ) {
rle.inject( [] ) { r, v ->
r.addAll( [ v.object ] * v.runLength )
r
}
}
We can then encode the original list:
def list = [ 4, 1, 1, 1, 3, 5, 1, 1 ]
rle = rleEncode( list )
And filter this RleUnit list with the Groovy find method:
// remove all elements with a runLength of 3
noThrees = rle.findAll { it.runLength != 3 }
unpackNoThrees = rleDecode( noThrees )
assert unpackNoThrees == [ 4, 3, 5, 1, 1 ]
// remove all elements with a runLength of less than 3
threeOrMore = rle.findAll { it.runLength >= 3 }
unpackThreeOrMore = rleDecode( threeOrMore )
assert unpackThreeOrMore == [ 1, 1, 1 ]
Lets say I have a string like this :
string = [+++[>>[--]]]abced
Now I want a someway to return a list that has: [[--],[>>],[+++]]. That is the contents of the deepest [ nesting followed by other nesting. I came up with this solution like this :
def string = "[+++[>>[--]]]"
loop = []
temp = []
string.each {
bool = false
if(it == "["){
temp = []
bool = true
}
else if( it != "]")
temp << it
if(bool)
loop << temp
}
println loop.reverse()
But this indeed takes the abced string after the last ] and put into the result!. But what I want is only [[--],[>>],[+++]]
Are there any groovy way of solving this?
You can use this, if you wouldn't mind using recursion
def sub(s , list){
if(!s.contains('[') && !s.contains('['))
return list
def clipped = s.substring(s.lastIndexOf('[')+1, s.indexOf(']'))
list.add(clipped)
s = s - "[$clipped]"
sub(s , list)
}
Calling
sub('''[+++[>>[--]]]abced''' , [])
returns a list of all subportions enclosed between braces.
['--', '>>', '+++']
If your brackets are symmetrical, you could just introduce a counter variable that holds the depth of the bracket nesting. Only depth levels above 0 are allowed in the output:
def string = "[+++[>>[--]]]abc"
loop = []
temp = []
depth = 0;
string.each {
bool = false
if(it == "["){
temp = []
bool = true
depth++;
}
else if (it == "]"){
depth--;
}
else if (depth > 0){
temp << it
}
if(bool){
loop << temp
}
}
println loop.reverse()
class Main {
private static final def pattern = ~/([^\[]*)\[(.+?)\][^\]]*/
static void main(String[] args) {
def string = "[+++[>>[--]]]abced"
def result = match(string)
println result
}
static def match(String val) {
def matcher = pattern.matcher(val);
if (matcher.matches()) {
return matcher.group(1) ? match(matcher.group(2)) + matcher.group(1) : match(matcher.group(2))
}
[val]
}
}
System.out
[--, >>, +++]
The capturing of the first group in the regex pattern could probably be improved. Right now the first group is any character that is not [ and if there are nothing in front of the first [ then the first group will contain an empty string.