Collecting a GPars loop to a Map - groovy

I need to iterate on a List and for every item run a time-expensive operation and then collect its results to a map, something like this:
List<String> strings = ['foo', 'bar', 'baz']
Map<String, Object> result = strings.collectEntries { key ->
[key, expensiveOperation(key)]
}
So that then my result is something like
[foo: <an object>, bar: <another object>, baz: <another object>]
Since the operations i need to do are pretty long and don't depend on each other, I've been willing to investigate using GPars to run the loop in parallel.
However, GPars has a collectParallel method that loops through a collection in parallel and collects to a List but not a collectEntriesParallel that collects to a Map: what's the correct way to do this with GPars?

There is no collectEntriesParallel because it would have to produce the same result as:
collectParallel {}.collectEntries {}
as Tim mentioned in the comment. It's hard to make reducing list of values to map (or any other mutable container) in a deterministic way other than collecting results to a list in parallel and in the end collecting to map entries in a sequential manner. Consider following sequential example:
static def expensiveOperation(String key) {
Thread.sleep(1000)
return key.reverse()
}
List<String> strings = ['foo', 'bar', 'baz']
GParsPool.withPool {
def result = strings.inject([:]) { seed, key ->
println "[${Thread.currentThread().name}] (${System.currentTimeMillis()}) seed = ${seed}, key = ${key}"
seed + [(key): expensiveOperation(key.toString())]
}
println result
}
In this example we are using Collection.inject(initialValue, closure) which is an equivalent of good old "fold left" operation - it starts with initial value [:] and iterates over all values and adds them as key and value to initial map. Sequential execution in this case takes approximately 3 seconds (each expensiveOperation() sleeps for 1 second).
Console output:
[main] (1519925046610) seed = [:], key = foo
[main] (1519925047773) seed = [foo:oof], key = bar
[main] (1519925048774) seed = [foo:oof, bar:rab], key = baz
[foo:oof, bar:rab, baz:zab]
And this is basically what collectEntries() does - it's kind of reduction operation where initial value is an empty map.
Now let's see what happens if we try to parallelize it - instead of inject we will use injectParallel method:
GParsPool.withPool {
def result = strings.injectParallel([:]) { seed, key ->
println "[${Thread.currentThread().name}] (${System.currentTimeMillis()}) seed = ${seed}, key = ${key}"
seed + [(key): expensiveOperation(key.toString())]
}
println result
}
Let's see what is the result:
[ForkJoinPool-1-worker-1] (1519925323803) seed = foo, key = bar
[ForkJoinPool-1-worker-2] (1519925323811) seed = baz, key = [:]
[ForkJoinPool-1-worker-1] (1519925324822) seed = foo[bar:rab], key = baz[[:]:]:[]
foo[bar:rab][baz[[:]:]:[]:][:]:]:[[zab]
As you can see parallel version of inject does not care about the order (which is expected) and e.g. first thread received foo as a seed variable and bar as a key. This is what could happen if reduction to a map (or any mutable object) was performed in parallel and without specific order.
Solution
There are two ways to parallelize the process:
1. collectParallel + collectEntries combination
As Tim Yates mentioned in the comment you can parallel expensive operation execution and in the end collect results to a map sequentially:
static def expensiveOperation(String key) {
Thread.sleep(1000)
return key.reverse()
}
List<String> strings = ['foo', 'bar', 'baz']
GParsPool.withPool {
def result = strings.collectParallel { [it, expensiveOperation(it)] }.collectEntries { [(it[0]): it[1]] }
println result
}
This example executes in approximately 1 second and produces following output:
[foo:oof, bar:rab, baz:zab]
2. Java's parallel stream
Alternatively you can use Java's parallel stream with Collectors.toMap() reducer function:
static def expensiveOperation(String key) {
Thread.sleep(1000)
return key.reverse()
}
List<String> strings = ['foo', 'bar', 'baz']
def result = strings.parallelStream()
.collect(Collectors.toMap(Function.identity(), { str -> expensiveOperation(str)}))
println result
This example also executes in approximately 1 second and produces output like that:
[bar:rab, foo:oof, baz:zab]
Hope it helps.

Related

Implicit class holding mutable variable in multithreaded environment

I need to implement a parallel method, which takes two computation blocks, a and b, and starts each of them in a new thread. The method must return a tuple with the result values of both the computations. It should have the following signature:
def parallel[A, B](a: => A, b: => B): (A, B)
I managed to solve the exercise by using straight Java-like approach. Then I decided to make up a solution with implicit class. Here's it:
object ParallelApp extends App {
implicit class ParallelOps[A](a: => A) {
var result: A = _
def spawn(): Unit = {
val thread = new Thread {
override def run(): Unit = {
result = a
}
}
thread.start()
thread.join()
}
}
def parallel[A, B](a: => A, b: => B): (A, B) = {
a.spawn()
b.spawn()
(a.result, b.result)
}
println(parallel(1 + 2, "a" + "b"))
}
For unknown reason, I receive output (null,null). Could you please point me out where is the problem?
Spoiler alert: It's not complicated. It's funny, like a magic trick (if you consider reading the documentation about Java Memory Model "funny", that is). If you haven't figured it out yet, I would highly recommend to try to figure it out, otherwise it won't be funny. Someone should make a "division-by-zero proves 2 = 4"-riddle out of it.
Consider the following shorter example:
implicit class Foo[A](a: A) {
var result: String = "not initialized"
def computeResult(): Unit = result = "Yay, result!"
}
val a = "a string"
a.computeResult()
println(a.result)
When run, it prints
not initialized
despite the fact that we invoked computeResult() and set result to "Yay, result!". The problem is that the two invocations a.computeResult() and a.result belong to two completely independent instances of Foo. The implicit conversion is performed twice, and the second implicitly created object doesn't know anything about the changes in the first implicitly created object. It has nothing to do with threads or JMM at all.
By the way: your code is not parallel. Calling join right after calling start doesn't bring you anything, your main thread will simply go idle and wait until another thread finishes. At no point will there be two threads that do any useful work concurrently.
EDIT: Fixed a bug pointed out by Andrey Tyukin
One way to solve your problem is to use Scala Futures
Documentation. Tutorial.
Useful Klang Blog.
You'll typically need some combination of these libraries:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.{Await, Future}
import scala.util.{Failure, Success}
import scala.concurrent.duration._
an asynchronous example:
def parallelAsync[A,B](a: => A, b: => B): Future[(A,B)] = {
// as per Andrey Tyukin's comments, this line runs
// the two futures sequentially and we do not get
// any benefit from it. I will leave this line here
// so others will not fall in my trap
//for {i <- Future(a); j <- Future(b) } yield (i,j)
Future(a) zip Future(b)
}
parallelAsync(1 + 2, "a" + "b").onComplete {
case Success(x) => println(x)
case Failure(e) => e.printStackTrace()
}
If you must block until both are complete, you can use this:
def parallelSync[A,B](a: => A, b: => B): (A,B) = {
// see comment above
//val f = for { i <- Future(a); j <- Future(b) } yield (i,j)
val tuple = Future(a) zip Future(b)
Await.result(tuple, 5 second)
}
println(parallelSync(3 + 4, "c" + "d"))
When running these little examples, don't forget to sleep a little bit at the end so the program won't end before the results come back
Thread.sleep(3000)

Eager interpolation with just a closure behaves like a lazy one?

As part of learning Groovy, I'm trying to explore all intricate possibilities provided by string interpolation.
One of my little experiments gave results that don't make sense to me, and now I'm wondering whether I've completely misunderstood the basic concepts of lazy and eager interpolation in Groovy.
Here's the code I ran:
def myVar1 = 3
// An eager interpolation containing just a closure.
def myStr = "${{->myVar1}}"
print ("Just after the creation of myStr\n")
print (myStr as String)
myVar1 += 1 // Bump up myVar1.
print ("\nJust after incrementing myVar1\n")
print (myStr as String)
Here's the output I got:
Just after the creation of myStr
3
Just after incrementing myVar1
4
Clearly, the closure has been invoked a second time. And the only way the closure could have been re-executed is by the containing interpolation getting re-evaluated. But then, the containing interpolation is, by itself, not a closure, though it contains a closure. So then, why is it getting re-evaluated?
This is how GString.toString() method is implemented. If you take a look at the source code of GString class, you will find something like this:
public String toString() {
StringWriter buffer = new StringWriter();
try {
writeTo(buffer);
}
catch (IOException e) {
throw new StringWriterIOException(e);
}
return buffer.toString();
}
public Writer writeTo(Writer out) throws IOException {
String[] s = getStrings();
int numberOfValues = values.length;
for (int i = 0, size = s.length; i < size; i++) {
out.write(s[i]);
if (i < numberOfValues) {
final Object value = values[i];
if (value instanceof Closure) {
final Closure c = (Closure) value;
if (c.getMaximumNumberOfParameters() == 0) {
InvokerHelper.write(out, c.call());
} else if (c.getMaximumNumberOfParameters() == 1) {
c.call(out);
} else {
throw new GroovyRuntimeException("Trying to evaluate a GString containing a Closure taking "
+ c.getMaximumNumberOfParameters() + " parameters");
}
} else {
InvokerHelper.write(out, value);
}
}
}
return out;
}
Notice that writeTo method examines what the values passed for interpolation are, and in case of closure, it invokes it. This is the way GString handles lazy-evaluation of interpolated values.
Now let's take a look at a few examples. Let's assume we want to print a GString and interpolate a value returned by some method call. This method will also print something to the console, so we can see if the method call was triggered eagerly or lazily.
Ex.1: Eager evaluation
class GStringLazyEvaluation {
static void main(String[] args) {
def var = 1
def str = "${loadValue(var++)}"
println "Starting the loop..."
5.times {
println str
}
println "Loop ended..."
}
static Integer loadValue(int val) {
println "This method returns value $val"
return val
}
}
The output:
This method returns value 1
Starting the loop...
1
1
1
1
1
Loop ended...
The default eager behavior. The method loadValue() was invoked before we have printed out str to the console.
Ex.2: Lazy evaluation
class GStringLazyEvaluation {
static void main(String[] args) {
def var = 1
def str = "${ -> loadValue(var++)}"
println "Starting the loop..."
5.times {
println str
}
println "Loop ended..."
}
static Integer loadValue(int val) {
println "This method returns value $val"
return val
}
}
The output:
Starting the loop...
This method returns value 1
1
This method returns value 2
2
This method returns value 3
3
This method returns value 4
4
This method returns value 5
5
Loop ended...
In the second example, we take advantage of lazy evaluation. We define str with a closure that invokes loadValue() method and this invocation is executed when we explicitly print the str to the console (to be more specific - when the GString.toString() method gets executed).
Ex.3: Lazy evaluation and closure memoization
class GStringLazyEvaluation {
static void main(String[] args) {
def var = 1
def closure = { -> loadValue(var++)}
def str = "${closure.memoize()}"
println "Starting the loop..."
5.times {
println str
}
println "Loop ended..."
}
static Integer loadValue(int val) {
println "This method returns value $val"
return val
}
}
The output:
Starting the loop...
This method returns value 1
1
1
1
1
1
Loop ended...
And here is the example you most probably look for. In this example, we still take advantage of lazy evaluation thanks to the closure parameter. However, in this case, we use closure's memoization feature. The evaluation of the string is postponed to the first GString.toString() invocation and closure's result gets memorized, so the next time it gets called it returns the result instead of re-evaluating the closure.
What is the difference between ${{->myVar1}} and ${->myVar1}?
As it was mentioned earlier, GString.toString() method uses GString.writeTo(out) that checks if the given placeholder stores a closure for a lazy evaluation. Every GString instance store placeholder values in the GString.values array and it gets initialized during GString initialization. Let's consider the following example:
def str = "${myVar1} ... ${-> myVar1} ... ${{-> myVar1}}"
Now let's follow GString.values array initialization:
${myVar1} --> evaluates `myVar1` expression and copies its return value to the values array
${-> myVar1} --> it sees this is closure expression so it copies the closure to values array
${{-> myVar1}} --> evaluates `{-> myVar1}` which is closure definition expression in this case and copies its return value (a closure) to the values array
As you can see, in the 1st and 3rd example it did exactly the same - it evaluated the expression and stored it in the GString.values array of type Object[]. And here is the crucial part: the expression like {->something} is not a closure invocation expression. The expression that evaluates the closure is
{->myVar1}()
or
{->myVar1}.call()
It can be illustrated with the following example:
def str = "${println 'B'; 2 * 4} ${{ -> println 'C'; 2 * 5}} ${{ -> println 'A'; 2 * 6}.call()}"
println str
Values initialization is as follows:
${println 'B'; 2 * 4} ---> evaluates the expression which prints 'B' and returns 8 - this value is stored in values array.
${{ -> println 'C'; 2 * 5}} ---> evaluates the expression which is nothing else than creation of a closure. This closure is stored in the values array.
${{ -> println 'A'; 2 * 6}.call()}" ---> evaluates the expression which creates a closure and then calls it explicitely. It prints 'A' and returns 12 which is stored in the values array at the last index.
That is why after GString object initialization we end up with values array like:
[8, script$_main_closure1, 12]
Now, the creation of this GString caused a side effect - the following characters shown on the console:
B
A
This is because the 1st and the 3rd values evaluation invoked println method call.
Now, when we finally call println str which invokes GString.toString() method, all values get processed. When the interpolation process starts it does the following:
value[0] --> 8 --> writes "8"
value[1] --> script$_main_closure1 --> invoke script$_main_closure1.call() --> prints 'C' --> returns 10 --> 10 --> writes "10"
value[2] --> 12 --> writes "12"
That is why the final console output looks like this:
B
A
C
8 10 12
This is why in practice expressions like ${->myVar1} and ${{->myVar1}} are similar. In the first case GString initialization does not evaluate the closure expression and puts it directly to the values array, in the second example placeholder gets evaluated and the expression it evalutes creates and returns the closure which then gets stored in the values array.
Note on Groovy 3.x
If you try to execute the expression ${{->myVar1}} in Groovy 3.x you will end up with the following compiler error:
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during conversion: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.groovy.parser.antlr4.AstBuilder.lambda$visitGstring$28(AstBuilder.java:3579)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.apache.groovy.parser.antlr4.AstBuilder.visitGstring(AstBuilder.java:3591)
at org.apache.groovy.parser.antlr4.AstBuilder.visitGstring(AstBuilder.java:356)
at org.apache.groovy.parser.antlr4.GroovyParser$GstringContext.accept(GroovyParser.java:4182)
at groovyjarjarantlr4.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:20)
at org.apache.groovy.parser.antlr4.AstBuilder.visit(AstBuilder.java:4287)
.....
at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:565)
at org.codehaus.groovy.tools.FileSystemCompiler.compile(FileSystemCompiler.java:72)
at org.codehaus.groovy.tools.FileSystemCompiler.doCompilation(FileSystemCompiler.java:240)
at org.codehaus.groovy.tools.FileSystemCompiler.commandLineCompile(FileSystemCompiler.java:163)
at org.codehaus.groovy.tools.FileSystemCompiler.commandLineCompileWithErrorHandling(FileSystemCompiler.java:203)
at org.codehaus.groovy.tools.FileSystemCompiler.main(FileSystemCompiler.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:114)
at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:136)
1 error

Building a frequency count map in groovy.

I have an array and i want to build a map out of it recording the frequency of elements in the array. So for the example below the map = [15:2, 16:1] is what it will look like. How do I do this in Groovy ?
static void doSomething()
{
def a = [15,16,15]
def map = []
a.each{
k,v->
if(map.contains(it))
map.putAt k, v++
else
map.putAt k, 1;
}
println map
}
In Groovy 1.8 or higher,
assert [15, 16, 15].countBy { it } == [15: 2, 16: 1]
You could modify your code to be the following:
void doSomething() {
def a = [15,16,15]
def map = [:] //1
a.each { //2
if(map.containsKey(it)) map[it] = map[it] + 1 //3
else map[it] = 1;
}
println map
}
This fixes a few things:
map needs to be initiated with colon between braces, as notes by Bill James in comments.
can't use a 2-parameter version of each on an arraylist
postfix increment won't result in incremented value being saved; Also, explicit putAt call is fine, but it's there to provide the overloaded [key] = val syntax which is more expressive.
All that said, I'm assuming this is a coding exercise to learn groovy. doelleri's answer is more succinct and uses the tools provided, so in a real-world situation, I'd go with that.

Performance difference in toString.map and toString.toArray.map

While coding Euler problems, I ran across what I think is bizarre:
The method toString.map is slower than toString.toArray.map.
Here's an example:
def main(args: Array[String])
{
def toDigit(num : Int) = num.toString.map(_ - 48) //2137 ms
def toDigitFast(num : Int) = num.toString.toArray.map(_ - 48) //592 ms
val startTime = System.currentTimeMillis;
(1 to 1200000).map(toDigit)
println(System.currentTimeMillis - startTime)
}
Shouldn't the method map on String fallback to a map over the array? Why is there such a noticeable difference? (Note that increasing the number even causes an stack overflow on the non-array case).
Original
Could be because toString.map uses the WrappedString implicit, while toString.toArray.map uses the WrappedArray implicit to resolve map.
Let's see map, as defined in TraversableLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}
WrappedString uses a StringBuilder as builder:
def +=(x: Char): this.type = { append(x); this }
def append(x: Any): StringBuilder = {
underlying append String.valueOf(x)
this
}
The String.valueOf call for Any uses Java Object.toString on the Char instances, possibly getting boxed first. These extra ops might be the cause of speed difference, versus the supposedly shorter code paths of the Array builder.
This is a guess though, would have to measure.
Edit
After revising, the general point still stands, but the I referred the wrong implicits, since the toDigit methods return an Int sequence (or like), not a translated string as I misread.
toDigit uses LowPriorityImplicits.fallbackStringCanBuildFrom[T]: CanBuildFrom[String, T, immutable.IndexedSeq[T]], with T = Int, which just defers to a general IndexedSeq builder.
toDigitFast uses a direct Array implicit of type CanBuildFrom[Array[_], T, Array[T]], which is unarguably faster.
Passing the following CBF for toDigit explicitly makes the two methods on par:
object FastStringToArrayBuild {
def canBuildFrom[T : ClassManifest] = new CanBuildFrom[String, T, Array[T]] {
private def newBuilder = scala.collection.mutable.ArrayBuilder.make()
def apply(from: String) = newBuilder
def apply() = newBuilder
}
}
You're being fooled by running out of memory. The toDigit version does create more intermediate objects, but if you have plenty of memory then the GC won't be heavily impacted (and it'll all run faster). For example, if instead of creating 1.2 million numbers, I create 12k 100x in a row, I get approximately equal times for the two methods. If I create 1.2k 5-digit numbers 1000x in a row, I find that toDigit is about 5% faster.
Given that the toDigit method produces an immutable collection, which is better when all else is equal since it is easier to reason about, and given that all else is equal for all but highly demanding tasks, I think the library is as it should be.
When trying to improve performance, of course one needs to keep all sorts of tricks in mind; one of these is that arrays have better memory characteristics for collections of known length than do the fancy collections in the Scala library. Also, one needs to know that map isn't the fastest way to get things done; if you really wanted this to be fast you should
final def toDigitReallyFast(num: Int, accum: Long = 0L, iter: Int = 0): Array[Byte] = {
if (num==0) {
val ans = new Array[Byte](math.max(1,iter))
var i = 0
var ac = accum
while (i < ans.length) {
ans(ans.length-i-1) = (ac & 0xF).toByte
ac >>= 4
i += 1
}
ans
}
else {
val next = num/10
toDigitReallyFast(next, (accum << 4) | (num-10*next), iter+1)
}
}
which on my machine is at 4x faster than either of the others. And you can get almost 3x faster yet again if you leave everything in a Long and pack the results in an array instead of using 1 to N:
final def toDigitExtremelyFast(num: Int, accum: Long = 0L, iter: Int = 0): Long = {
if (num==0) accum | (iter.toLong << 48)
else {
val next = num/10
toDigitExtremelyFast(next, accum | ((num-10*next).toLong<<(4*iter)), iter+1)
}
}
// loop, instead of 1 to N map, for the 1.2k number case
{
var i = 10000
val a = new Array[Long](1201)
while (i<=11200) {
a(i-10000) = toDigitReallyReallyFast(i)
i += 1
}
a
}
As with many things, performance tuning is highly dependent on exactly what you want to do. In contrast, library design has to balance many different concerns. I do think it's worth noticing where the library is sub-optimal with respect to performance, but this isn't really one of those cases IMO; the flexibility is worth it for the common use cases.

Remove key/value from map while iterating

I'm creating a map like this:
def myMap = [:]
The map is basically an object for a key and an int for a value. When I iterate over the map, I decret the value, and if it's 0, I remove it. I already tried myMap.remove(), but I get a ConcurrentModificationError - which is fair enough. So I move on to using it.remove(), which is giving me weird results.
Basically, my code is this:
myMap.each {
it.value--;
if( it.value <= 0 )
it.remove();
}
Simple enough. My problem is, if I print myMap.size() before and after the remove, they're the same. If I call myMap.containsKey( key ), it gives me true, the key is still in there.
But, if I print out the map like this:
myMap.each { System.out.println( "$it.key: $it.value" ); }
I get nothing, and calling myMap.keySet() and myMap.values() return empty.
Anyone know what's going on?
This should be a bit more efficient than Tim's answer (because you only need to iterate over the map once). Unfortunately, it is also pretty verbose
def map = [2:1, 3:4]
def iterator = map.entrySet().iterator()
while (iterator.hasNext()) {
if (iterator.next().value - 1 <= 0) {
iterator.remove()
}
}
// test that it worked
assert map == [3:4]
Can you do something like this:
myMap = myMap.each { it.value-- }.findAll { it.value > 0 }
That will subtract one from every value, then return you a new map of only those entries where the value is greater than zero.
You shouldn't call the remove method on a Map Entry, it is supposed to be a private method used internally by the Map (see line 325 for the Java 7 implementation), so you calling it yourself is getting the enclosing Map into all sorts of bother (it doesn't know that it is losing entries)
Groovy lets you call private methods, so you can do this sort of trickery behind the back of the Java classes
Edit -- Iterator method
Another way would be:
myMap.iterator().with { iterator ->
iterator.each { entry ->
entry.value--
if( entry.value <= 0 ) iterator.remove()
}
}

Resources