Scala - Executing every element until they all have finished - multithreading

I cannot figure out why my function invokeAll does not give out the correct output/work properly. Any solutions? (No futures or parallel collections allowed and the return type needs to be Seq[Int])
def invokeAll(work: Seq[() => Int]): Seq[Int] = {
//this is what we should return as an output "return res.toSeq"
//res cannot be changed!
val res = new Array[Int](work.length)
var list = mutable.Set[Int]()
var n = res.size
val procedure = (0 until n).map(work =>
new Runnable {
def run {
//add the finished element/Int to list
list += work
}
}
)
val threads = procedure.map(new Thread(_))
threads.foreach(x => x.start())
threads.foreach (x => (x.join()))
res ++ list
//this should be the final output ("return res.toSeq")
return res.toSeq
}

OMG, I know a java programmer, when I see one :)
Don't do this, it's not java!
val results: Future[Seq[Int]] = Future.traverse(work)
This is how you do it in scala.
This gives you a Future with the results of all executions, that will be satisfied when all work is finished. You can use .map, .flatMap etc. to access and transform those results. For example
val sumOfAll: Future[Int] = results.map(_.sum)
Or (in the worst case, when you want to just give the result back to imperative code), you could block and wait on the future to get ahold of the actual result (don't do this unless you are absolutely desperate): Await.result(results, 1 year)
If you want the results as array, results.map(_.toArray) will do that ... but you really should not: arrays aren't really a good choice for the vast majority of use cases in scala. Just stick with Seq.

The main problem in your code is that you are using fixed size array and trying to add some elements using ++ (concatenate) operator: res ++ list. It produces new Seq but you don't store it in some val.
You could remove last line return res.toSeq and see that res ++ lest will be return value. It will be your work.length array of zeros res with some list sequence at the end. Try read more about scala collections most of them immutable and there is a good practice to use immutable data structures. In scala Arrays doesn't accumulate values using ++ operator in left operand. Array's in scala are fixed size.

Related

Injecting key/value into HashMap

I'm trying to generate HashMap object that will have properties and values set from parsed text input. Working fine with simple assigned, but wanted to make it more clever and use inject.
def result = new HashMap();
def buildLog = """
BuildDir:
MSBuildProjectFile:test.csproj
TargetName: test
Compile:
Reference:
""".trim().readLines()*.trim()
buildLog.each {
def (k,v) = it.tokenize(':')
result."${k.trim()}"=v?.trim()
}
println "\nResult:\n${result.collect { k,v -> "\t$k='$v'\n" }.join()}"
generates expected output:
Result:
Reference='null'
MSBuildProjectFile='test.csproj'
BuildDir='null'
TargetName='test'
Compile='null'
after replacing the insides of .each { } closure with injection:
it.tokenize(':').inject({ key, value -> result."${key}" = value?.trim()})
the results generated are missing unset values
Result:
MSBuildProjectFile='test.csproj'
TargetName='test'
Am I doing something wrong, tried with inject ("", {...}) but it seems to push may keys into values.
inject is basically a reduce. The reducing function takes two arguments, the result of the previous iteration or the initial value (e.g. the accumulator) and the next value from the sequence. So it could be made to work, but since you only expect one sequence value, it just convolutes the code.
I do see a great use for collectEntries here, as it allows you to create a Map using either small key/values map, or lists of two elements. And the latter you have:
result = buildLog.collectEntries {
it.split(":",2)*.trim()
}
should work for your code instead of buildLog.each

How can you store the results from a forEach in Spark

DataSet#foreach(f) applies the function f to each row in the dataset. In a clustered environment, the data is split across the cluster. How can the results from each of these functions be collected?
For example, say the function would count the number of characters stored in each row. How can you create a DataSet or RDD that contains the results of each of these functions applied to each row?
The definition for foreach looks something like :
final def foreach(f: (A) ⇒ Unit): Unit
f : The function that is applied for its side-effect to every element.
The result of function f is discarded
foreach in Scala is generally used to denote the usage of a function that involves a side-effect, e.g. printing to STDOUT.
If you want to return something by applying a particular function, you'll have to use map
final def map[B](f: (A) ⇒ B): List[B]
I copied the syntax from the documentation for List but it'll be something similar for RDDs as well.
As you can see, it works the function f on datatype A and returns a collection of datatype B where A and B can be the same data type as well.
val rdd = sc.parallelize(Array(
"String1",
"String2",
"String3" ))
scala> rdd.foreach(x => (x, x.length) )
// Nothing happens
rdd.map(x => (x, x.length) ).collect
// Array[(String, Int)] = Array((String1,7), (String2,7), (String3,7))

mapPartitionsWithIndex - how is output combined

I am trying to understand mapPartitionsWithIndex in Spark. I found that the following two examples produce vastly different output:
parallel = sc.parallelize(range(1,10),2)
def show(index, iterator): yield 'index: '+str(index)+" values: "+
str(list(iterator))
parallel.mapPartitionsWithIndex(show).collect()
parallel = sc.parallelize(range(1,10),2)
def show(index, iterator): return 'index: '+str(index)+" values: "+
str(list(iterator))
parallel.mapPartitionsWithIndex(show).collect()
As highlighted, the difference lies in whether the show function returns a generator or an iterator.
I guess I do not understand how mapPartitionsWithIndex combines the results from the individual partitions.
Can you please explain to me how this behavior occurs?
mapPartitionsWithIndex(self, f, preservesPartitioning=False)
The parameter: f must return an iterable object.
In general it should be raise an error if there is no iterable object returned.
But in your case2, return String is turned into return list of letter by mistake through iterator = iter(iterator) in source code(pyspark/serializers.py, line 266).
Just return ["I'm String"] if you insist on using return.

Short-circuiting in functional Groovy?

"When you've found the treasure, stop digging!"
I'm wanting to use more functional programming in Groovy, and thought rewriting the following method would be good training. It's harder than it looks because Groovy doesn't appear to build short-circuiting into its more functional features.
Here's an imperative function to do the job:
fullyQualifiedNames = ['a/b/c/d/e', 'f/g/h/i/j', 'f/g/h/d/e']
String shortestUniqueName(String nameToShorten) {
def currentLevel = 1
String shortName = ''
def separator = '/'
while (fullyQualifiedNames.findAll { fqName ->
shortName = nameToShorten.tokenize(separator)[-currentLevel..-1].join(separator)
fqName.endsWith(shortName)
}.size() > 1) {
++currentLevel
}
return shortName
}
println shortestUniqueName('a/b/c/d/e')
Result: c/d/e
It scans a list of fully-qualified filenames and returns the shortest unique form. There are potentially hundreds of fully-qualified names.
As soon as the method finds a short name with only one match, that short name is the right answer, and the iteration can stop. There's no need to scan the rest of the name or do any more expensive list searches.
But turning to a more functional flow in Groovy, neither return nor break can drop you out of the iteration:
return simply returns from the present iteration, not from the whole .each so it doesn't short-circuit.
break isn't allowed outside of a loop, and .each {} and .eachWithIndex {} are not considered loop constructs.
I can't use .find() instead of .findAll() because my program logic requires that I scan all elements of the list, nut just stop at the first.
There are plenty of reasons not to use try..catch blocks, but the best I've read is from here:
Exceptions are basically non-local goto statements with all the
consequences of the latter. Using exceptions for flow control
violates the principle of least astonishment, make programs hard to read
(remember that programs are written for programmers first).
Some of the usual ways around this problem are detailed here including a solution based on a new flavour of .each. This is the closest to a solution I've found so far, but I need to use .eachWithIndex() for my use case (in progress.)
Here's my own poor attempt at a short-circuiting functional solution:
fullyQualifiedNames = ['a/b/c/d/e', 'f/g/h/i/j', 'f/g/h/d/e']
def shortestUniqueName(String nameToShorten) {
def found = ''
def final separator = '/'
def nameComponents = nameToShorten.tokenize(separator).reverse()
nameComponents.eachWithIndex { String _, int i ->
if (!found) {
def candidate = nameComponents[0..i].reverse().join(separator)
def matches = fullyQualifiedNames.findAll { String fqName ->
fqName.endsWith candidate
}
if (matches.size() == 1) {
found = candidate
}
}
}
return found
}
println shortestUniqueName('a/b/c/d/e')
Result: c/d/e
Please shoot me down if there is a more idiomatic way to short-circuit in Groovy that I haven't thought of. Thank you!
There's probably a cleaner looking (and easier to read) solution, but you can do this sort of thing:
String shortestUniqueName(String nameToShorten) {
// Split the name to shorten, and make a list of all sequential combinations of elements
nameToShorten.split('/').reverse().inject([]) { agg, l ->
if(agg) agg + [agg[-1] + l] else agg << [l]
}
// Starting with the smallest element
.find { elements ->
fullyQualifiedNames.findAll { name ->
name.endsWith(elements.reverse().join('/'))
}.size() == 1
}
?.reverse()
?.join('/')
?: ''
}

takeRightWhile() method in scala

I might be missing something but recently I came across a task to get last symbols according to some condition. For example I have a string: "this_is_separated_values_5". Now I want to extract 5 as Int.
Note: number of parts separated by _ is not defined.
If I would have a method takeRightWhile(f: Char => Boolean) on a string it would be trivial: takeRightWhile(ch => ch != '_'). Moreover it would be efficient: a straightforward implementation would actually involve finding the last index of _ and taking a substring while the use of this method would save first step and provide better average time complexity.
UPDATE: Guys, all the variations of str.reverse.takeWhile(_!='_').reverse are quite inefficient as you actually use additional O(n) space. If you want to implement method takeRightWhile efficiently you could iterate starting from the right, accumulating result in string builder of whatever else, and returning the result. I am asking about this kind of method, not implementation which was already described and declined in the question itself.
Question: Does this kind of method exist in scala standard library? If no, is there method combination from the standard library to achieve the same in minimum amount of lines?
Thanks in advance.
Possible solution:
str.reverse.takeWhile(_!='_').reverse
Update
You can go from right to left with following expression using foldRight:
str.toList.foldRight(List.empty[Char]) {
case (item, acc) => item::acc
}
Here you need to check condition and stop adding items after condition met. For this you can pass a flag to accumulated value:
val (_, list) = str.toList.foldRight((false, List.empty[Char])) {
case (item, (false, list)) if item!='_' => (false, item::list)
case (_, (_, list)) => (true, list)
}
val res = list.mkString.toInt
This solution is even more inefficient then solution with double reverse:
Implementation of foldRight uses combination of List reverse and foldLeft
You cannot break foldRight execution, so you need flag to skip all items after condition met
I'd go with this:
val s = "string_with_following_number_42"
s.split("_").reverse.head
// res:String = 42
This is a naive attempt and by no means optimized. What it does is splitting the String into an Array of Strings, reverses it and takes the first element. Note that, because the reversing happens after the splitting, the order of the characters is correct.
I am not exactly sure about the problem you are facing. My understanding is that you want have a string of format xxx_xxx_xx_...._xxx_123 and you want to extract the part at the end as Int.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val yourInt = yourStr.split('_').last.toInt
// But remember that the above is unsafe so you may want to take it as Option
val yourIntOpt = Try(yourStr.split('_').last.toInt).toOption
Or... lets say your requirement is to collect a right-suffix till some boolean condition remains true.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val rightSuffix = yourStr.reverse.takeWhile(c => c != '_').reverse
val yourInt = rightSuffix.toInt
// but above is unsafe so
val yourIntOpt = Try(righSuffix.toInt).toOption
Comment if your requirement is different from this.
You can use StringBuilder and lastIndexWhere.
val str = "this_is_separated_values_5"
val sb = new StringBuilder(str)
val lastIdx = sb.lastIndexWhere(ch => ch != '_')
val lastCh = str.charAt(lastIdx)

Resources