Iterate every row of a spark dataframe without using collect - apache-spark

I want to iterate every row of a dataframe without using collect. Here is my current implementation:
val df = spark.read.csv("/tmp/s0v00fc/test_dir")
import scala.collection.mutable.Map
var m1 = Map[Int, Int]()
var m4 = Map[Int, Int]()
var j = 1
def Test(m:Int, n:Int):Unit = {
if (!m1.contains(m)) {
m1 += (m -> j)
m4 += (j -> m)
j += 1
}
if (!m1.contains(n)) {
m1 += (n -> j)
m4 += (j -> n)
j += 1
}
df.foreach { row => Test(row(0).toString.toInt, row(1).toString.toInt) }
This does not give any error but m1 and m4 are still empty. I can get the result I am expecting if I do a df.collect as shown below -
df.collect.foreach { row => Test(row(0).toString.toInt, row(1).toString.toInt) }
How do I execute the custom function "Test" on every row of the dataframe without using collect

According to the Spark documentation for foreach:
"Note: modifying variables other than Accumulators outside of the foreach()may result in undefined behavior. See Understanding closures for more details."
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

Related

Multithreading in scala

I was given a challenge recently in school to create a simple program in Scala the does some calculations in a matrix, the thing is I have to do these calculations using 5 threads, since I had no prior knowledge of Scala I am stuck. I searched online but I did not find how to create the exact number of threads I want. This is the code:
import scala.math
object Test{
def main(args: Array[String]){
val M1: Seq[Seq[Int]] = List(
List(1, 2, 3),
List(4, 5, 6),
List(7, 8, 9)
)
var tempData : Float= 0
var count:Int = 1
var finalData:Int=0
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
def calc(i:Int, j:Int): Int ={
if((i<0)|| (j<0) || (i>M1.length-1))
return 0
else{
count +=1
return M1(i)(j)}
}
}
I tried this:
for (a <- 0 until 1) {
val thread = new Thread {
override def run {
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
}
}
thread.start
}
but it only executed the same thing 10 times
Here's the original core of the calculation.
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
Let's actually build a result array
val R = Array.ofDim[Int](M1.length, M1(0).length)
var tempData : Float= 0
var count:Int = 1
var finalData:Int=0
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
R(i)(j) = math.ceil(tempData/count).toInt
}
Now, that mutable count modified in one function and referenced in another is a bit of a code smell. Let's remove it - change calc to return an option, assemble a list of the things to average, and flatten to keep only the Some
val R = Array.ofDim[Int](M1.length, M1(0).length)
for (i <- 0 to M1.length - 1; j <- 0 to M1(0).length - 1) {
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
R(i)(j) = math.ceil(tempList.sum.toDouble / tempList.length).toInt
}
def calc(i: Int, j: Int): Option[Int] = {
if ((i < 0) || (j < 0) || (i > M1.length - 1))
None
else {
Some(M1(i)(j))
}
}
Next, a side-effecting for is a bit of a code smell too. So in the inner loop, let's produce each row and in the outer loop a list of the rows...
val R = for (i <- 0 to M1.length - 1) yield {
for (j <- 0 to M1(0).length - 1) yield {
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}
}
Now, we read the Scala API and we notice ParSeq and Seq.par so we'd like to work with map and friends. So let's un-sugar the for comprehensions
val R = (0 until M1.length).map { i =>
(0 until M1(0).length).map { j =>
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}
}
This is our MotionBlurSingleThread. To make it parallel, we simply do
val R = (0 until M1.length).par.map { i =>
(0 until M1(0).length).par.map { j =>
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}.seq
}.seq
And this is our MotionBlurMultiThread. And it is nicely functional too (no mutable values)
The limit to 5 or 10 threads isn't in the challenge on Github, but if you need to do that you can look at scala parallel collections degree of parallelism and related questions
I am not an expert, neither on Scala nor on concurrency.
Scala approach to concurrency is through the use of actors and messaging, you can read a little about that here, Programming in Scala, chapter 30 Actors and Concurrency (the first edition is free but it is outdated). As I was telling, the edition is outdated and in the latest version of Scala (2.12) the actors library is no longer included, and they recommend to use Akka, you can read about that here.
So, I would not recommend learning about Scala, Sbt and Akka just for a challenge, but you can download an Akka quickstart here and customize the example given to your needs, it is nicely explained in the link. Each instance of the Actor has his own thread. You can read about actors and threads here, in specific, the section about state.

I want to collect the data frame column values in an array list to conduct some computations, is it possible?

I am loading data from phoenix through this:
val tableDF = sqlContext.phoenixTableAsDataFrame("Hbtable", Array("ID", "distance"), conf = configuration)
and want to carry out the following computation on the column values distance:
val list=Array(10,20,30,40,10,20,0,10,20,30,40,50,60)//list of values from the column distance
val first=list(0)
val last=list(list.length-1)
var m = 0;
for (a <- 0 to list.length-2) {
if (list(a + 1) < list(a) && list(a+1)>=0)
{
m = m + list(a)
}
}
val totalDist=(m+last-first)
You can do something like this. It returns Array[Any]
`val array = df.select("distance").rdd.map(r => r(0)).collect()
If you want to get the data type properly, then you can use. It returns the Array[Int]
val array = df.select("distance").rdd.map(r => r(0).asInstanceOf[Int]).collect()

Accessing rows outside of window while aggregating in Spark dataframe

In short, in the example below I want to pin 'b to be the value in the row that the result will appear in.
Given:
a,b
1,2
4,6
3,7 ==> 'special would be: (1-7 + 4-7 + 3-7) == -13 in this row
val baseWin = Window.partitionBy("something_I_forgot").orderBy("whatever")
val sumWin = baseWin.rowsBetween(-2, 0)
frame.withColumn("special",sum( 'a - 'b ).over(win) )
Or another way to think of it is I want to close over the row when I calculate the sum so that I can pass in the value of 'b (in this case 7)
* Update *
Here is what I want to accomplish as an UDF. In short, I used a foldLeft.
def mad(field : Column, numPeriods : Integer) : Column = {
val baseWin = Window.partitionBy("exchange","symbol").orderBy("datetime")
val win = baseWin.rowsBetween(numPeriods + 1, 0)
val subFunc: (Seq[Double],Int) => Double = { (input: Seq[Double], numPeriods : Int) => {
val agg = grizzled.math.stats.mean(input: _*)
val fooBar = (1.0 / -numPeriods)*input.foldLeft(0.0)( (a,b) => a + Math.abs((b-agg)) )
fooBar
} }
val myUdf = udf( subFunc )
myUdf(collect_list(field.cast(DoubleType)).over(win),lit(numPeriods))
}
If I understood correctly what you're trying to do, I think you can refactor your logic a bit to achieve it. The way you have it right now, you're probably getting "-7" instead of -13.
For the "special" column, (1-7 + 4-7 + 3-7), you can calculate it like (sum(a) - count(*) * b):
dfA.withColumn("special",sum('a).over(win) - count("*").over(win) * 'b)

Parallel Merge Sort in Scala

I have been trying to implement parallel merge sort in Scala. But with 8 cores, using .sorted is still about twice as fast.
edit:
I rewrote most of the code to minimize object creation. Now it runs about as fast as the .sorted
Input file with 1.2M integers:
1.333580 seconds (my implementation)
1.439293 seconds (.sorted)
How should I parallelize this?
New implementation
object Mergesort extends App
{
//=====================================================================================================================
// UTILITY
implicit object comp extends Ordering[Any] {
def compare(a: Any, b: Any) = {
(a, b) match {
case (a: Int, b: Int) => a compare b
case (a: String, b: String) => a compare b
case _ => 0
}
}
}
//=====================================================================================================================
// MERGESORT
val THRESHOLD = 30
def inssort[A](a: Array[A], left: Int, right: Int): Array[A] = {
for (i <- (left+1) until right) {
var j = i
val item = a(j)
while (j > left && comp.lt(item,a(j-1))) {
a(j) = a(j-1)
j -= 1
}
a(j) = item
}
a
}
def mergesort_merge[A](a: Array[A], temp: Array[A], left: Int, right: Int, mid: Int) : Array[A] = {
var i = left
var j = right
while (i < mid) { temp(i) = a(i); i+=1; }
while (j > mid) { temp(i) = a(j-1); i+=1; j-=1; }
i = left
j = right-1
var k = left
while (k < right) {
if (comp.lt(temp(i), temp(j))) { a(k) = temp(i); i+=1; k+=1; }
else { a(k) = temp(j); j-=1; k+=1; }
}
a
}
def mergesort_split[A](a: Array[A], temp: Array[A], left: Int, right: Int): Array[A] = {
if (right-left == 1) a
if ((right-left) > THRESHOLD) {
val mid = (left+right)/2
mergesort_split(a, temp, left, mid)
mergesort_split(a, temp, mid, right)
mergesort_merge(a, temp, left, right, mid)
}
else
inssort(a, left, right)
}
def mergesort[A: ClassTag](a: Array[A]): Array[A] = {
val temp = new Array[A](a.size)
mergesort_split(a, temp, 0, a.size)
}
Previous implementation
Input file with 1.2M integers:
4.269937 seconds (my implementation)
1.831767 seconds (.sorted)
What sort of tricks there are to make it faster and cleaner?
object Mergesort extends App
{
//=====================================================================================================================
// UTILITY
val StartNano = System.nanoTime
def dbg(msg: String) = println("%05d DBG ".format(((System.nanoTime - StartNano)/1e6).toInt) + msg)
def time[T](work: =>T) = {
val start = System.nanoTime
val res = work
println("%f seconds".format((System.nanoTime - start)/1e9))
res
}
implicit object comp extends Ordering[Any] {
def compare(a: Any, b: Any) = {
(a, b) match {
case (a: Int, b: Int) => a compare b
case (a: String, b: String) => a compare b
case _ => 0
}
}
}
//=====================================================================================================================
// MERGESORT
def merge[A](left: List[A], right: List[A]): Stream[A] = (left, right) match {
case (x :: xs, y :: ys) if comp.lteq(x, y) => x #:: merge(xs, right)
case (x :: xs, y :: ys) => y #:: merge(left, ys)
case _ => if (left.isEmpty) right.toStream else left.toStream
}
def sort[A](input: List[A], length: Int): List[A] = {
if (length < 100) return input.sortWith(comp.lt)
input match {
case Nil | List(_) => input
case _ =>
val middle = length / 2
val (left, right) = input splitAt middle
merge(sort(left, middle), sort(right, middle + length%2)).toList
}
}
def msort[A](input: List[A]): List[A] = sort(input, input.length)
//=====================================================================================================================
// PARALLELIZATION
//val cores = Runtime.getRuntime.availableProcessors
//dbg("Detected %d cores.".format(cores))
//lazy implicit val ec = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(cores))
def futuremerge[A](fa: Future[List[A]], fb: Future[List[A]])(implicit order: Ordering[A], ec: ExecutionContext) =
{
for {
a <- fa
b <- fb
} yield merge(a, b).toList
}
def parallel_msort[A](input: List[A], length: Int)(implicit order: Ordering[A]): Future[List[A]] = {
val middle = length / 2
val (left, right) = input splitAt middle
if(length > 500) {
val fl = parallel_msort(left, middle)
val fr = parallel_msort(right, middle + length%2)
futuremerge(fl, fr)
}
else {
Future(msort(input))
}
}
//=====================================================================================================================
// MAIN
val results = time({
val src = Source.fromFile("in.txt").getLines
val header = src.next.split(" ").toVector
val lines = if (header(0) == "i") src.map(_.toInt).toList else src.toList
val f = parallel_msort(lines, lines.length)
Await.result(f, concurrent.duration.Duration.Inf)
})
println("Sorted as comparison...")
val sorted_src = Source.fromFile(input_folder+"in.txt").getLines
sorted_src.next
time(sorted_src.toList.sorted)
val writer = new PrintWriter("out.txt", "UTF-8")
try writer.print(results.mkString("\n"))
finally writer.close
}
My answer is probably going to be a bit long, but i hope that it will be useful for both you and me.
So, first question is: "how scala is doing sorting for a List?" Let's have a look at the code from scala repo!
def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
val len = this.length
val b = newBuilder
if (len == 1) b ++= this
else if (len > 1) {
b.sizeHint(len)
val arr = new Array[AnyRef](len) // Previously used ArraySeq for more compact but slower code
var i = 0
for (x <- this) {
arr(i) = x.asInstanceOf[AnyRef]
i += 1
}
java.util.Arrays.sort(arr, ord.asInstanceOf[Ordering[Object]])
i = 0
while (i < arr.length) {
b += arr(i).asInstanceOf[A]
i += 1
}
}
b.result()
}
So what the hell is going on here? Long story short: with java. Everything else is just size justification and casting. Basically this is the line which defines it:
java.util.Arrays.sort(arr, ord.asInstanceOf[Ordering[Object]])
Let's go one level deeper into JDK sources:
public static <T> void sort(T[] a, Comparator<? super T> c) {
if (c == null) {
sort(a);
} else {
if (LegacyMergeSort.userRequested)
legacyMergeSort(a, c);
else
TimSort.sort(a, 0, a.length, c, null, 0, 0);
}
}
legacyMergeSort is nothing but single threaded implementation of merge sort algorithm.
The next question is: "what is TimSort.sort and when do we use it?"
To my best knowledge default value for this property is false, which leads us to TimSort.sort algorithm. Description can be found here. Why is it better? Less comparisons that in merge sort according to comments in JDK sources.
Moreover you should be aware that it is all single threaded, so no parallelization here.
Third question, "your code":
You create too many objects. When it comes to performance, mutation (sadly) is your friend.
Premature optimization is the root of all evil -- Donald Knuth. Before making any optimizations (like parallelism), try to implement single threaded version and compare the results.
Use something like JMH to test performance of your code.
You should not probably use Stream class if you want to have the best performance as it does additional caching.
I intentionally did not give you answer like "super-fast merge sort in scala can be found here", but just some tips for you to apply to your code and coding practices.
Hope it will help you.

Groovier way of manipulating the list

I have two list like this :
def a = [100,200,300]
def b = [30,60,90]
I want the Groovier way of manipulating the a like this :
1) First element of a should be changed to a[0]-2*b[0]
2)Second element of a should be changed to a[1]-4*b[1]
3)Third element of a should be changed to a[2]-8*b[2]
(provided that both a and b will be of same length of 3)
If the list changed to map like this, lets say:
def a1 = [100:30, 200:60, 300:90]
how one could do the same above operation in this case.
Thanks in advance.
For List, I'd go with:
def result = []
a.eachWithIndex{ item, index ->
result << item - ((2**index) * b[index])
}
For Map it's a bit easier, but still requires an external state:
int i = 1
def result = a.collect { k, v -> k - ((2**i++) * v) }
A pity, Groovy doesn't have an analog for zip, in this case - something like zipWithIndex or collectWithIndex.
Using collect
In response to Victor in the comments, you can do this using a collect
def a = [100,200,300]
def b = [30,60,90]
// Introduce a list `c` of the multiplier
def c = (1..a.size()).collect { 2**it }
// Transpose these lists together, and calculate
[a,b,c].transpose().collect { x, y, z ->
x - y * z
}
Using inject
You can also use inject, passing in a map of multiplier and result, then fetching the result out at the end:
def result = [a,b].transpose().inject( [ mult:2, result:[] ] ) { acc, vals ->
acc.result << vals.with { av, bv -> av - ( acc.mult * bv ) }
acc.mult *= 2
acc
}.result
And similarly, you can use inject for the map:
def result = a1.inject( [ mult:2, result:[] ] ) { acc, key, val ->
acc.result << key - ( acc.mult * val )
acc.mult *= 2
acc
}.result
Using inject has the advantage that you don't need external variables declared, but has the disadvantage of being harder to read the code (and as Victor points out in the comments, this makes static analysis of the code hard to impossible for IDEs and groovypp)
def a1 = [100:30, 200:60, 300:90]
a1.eachWithIndex{item,index ->
println item.key-((2**(index+1))*item.value)
i++
}

Resources