Trying to calcula execution time in Spark

Trying to calcula execution time in Spark - apache-spark

If I do a code like this:
foreachRDD{ rdd =>
//operation1
val before = time.now()
val result = rdd.map(r=> //some operation)
val finalTime = time.now() - before
//operation1
val before2 = time.now()
val result2 = result.map(r=> //some operation)
val finalTime2 = time.now() - before2
....
//Some action
}
I think that finalTime and finalTime2 are executed in the driver and they give me the real time to execute each of these operations, am I right? or these operations where are really executed?

I think you can use the time function, but only available only after 2.1.0 (You can add manually for the lower versions.)
val spark = SparkSession
.builder()
.appName("Spark test")
.master("local[*]")
.getOrCreate()
val df = ???
spark.time(df.show()) //some block of operation here
You can see here
/**
* Executes some code block and prints to stdout the time taken to execute the block. This is
* available in Scala only and is used primarily for interactive testing and debugging.
*
* #since 2.1.0
*/
def time[T](f: => T): T = {
val start = System.nanoTime()
val ret = f
val end = System.nanoTime()
// scalastyle:off println
println(s"Time taken: ${(end - start) / 1000 / 1000} ms")
// scalastyle:on println
ret
}

Related

Implementing a multithreading function for running "foreach, map and reduce" parallel

I am quite new to Scala but I am learning about Threads and Multithreading.
As the title says, I am trying to implement a way to divide the problem onto different threads of variable count.
We are given this code:
/** Executes the provided function for each entry in the input sequence in parallel.
*
* #param input the input sequence
* #param parallelism the number of threads to use
* #param f the function to run
*/
def parallelForeach[A](input: IndexedSeq[A], parallelism: Int, f: A => Unit): Unit = ???
I tried implementing it like this:
def parallelForeach[A](input: IndexedSeq[A], parallelism: Int, f: A => Unit): Unit = {
if (parallelism < 1) {
throw new IllegalArgumentException("a degree of parallelism < 1 s not allowed for parallel foreach")
}
val threads = (0 until parallelism).map { threadId =>
val startIndex = threadId * input.size / parallelism
val endIndex = (threadId + 1) * input.size / parallelism
val task: Runnable = () => {
(startIndex until endIndex).foreach { A =>
val key = input.grouped(input.size / parallelism)
val x: Unit = input.foreach(A => f(A))
x
}
}
new Thread(task)
}
threads.foreach(_.start())
threads.foreach(_.join())
}
for this test:
test("parallel foreach should perform the given function once for each element in the sequence") {
val counter = AtomicLong(0L)
parallelForeach((1 to 100), 16, counter.addAndGet(_))
assert(counter.get() == 5050)
But, as you can guess, it doesn't work this way as my result isn't 5050 but 505000.
Now here is my question. How do I implement a way to use multithreading efficiently, so there are for example 16 different threads working at the same time?

Check your test: "1 to 100".
With your Code you go with each thread through 100, this is why your result is 100 times to large.

How to define mergeExpressions for a custom DeclarativeAggregate (in catalyst package)

I don't understand the general approach one takes to determine the mergeExpressions function for non-trivial aggregators.
The mergeExpresssions method for something like org.apache.spark.sql.catalyst.expressions.aggregate.Average is straightforward:
override lazy val mergeExpressions = Seq(
/* sum = */ sum.left + sum.right,
/* count = */ count.left + count.right
)
The mergeExpressions for CentralMomentAgg aggregators is a bit more involved.
What I would like to do is create a WeightedStddevSamp aggregator modeled after sparks CentralMomentAgg.
I almost have it working, but the weighted standard deviations that it produces are still a little off from what I compute by hand.
I'm having trouble debugging it because I do not understand how I can compute the exact logic for the mergeExpressions method.
Below is my code. The updateExpressions method is based on this weighted incremental algorithm, so I'm pretty sure that method is correct. I believe my problem is in the mergeExpressions method. Any hints would be appreciated.
abstract class WeightedCentralMomentAgg(child: Expression, weight: Expression) extends DeclarativeAggregate {
override def children: Seq[Expression] = Seq(child, weight)
override def nullable: Boolean = true
override def dataType: DataType = DoubleType
override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, DoubleType)
protected val wSum = AttributeReference("wSum", DoubleType, nullable = false)()
protected val mean = AttributeReference("mean", DoubleType, nullable = false)()
protected val s = AttributeReference("s", DoubleType, nullable = false)()
override val aggBufferAttributes = Seq(wSum, mean, s)
override val initialValues: Seq[Expression] = Array.fill(3)(Literal(0.0))
// See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Weighted_incremental_algorithm
override val updateExpressions: Seq[Expression] = {
val newWSum = wSum + weight
val newMean = mean + (weight / newWSum) * (child - mean)
val newS = s + weight * (child - mean) * (child - newMean)
Seq(
If(IsNull(child), wSum, newWSum),
If(IsNull(child), mean, newMean),
If(IsNull(child), s, newS)
)
}
override val mergeExpressions: Seq[Expression] = {
val wSum1 = wSum.left
val wSum2 = wSum.right
val newWSum = wSum1 + wSum2
val delta = mean.right - mean.left
val deltaN = If(newWSum === Literal(0.0), Literal(0.0), delta / newWSum)
val newMean = mean.left + wSum1 / newWSum * delta // ???
val newS = s.left + s.right + wSum1 * wSum2 * delta * deltaN // ???
Seq(newWSum, newMean, newS)
}
}
// Compute the weighted sample standard deviation of a column
case class WeightedStddevSamp(child: Expression, weight: Expression)
extends WeightedCentralMomentAgg(child, weight) {
override val evaluateExpression: Expression = {
If(wSum === Literal(0.0), Literal.create(null, DoubleType),
If(wSum === Literal(1.0), Literal(Double.NaN),
Sqrt(s / wSum) ) )
}
override def prettyName: String = "wtd_stddev_samp"
}

For any hash aggregation, it's divided into four steps:
1) initialize the buffer (wSum, mean, s)
2) Within a partition, update the buffer of the key given all the input (call updateExpression for each of input)
3) After shuffling, merge all the buffer for same key using mergeExpression. wSum.left means wSum in left buffer, wSum.right means wSum in the other buffer
4) get the final result from buffer using valueExpression

I discovered how to write the mergeExpressions function for weighted standard deviation. I actually had it right, but then was using a population variance rather than a sample variance calculation in evaluateExpression. The implementation shown below gives the same result as above, but it easier to understand.
override val mergeExpressions: Seq[Expression] = {
val newN = n.left + n.right
val wSum1 = wSum.left
val wSum2 = wSum.right
val newWSum = wSum1 + wSum2
val delta = mean.right - mean.left
val deltaN = If(newWSum === Literal(0.0), Literal(0.0), delta / newWSum)
val newMean = mean.left + deltaN * wSum2
val newS = (((wSum1 * s.left) + (wSum2 * s.right)) / newWSum) + (wSum1 * wSum2 * deltaN * deltaN)
Seq(newN, newWSum, newMean, newS)
}
Here are some references
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf
http://people.ds.cam.ac.uk/fanf2/hermes/doc/antiforgery/stats.pdf
https://blog.cordiner.net/2010/06/16/calculating-variance-and-mean-with-mapreduce-python/ (This last one gave me the clue I needed for the mergeExpressions function)
Davies' post gives an outline of the approach, but for many non-trivial aggregators, I think the mergeExpressions function can be quite complex and involve advanced math to determine a correct and efficient solution. Fortunately, in this case, I found someone who had worked it out.
This solution matches what I work out by hand. Its important to note that the evaluateExpression needs to be modified slightly (to be s / ((n-1)*wSum/n)) if you want sample variance instead of population variance.

Spark,Graphx program does not utilize cpu and memory

I have a function that takes the neighbors of a node ,for the neighbors i use broadcast variable and the id of the node itself and it calculates the closeness centrality for that node.I map each node of the graph with the result of that function.When i open the task manager the cpu is not utilized at all as if it is not working in parallel , the same goes for memory , but the every node executes the function in parallel and also the data is large and it takes time to complete ,its not like it does not need the resources.Every help is truly appreciated , thank you.
For loading the graph i use val graph = GraphLoader.edgeListFile(sc, path).cache
object ClosenessCentrality {
case class Vertex(id: VertexId)
def run(graph: Graph[Int, Float],sc: SparkContext): Unit = {
//Have to reverse edges and make graph undirected because is bipartite
val neighbors = CollectNeighbors.collectWeightedNeighbors(graph).collectAsMap()
val bNeighbors = sc.broadcast(neighbors)
val result = graph.vertices.map(f => shortestPaths(f._1,bNeighbors.value))
//result.coalesce(1)
result.count()
}
def shortestPaths(source: VertexId, neighbors: Map[VertexId, Map[VertexId, Float]]): Double ={
val predecessors = new mutable.HashMap[VertexId, ListBuffer[VertexId]]()
val distances = new mutable.HashMap[VertexId, Double]()
val q = new FibonacciHeap[Vertex]
val nodes = new mutable.HashMap[VertexId, FibonacciHeap.Node[Vertex]]()
distances.put(source, 0)
for (w <- neighbors) {
if (w._1 != source)
distances.put(w._1, Int.MaxValue)
predecessors.put(w._1, ListBuffer[VertexId]())
val node = q.insert(Vertex(w._1), distances(w._1))
nodes.put(w._1, node)
}
while (!q.isEmpty) {
val u = q.minNode
val node = u.data.id
q.removeMin()
//discover paths
//println("Current node is:"+node+" "+neighbors(node).size)
for (w <- neighbors(node).keys) {
//print("Neighbor is"+w)
val alt = distances(node) + neighbors(node)(w)
// if (distances(w) > alt) {
// distances(w) = alt
// q.decreaseKey(nodes(w), alt)
// }
// if (distances(w) == alt)
// predecessors(w).+=(node)
if(alt< distances(w)){
distances(w) = alt
predecessors(w).+=(node)
q.decreaseKey(nodes(w), alt)
}
}//For
}
val sum = distances.values.sum
sum
}

To provide somewhat of an answer to your original question, I suspect that your RDD only has a single partition, thus using a single core to process.
The edgeListFile method has an argument to specify the minimum number of partitions you want.
Also, you can use repartition to get more partitions.
You mentionned coalesce but that only reduces the number of partitions by default, see this question : Spark Coalesce More Partitions

how to measure scala thread pool time?

Hi I wonder how to measure thread pool time in scala.
Here is a example.
val pool = java.util.concurrent.Executors.newFixedThreadPool(2)
val start_time = System.nanoTime()
1 to 10 foreach { x =>
pool.execute(
new Runnable {
def run {
try{
Thread.sleep(2000)
println("n: %s, thread: %s".format(x, Thread.currentThread.getId))
}finally{
pool.shutdown()
}
}
}
)
}
val end_time = System.nanoTime()
println("time is "+(end_time - start_time)/(1e6 * 60 * 60))
But I think this is not working properly.
Is there any methods to measure the time?

There are numerous threads in your snippet.
main thread where you've created fixed thread pool and execute the loop.
10 threads that are "sleeping" 2 seconds and printing some stuff.
Your main thread as soon as it create 10 threads finish its work and print time. It does not wait for all parallel threads to be completed.
What you have to do is await results from all threads and only then perform total time calculation.
I would suggest you to learn a bit about concept of Future which will allow you to wait for result properly.
So your code might looks like following:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.concurrent.duration.Duration
val start_time = System.nanoTime()
val zz = 1 to 10 map { x =>
Future {
Thread.sleep(2000)
println("n: %s, thread: %s".format(x, Thread.currentThread.getId))
}
}
Await.result(Future.sequence(zz), Duration.Inf)
val end_time = System.nanoTime()
println("time is " + (end_time - start_time) / (1e6 * 60 * 60))
I've used default global scala thread pool.

Scala Future/Promise fast-fail pipeline

I want to launch two or more Future/Promises in parallel and fail even if one of the launched Future/Promise fails and dont want to wait for the rest to complete.
What is the most idiomatic way to compose this pipeline in Scala.
EDIT: more contextual information.
I have to launch two external processes one writing to a fifo file and another reading from it. Say if the writer process fails; the reader thread might hang forever waiting for any input from the file. So I would want to launch both the processes in parallel and fail fast even if one of the Future/Promise fails without waiting for the completion of the other.
Below is the sample code to be more precise. the commands are not exactly cat and tail. I have used them for brevity.
val future1 = Future { executeShellCommand("cat file.txt > fifo.pipe") }
val future2 = Future { executeShellCommand("tail fifo.pipe") }

If I understand the question correctly, what we are looking for is a fail-fast sequence implementation, which is akin to a failure-biased version of firstCompletedOf
Here, we eagerly register a failure callback in case one of the futures fails early on, ensuring that we fail as soon as any of the futures fail.
import scala.concurrent.{Future, Promise}
import scala.util.{Success, Failure}
import scala.concurrent.ExecutionContext.Implicits.global
def failFast[T](futures: Seq[Future[T]]): Future[Seq[T]] = {
val promise = Promise[Seq[T]]
futures.foreach{f => f.onFailure{case ex => promise.failure(ex)}}
val res = Future.sequence(futures)
promise.completeWith(res).future
}
In contrast to Future.sequence, this implementation will fail as soon as any of the futures fail, regardless of ordering.
Let's show that with an example:
import scala.util.Try
// help method to measure time
def resilientTime[T](t: =>T):(Try[T], Long) = {
val t0 = System.currentTimeMillis
val res = Try(t)
(res, System.currentTimeMillis-t0)
}
import scala.concurrent.duration._
import scala.concurrent.Await
First future will fail (failure in 2 seconds)
val f1 = Future[Int]{Thread.sleep(2000); throw new Exception("boom")}
val f2 = Future[Int]{Thread.sleep(5000); 42}
val f3 = Future[Int]{Thread.sleep(10000); 101}
val res = failFast(Seq(f1,f2,f3))
resilientTime(Await.result(res, 10.seconds))
// res: (scala.util.Try[Seq[Int]], Long) = (Failure(java.lang.Exception: boom),1998)
Last future will fail. Failure also in 2 seconds. (note the order in the sequence construction)
val f1 = Future[Int]{Thread.sleep(2000); throw new Exception("boom")}
val f2 = Future[Int]{Thread.sleep(5000); 42}
val f3 = Future[Int]{Thread.sleep(10000); 101}
val res = failFast(Seq(f3,f2,f1))
resilientTime(Await.result(res, 10.seconds))
// res: (scala.util.Try[Seq[Int]], Long) = (Failure(java.lang.Exception: boom),1998)
Comparing with Future.sequence where failure depends on the ordering (failure in 10 seconds):
val f1 = Future[Int]{Thread.sleep(2000); throw new Exception("boom")}
val f2 = Future[Int]{Thread.sleep(5000); 42}
val f3 = Future[Int]{Thread.sleep(10000); 101}
val seq = Seq(f3,f2,f1)
resilientTime(Await.result(Future.sequence(seq), 10.seconds))
//res: (scala.util.Try[Seq[Int]], Long) = (Failure(java.lang.Exception: boom),10000)

Use Future.sequence:
val both = Future.sequence(Seq(
firstFuture,
secondFuture));
This is the correct way to aggregate two or more futures where the failure of one fails the aggregated future and the aggregated future completes when all inner futures complete. An older version of this answer suggested a for-comprehension which while very common would not reject immediately of one of the futures rejects but rather wait for it.

Zip the futures
val f1 = Future { doSomething() }
val f2 = Future { doSomething() }
val resultF = f1 zip f2
resultF future fails if any one of f1 or f2 is failed
Time taken to resolve is min(f1time, f2time)
scala> import scala.util._
import scala.util._
scala> import scala.concurrent._
import scala.concurrent._
scala> import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.ExecutionContext.Implicits.global
scala> val f = Future { Thread.sleep(10000); throw new Exception("f") }
f: scala.concurrent.Future[Nothing] = scala.concurrent.impl.Promise$DefaultPromise#da1f03e
scala> val g = Future { Thread.sleep(20000); throw new Exception("g") }
g: scala.concurrent.Future[Nothing] = scala.concurrent.impl.Promise$DefaultPromise#634a98e3
scala> val x = f zip g
x: scala.concurrent.Future[(Nothing, Nothing)] = scala.concurrent.impl.Promise$DefaultPromise#3447e854
scala> x onComplete { case Success(x) => println(x) case Failure(th) => println(th)}
result: java.lang.Exception: f

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Trying to calcula execution time in Spark - apache-spark

Related

Implementing a multithreading function for running "foreach, map and reduce" parallel

How to define mergeExpressions for a custom DeclarativeAggregate (in catalyst package)

Spark,Graphx program does not utilize cpu and memory

how to measure scala thread pool time?

Scala Future/Promise fast-fail pipeline

Categories

Resources