How to create RDD from within Task? - apache-spark

Normally when creating an RDD from a List you can just use the SparkContext.parallelize method, but you can not use the spark context from within a Task as it's not serializeable. I have a need to create an RDD from a list of Strings from within a task. Is there a way to do this?
I've tried creating a new SparkContext in the task, but it gives me an error about not supporting multiple spark contexts in the same JVM and that I need to set spark.driver.allowMultipleContexts = true. According to the Apache User Group, that setting however does not yet seem to be supported

As far as I am concerned it is not possible and it is hardly a matter of serialization or a support for multiple Spark contexts. A fundamental limitation is a core Spark architecture. Since Spark context is maintained by a driver and tasks are executed on the workers creating a RDD from inside a task would require pushing changes from workers to a driver. I am not saying it is technically impossible but a whole ideas seems to be rather cumbersome.
Creating Spark context from inside tasks looks even worse. First of all it would mean that context is created on the workers, which for all practical purposes don't communicate with each other. Each worker would get its own context which could operate only on a data that is accessible on given worker. Finally preserving worker state is definitely not a part of the contract so any context create inside a task should be simply garbage collected after the task is finished.
If handling the problem using multiple jobs is not an option you can try to use mapPartitions like this:
val rdd = sc.parallelize(1 to 100)
val tmp = rdd.mapPartitions(iter => {
val results = Map(
"odd" -> scala.collection.mutable.ArrayBuffer.empty[Int],
"even" -> scala.collection.mutable.ArrayBuffer.empty[Int]
)
for(i <- iter) {
if (i % 2 != 0) results("odd") += i
else results("even") += i
}
Iterator(results)
})
val odd = tmp.flatMap(_("odd"))
val even = tmp.flatMap(_("even"))

Related

Why would preferredLocations not be enforced on an empty Spark cluster?

My Spark job consists of 3 workers, co-located with the data they need to read. I submit a RDD with some metadata and the job tasks turn that metadata in into real data. For instance the metadata could contain a file to read from the local worker filesystem and the first stage of the spark job would be to read that file into a RDD partition.
In my environment the data may not be present on all 3 workers and it is way too expensive to read across workers (i.e. if the data is on worker1 then worker2 can not reach out and fetch it). For this reason I have to force partitions onto the appropriate worker for the data they are reading. I have a mechanism for achieving this where I check the worker against the expected worker in the metadata and fail the task with a descriptive error message if they don't match. Using blacklisting I can ensure that the task is rescheduled on a different node until the right one is found. This works fine but as an optimization I wanted to use preferredLocations to help the tasks get assigned to the right workers initially without having to go through the try/reschedule process.
Is use makeRDD to create my initial RDD (of metadata) with the correct preferredLocations as per the answer here: How to control preferred locations of RDD partitions?, however it's not exhibiting the behaviour I expect. The code to makeRDD is below:
sc.makeRDD(taskAssigments)
where taskAssignments takes the form:
val taskAssignments = mutable.ArrayBuffer[(String, Seq[String])]()
metadataMappings.foreach { case(k , v) => {
taskAssignments += (k + ":" + v.mkString(",") -> Seq(idHostnameMappings(k)))
}}
idHostMappings is just a map of id -> hostName and I've verified that it contains the correct information.
Given that my test Spark cluster is completely clean with no other jobs running on it and there is no skew in the input RDD (it has 3 partitions to match the 3 workers) I would have expected the tasks to be assigned to their preferredLocations. Instead I still the error messages indicating that tasks are going through the fail/reschedule process.
Is my assumption that tasks would be scheduled at their preferredLocations on a clean cluster correct and is there anything further I can do to force this?
Follow up:
I was also able to create a much simpler test case. My 3 spark workers are named worker1,worker2 and worker3 and I run the following:
import scala.collection.mutable
val someData = mutable.ArrayBuffer[(String, Seq[String])]()
someData += ("1" -> Seq("worker1"))
someData += ("2" -> Seq("worker2"))
someData += ("3" -> Seq("worker3"))
val someRdd = sc.makeRDD(someData)
someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
I'd expect to see 1:worker1 etc but in fact see
1:worker3
2:worker1
3:worker2
can anyone explain this behaviour?
It turned out the issue was with my environment not Spark. Just in case anyone else is experiencing this the problem was that the Spark workers did not use the machine hostname by default. Setting the following environment variable on each worker rectified it: SPARK_LOCAL_HOSTNAME: "worker1"

Can we execute map() on List, which is generated from collect() in spark

I am curious to understand where will the Transformations execute(Inside executor or Driver) when the are specified after an action. Suppose below is the rough flow of Transformations and actions.
val R1 = Rdd.map(...);
val R2 = R1.filter(...);
val R3 = R2.flatMap(...);
Untill the above, the instructions execute on Executor in a distributed manner.
val lst = R3.collect(); --> Collect will also be executed in Distributed Manner I suppose. Please correct if i am wrong. It sends back the output to Driver.
Now CAN we execute map() like below?
lst.map(...)
If we can, then where will this code execute? on Driver or Executor?
Spark follows lazy evaluation which means it will start the job only when there is an action applied.
In your example R3.collect() is the action which triggers to run your entire lineage. Yes collect() will run in distributed way and will return all the transformed data to Driver node. Once it is done your lst variable is going to be an in memory collection (array).
Since lst is an array lst.map(...) is possible and it is going to run on Driver.
Important point to note, if you working with really large data set
and doing collect is a bad practice as it will bring entire data into
Driver which often given OOM exception.
Let me know if this helps.

Why does folding dataframes cause a NullPointerException? [duplicate]

sessionIdList is of type :
scala> sessionIdList
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
When I try to run below code :
val x = sc.parallelize(List(1,2,3))
val cartesianComp = x.cartesian(x).map(x => (x))
val kDistanceNeighbourhood = sessionIdList.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
I receive exception :
14/05/21 16:20:46 ERROR Executor: Exception in task ID 80
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.filter(RDD.scala:261)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:38)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
However if I use :
val l = sc.parallelize(List("1","2"))
val kDistanceNeighbourhood = l.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
Then no exception is displayed
The difference between the two code snippets is that in first snippet sessionIdList is of type :
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
and in second snippet "l" is of type
scala> l
res13: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at parallelize at <console>:12
Why is this error occuring ?
Do I need to convert sessionIdList to ParallelCollectionRDD in order to fix this ?
Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations.
In the first case, you're seeing a NullPointerException thrown by the worker when it tries to access a SparkContext object that's only present on the driver and not the workers.
In the second case, my hunch is the job was run locally on the driver and worked purely by accident.
Its a reasonable question and I have heard it asked it enough times that. I'm going to try to take a stab at explaining why this is true, because it might help.
Nested RDDs will always throw an exception in production. Nested function calls as I think you are describing them here, if it means calling an RDD operation inside an RDD operation, will cause also cause failures since it is actually the same thing. (RDDs are immutable, so performing an RDD operation such as a "map" is equivalent to creating a new RDD.) The in ability to create nested RDDs is a necessary consequence of the way an RDD is defined and the way the Spark Application is set up.
An RDD is a distributed collection of objects (called partitions) that live on the Spark Executors. Spark executors cannot communicate with each other, only with the Spark driver. The RDD operations are all computed in pieces on these partitions.Because the RDD's executor environment isn't recursive (i.e. you can configure a Spark driver to be on a spark executor with sub executors) neither can an RDD.
In your program, you have created a distributed collection of partitions of integers. You are then performing a mapping operation. When the Spark driver sees a mapping operation, it sends the instructions to do the mapping to the executors, who perform the transformation on each partition in parallel. But your mapping cannot be done, because on each partition you are trying to call the "whole RDD" to perform another distributed operation. This can't not be done, because each partition does not have access to the information on the other partitions, if it did, the computation couldn't run in parallel.
What you can do instead, because the data you need in the map is probably small (since you are doing a filter, and the filter does not require any information about sessionIdList) is to first filter the session ID list. Then collect that list to the driver. Then broadcast it to the executors, where you can use it in the map. If the sessionID list is too large, you will probably need to do a join.

How to execute computation per executor in Spark

In my computation, I
first broadcast some data, say bc,
then compute some big data shared by all executor/partition:val shared = f(bc)
then run the distributed computing, using shared data
To avoid computing the shared data on all RDD items, I can use .mapPartitions but I have much more partitions than executors. So it run the computation of shared data more times than necessary.
I found a simple method to compute the shared data only once per executor (which I understood is the JVM actually running the spark tasks): using lazy val on the broadcast data.
// class to be Broadcast
case class BC(input: WhatEver){
lazy val shared = f(input)
}
// in the spark code
val sc = ... // init SparkContext
val bc = sc.broadcast(BC(...))
val initRdd = sc.parallelize(1 to 10000, numSlices = 10000)
initRDD.map{ i =>
val shared = bc.value.shared
... // run computation using shared data
}
I think this does what I want, but
I am not sure, can someone guaranties it?
I am not sure lazy val is the best way to manage concurrent access, especially with respect to the spark internal distribution system. Is there a better way?
if computing shared fails, I think it will be recomputed for all RDD items with possible retries, instead of simply stopping the whole job with a unique error.
So, is there a better way?

Why does foreach not bring anything to the driver program?

I wrote this program in spark shell
val array = sc.parallelize(List(1, 2, 3, 4))
array.foreach(x => println(x))
this prints some debug statements but not the actual numbers.
The code below works fine
for(num <- array.take(4)) {
println(num)
}
I do understand that take is an action and therefore will cause spark to trigger the lazy computation.
But foreach should have worked the same way... why did foreach not bring anything back from spark and start doing the actual processing (get out of lazy mode)
How can I make the foreach on the rdd work?
The RDD.foreach method in Spark runs on the cluster so each worker which contains these records is running the operations in foreach. I.e. your code is running, but they are printing out on the Spark workers stdout, not in the driver/your shell session. If you look at the output (stdout) for your Spark workers, you will see these printed to the console.
You can view the stdout on the workers by going to the web gui running for each running executor. An example URL is http://workerIp:workerPort/logPage/?appId=app-20150303023103-0043&executorId=1&logType=stdout
In this example Spark chooses to put all the records of the RDD in the same partition.
This makes sense if you think about it - look at the function signature for foreach - it doesn't return anything.
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit
This is really the purpose of foreach in scala - its used to side effect.
When you collect records, you bring them back into the driver so logically collect/take operations are just running on a Scala collection within the Spark driver - you can see the log output as the spark driver/spark shell is whats printing to stdout in your session.
A use case of foreach may not seem immediately apparent, an example - if for each record in the RDD you wanted to do some external behaviour, like call a REST api, you could do this in the foreach, then each Spark worker would submit a call to the API server with the value. If foreach did bring back records, you could easily blow out the memory in the driver/shell process. This way you avoid these issues and can do side-effects on all the items in an RDD over the cluster.
If you want to see whats in an RDD I use;
array.collect.foreach(println)
//Instead of collect, use take(...) or takeSample(...) if the RDD is large
You can use RDD.toLocalIterator() to bring the data to the driver (one RDD partition at a time):
val array = sc.parallelize(List(1, 2, 3, 4))
for(rec <- array.toLocalIterator) { println(rec) }
See also
Spark: Best practice for retrieving big data from RDD to local machine
this blog post about toLocalIterator

Resources