How to execute computation per executor in Spark - apache-spark

In my computation, I
first broadcast some data, say bc,
then compute some big data shared by all executor/partition:val shared = f(bc)
then run the distributed computing, using shared data
To avoid computing the shared data on all RDD items, I can use .mapPartitions but I have much more partitions than executors. So it run the computation of shared data more times than necessary.
I found a simple method to compute the shared data only once per executor (which I understood is the JVM actually running the spark tasks): using lazy val on the broadcast data.
// class to be Broadcast
case class BC(input: WhatEver){
lazy val shared = f(input)
}
// in the spark code
val sc = ... // init SparkContext
val bc = sc.broadcast(BC(...))
val initRdd = sc.parallelize(1 to 10000, numSlices = 10000)
initRDD.map{ i =>
val shared = bc.value.shared
... // run computation using shared data
}
I think this does what I want, but
I am not sure, can someone guaranties it?
I am not sure lazy val is the best way to manage concurrent access, especially with respect to the spark internal distribution system. Is there a better way?
if computing shared fails, I think it will be recomputed for all RDD items with possible retries, instead of simply stopping the whole job with a unique error.
So, is there a better way?

Related

Why does Spark need to serialize data in an RDD for each task it runs?

Even with a .cache()d RDD, Spark still seems to serialize the data for each task run. Consider this code:
class LoggingSerializable() extends Externalizable {
override def writeExternal(out: ObjectOutput): Unit = {
println("xxx serializing")
}
override def readExternal(in: ObjectInput): Unit = {
println("xxx deserializing")
}
}
object SparkSer {
def main(args: Array[String]) = {
val conf = new SparkConf().setAppName("SparkSer").setMaster("local")
val spark = new SparkContext(conf)
val rdd: RDD[LoggingSerializable] = spark.parallelize(Seq(new LoggingSerializable())).cache()
println("xxx done loading")
rdd.foreach(ConstantClosure)
println("xxx done 1")
rdd.foreach(ConstantClosure)
println("xxx done 2")
spark.stop()
}
}
object ConstantClosure extends (LoggingSerializable => Unit) with Serializable {
def apply(t: LoggingSerializable): Unit = {
println("xxx closure ran")
}
}
It prints
xxx done loading
xxx serializing
xxx deserializing
xxx closure ran
xxx done 1
xxx serializing
xxx deserializing
xxx closure ran
xxx done 2
Even though I called .cache() on rdd, Spark still serializes the data for each call to .foreach. The official docs say
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
and that MEMORY_ONLY means
Store RDD as deserialized Java objects in the JVM.
Note that Spark tries to serialize the data it's when serializing the task, but ConstantClosure does not close over anything, so I don't understand why it would need to serialize any data.
I am asking because I would like to be able to run Spark in local mode without any performance loss, but having to serialize large elements in an RDD for each RDD action can be very costly. I am not sure if this problem is unique to local mode. It seems like Spark can't possibly send the data in an RDD over the wire to workers for every action, even when the RDD is cached.
I'm using spark-core 3.0.0.
This is because you are using parallelize. parallelize is using a special RDD, ParallelCollectionRDD, which put the data into Partitions. Partition defines a Spark task and it will be sent to executors inside a Spark task (ShuffleMapTask or ResultTask). If you print the stack trace in readExternal and writeExternal, you should be able to see that it happens when serializing and deserializing a Spark task.
In other words, data is a part of the Spark task metadata for ParallelCollectionRDD, and Spark has to send tasks to run in executors, that's where the serialization happens.
Most of other RDDs read data from external systems (such as files), so they don't have such behavior.
I agree that behavior looks surprising. Off the top of my head, I might guess that it's because caching the blocks is asynchronous, and all of this happens very fast. It's possible it simply does not wait around for the cached partition to become available and recomputes it the second time.
To test that hypothesis, introduce a lengthy wait before the second foreach just to see if that changes things.

Optimal (low-latency) spark settings for small datasets

I'm aware that spark is designed for large datasets for which it's great. But under certain circumstances I don't need this scalability, e.g. for unit tests or for data exploration on small datasets. Under these conditions spark performs relatively bad compared implementation in pure scala/python/matlab/R etc.
Note that I don't want to drop spark entirely, I want to keep the framework for larger workloads without re-implementing everything.
How can I disable sparks overhead as much as possible on small datasets (say 10-1000s of records)? I'm tried using only 1 partition in local mode (setting spark.sql.shuffle.partitions=1 and spark.default.parallelism=1)? Even which these settings, simple queries on 100 records take on the order of 1-2 seconds.
Note that I'm not trying to reduce the time for SparkSession instantiation, just the execution time given SparkSession exists.
operations in spark have same signature as the scala collections.
You could implement something like:
val useSpark = false
val rdd: RDD[String]
val list: List[String] = Nil
def mapping: String => Int = s => s.length
if (useSpark) {
rdd.map(mapping)
} else {
list.map(mapping)
}
I think this code could be abstracted even more.

Why does folding dataframes cause a NullPointerException? [duplicate]

sessionIdList is of type :
scala> sessionIdList
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
When I try to run below code :
val x = sc.parallelize(List(1,2,3))
val cartesianComp = x.cartesian(x).map(x => (x))
val kDistanceNeighbourhood = sessionIdList.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
I receive exception :
14/05/21 16:20:46 ERROR Executor: Exception in task ID 80
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.filter(RDD.scala:261)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:38)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
However if I use :
val l = sc.parallelize(List("1","2"))
val kDistanceNeighbourhood = l.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
Then no exception is displayed
The difference between the two code snippets is that in first snippet sessionIdList is of type :
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
and in second snippet "l" is of type
scala> l
res13: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at parallelize at <console>:12
Why is this error occuring ?
Do I need to convert sessionIdList to ParallelCollectionRDD in order to fix this ?
Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations.
In the first case, you're seeing a NullPointerException thrown by the worker when it tries to access a SparkContext object that's only present on the driver and not the workers.
In the second case, my hunch is the job was run locally on the driver and worked purely by accident.
Its a reasonable question and I have heard it asked it enough times that. I'm going to try to take a stab at explaining why this is true, because it might help.
Nested RDDs will always throw an exception in production. Nested function calls as I think you are describing them here, if it means calling an RDD operation inside an RDD operation, will cause also cause failures since it is actually the same thing. (RDDs are immutable, so performing an RDD operation such as a "map" is equivalent to creating a new RDD.) The in ability to create nested RDDs is a necessary consequence of the way an RDD is defined and the way the Spark Application is set up.
An RDD is a distributed collection of objects (called partitions) that live on the Spark Executors. Spark executors cannot communicate with each other, only with the Spark driver. The RDD operations are all computed in pieces on these partitions.Because the RDD's executor environment isn't recursive (i.e. you can configure a Spark driver to be on a spark executor with sub executors) neither can an RDD.
In your program, you have created a distributed collection of partitions of integers. You are then performing a mapping operation. When the Spark driver sees a mapping operation, it sends the instructions to do the mapping to the executors, who perform the transformation on each partition in parallel. But your mapping cannot be done, because on each partition you are trying to call the "whole RDD" to perform another distributed operation. This can't not be done, because each partition does not have access to the information on the other partitions, if it did, the computation couldn't run in parallel.
What you can do instead, because the data you need in the map is probably small (since you are doing a filter, and the filter does not require any information about sessionIdList) is to first filter the session ID list. Then collect that list to the driver. Then broadcast it to the executors, where you can use it in the map. If the sessionID list is too large, you will probably need to do a join.

How to create RDD from within Task?

Normally when creating an RDD from a List you can just use the SparkContext.parallelize method, but you can not use the spark context from within a Task as it's not serializeable. I have a need to create an RDD from a list of Strings from within a task. Is there a way to do this?
I've tried creating a new SparkContext in the task, but it gives me an error about not supporting multiple spark contexts in the same JVM and that I need to set spark.driver.allowMultipleContexts = true. According to the Apache User Group, that setting however does not yet seem to be supported
As far as I am concerned it is not possible and it is hardly a matter of serialization or a support for multiple Spark contexts. A fundamental limitation is a core Spark architecture. Since Spark context is maintained by a driver and tasks are executed on the workers creating a RDD from inside a task would require pushing changes from workers to a driver. I am not saying it is technically impossible but a whole ideas seems to be rather cumbersome.
Creating Spark context from inside tasks looks even worse. First of all it would mean that context is created on the workers, which for all practical purposes don't communicate with each other. Each worker would get its own context which could operate only on a data that is accessible on given worker. Finally preserving worker state is definitely not a part of the contract so any context create inside a task should be simply garbage collected after the task is finished.
If handling the problem using multiple jobs is not an option you can try to use mapPartitions like this:
val rdd = sc.parallelize(1 to 100)
val tmp = rdd.mapPartitions(iter => {
val results = Map(
"odd" -> scala.collection.mutable.ArrayBuffer.empty[Int],
"even" -> scala.collection.mutable.ArrayBuffer.empty[Int]
)
for(i <- iter) {
if (i % 2 != 0) results("odd") += i
else results("even") += i
}
Iterator(results)
})
val odd = tmp.flatMap(_("odd"))
val even = tmp.flatMap(_("even"))

How to create Spark RDD from an iterator?

To make it clear, I am not looking for RDD from an array/list like
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7); // sample
JavaRDD<Integer> rdd = new JavaSparkContext().parallelize(list);
How can I create a spark RDD from a java iterator without completely buffering it in memory?
Iterator<Integer> iterator = Arrays.asList(1, 2, 3, 4).iterator(); //sample iterator for illustration
JavaRDD<Integer> rdd = new JavaSparkContext().what("?", iterator); //the Question
Additional Question:
Is it a requirement for source to be re-readable(or capable to read many times) to offer resilience for RDD? In other words, since iterators are fundamentally read-once, is it even possible to create Resilient Distributed Datasets(RDD) from iterators?
As somebody else said, you could do something with spark streaming, but as for pure spark, you can't, and the reason is that what you're asking goes against spark's model. Let me explain.
To distribute and parallelize work, spark has to divide it in chunks. When reading from HDFS, that 'chunking' is done for Spark by HDFS, since HDFS files are organized in blocks. Spark will generally generate one task per block.
Now, iterators only provide sequential access to your data, so it's impossible for spark to organize it in chunks without reading it all in memory.
It may be possible to build a RDD that has a single iterable partition, but even then, it is impossible to say if the implementation of the Iterable could be sent to workers. When using sc.parallelize() spark creates partitions that implement serializable so each partition can be sent to a different worker. The iterable could be over a network connection, or file in the local FS, so they cannot be sent to the workers unless they are buffered in memory.
Super old question but I would just create the iterators in a flatMap after serialization.
var ranges = Arrays.asList(Pair.of(1,7), Pair.of(0,5));
JavaRDD<Integer> data = sparkContext.parallelize(ranges).flatMap(pair -> Flux.range(pair.left(), pair.right()).toStream().iterator());

Resources