Why does foreach not bring anything to the driver program? - apache-spark

I wrote this program in spark shell
val array = sc.parallelize(List(1, 2, 3, 4))
array.foreach(x => println(x))
this prints some debug statements but not the actual numbers.
The code below works fine
for(num <- array.take(4)) {
println(num)
}
I do understand that take is an action and therefore will cause spark to trigger the lazy computation.
But foreach should have worked the same way... why did foreach not bring anything back from spark and start doing the actual processing (get out of lazy mode)
How can I make the foreach on the rdd work?

The RDD.foreach method in Spark runs on the cluster so each worker which contains these records is running the operations in foreach. I.e. your code is running, but they are printing out on the Spark workers stdout, not in the driver/your shell session. If you look at the output (stdout) for your Spark workers, you will see these printed to the console.
You can view the stdout on the workers by going to the web gui running for each running executor. An example URL is http://workerIp:workerPort/logPage/?appId=app-20150303023103-0043&executorId=1&logType=stdout
In this example Spark chooses to put all the records of the RDD in the same partition.
This makes sense if you think about it - look at the function signature for foreach - it doesn't return anything.
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit
This is really the purpose of foreach in scala - its used to side effect.
When you collect records, you bring them back into the driver so logically collect/take operations are just running on a Scala collection within the Spark driver - you can see the log output as the spark driver/spark shell is whats printing to stdout in your session.
A use case of foreach may not seem immediately apparent, an example - if for each record in the RDD you wanted to do some external behaviour, like call a REST api, you could do this in the foreach, then each Spark worker would submit a call to the API server with the value. If foreach did bring back records, you could easily blow out the memory in the driver/shell process. This way you avoid these issues and can do side-effects on all the items in an RDD over the cluster.
If you want to see whats in an RDD I use;
array.collect.foreach(println)
//Instead of collect, use take(...) or takeSample(...) if the RDD is large

You can use RDD.toLocalIterator() to bring the data to the driver (one RDD partition at a time):
val array = sc.parallelize(List(1, 2, 3, 4))
for(rec <- array.toLocalIterator) { println(rec) }
See also
Spark: Best practice for retrieving big data from RDD to local machine
this blog post about toLocalIterator

Related

Can we execute map() on List, which is generated from collect() in spark

I am curious to understand where will the Transformations execute(Inside executor or Driver) when the are specified after an action. Suppose below is the rough flow of Transformations and actions.
val R1 = Rdd.map(...);
val R2 = R1.filter(...);
val R3 = R2.flatMap(...);
Untill the above, the instructions execute on Executor in a distributed manner.
val lst = R3.collect(); --> Collect will also be executed in Distributed Manner I suppose. Please correct if i am wrong. It sends back the output to Driver.
Now CAN we execute map() like below?
lst.map(...)
If we can, then where will this code execute? on Driver or Executor?
Spark follows lazy evaluation which means it will start the job only when there is an action applied.
In your example R3.collect() is the action which triggers to run your entire lineage. Yes collect() will run in distributed way and will return all the transformed data to Driver node. Once it is done your lst variable is going to be an in memory collection (array).
Since lst is an array lst.map(...) is possible and it is going to run on Driver.
Important point to note, if you working with really large data set
and doing collect is a bad practice as it will bring entire data into
Driver which often given OOM exception.
Let me know if this helps.

Why does folding dataframes cause a NullPointerException? [duplicate]

sessionIdList is of type :
scala> sessionIdList
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
When I try to run below code :
val x = sc.parallelize(List(1,2,3))
val cartesianComp = x.cartesian(x).map(x => (x))
val kDistanceNeighbourhood = sessionIdList.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
I receive exception :
14/05/21 16:20:46 ERROR Executor: Exception in task ID 80
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.filter(RDD.scala:261)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:38)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
However if I use :
val l = sc.parallelize(List("1","2"))
val kDistanceNeighbourhood = l.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
Then no exception is displayed
The difference between the two code snippets is that in first snippet sessionIdList is of type :
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
and in second snippet "l" is of type
scala> l
res13: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at parallelize at <console>:12
Why is this error occuring ?
Do I need to convert sessionIdList to ParallelCollectionRDD in order to fix this ?
Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations.
In the first case, you're seeing a NullPointerException thrown by the worker when it tries to access a SparkContext object that's only present on the driver and not the workers.
In the second case, my hunch is the job was run locally on the driver and worked purely by accident.
Its a reasonable question and I have heard it asked it enough times that. I'm going to try to take a stab at explaining why this is true, because it might help.
Nested RDDs will always throw an exception in production. Nested function calls as I think you are describing them here, if it means calling an RDD operation inside an RDD operation, will cause also cause failures since it is actually the same thing. (RDDs are immutable, so performing an RDD operation such as a "map" is equivalent to creating a new RDD.) The in ability to create nested RDDs is a necessary consequence of the way an RDD is defined and the way the Spark Application is set up.
An RDD is a distributed collection of objects (called partitions) that live on the Spark Executors. Spark executors cannot communicate with each other, only with the Spark driver. The RDD operations are all computed in pieces on these partitions.Because the RDD's executor environment isn't recursive (i.e. you can configure a Spark driver to be on a spark executor with sub executors) neither can an RDD.
In your program, you have created a distributed collection of partitions of integers. You are then performing a mapping operation. When the Spark driver sees a mapping operation, it sends the instructions to do the mapping to the executors, who perform the transformation on each partition in parallel. But your mapping cannot be done, because on each partition you are trying to call the "whole RDD" to perform another distributed operation. This can't not be done, because each partition does not have access to the information on the other partitions, if it did, the computation couldn't run in parallel.
What you can do instead, because the data you need in the map is probably small (since you are doing a filter, and the filter does not require any information about sessionIdList) is to first filter the session ID list. Then collect that list to the driver. Then broadcast it to the executors, where you can use it in the map. If the sessionID list is too large, you will probably need to do a join.

rdd.saveAsTextFile executes on driver or workers inside foreachRDD.

Below is my code snippet. I have DStream which I am trying to save it to HDFS. Just wanted to know efficient way with compression.
pairedDStream.foreachRDD { rdd =>
val time = Calendar.getInstance.getTimeInMillis;
val textOutputFolder = outputDir + "/output-" + time
if (args.length == 4) {
val compressionCodec = args(3)
rdd.saveAsTextFile(textOutputFolder, CommonUtils.getCompressionCodec(compressionCodec))
} else {
rdd.saveAsTextFile(textOutputFolder, CommonUtils.getCompressionCodec(null))
}
}
rdd.saveAsTextFile is executed on worker nodes, in fact all rdd operations are executed parallelly inside dstream.foreachRDD. Spark documentation mention we should use this dstream operation for pushing the data in each RDD to an external system.
foreachRDD(func): The most generic output operator that applies a
function, func, to each RDD generated from the stream. This function
should push the data in each RDD to an external system, such as saving
the RDD to files, or writing it over the network to a database. Note
that the function func is executed in the driver process running the
streaming application, and will usually have RDD actions in it that
will force the computation of the streaming RDDs.
Design Patterns for using foreachRDD section also clearly states dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. You can further read this section to know how to optimize operations on RDD in a dstream.
Hope this helps!

How to create RDD from within Task?

Normally when creating an RDD from a List you can just use the SparkContext.parallelize method, but you can not use the spark context from within a Task as it's not serializeable. I have a need to create an RDD from a list of Strings from within a task. Is there a way to do this?
I've tried creating a new SparkContext in the task, but it gives me an error about not supporting multiple spark contexts in the same JVM and that I need to set spark.driver.allowMultipleContexts = true. According to the Apache User Group, that setting however does not yet seem to be supported
As far as I am concerned it is not possible and it is hardly a matter of serialization or a support for multiple Spark contexts. A fundamental limitation is a core Spark architecture. Since Spark context is maintained by a driver and tasks are executed on the workers creating a RDD from inside a task would require pushing changes from workers to a driver. I am not saying it is technically impossible but a whole ideas seems to be rather cumbersome.
Creating Spark context from inside tasks looks even worse. First of all it would mean that context is created on the workers, which for all practical purposes don't communicate with each other. Each worker would get its own context which could operate only on a data that is accessible on given worker. Finally preserving worker state is definitely not a part of the contract so any context create inside a task should be simply garbage collected after the task is finished.
If handling the problem using multiple jobs is not an option you can try to use mapPartitions like this:
val rdd = sc.parallelize(1 to 100)
val tmp = rdd.mapPartitions(iter => {
val results = Map(
"odd" -> scala.collection.mutable.ArrayBuffer.empty[Int],
"even" -> scala.collection.mutable.ArrayBuffer.empty[Int]
)
for(i <- iter) {
if (i % 2 != 0) results("odd") += i
else results("even") += i
}
Iterator(results)
})
val odd = tmp.flatMap(_("odd"))
val even = tmp.flatMap(_("even"))

Spark SQL + Streaming issues

We are trying to implement a use case using Spark Streaming and Spark SQL that allows us to run user-defined rules against some data (See below for how the data is captured and used). The idea is to use SQL to specify the rules and return the results as alerts to the users. Executing the query based on each incoming event batch seems to be very slow. Would appreciate if anyone can suggest a better approach to implementing this use case. Also, would like know if Spark is executing the sql on the driver or workers? Thanks in advance. Given below are the steps we perform in order to achieve this -
1) Load the initial dataset from an external database as a JDBCRDD
JDBCRDD<SomeState> initialRDD = JDBCRDD.create(...);
2) Create an incoming DStream (that captures updates to the initialized data)
JavaReceiverInputDStream<SparkFlumeEvent> flumeStream =
FlumeUtils.createStream(ssc, flumeAgentHost, flumeAgentPort);
JavaDStream<SomeState> incomingDStream = flumeStream.map(...);
3) Create a Pair DStream using the incoming DStream
JavaPairDStream<Object,SomeState> pairDStream =
incomingDStream.map(...);
4) Create a Stateful DStream from the pair DStream using the initialized RDD as the base state
JavaPairDStream<Object,SomeState> statefulDStream = pairDStream.updateStateByKey(...);
JavaRDD<SomeState> updatedStateRDD = statefulDStream.map(...);
5) Run a user-driven query against the updated state based on the values in the incoming stream
incomingStream.foreachRDD(new Function<JavaRDD<SomeState>,Void>() {
#Override
public Void call(JavaRDD<SomeState> events) throws Exception {
updatedStateRDD.count();
SQLContext sqx = new SQLContext(events.context());
schemaDf = sqx.createDataFrame(updatedStateRDD, SomeState.class);
schemaDf.registerTempTable("TEMP_TABLE");
sqx.sql(SELECT col1 from TEMP_TABLE where <condition1> and <condition2> ...);
//collect the results and process and send alerts
...
}
);
The first step should be to identify which step is taking most of the time.
Please see the Spark Master UI and identify which Step/ Phase is taking most of the time.
There are few best practices + my observations which you can consider: -
Use Singleton SQLContext - See example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
updateStateByKey can be a memory intensive operation in case of large number of keys. You need to check size of data processed by
updateStateByKey function and also if it fits well in the given
memory.
How is your GC behaving?
Are you really using "initialRDD"? if not then do not load it. In case it is static dataset then cache it.
Check the time taken by your SQL Query too.
Here are few more questions/ areas which can help you
What is the StorageLevel for DStreams?
Size of cluster and configuration of Cluster
version of Spark?
Lastly - ForEachRDD is an Output Operation which executes the given function on the Driver but RDD might actions and those actions are executed on worker nodes.
You may need to read this for better explaination about Output Operations - http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
I too facing the same issue could you please let me know if you have got the solution for the same? Though I have mentioned the detailed use case in below post.
Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming

Resources