Local Java Data Structure Causing OOM Error in Spark Map Call - apache-spark

I'm trying to run a mapToPair function on a javaPairRDD of about 1.5 million entries. Outside of the call, I have a Java Map that's locally defined. If I access the Map inside the mapToPair function then my program runs out of memory. If I don't access the Map, then it executes successfully, even if I access the map in the main loop of the code. Any thoughts on why this might be happening? My hypothesis is that accessing the Map inside the anonymous function is causing Spark to duplicate it a lot of times.
I'm running Spark in Local mode with 16 threads. The issue occurs for anything from 16 to 4000 partitions of the data.
Code example:
Working Code:
JavaPairRDD<Integer, CustomObject> pairRDD = createRDD();
while(loop_condition = true) {
Map<Integer, CustomObject> bigLocalMap = createMap();
System.out.println(bigLocalMap.size());
pairRDD = pairRDD.mapToPair(pair -> {
return pair;
}
}
Not Working Code
JavaPairRDD<Integer, CustomObject> pairRDD = createRDD();
while(loop_condition = true) {
Map<Integer, CustomObject> bigLocalMap = createMap();
pairRDD = pairRDD.mapToPair(pair -> {
System.out.println(bigLocalMap.size());
return pair;
}
}

How big is bigLocalMap? The way you are referencing it (via a closure) requires it to be serialized and sent to every executor for every core. Instead you should pass it around as a broadcast variable.
The general idea is you can register data that you want to be accessible on all of the executors and spark will ensure that the data is effeciently transferred and only stored once per executor. With the closure method you will end up with duplicates if you have configured executors to have multiple cores.
Reference:
https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
If you are still running out of memory, I would take a look at your memory settings. Some candidates for addressing it would be to:
Reduce the number of cores per executor (less simultaneous tasks using memory)
Increase the number of partitions, either by setting spark.default.parallelism and spark.sql.shuffle.partitions (which will only take affect after the first shuffle) or by explicitly calling repartition. Smaller tasks will have less memory pressure.
If you have the resources, increase the amount of RAM you are giving to your executors with the spark.executor.memory setting

Related

Why would preferredLocations not be enforced on an empty Spark cluster?

My Spark job consists of 3 workers, co-located with the data they need to read. I submit a RDD with some metadata and the job tasks turn that metadata in into real data. For instance the metadata could contain a file to read from the local worker filesystem and the first stage of the spark job would be to read that file into a RDD partition.
In my environment the data may not be present on all 3 workers and it is way too expensive to read across workers (i.e. if the data is on worker1 then worker2 can not reach out and fetch it). For this reason I have to force partitions onto the appropriate worker for the data they are reading. I have a mechanism for achieving this where I check the worker against the expected worker in the metadata and fail the task with a descriptive error message if they don't match. Using blacklisting I can ensure that the task is rescheduled on a different node until the right one is found. This works fine but as an optimization I wanted to use preferredLocations to help the tasks get assigned to the right workers initially without having to go through the try/reschedule process.
Is use makeRDD to create my initial RDD (of metadata) with the correct preferredLocations as per the answer here: How to control preferred locations of RDD partitions?, however it's not exhibiting the behaviour I expect. The code to makeRDD is below:
sc.makeRDD(taskAssigments)
where taskAssignments takes the form:
val taskAssignments = mutable.ArrayBuffer[(String, Seq[String])]()
metadataMappings.foreach { case(k , v) => {
taskAssignments += (k + ":" + v.mkString(",") -> Seq(idHostnameMappings(k)))
}}
idHostMappings is just a map of id -> hostName and I've verified that it contains the correct information.
Given that my test Spark cluster is completely clean with no other jobs running on it and there is no skew in the input RDD (it has 3 partitions to match the 3 workers) I would have expected the tasks to be assigned to their preferredLocations. Instead I still the error messages indicating that tasks are going through the fail/reschedule process.
Is my assumption that tasks would be scheduled at their preferredLocations on a clean cluster correct and is there anything further I can do to force this?
Follow up:
I was also able to create a much simpler test case. My 3 spark workers are named worker1,worker2 and worker3 and I run the following:
import scala.collection.mutable
val someData = mutable.ArrayBuffer[(String, Seq[String])]()
someData += ("1" -> Seq("worker1"))
someData += ("2" -> Seq("worker2"))
someData += ("3" -> Seq("worker3"))
val someRdd = sc.makeRDD(someData)
someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
I'd expect to see 1:worker1 etc but in fact see
1:worker3
2:worker1
3:worker2
can anyone explain this behaviour?
It turned out the issue was with my environment not Spark. Just in case anyone else is experiencing this the problem was that the Spark workers did not use the machine hostname by default. Setting the following environment variable on each worker rectified it: SPARK_LOCAL_HOSTNAME: "worker1"

How many blocks (or partitions) does spark load from the HDFS at the first time when action is performed?

I want to know data flow from hdfs to spark job in detail when spark job is executed. As far as I know, when sc.textFile("...") and the action command are executed, the data stored in the HDFS is partitioned and then they are loaded to the spark-side. For example, there are 30Gb data in the HDFS, and action is performed, then the the number of partitions(maybe, 30Gb/128Mb,default) are created,they are distributed to the worker nodes and spark will process the partitioned data.
When the worker node's memory is not sufficient (ex. 500mb or 1Gb/node), I think that the partitions that are passed to each worker node cannot be inside the memory at the same time. Therefore, I think that each node processes a few partitions out of the total partitions that they will have to deal with, and then swaps the processed partitions into disk and then takes the next partitions. In this process, worker node can process all the partitions even in the memory resources are not large.
In this process, I think that spark can deal with a lot of data even though they have small memory resources. However, when I experimented above concept with the data, I faced the java heap space error. Of course, my code has different transformations including sc.textFile("..."), but it has very basic operations as shown in below.
JavaRDD<String> data = sc.textFile("...")
JavaPairRDD<String, List<String>> retRDD = data.flatMapToPair(new PairFlatMapFunction<String, String, List<String>>() {
#Override
public Iterable<Tuple2<String, List<String>>> call(String s) throws Exception {
List<Tuple2<String, List<String>>> rtn = new ArrayList<Tuple2<String, List<String>>>();
String s1 = s;
String []splitted = s1.split("\t");
long ckAsLong = Long.parseLong(splitted[2]);
int batchId = (int) (ckAsLong / batchSize);
List<String> temp = new ArrayList<String>();
temp.add(splitted[2] + "\t" + splitted[3]);
rtn.add(new Tuple2<String, List<String>>(
splitted[0] + "\t" + splitted[1] + "\t" + batchId + "\t",
temp
));
return rtn;
}
});
retRDD.count() //action is performed
I have four nodes and I gave each node 500mb then the heap space error occurred at the very earlier step when I ran the spark-summit. When giving 1Gb to the node, the program was completed without error.
Based on my experience, I think that spark job tries to load the specific partitions from the total partitions and then processes them. For example, the 10 partitions are assigned to the node1, then node1 processes the 3 out of 10 at first and then moves them into the disk and then load next partitions and so on. Because I think that the total sizeof specific partitions is above the executor's memory size, the java heap space occurred.
My questions is as below.
If my guess is right, how to figure out the specific partitions that spark loads form the HDFS at the first time when the memory is not enough? Otherwise can you correct my assumptions so that I can understand spark concept in an accurate way.
Thanks!

How to execute computation per executor in Spark

In my computation, I
first broadcast some data, say bc,
then compute some big data shared by all executor/partition:val shared = f(bc)
then run the distributed computing, using shared data
To avoid computing the shared data on all RDD items, I can use .mapPartitions but I have much more partitions than executors. So it run the computation of shared data more times than necessary.
I found a simple method to compute the shared data only once per executor (which I understood is the JVM actually running the spark tasks): using lazy val on the broadcast data.
// class to be Broadcast
case class BC(input: WhatEver){
lazy val shared = f(input)
}
// in the spark code
val sc = ... // init SparkContext
val bc = sc.broadcast(BC(...))
val initRdd = sc.parallelize(1 to 10000, numSlices = 10000)
initRDD.map{ i =>
val shared = bc.value.shared
... // run computation using shared data
}
I think this does what I want, but
I am not sure, can someone guaranties it?
I am not sure lazy val is the best way to manage concurrent access, especially with respect to the spark internal distribution system. Is there a better way?
if computing shared fails, I think it will be recomputed for all RDD items with possible retries, instead of simply stopping the whole job with a unique error.
So, is there a better way?

Spark Streaming Dynamic Allocation ExecutorAllocationManager

We have a spark 2.1 streaming application with a mapWithState, enabling spark.streaming.dynamicAllocation.enabled=true. The pipeline is as follows:
var rdd_out = ssc.textFileStream()
.map(convertToEvent(_))
.combineByKey(...., new HashPartitioner(partitions))
.mapWithState(stateSpec)
.map(s => sessionAnalysis(s))
.foreachRDD( rdd => rdd.toDF().....save(output));
The streaming app starts with 2 executors, after some time it creates new executors, as the load increases as expected. The problem is that the load is not shared by those executors.
The number of Partitions is big enough to spill over to the new executors, and the key is equally distributed, I set it up with 40+ partitions, but I can see only 8 partitions (2 executors x 4 cores each) on the mapWithState storage. I am expecting when new executors are allocated, those 8 partitions get split and assigned to the new ones, but this never happens.
Please advise.
Thanks,
Apparently the answer was staring at my face al along :). RDDs as per documentation below, should inherit the upstream partitions.
* Otherwise, we use a default HashPartitioner. For the number of partitions, if
* spark.default.parallelism is set, then we'll use the value from SparkContext
* defaultParallelism, otherwise we'll use the max number of upstream partitions.
The state inside a mapWithState however does not have an upstream RDD. Therefore is set to the default parallelism unless you specify the partitions directly in the state, as the example bellow.
val stateSpec = StateSpec.function(crediting.formSession _)
.timeout(timeout)
.numPartitions(partitions) // <----------
var rdd_out = ssc.textFileStream()
.map(convertToEvent(_))
.combineByKey(...., new HashPartitioner(partitions))
.mapWithState(stateSpec)
.map(s => sessionAnalysis(s))
.foreachRDD( rdd => rdd.toDF().....save(output));
Still need to figure out how to make the number of partitions dynamic, as with dynamic allocation, this should change at runtime.

Can anyone explain about rdd blocks in executors

Can anyone explain why rdd blocks are increasing when i am running the spark code second time even though they are stored in spark memory during first run.I am giving input using thread.what is the exact meaning of rdd blocks.
I have been researching about this today and it seems RDD blocks is the sum of RDD blocks and non-RDD blocks.
Check out the code at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
val rddBlocks = status.numBlocks
And if you go to the below link of Apache Spark Repo on Github:
https://github.com/apache/spark/blob/d5b1d5fc80153571c308130833d0c0774de62c92/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala
You will find below lines of code:
/**
* Return the number of blocks stored in this block manager in O(RDDs) time.
*
* #note This is much faster than `this.blocks.size`, which is O(blocks) time.
*/
def numBlocks: Int = _nonRddBlocks.size + numRddBlocks
Non-rdd blocks are the ones created by broadcast variables as they are stored as cached blocks in memory. The tasks are sent by driver to the executors through broadcast variables.
Now these system created broadcast variables are deleted through ContextCleaner service and consequently the corresponding non-RDD block is removed.
RDD blocks are unpersisted through rdd.unpersist().

Resources