Can anyone explain about rdd blocks in executors - apache-spark

Can anyone explain why rdd blocks are increasing when i am running the spark code second time even though they are stored in spark memory during first run.I am giving input using thread.what is the exact meaning of rdd blocks.

I have been researching about this today and it seems RDD blocks is the sum of RDD blocks and non-RDD blocks.
Check out the code at:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/exec/ExecutorsPage.scala
val rddBlocks = status.numBlocks
And if you go to the below link of Apache Spark Repo on Github:
https://github.com/apache/spark/blob/d5b1d5fc80153571c308130833d0c0774de62c92/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala
You will find below lines of code:
/**
* Return the number of blocks stored in this block manager in O(RDDs) time.
*
* #note This is much faster than `this.blocks.size`, which is O(blocks) time.
*/
def numBlocks: Int = _nonRddBlocks.size + numRddBlocks
Non-rdd blocks are the ones created by broadcast variables as they are stored as cached blocks in memory. The tasks are sent by driver to the executors through broadcast variables.
Now these system created broadcast variables are deleted through ContextCleaner service and consequently the corresponding non-RDD block is removed.
RDD blocks are unpersisted through rdd.unpersist().

Related

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

Spark executor out of memory on join

Hi I am using spark Mllib and doing approxSimilarityJoin between a 1M dataset and a 1k dataset.
When i do it I bradcast the 1k one.
What I see is that thew job stops going forward at the second-last task.
All the executors are dead but one which keeps running for very long time until it reaches Out of memory.
I checked ganglia and it shows memory keeping rising until it reaches the limit
and the disk space keeps going down until it finishes:
The action I called is a write, but it does the same with count.
Now I wonder: is it possible that all the partitions in the cluster converge to only one node and creating this bottleneck?
Here is my code snippet:
var dfW = cookesWb.withColumn("n", monotonically_increasing_id())
var bunchDf = dfW.filter(col("n").geq(0) && col("n").lt(1000000) )
bunchDf.repartition(3000)
model.
approxSimilarityJoin(bunchDf,broadcast(cookesNextLimited),80,"EuclideanDistance").
withColumn("min_distance", min(col("EuclideanDistance")).over(Window.partitionBy(col("datasetA.uid")))
).
filter(col("EuclideanDistance") === col("min_distance")).
select(col("datasetA.uid").alias("weboId"),
col("datasetB.nextploraId").alias("nextId"),
col("EuclideanDistance")).write.format("parquet").mode("overwrite").save("approxJoin.parquet")
I'll try to answer as best as I can.
In Spark there are things that are called shuffle operations, and they do exactly what you thought , after some calculations they transfer all the information to a single node.
If you think about it there's no other way for those operations to work without putting all the data in a single node in the end.
example for join operation:
you have to partitions on 2 different nodes
partition 1:
s, 1
partition 2:
s, k
and you want to join by the s.
If you dont get both rows on a single machine it will be impossible to calculate they need to be joined.
It is the same with count and reduce and many more operations.
You can read about shuffle operations or ask me if you want more clarification.
a possible solution for you is :
instead of only saving data in memory you can use something like :
dfW.persist(StorageLevel.MEMORY_AND_DISK_SER)
there are other options for persist but what it does basically is saving the partitions and data not only in memory but in disk as well in a Serialized way to save space.

Local Java Data Structure Causing OOM Error in Spark Map Call

I'm trying to run a mapToPair function on a javaPairRDD of about 1.5 million entries. Outside of the call, I have a Java Map that's locally defined. If I access the Map inside the mapToPair function then my program runs out of memory. If I don't access the Map, then it executes successfully, even if I access the map in the main loop of the code. Any thoughts on why this might be happening? My hypothesis is that accessing the Map inside the anonymous function is causing Spark to duplicate it a lot of times.
I'm running Spark in Local mode with 16 threads. The issue occurs for anything from 16 to 4000 partitions of the data.
Code example:
Working Code:
JavaPairRDD<Integer, CustomObject> pairRDD = createRDD();
while(loop_condition = true) {
Map<Integer, CustomObject> bigLocalMap = createMap();
System.out.println(bigLocalMap.size());
pairRDD = pairRDD.mapToPair(pair -> {
return pair;
}
}
Not Working Code
JavaPairRDD<Integer, CustomObject> pairRDD = createRDD();
while(loop_condition = true) {
Map<Integer, CustomObject> bigLocalMap = createMap();
pairRDD = pairRDD.mapToPair(pair -> {
System.out.println(bigLocalMap.size());
return pair;
}
}
How big is bigLocalMap? The way you are referencing it (via a closure) requires it to be serialized and sent to every executor for every core. Instead you should pass it around as a broadcast variable.
The general idea is you can register data that you want to be accessible on all of the executors and spark will ensure that the data is effeciently transferred and only stored once per executor. With the closure method you will end up with duplicates if you have configured executors to have multiple cores.
Reference:
https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables
If you are still running out of memory, I would take a look at your memory settings. Some candidates for addressing it would be to:
Reduce the number of cores per executor (less simultaneous tasks using memory)
Increase the number of partitions, either by setting spark.default.parallelism and spark.sql.shuffle.partitions (which will only take affect after the first shuffle) or by explicitly calling repartition. Smaller tasks will have less memory pressure.
If you have the resources, increase the amount of RAM you are giving to your executors with the spark.executor.memory setting

Why does SparkContext.parallelize use memory of the driver?

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node.
Here's an example of my code:
# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)
I've tried
del a
to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.
After I create rdd_a, how can I destroy a to free the master node's memory?
Thanks!
The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.
Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
That's how it supposed to be and that's why SparkContext.parallelize is only meant for demos and learning purposes, i.e. for quite small datasets.
Quoting the scaladoc of parallelize
parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] Distribute a local Scala collection to form an RDD.
Note "a local Scala collection" that means that the collection you want to map to a RDD (or create a RDD from) is already in the memory of the driver.
In your case, a is a local Python variable and Spark knows nothing about it. What happens when you use parallelize is that the local variable (that's already in the memory) is wrapped in this nice data abstraction called RDD. It's simply a wrapper around the data that's already in memory on the driver. Spark can't do much about that. It's simply too late. But Spark plays nicely and pretends the data is as distributed as other datasets you could have processed using Spark.
That's why parallelize is only meant for small datasets to play around (and mainly for demos).
Just like Jacek's answer, parallelize is only demo for small dataset, you can access all variables defined in driver within parallelize block.

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.

Resources