Pyspark: is this object in driver memory? - apache-spark

Suppose I am writing code in pyspark.sql, creating pyspark dataframes and other objects. Given an object (variable), how do I know whether it's sitting in driver memory or not? I know that methods like toPandas collect the data into the driver's memory, but looking at the result of foo = bar.toPandas(), is there a way to see from foo that it lives on the driver's memory?
This question is different from others on SO which ask for the difference between the driver and executor. It is also different from asking which actions bring dataframes into driver memory.
Thank you.

Related

DataFrame Lifespan in memory, Spark?

My question is more related to memory management and GC in sprak internally.
If I will create a RDD, how long it will leave in my Executor memory.
# Program Starts
spark = SparkSession.builder.appName("").master("yarn").getOrCreate()
df = spark.range(10)
df.show()
# other Operations
# Program end!!!
Will it be automatically deleted once my Execution finishes. If Yes, Is there any way to delete it manually during program execution.
How and when Garbage collection called in Spark. Can we implement custom GC like JAVA program and use it in Spark.
DataFrame are Java objects so if no reference found your object is eligible to garbage collection
Spark - Scope, Data Frame, and memory management
Calling Custom gc not possible
Manually calling spark's garbage collection from pyspark
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
https://spark.apache.org/docs/2.2.0/tuning.html#memory-management-overview
"how long it will leave in my Executor memory."
In this particular case spark will no materialize the full dataset ever, instead it will iterate through one by one. Only a few operators materialize the full dataset. This includes, sorts/joins/groupbys/writes/etc
"Will it be automatically deleted once my Execution finishes."
spark automatically cleans any temp data.
"If Yes, Is there any way to delete it manually during program execution."
spark only keeps that data around if its in use or has been manually persisted. what are you trying to accomplish in particular?
"How and when Garbage collection called in Spark."
Spark runs on the JVM and the JVM with automatically GC when certain metrics are hit.

Apache Spark - Iterators and Memory consumption

I am a newbie to the spark and have question regarding spark memory usage with iterators.
When using Foreach() or MapPartitions() of Datasets (or even a direct call to iterator() function of RDD), does spark needs to load the entire partition to RAM first (assuming partition is in disk) or can data be lazy loaded as we continue to iterate (meaning that spark can load only part of the partition data execute task and save to disk the intermediate result)
The first difference of those two is that forEach() is an action when mapPartition() is a transformation. It would be more meaningful to compare forEach with forEachPartition since they are both actions and they both work on the final-accumulated data on the driver. Refer here for a detailed discussions over those two. As for the memory consumption it really depends on how much data you return to the driver. As a rule of thumb remember to return the results on the driver using methods like limit(), take(), first() etc and avoid using collect() unless you are sure that the data can fit on driver's memory.
The mapPartition can be compared with the map or flatMap functions and they will modify the dataset's data by applying some transformation. mapPartition is more efficient since it will execute the given func fewer times when map will do the same of each item in the dataset. Refer here for more details about these two functions.

what is driver memory and executor memory in spark? [duplicate]

This question already has answers here:
How to set Apache Spark Executor memory
(13 answers)
Closed 3 years ago.
I am new to spark framework and i would like to know what is driver memory and executor memory? what is the effective way to get the maximum performance from both of them?
Spark need a driver to handle the executors. So the best way to understand is:
Driver
The one responsible to handle the main logic of your code, get resources with yarn, handle the allocation and handle some small amount of data for some type of logic. The Driver Memory is all related to how much data you will retrieve to the master to handle some logic. If you retrieve too much data with a rdd.collect() your driver will run out of memory. The memory for the driver usually is small 2Gb to 4Gb is more than enough if you don't send too much data to it.
Worker
Here is where the magic happens, the worker will be the one responsible to execute your job. The amount of memory depends of what you are going to do. If you just going to do a map function where you just going to transform the data with no type of aggregation, you usually don't need much memory. But if you are going to run big aggregations, a lot of steps and etc. Usually you will use a good amount of memory. And it is related to the size of your files that you will read.
Tell you a proper amount of memory for each case all depends of how your job will work. You need to understand what is the impact of each function and monitor to tune your memory usage for each job. Maybe 2Gb per worker is what you need, but sometimes 8Gb per workers is what you need.

Why does SparkContext.parallelize use memory of the driver?

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node.
Here's an example of my code:
# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)
I've tried
del a
to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.
After I create rdd_a, how can I destroy a to free the master node's memory?
Thanks!
The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.
Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
That's how it supposed to be and that's why SparkContext.parallelize is only meant for demos and learning purposes, i.e. for quite small datasets.
Quoting the scaladoc of parallelize
parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] Distribute a local Scala collection to form an RDD.
Note "a local Scala collection" that means that the collection you want to map to a RDD (or create a RDD from) is already in the memory of the driver.
In your case, a is a local Python variable and Spark knows nothing about it. What happens when you use parallelize is that the local variable (that's already in the memory) is wrapped in this nice data abstraction called RDD. It's simply a wrapper around the data that's already in memory on the driver. Spark can't do much about that. It's simply too late. But Spark plays nicely and pretends the data is as distributed as other datasets you could have processed using Spark.
That's why parallelize is only meant for small datasets to play around (and mainly for demos).
Just like Jacek's answer, parallelize is only demo for small dataset, you can access all variables defined in driver within parallelize block.

Spark partitionBy on write.save brings all data to driver?

So basically I have a python spark job that reads some simple json files, and then tries to write them as orc files partitioned by one field. The partition is not very balanced, as some keys are really big, and other really small.
I had memory issues when doing something like this:
events.write.mode('append').partitionBy("type").save("s3n://mybucket/tofolder"), format="orc")
Adding memory to the executors didn't seem to have any effect, but I solved it increasing the driver memory. Does this mean that all the data is being send to the driver for it to write? Can't each executor write its own partition? Im using Spark 2.0.1
Even if you partition dataset and then write it on storage there is no possibility that records are sent to the driver. You should look at logs of memory issues (if they occur on driver on or executors) to figure out exact reason of failing.
Probably your driver has too low memory to handle this write because of previous computations. Try decreasing spark.ui.retainedJobs and spark.ui.retainedStages to save memory on old jobs and stages metadata. If this won't help, connect to driver with jvisualvm to find job/stage than consumes large heap fragments and try to optimize.

Resources