Spark Streaming memory usage doubts - apache-spark

I am trying to figure out the memory used by executors for a Spark Streaming job. For data I am using the rest endpoint for Spark AllExecutors
and just summing up the metrics totalDuration * spark.executor.memory for every executor and then emitting the final sum as the memory usage.
But this is coming out to be very small for application which ran whole day , is something wrong with the logic.Also I am using dynamic allocation and executorIdleTimeout is 5 seconds.
Also I am also assuming that if some executor was removed for due to idle timeout and then was allocated to some other task then its totalDuration will be increased by the amount of time took by the executor to execute this new task.
But what is unusual is that as the duration for execution of increases the memory usage metrics is decreasing.

Related

GC and shuffle read is high (red) in databricks cluster, how to tune this?

I need recommendation for cluster design in databricks, we have a ETL batch load running every 20 minutes.
there are 25+ notebooks doing straight merge in silver layer tables( fact/dimensions)
cluster config are as follows:
instance type- F64- compute optimised
worker nodes -3 - 128 gb memroy X 64 cores
driver node - 1 memory optimized - 64 GB X8 cores
we need to run minimize execution time and increase parallelism
I am attaching snapshot from SPARKUI of databricks cluster (executor page) for your refrence.
[1]: https://i.stack.imgur.com/qMFyf.png
I see red flag for GC time and shuffle read, GC time turns out more than 10% of total task time.
How can we bring this down and we missing our SLA for load cycle.
Thanks
Try increasing the executors and nodes and also giving more memory per executor. Also, you can see which line in your code is specifically taking long by looking at logs.
You need to ensure that you are not performing actions that involve a lot of data shuffling.

How do you find out exactly what had caused the high GC time for the spark tasks in any given spark stage?

I do have a spark application where in one of the spark stage took most of the time 2.5hrs + . I did a dive deep and found for majority of the tasks the GC time was pretty high 60% of total task execution time.
The question that I have is :
How do i co-relate this piece of spark task with my code ?
enter image description here
How do I identify what part of my spark code written using PySpark had caused the high GC time ? enter image description here
In general what causes high GC time for any given spark task , I want to know ?
High GC means frequent GC or GC taking long time. A few suggestions with the limited info I could gather from the screenshots:
One thing to check is are you caching big rdd/rdd's. Uncaching them as soon as they are no longer required will reduce the memory pressure. Is stage 68 part of first job, uncache unrequired data from previous jobs?
How to figure out which operation is this: Use DAG visualization link on the top of stage,job pages to understand the flow. For SQL use SQL tab on the UI.
Also there are 2000 tasks for ~ 40GB of shuffle data, each task handling 20 MB which is very small. better to have atleast ~128MB per task. tune this parallelism back to default 200 ?
If you can't optimize your code then, use more memory by adding more nodes or nodes with larger memory.
From experience, high GC time is caused by tasks requiring more than the available memory. High GC time is often also accompanied by the tasks spilling to disk (entries in the Memory Spill and Disk Spill columns).
Also, from Learning Spark:
A high GC time signals too many objects on the heap (your executors may be memory-starved).
Damji, Jules S.,Wenig, Brooke,Das, Tathagata,Lee, Denny.
In my experience, a good mitigation is to increase the number of partitions read by the given stage to reduce the memory required by the individual tasks e.g. by decreasing spark.files.maxPartitionBytes when reading files, or increasing spark.sql.shuffle.partitions when joining dataframes.

how spark dynamic allocation clears queued task

I can open new vm on the fly based on that I am asking this question.
I am using spark dynamic allocation when I used
spark.dynamicAllocation.minExecutors=10 on sudden burst of data spark opens new executors very slowly results in long queues
when I changed spark.dynamicAllocation.minExecutors=200 to larger number on sudden burst it opens new executors very fast and the queue can clear up.
My question is do we have to set this to high value for such situation.
Vipin,
When you set the Dynamic Allocation in spark, as I can see you enable it and set the min of executors. But, when you need 200 executor to be faster the allocation has one configuration called spark.dynamicAllocation.schedulerBacklogTimeout this for default has 1s of timeout.
This mean that after 1s if you task didn't finished a task it will allocate more executors.
According to the documentation in spark, that says:
Spark requests executors in rounds. The actual request is triggered when there have been pending tasks for spark.dynamicAllocation.schedulerBacklogTimeout seconds, and then triggered again every spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds thereafter if the queue of pending tasks persists. Additionally, the number of executors requested in each round increases exponentially from the previous round. For instance, an application will add 1 executor in the first round, and then 2, 4, 8 and so on executors in the subsequent rounds.
So for every seccond Spark allocate 2^n for n secconds of delay. To reach the 200 executor you need to wait at least 8 secconds to request the executors to Yarn. And few more secconds to solve that.
Maybe if you raise the number of cores it will help you. But if you are using the full cores of each node... Well thre is no solution.

Max Executor failures in Spark dynamic allocation

I am using dynamic allocation feature of spark to run my spark job. It allocates around 50-100 executors. For some reason few executors are lost resulting in shutting down the job. Log shows that this happened due to max executor failures reached. It is set to 3 by default. Hence when 3 executors are lost the job gets killed even if other 40-50 executors are running.
I know that I can change the max executor failure limit but this seems like a workaround. Is there something else that I can try. All suggestions are welcome.

SPARK : Increasing the number of DRIVER MEMORY can decrease the performance?

I am tuning an application running on Spark 1.5.2. I ran 2 times the exact same script, but with different driver.memory parameter.
First time : driver.memory = 15g / Execution time : 6,1h
Second time : driver.memory = 2g / Execution time : 5,7h
The goal of the script is only making join on a same table and iterate on it with a newer table, before saving it in a Hive table.
I though the more memory we give, the better it is. But this idea is kind of false according to the tests... Is really the driver memory responsible of it ? Or is the process which run +/- randomly ...?
It does not matter if your driver is running on standalone machine(where no executor is running). Try to increase driver memory if you are using collect/take actions otherwise increase executor memory for better performance.
If you are not using cache, try to increase spark.shuffle.memoryFraction.
See spark doc for more details: https://spark.apache.org/docs/1.5.2/configuration.html

Resources