Why could SparkSession initialization take longer every iteration in a single application? - apache-spark

I use spark for batch analysis.
I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. Run spark-sumbit my_code.py without any additional configuration parameters.
In a while loop I start SparkSession, analyze data and then stop the context and this process repeats every 10 seconds.
while True:
spark = SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' , '5g').getOrCreate()
sc = spark.sparkContext
#some process and analyze
spark.stop()
When program starts, it works perfectly.
but when it works for many hours. spark initialization take long time.
it makes 10 or 20 seconds for just initializing spark.
So what is the problem ?

You use a single-JVM local run mode. I can't explain exactly what happens in your case, but it's not surprising to see this single JVM being more and more under pressure for memory. It starts clean and over time Spark leaves some temporary objects before they get GCed.
I strongly recommend attaching jconsole to see the JVM metrics and monitor memory and CPU usage.

Related

Spark long running jobs with dataset

I have a spark code that used to run batch jobs(each job span varies from few seconds to few minutes). Now I wanted to take this same code and run it long running. To do this I have thought to create spark context only once and then in a while loop I would wait for new config/tasks to come and will start executing them.
So far whenever I tried to run this code, my applications stops running after 5-6 iterations without any exception or error printed. This long running job has been assigned with 1 executor with 10GB of memory and a spark driver with 4GB of memory(which was good for our batch job). So my questions is what are various things that we need to do to move from small batch jobs to long running jobs within code itself. I have seen this useful link - http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/ but this link is mostly about spark configurations to keep them running for long.
Spark version - 2.3 (can move to spark 2.4.1) running over yarn cluster

Spark (yarn-client mode) not releasing memory after job/stage finishes

We are consistently observing this behavior with interactive spark jobs in spark-shell or running Sparklyr in RStudio etc.
Say I launched spark-shell in yarn-client mode and performed an action, which triggered several stages in a job and consumed x cores and y MB memory. Once this job finishes, and the corresponding spark session is still active, the allocated cores & memory is not released even though that job is finished.
Is this normal behavior?
Until the corresponding spark session is finished, the ip:8088/ws/v1/cluster/apps/application_1536663543320_0040/
kept showing:
y
x
z
I would assume, Yarn would dynamically allocate these unused resources to other spark jobs which are awaiting resources.
Please clarify if I am missing something here.
You need to play with configs around dynamic allocation https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation -
Set spark.dynamicAllocation.executorIdleTimeout to a smaller value say 10s. Default value of this parameter is 60s. This config tells spark that it should release the executor only when it is idle for this much time.
Check spark.dynamicAllocation.initialExecutors/spark.dynamicAllocation.minExecutors. Set these to a small number - say 1/2. The spark application will never downscale below this number unless the SparkSession is closed.
Once you set these two configs, your application should release the extra executors once they are idle for 10 seconds.
Yes the resources are allocated until the SparkSession is active. To handle this better you can use dynamic allocation.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html

Spark Memory Usage Concentrated on Driver / Master

I'm currently developing a Spark (v 2.2.0) Streaming application and am running into issues with the way Spark seems to be allocating work across the cluster. This application is submitted to AWS EMR using client mode, so there is a driver node and a couple of worker nodes. Here is a screenshot of Ganglia that shows memory usage in the last hour:
The left-most node is the "master" or "driver" node, and the other two are worker nodes. There are spikes in the memory usage for all three nodes that correspond to workloads coming in through the stream, but the spikes are not equal (even when scaled to % memory usage). When a large workload comes in, the driver node appears to be overworked, and the job will crash with an error regarding memory:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000053e980000, 674234368, 0) failed; error='Cannot allocate memory' (errno=12)
I've also run into this:
Exception in thread "streaming-job-executor-10" java.lang.OutOfMemoryError: Java heap space when the master runs out of memory, which is equally confusing, as my understanding is that "client" mode would not use the driver / master node as an executor.
Pertinent details:
As mentioned earlier, this application is submitted in client mode: spark-submit --deploy-mode client --master yarn ....
Nowhere in the program am I running collect or coalesce
Any work that I suspect of being run on a single node (jdbc reads mainly) is repartition'd after completion.
There are a couple of very, very small datasets persist into memory.
1 x Driver specs: 4 cores, 16GB RAM (m4.xlarge instance)
2 x Worker specs: 4 cores, 30.5GB RAM (r3.xlarge instance)
I have tried both allowing Spark to choose executor size / cores and specifying them manually. Both cases behave the same. (I manually specified 6 executors, 1 core, 9GB RAM)
I'm certainly at a loss here. I'm not sure what is going on in the code to be triggering the driver to hog the workload like this.
The only suspect I can think of is a code snippet similar to the following:
val scoringAlgorithm = HelperFunctions.scoring(_: Row, batchTime)
val rawScored = dataToScore.map(scoringAlgorithm)
Here, a function is being loaded from a static object, and used to map over the Dataset. It is my understanding that Spark will serialize this function across the cluster (re: http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#passing-functions-to-spark). However perhaps I am mistaken and it is simply running this transformation on the driver.
If anyone has any insight to this issue, I would love to hear it!
I ended up solving this issue. Here's how I addressed it:
I made an incorrect assertion in stating the problem: there was a collect statement at the beginning of the Spark program.
I had a transaction that required collect() to run as it was designed. My assumption was that calling repartition(n) on the resulting data would split the data back amongst the executors in the cluster. From what I can tell, this strategy does not work. Once I re-wrote this line, Spark started behaving as I expected and farming jobs out to worker nodes.
My advice to any lost soul who stumbles across this issue: don't collect unless it's the end of your Spark program. You can not recover from it. Find another way to perform your task. (I ended up switching a SQL transaction from where col in (,,,) syntax to a join on the database.)

SPARK : Increasing the number of DRIVER MEMORY can decrease the performance?

I am tuning an application running on Spark 1.5.2. I ran 2 times the exact same script, but with different driver.memory parameter.
First time : driver.memory = 15g / Execution time : 6,1h
Second time : driver.memory = 2g / Execution time : 5,7h
The goal of the script is only making join on a same table and iterate on it with a newer table, before saving it in a Hive table.
I though the more memory we give, the better it is. But this idea is kind of false according to the tests... Is really the driver memory responsible of it ? Or is the process which run +/- randomly ...?
It does not matter if your driver is running on standalone machine(where no executor is running). Try to increase driver memory if you are using collect/take actions otherwise increase executor memory for better performance.
If you are not using cache, try to increase spark.shuffle.memoryFraction.
See spark doc for more details: https://spark.apache.org/docs/1.5.2/configuration.html

Spark Performance issue while adding more worker nodes

I am being new on Spark. I am facing performance issue when the number of worker nodes are increased. So to investigate that, I have tried some sample code on spark-shell.
I have created a Amazon AWS EMR with 2 worker nodes (m3.xlarge). I have used the following code on spark-shell on the master node.
var df = sqlContext.range(0,6000000000L).withColumn("col1",rand(10)).withColumn("col2",rand(20))
df.selectExpr("id","col1","col2","if(id%2=0,1,0) as key").groupBy("key").agg(avg("col1"),avg("col2")).show()
This code executed without any issues and took around 8 mins. But when I have added 2 more worker nodes (m3.xlarge) and executed the same code using spark-shell on master node, the time increased to 10 mins.
Here is the issue, I think the time should be decreased, not by half, but I should decrease. I have no idea why on increasing worker node same spark job is taking more time. Any idea why this is happening? Am I missing any thing?
This should not happen, but it is possible for an algorithm to run slower when distributed.
Basically, if the synchronization part is a heavy one, doing that with 2 nodes will take more time then with one.
I would start by comparing some simpler transformations, running a more asynchronous code, as without any sync points (such as group by key), and see if you get the same issue.
#z-star, yes an algorithm might b slow when distributed. I found the solution by using Spark Dynamic Allocation. This enable spark to use only required executors. While the static allocation runs a job on all executors, which was increasing the execution time with more nodes.

Resources