Spark long running jobs with dataset - apache-spark

I have a spark code that used to run batch jobs(each job span varies from few seconds to few minutes). Now I wanted to take this same code and run it long running. To do this I have thought to create spark context only once and then in a while loop I would wait for new config/tasks to come and will start executing them.
So far whenever I tried to run this code, my applications stops running after 5-6 iterations without any exception or error printed. This long running job has been assigned with 1 executor with 10GB of memory and a spark driver with 4GB of memory(which was good for our batch job). So my questions is what are various things that we need to do to move from small batch jobs to long running jobs within code itself. I have seen this useful link - http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/ but this link is mostly about spark configurations to keep them running for long.
Spark version - 2.3 (can move to spark 2.4.1) running over yarn cluster

Related

Spark not using all CPUs available

I am running a query using Hive on Spark which is exhibiting some strange behavior. I've run it multiple times and observed the same behavior. The query:
reads from a large Hive external table
Spark creates about ~990,000 tasks
runs in a YARN queue with > 2900 CPUs available
uses 700 executors with 4 CPUs per executor
All is well at the start of the job. After ~1.5 hours of 2800 CPUs cranking, the job is ~80% complete (800k/990k tasks). From there, things start to nosedive: Spark stops using all of the CPUs available to it to work on tasks. With ~190k tasks to go, Spark will gradually drop from using 2800 CPUs to double digits (usually bottoming out around 20 total CPUs). This makes the last 190k tasks take significantly longer to finish than the previous 800k.
I could see as the job got very close to completing that Spark would be unable to parallelize a small amount of remaining tasks across a large number of CPUs. But with 190k tasks left to be started, it seems way too early for that.
Things I've checked:
No other job is pre-empting its resources in YARN. (In addition, if this were the case, I would expect the job to randomly lose/regain resources, instead of predictably losing steam at the 80% mark).
This occurs whether dynamic allocation is enabled or disabled. If disabled, Spark has all 2800 CPUs available for the entire run time of the job - it just doesn't use them. If enabled, Spark does spin down executors as it decides it no longer needs them.
If data skew were the issue, I could see some tasks taking longer than others to finish. But it doesn't explain why Spark wouldn't be using idle CPUs to start on the backlog of tasks still to go.
Does anyone have any advice?
For posterity, this answer from Travis Hegner contained the answer.
Setting spark.locality.wait=0s fixes this issue. I'm also not sure why a 3 second wait causes such a pile up in Spark's ability to schedule tasks, but setting to 0 makes the job run extremely well.

Possible reasons that spark waits and does not schedule tasks to run?

This might be a very generic question but hope someone can point some hint. But I found that sometimes, my job spark seems to hit a "pause" many times:
The natural of the job is: read orc files (from a hive table), filter by certain columns, no join, then write out to another hive table.
There were total 64K tasks for my job / stage (FileScan orc, followed by Filter, Project).
The application has 500 executors, each has 4 cores. Initially, about 2000 tasks were running concurrently, things look good.
After a while, I noticed the number running tasks dropped all the way near 100. Many cores/executors were just waiting with nothing to do. (I checked the log from these waiting executors, there was no error. All assigned tasks were done on them, they were just waiting)
After about 3-5 minutes, then these waiting executors suddenly got tasks assigned and now were working happily.
Any particular reasons this can be? The application is running from spark-shell (--master yarn --deploy-mode client, with number of executors/sizes etc. specified)
Thanks!

Spark Streaming - Jobs run concurrently with default spark.streaming.concurrentJobs setting

I have come across a wierd behaviour in Spark Streaming Job.
We have used the default value for spark.streaming.concurrentJobs which is 1.
The same streaming job was running for more than a day properly with the batch interval being set to 10 Minutes.
Suddenly that same job has started running concurrently for all the batches that come in without putting them in Queue.
Has anyone faced this before?
This would be of great help!
This kind of behavior seems to be curious but I believe seems to happen when there is only 1 job running at a time and if batch processing time < batch interval, then the system seems to be stable.
Spark Streaming creator Tathagata hs mentioned about this: How jobs are assigned to executors in Spark Streaming?.

Why could SparkSession initialization take longer every iteration in a single application?

I use spark for batch analysis.
I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. Run spark-sumbit my_code.py without any additional configuration parameters.
In a while loop I start SparkSession, analyze data and then stop the context and this process repeats every 10 seconds.
while True:
spark = SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' , '5g').getOrCreate()
sc = spark.sparkContext
#some process and analyze
spark.stop()
When program starts, it works perfectly.
but when it works for many hours. spark initialization take long time.
it makes 10 or 20 seconds for just initializing spark.
So what is the problem ?
You use a single-JVM local run mode. I can't explain exactly what happens in your case, but it's not surprising to see this single JVM being more and more under pressure for memory. It starts clean and over time Spark leaves some temporary objects before they get GCed.
I strongly recommend attaching jconsole to see the JVM metrics and monitor memory and CPU usage.

Spark Performance issue while adding more worker nodes

I am being new on Spark. I am facing performance issue when the number of worker nodes are increased. So to investigate that, I have tried some sample code on spark-shell.
I have created a Amazon AWS EMR with 2 worker nodes (m3.xlarge). I have used the following code on spark-shell on the master node.
var df = sqlContext.range(0,6000000000L).withColumn("col1",rand(10)).withColumn("col2",rand(20))
df.selectExpr("id","col1","col2","if(id%2=0,1,0) as key").groupBy("key").agg(avg("col1"),avg("col2")).show()
This code executed without any issues and took around 8 mins. But when I have added 2 more worker nodes (m3.xlarge) and executed the same code using spark-shell on master node, the time increased to 10 mins.
Here is the issue, I think the time should be decreased, not by half, but I should decrease. I have no idea why on increasing worker node same spark job is taking more time. Any idea why this is happening? Am I missing any thing?
This should not happen, but it is possible for an algorithm to run slower when distributed.
Basically, if the synchronization part is a heavy one, doing that with 2 nodes will take more time then with one.
I would start by comparing some simpler transformations, running a more asynchronous code, as without any sync points (such as group by key), and see if you get the same issue.
#z-star, yes an algorithm might b slow when distributed. I found the solution by using Spark Dynamic Allocation. This enable spark to use only required executors. While the static allocation runs a job on all executors, which was increasing the execution time with more nodes.

Resources