Is there any Jar job timeout limits or Jar job can run without limits at databricks ? Our application starting long running spark job for a weeks, creating spark sessions and "fire" subjobs, but in August release notes I found that notebooks limiting job execution for 2 days. Is it possible to run streaming jobs with such limits ?
Set spark.executor.heartbeatInterval to 100000 and spark.network.timeout to 200000 in spark-defaults.conf file.
Code -
.set("spark.executor.heartbeatInterval", "100000 ") \
.set("spark.network.timeout", "200000 ")
Related
I am running a spark job which partitions data on two columns as an EMR step. The spark job has spark.sql.sources.partitionOverwriteMode set to dynamic and SaveMode as overwrite.
I can see the Spark job has finished execution by looking at the Spark UI but the EMR step continues to be in RUNNING state for more than an hour. I can also see _SUCCESS file in the root directory with timestamp in line with the spark job completion.
Any idea why the EMR step isn't completing or best practices to speed up the process?
I use airflow to submit multiple hourly spark jobs to an EMR. In one hour I can have upwards to 30 spark submits.
The EMR is 1 master node and 4 core nodes all c4.4xlarge.
My spark submits use master yarn and deploy-mode client.
Every hour multiple airflow dags will ssh into the EMR and spark-submit their jobs. Most of the jobs are small and finishes within a few minutes, except for a few that take 10-15 mins.
I have been hitting a reoccurring error logged by airflow and once one task receives it, it waterfalls down to the rest of them:
airflow.exceptions.AirflowException: SSH operator error: No existing session
This means airflow was unable to ssh into the cluster. I even tried to ssh through my computer and it just hangs. Is it possible there are too many spark tasks running? I wouldn't think so because my cluster is pretty big for the jobs I have to run.
I have Spark 2.3.1 custom non ambari installation on HDP 2.6.2 running on a cluster. I have made all the necessary configuration as per the spark and non ambari installation guides.
Now when I submit the spark job in Yarn cluster mode, I see huge gap of 10-12 minutes between the jobs and I do not see any error or operation that are being performed between the jobs. Attached screenshot shows the delay of close to 10 minutes between the jobs and this is leading to unnecessary delay in completing the Spark job.
Spark 2.3.1 job submitted in Yarn Cluster mode
I have checked the Yarn logs and Spark UI and I do not see any errors or any operations logged with the timestamp between the jobs.
Looking through the event timeline I see the gap of 10 +minutes between the jobs.
Event timeline gap between the jobs
Need help in providing any pointers to know how to fix this issue and improve the performance of the job.
Regards,
Vish
I have sparkR application, it has multiple jobs, but each time active job is 1, how can i increase number of active jobs? Here is my code. Correct me if I am wrong. And I am using Apache Spark 1.6.0 and executing it on Amazon AWS on EMR-4.4.0
Updated Spark UI image
When submitting a long-running Spark batch job through Livy, the job is defaulting to five retries, which takes forever to finally fail. How can I change this so the job fails immediately?
My environment is Spark 1.6, running on Azure HDInsight (HDP).
Thanks!
This is a configuration on your yarn, not on Livy.
Go to yarn's configuration page under "Advanced yarn-site" and change "yarn.resourcemanager.am.max-attempts" from 5 to 1 if you want it to do no retry.