I start a spark streaming task which runing on yarn. And I kill some executor by SparkContext.killExecutors and add some executors when it is running.
after about from half hour to 2 hours, An exception occurs.
I found that the external shuffle service is necessary for dynamic resource allocation in spark streaming in some post.
My question is that is the ESS necessary for SparkContext.killExecutor??
Related
We are migrating 2 Spark Streaming jobs using Structured Streaming from on-prem to GCP.
One of them stream messages from Kafka and saves in GCS. And the other, stream from GCS and save in BigQuery.
Sometimes this jobs fails because of any problem, for example: (OutOfMemoryError, Connection reset by peer, Java heap space, etc).
When we get an Exception in on-prem environment, YARN marks the job as FAILLED and we have a scheduler flow that will rise the job again.
In GCP, we developed the same flow, that will rise the job again when fails. But when we get an Exception in DataProc, YARN marks the job as SUCCEEDED and DataProc remain with the status RUNNING.
You can see in this image the log with StreamingQueryException and the status of the job is Running ("Em execução" is running in Portuguese).
Dataproc job
I have Spark 2.3.1 custom non ambari installation on HDP 2.6.2 running on a cluster. I have made all the necessary configuration as per the spark and non ambari installation guides.
Now when I submit the spark job in Yarn cluster mode, I see huge gap of 10-12 minutes between the jobs and I do not see any error or operation that are being performed between the jobs. Attached screenshot shows the delay of close to 10 minutes between the jobs and this is leading to unnecessary delay in completing the Spark job.
Spark 2.3.1 job submitted in Yarn Cluster mode
I have checked the Yarn logs and Spark UI and I do not see any errors or any operations logged with the timestamp between the jobs.
Looking through the event timeline I see the gap of 10 +minutes between the jobs.
Event timeline gap between the jobs
Need help in providing any pointers to know how to fix this issue and improve the performance of the job.
Regards,
Vish
How to avoid Executor Failures while Spark jobs are executing .
We are using Spark 1.6 version as part of Cloudera CDH 5.10.
Normally I am getting below error.
ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 127100 ms
There could be various reasons behind the slow tasks execution then it gets timeout, you need to drill down to find the rootcause.
Sometimes tuning default timeout configuration parameters also helps. Go to spark UI configuration tab and find out values for below parameters then increase timeout parameters in spark-submit.
spark.worker.timeout
spark.network.timeout
spark.akka.timeout
Running job with speculative execution spark.speculation=true also helps, if one or more tasks are running slowly in a stage, they will be re-launched.
Explore more about spark 1.6.0 configuration properties.
We are running a 1 namenode and 3 datanode cluster on top of Azure. On top of this I am running my spark job on Yarn-Cluster mode.
Also, We are using HDP 2.5 which have spark 1.6.2 integrated into its setup. Now I have this very weird issue where the Processing time of my job suddenly increases to 4s.
This has happened quite some times but does not follow a pattern, sometimes the 4s waiting time is from the start of the job or may be at the middle of the job as shown below.
One thing to notice is that I have no events coming in which is processed so technically the processing time should stay almost the same. Also, my spark streaming job has a batch duration of 1s so it can't be that.
I dont have any error in the logs or anywhere and I am being lost to process this issue.
Minor details about the job:
I am reading messages over kafka topic and then storing them within Hbase tables using Phoenix JDBC Connector.
EDIT: More Information
In the InsertTransactionsPerRDDPartitions, I am performing connection open and write operation to HBase using Phoenix JDBC connectivity.
updatedEventLinks.foreachRDD(rdd -> {
if(!rdd.isEmpty()) {
rdd.foreachPartition(new InsertTransactionsPerRDDPartitions(this.prop));
rdd.foreachPartition(new DoSomethingElse(this.kafkaPublishingProps, this.prop));
}
});
I'm having trouble evenly distributing streaming receivers among all executors of a yarn-cluster.
I've got a yarn-cluster with 8 executors, I create 8 streaming custom receivers and spark is supposed to launch these receivers one per executor. However this doesn't happen all the time and sometimes all receivers are launched on the same executor (here's the jira bug: https://issues.apache.org/jira/browse/SPARK-10730).
So my idea is to run a dummy job, get the executors that were involved in that job and if I got all the executors, create the streaming receivers.
For doing that anyway I need to understand if there is a way to understand which executors were used for a job in java/scala code.
I believe it is possible to look what executors where doing what jobs by accessing Spark UI and Spark logs. From the official 1.5.0 documentation (here):
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
In the following screen you can see what executors are active. In case there are cores/nodes that are not being used, you can detect them by just looking what cores/nodes are actually active and running.
In addition, every executor displays information about the number of tasks that are being running on it.