Spark streaming job on yarn keeps terminating after a few hours - apache-spark

I have a spark streaming job that consumes a kafka topic and writes to a database. I submitted the job to yarn with the following parameters:
spark-submit \
--jars mongo-spark-connector_2.11-2.4.0.jar,mongo-java-driver-3.11.0.jar,spark-sql-kafka-0-10_2.11-2.4.5.jar \
--driver-class-path mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--conf spark.executor.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--conf spark.driver.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--class com.example.StreamingApp \
--driver-memory 2g \
--num-executors 6 --executor-cores 3 --executor-memory 3g \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.streaming.backpressure.pid.minRate=10 \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.maxAppAttempts=4 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.yarn.max.executor.failures=16 \
--conf spark.yarn.executor.failuresValidityInterval=1h \
--conf spark.task.maxFailures=8 \
--queue users.adminuser \
--conf spark.speculation=true \
StreamingApp-2-4.0.1-SNAPSHOT.jar
But it terminates after a few hours with the following message on the terminal:
21/02/14 04:05:14 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:14 INFO yarn.Client:
client token: N/A
diagnostics: Attempt recovered after RM restart
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.adminuser
start time: 1613260105314
final status: UNDEFINED
tracking URL: https://XXXXXXXXXx:8090/proxy/application_1613217899387_6697/
user: adminuser
21/02/14 04:05:15 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:16 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:17 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:18 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:19 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:20 INFO yarn.Client: Application report for application_1613217899387_6697 (state: FINISHED)
21/02/14 04:05:20 INFO yarn.Client:
client token: N/A
diagnostics: Attempt recovered after RM restartDue to executor failures all available nodes are blacklisted
ApplicationMaster host: XXXXXXXXXx
ApplicationMaster RPC port: 41848
queue: root.users.adminuser
start time: 1613260105314
final status: FAILED
tracking URL: https://XXXXXXXXXx:8090/proxy/application_1613217899387_6697/
user: adminuser
21/02/14 04:05:20 ERROR yarn.Client: Application diagnostics message: Attempt recovered after RM restartDue to executor failures all available nodes are blacklisted
Exception in thread "main" org.apache.spark.SparkException: Application application_1613217899387_6697 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1155)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1603)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/02/14 04:05:20 INFO util.ShutdownHookManager: Shutdown hook called
21/02/14 04:05:20 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e51d75e7-f19b-4f2f-8d46-b91b1af064b3
21/02/14 04:05:20 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-16d23c16-30f5-4ab7-95b2-4ac4ad584905
I've relaunched a few times, but the same thing keeps happening.
Spark version is 2.4.0-cdh6.2.1, ResourceManager version is 3.0.0-cdh6.2.1

Related

YARN - application not getting accepted, error code 125

I am trying to run a spark-submit to yarn, but the application first hangs in ACCEPTED state, then fails with the following error:
22/11/23 17:58:24 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:24 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.my_user
start time: 1669222703023
final status: UNDEFINED
tracking URL: https://mask:8090/proxy/application_1668608030982_2921/
user: my_user
22/11/23 17:58:25 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:26 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:27 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:28 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:29 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:30 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:31 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:32 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:33 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:34 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:35 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:36 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:37 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:38 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:39 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:40 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:41 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:42 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:43 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:44 INFO yarn.Client: Application report for application_1668608030982_2921 (state: FAILED)
22/11/23 17:58:44 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1668608030982_2921 failed 2 times due to AM Container for appattempt_1668608030982_2921_000002 exited with exitCode: 125
Failing this attempt.Diagnostics: [2022-11-23 17:58:43.566]Exception from container-launch.
Container id: container_e172_1668608030982_2921_02_000001
Exit code: 125
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is my_user
main : requested yarn user is my_user
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /var/SP/data12/yarn/nm/nmPrivate/application_1668608030982_2921/container_e172_1668608030982_2921_02_000001/container_e172_1668608030982_2921_02_000001.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
I cannot find any references to exit code 125 for yarn; Any idea why this fails?
deploy mode is cluster
this is the spark-submit with a mock class name and without app params at the end(they are verified to be good parameters):
nohup spark-submit\
--class com.myClass\
--master yarn\
--deploy-mode $DEPLOY_MODE\
--num-executors $NUM_EXEC\
--executor-memory $EXEC_MEM\
--executor-cores $NUM_CORES\
--driver-memory "2g"\
--jars /opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar\
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:log4j-driver.properties -Dvm.logging.level=$DRIVER_LOGLEVEL -Dvm.logging.name=$LOGGING_NAME"\
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:log4j-executor.properties -Dvm.logging.level=$EXECUTOR_LOGLEVEL -Dvm.logging.name=$LOGGING_NAME"\
--files "log4j-driver.properties,log4j-executor.properties"\
--conf spark.yarn.keytab=$KRB_KEYTAB\
--conf spark.yarn.principal=$KRB_PRINCIPAL\
--conf spark.dynamicAllocation.enabled=false\
--conf spark.sql.catalogImplementation=in-memory\
--conf spark.sql.files.ignoreCorruptFiles=true\
$JAR

How to end Spark Submit and State Accepted

I'm running data cleaning job using apache griffin : https://griffin.apache.org/docs/quickstart.html
and after submitting the spark job
spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
--driver-memory 1g --executor-memory 1g --num-executors 2 \
/home/bigdata/apache-hive-2.2.0-bin/measure-0.4.0.jar \
/home/bigdata/apache-hive-2.2.0-bin/env.json /home/bigdata/apache-hive-2.2.0-bin/dq.json
My job is being submitted like the below:
20/04/08 13:18:30 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:31 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:32 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:33 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:34 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:35 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:36 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:37 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:38 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:39 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:40 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
And never stops:
and When I check the status of the yarn:
bigdata#dq2:~$ yarn application -status application_1586344612496_0231
20/04/08 13:16:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Application Report :
Application-Id : application_1586344612496_0231
Application-Name : batch_accu
Application-Type : SPARK
User : bigdata
Queue : default
Start-Time : 1586348775760
Finish-Time : 0
Progress : 0%
State : ACCEPTED
Final-State : UNDEFINED
Tracking-URL : N/A
RPC Port : -1
AM Host : N/A
Aggregate Resource Allocation : 0 MB-seconds, 0 vcore-seconds
Diagnostics :
Job is not moving can anyone pls help....
In my experience, there could be many causes for this issue, but the first checks you should do are the following:
Your firewall could be blocking some of the ports between the nodes inside your Hadoop cluster, so the computing never starts. Try to disable temporally the firewall for the private interface, and try again to exclude this problem (if this is the problem, reactivate the firewall and identify the ports you need to open!)
Spark might be configured uncorrectly (i.e. resources requirement)

Spark in Yarn Cluster Mode - Yarn client reports FAILED even when job completes successfully

I am experimenting with running Spark in yarn cluster mode (v2.3.0). We have traditionally been running in yarn client mode, but some jobs are submitted from .NET web services, so we have to keep a host process running in the background when using client mode (HostingEnvironment.QueueBackgroundWorkTime...). We are hoping we can execute these jobs in a more "fire and forget" style.
Our jobs continue to run successfully, but we see a curious entry in the logs where the yarn client that submits the job to the application manager is always reporting failure:
18/11/29 16:54:35 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:36 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:37 INFO yarn.Client: Application report for application_1539978346138_110818 (state: FINISHED)
18/11/29 16:54:37 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: <ip address>
ApplicationMaster RPC port: 0
queue: root.default
start time: 1543510402372
final status: FAILED
tracking URL: http://server.host.com:8088/proxy/application_1539978346138_110818/
user: p800s1
Exception in thread "main" org.apache.spark.SparkException: Application application_1539978346138_110818 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1153)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1568)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:892)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/11/29 16:54:37 INFO util.ShutdownHookManager: Shutdown hook called
We always create a SparkSession and always return sys.exit(0) (although that appears to be ignored by the Spark framework regardless of how we submit a job). We also have our own internal error logging that routes to Kafka/ElasticSearch. No errors are reported during the job run.
Here's an example of the submit command: spark2-submit --keytab /etc/keytabs/p800s1.ktf --principal p800s1#OURDOMAIN.COM --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 4g --class com.path.to.MainClass /path/to/UberJar.jar arg1 arg2
This seems to be harmless noise, but I don't like noise that I don't understand. Has anyone experienced something similar?

ERROR : User did not initialize spark context

Log error :
TestSuccessfull
2018-08-20 04:52:15 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13
2018-08-20 04:52:15 ERROR ApplicationMaster:91 - Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
2018-08-20 04:52:15 INFO SparkContext:54 - Invoking stop() from shutdown hook
Error log on console After submit command :
2018-08-20 05:47:35 INFO Client:54 - Application report for application_1534690018301_0035 (state: ACCEPTED)
2018-08-20 05:47:36 INFO Client:54 - Application report for application_1534690018301_0035 (state: ACCEPTED)
2018-08-20 05:47:37 INFO Client:54 - Application report for application_1534690018301_0035 (state: FAILED)
2018-08-20 05:47:37 INFO Client:54 -
client token: N/A
diagnostics: Application application_1534690018301_0035 failed 2 times due to AM Container for appattempt_1534690018301_0035_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: [2018-08-20 05:47:36.454]Exception from container-launch.
Container id: container_1534690018301_0035_02_000001
Exit code: 13
My code :
val sparkConf = new SparkConf().setAppName("Gathering Data")
val sc = new SparkContext(sparkConf)
submit command :
spark-submit --class spark_basic.Test_Local --master yarn --deploy-mode cluster /home/IdeaProjects/target/Spark-1.0-SNAPSHOT.jar
discription :
I have installed spark on hadoop in psedo distribustion mode.
spark-shell working fine. only problem when i used cluster mode .
My code also work file . i am able print output but at final its giving error .
I presume your lines of code has a line which sets master to local.
SparkConf.setMaster("local[*]")
if so, try to comment out that line and try again as you will be setting the master to yarn in your command
/usr/cdh/current/spark-client/bin/spark-submit --class com.test.sparkApp --master yarn --deploy-mode cluster --num-executors 40 --executor-cores 4 --driver-memory 17g --executor-memory 22g --files /usr/cdh/current/spark-client/conf/hive-site.xml /home/user/sparkApp.jar
Finally i got with
spark-submit
/home/mahendra/Marvaland/SparkEcho/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --master yarn --class spark_basic.Test_Local /home/mahendra/IdeaProjects/SparkTraining/target/SparkTraining-1.0-SNAPSHOT.jar
spark session
val spark = SparkSession.builder()
.appName("DataETL")
.master("local[1]")
.enableHiveSupport()
.getOrCreate()
thanks #cricket_007
This error may occur if you are submitting the spark job like this:
spark-submit --class some.path.com.Main --master yarn --deploy-mode cluster some_spark.jar (with passing master and deploy-mode as argument in CLI) and at the same time having this line: new SparkContext in your code.
Either get the context with val sc = SparkContext.getOrCreate() or do not pass the spark-submit master and deploy-mode arguments if want to have new SparkContext.

Spark Job Container exited with exitCode: -1000

I have been struggling to run sample job with spark 2.0.0 in yarn cluster mode, job exists with exitCode: -1000 without any other clues. Same job runs properly in local mode.
Spark command:
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#
TestJob class:
public class TestJob {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf();
JavaSparkContext jsc = new JavaSparkContext(conf);
System.out.println(
"TOtal count:"+
jsc.parallelize(Arrays.asList(new Integer[]{1,2,3,4})).count());
jsc.stop();
}
}
Error Log:
17/10/04 22:26:52 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:52 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.xyz
start time: 1507181210893
final status: UNDEFINED
tracking URL: http://xyzserver:8088/proxy/application_1506717704791_130756/
user: xyz
17/10/04 22:26:53 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:54 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:55 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:56 INFO Client: Application report for application_1506717704791_130756 (state: FAILED)
17/10/04 22:26:56 INFO Client:
client token: N/A
diagnostics: Application application_1506717704791_130756 failed 5 times due to AM Container for appattempt_1506717704791_130756_000005 exited with exitCode: -1000
For more detailed output, check application tracking page:http://xyzserver:8088/cluster/app/application_1506717704791_130756Then, click on links to logs of each attempt.
Diagnostics: Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.xyz
start time: 1507181210893
final status: FAILED
tracking URL: http://xyzserver:8088/cluster/app/application_1506717704791_130756
user: xyz
17/10/04 22:26:56 INFO Client: Deleted staging directory /xyz/spark/.sparkStaging/application_1506717704791_130756
Exception in thread "main" org.apache.spark.SparkException: Application application_1506717704791_130756 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
When I browse the page http://xyzserver:8088/cluster/app/application_1506717704791_130756 it doesn't exists.
No Yarn application logs found-
$yarn logs -applicationId application_1506717704791_130756
/apps/yarn/logs/xyz/logs/application_1506717704791_130756 does not have any log files.
What could be the possibly rootcause of this error and how to get detailed error logs?
After spending nearly one whole day I found the rootcause. When I remove spark.yarn.stagingDir it starts working and I am still not sure why spark is complaining about it-
Previous Spark Submit-
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#
New-
spark-submit \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#

Resources