YARN - application not getting accepted, error code 125

YARN - application not getting accepted, error code 125 - apache-spark

I am trying to run a spark-submit to yarn, but the application first hangs in ACCEPTED state, then fails with the following error:
22/11/23 17:58:24 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:24 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.my_user
start time: 1669222703023
final status: UNDEFINED
tracking URL: https://mask:8090/proxy/application_1668608030982_2921/
user: my_user
22/11/23 17:58:25 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:26 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:27 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:28 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:29 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:30 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:31 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:32 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:33 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:34 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:35 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:36 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:37 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:38 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:39 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:40 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:41 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:42 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:43 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:44 INFO yarn.Client: Application report for application_1668608030982_2921 (state: FAILED)
22/11/23 17:58:44 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1668608030982_2921 failed 2 times due to AM Container for appattempt_1668608030982_2921_000002 exited with exitCode: 125
Failing this attempt.Diagnostics: [2022-11-23 17:58:43.566]Exception from container-launch.
Container id: container_e172_1668608030982_2921_02_000001
Exit code: 125
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is my_user
main : requested yarn user is my_user
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /var/SP/data12/yarn/nm/nmPrivate/application_1668608030982_2921/container_e172_1668608030982_2921_02_000001/container_e172_1668608030982_2921_02_000001.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
I cannot find any references to exit code 125 for yarn; Any idea why this fails?
deploy mode is cluster
this is the spark-submit with a mock class name and without app params at the end(they are verified to be good parameters):
nohup spark-submit\
--class com.myClass\
--master yarn\
--deploy-mode $DEPLOY_MODE\
--num-executors $NUM_EXEC\
--executor-memory $EXEC_MEM\
--executor-cores $NUM_CORES\
--driver-memory "2g"\
--jars /opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar\
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:log4j-driver.properties -Dvm.logging.level=$DRIVER_LOGLEVEL -Dvm.logging.name=$LOGGING_NAME"\
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:log4j-executor.properties -Dvm.logging.level=$EXECUTOR_LOGLEVEL -Dvm.logging.name=$LOGGING_NAME"\
--files "log4j-driver.properties,log4j-executor.properties"\
--conf spark.yarn.keytab=$KRB_KEYTAB\
--conf spark.yarn.principal=$KRB_PRINCIPAL\
--conf spark.dynamicAllocation.enabled=false\
--conf spark.sql.catalogImplementation=in-memory\
--conf spark.sql.files.ignoreCorruptFiles=true\
$JAR

Related

Spark streaming job on yarn keeps terminating after a few hours

I have a spark streaming job that consumes a kafka topic and writes to a database. I submitted the job to yarn with the following parameters:
spark-submit \
--jars mongo-spark-connector_2.11-2.4.0.jar,mongo-java-driver-3.11.0.jar,spark-sql-kafka-0-10_2.11-2.4.5.jar \
--driver-class-path mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--conf spark.executor.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--conf spark.driver.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--class com.example.StreamingApp \
--driver-memory 2g \
--num-executors 6 --executor-cores 3 --executor-memory 3g \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.streaming.backpressure.pid.minRate=10 \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.maxAppAttempts=4 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.yarn.max.executor.failures=16 \
--conf spark.yarn.executor.failuresValidityInterval=1h \
--conf spark.task.maxFailures=8 \
--queue users.adminuser \
--conf spark.speculation=true \
StreamingApp-2-4.0.1-SNAPSHOT.jar
But it terminates after a few hours with the following message on the terminal:
21/02/14 04:05:14 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:14 INFO yarn.Client:
client token: N/A
diagnostics: Attempt recovered after RM restart
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.adminuser
start time: 1613260105314
final status: UNDEFINED
tracking URL: https://XXXXXXXXXx:8090/proxy/application_1613217899387_6697/
user: adminuser
21/02/14 04:05:15 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:16 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:17 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:18 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:19 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:20 INFO yarn.Client: Application report for application_1613217899387_6697 (state: FINISHED)
21/02/14 04:05:20 INFO yarn.Client:
client token: N/A
diagnostics: Attempt recovered after RM restartDue to executor failures all available nodes are blacklisted
ApplicationMaster host: XXXXXXXXXx
ApplicationMaster RPC port: 41848
queue: root.users.adminuser
start time: 1613260105314
final status: FAILED
tracking URL: https://XXXXXXXXXx:8090/proxy/application_1613217899387_6697/
user: adminuser
21/02/14 04:05:20 ERROR yarn.Client: Application diagnostics message: Attempt recovered after RM restartDue to executor failures all available nodes are blacklisted
Exception in thread "main" org.apache.spark.SparkException: Application application_1613217899387_6697 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1155)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1603)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/02/14 04:05:20 INFO util.ShutdownHookManager: Shutdown hook called
21/02/14 04:05:20 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e51d75e7-f19b-4f2f-8d46-b91b1af064b3
21/02/14 04:05:20 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-16d23c16-30f5-4ab7-95b2-4ac4ad584905
I've relaunched a few times, but the same thing keeps happening.
Spark version is 2.4.0-cdh6.2.1, ResourceManager version is 3.0.0-cdh6.2.1

How to end Spark Submit and State Accepted

I'm running data cleaning job using apache griffin : https://griffin.apache.org/docs/quickstart.html
and after submitting the spark job
spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
--driver-memory 1g --executor-memory 1g --num-executors 2 \
/home/bigdata/apache-hive-2.2.0-bin/measure-0.4.0.jar \
/home/bigdata/apache-hive-2.2.0-bin/env.json /home/bigdata/apache-hive-2.2.0-bin/dq.json
My job is being submitted like the below:
20/04/08 13:18:30 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:31 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:32 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:33 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:34 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:35 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:36 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:37 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:38 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:39 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:40 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
And never stops:
and When I check the status of the yarn:
bigdata#dq2:~$ yarn application -status application_1586344612496_0231
20/04/08 13:16:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Application Report :
Application-Id : application_1586344612496_0231
Application-Name : batch_accu
Application-Type : SPARK
User : bigdata
Queue : default
Start-Time : 1586348775760
Finish-Time : 0
Progress : 0%
State : ACCEPTED
Final-State : UNDEFINED
Tracking-URL : N/A
RPC Port : -1
AM Host : N/A
Aggregate Resource Allocation : 0 MB-seconds, 0 vcore-seconds
Diagnostics :
Job is not moving can anyone pls help....

In my experience, there could be many causes for this issue, but the first checks you should do are the following:
Your firewall could be blocking some of the ports between the nodes inside your Hadoop cluster, so the computing never starts. Try to disable temporally the firewall for the private interface, and try again to exclude this problem (if this is the problem, reactivate the firewall and identify the ports you need to open!)
Spark might be configured uncorrectly (i.e. resources requirement)

How to run spark-submit in virtualenv for pyspark?

Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit, but attempting to do so I get...
[me#airflowetl tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py
File "/bin/hdp-select", line 255
print "ERROR: Invalid package - " + name
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?
ls: cannot access /usr/hdp//hadoop/lib: No such file or directory
Exception in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
at org.apache.spark.launcher.Main.main(Main.java:118)
also tried...
(venv) [me#airflowetl tests]$ export HADOOP_CONF_DIR=/etc/hadoop/conf; spark-submit --master yarn --deploy-mode cluster sparksubmit.test.py
19/12/12 13:50:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/12/12 13:50:20 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
....
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
...or (from here https://www.hackingnote.com/en/spark/trouble-shooting/NoClassDefFoundError-ClientConfig)...
(venv) [airflow#airflowetl tests]$ spark-submit --master yarn --deploy-mode client --conf spark.hadoop.yarn.timeline-service.enabled=false sparksubmit.test.py
19/12/12 15:22:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/12/12 15:22:49 INFO spark.SparkContext: Running Spark version 2.4.4
19/12/12 15:22:49 INFO spark.SparkContext: Submitted application: hph_etl_TEST
19/12/12 15:22:49 INFO spark.SecurityManager: Changing view acls to: airflow
19/12/12 15:22:49 INFO spark.SecurityManager: Changing modify acls to: airflow
19/12/12 15:22:49 INFO spark.SecurityManager: Changing view acls groups to:
19/12/12 15:22:49 INFO spark.SecurityManager: Changing modify acls groups to:
19/12/12 15:22:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(airflow); groups with view permissions: Set(); users with modify permissions: Set(airflow); groups with modify permissions: Set()
19/12/12 15:22:49 INFO util.Utils: Successfully started service 'sparkDriver' on port 45232.
19/12/12 15:22:50 INFO spark.SparkEnv: Registering MapOutputTracker
19/12/12 15:22:50 INFO spark.SparkEnv: Registering BlockManagerMaster
19/12/12 15:22:50 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/12/12 15:22:50 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/12/12 15:22:50 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-320366b6-609a-497b-ac40-119d11682044
19/12/12 15:22:50 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
19/12/12 15:22:50 INFO spark.SparkEnv: Registering OutputCommitCoordinator
19/12/12 15:22:50 INFO util.log: Logging initialized #2663ms
19/12/12 15:22:50 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
19/12/12 15:22:50 INFO server.Server: Started #2763ms
19/12/12 15:22:50 INFO server.AbstractConnector: Started ServerConnector#50a3c656{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/12/12 15:22:50 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#306c15f1{/jobs,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#2b566f8d{/jobs/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#1b5ef515{/jobs/job,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#59f7a5e2{/jobs/job/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#41c58356{/stages,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#2d5f2026{/stages/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#324ca89a{/stages/stage,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#6f487c61{/stages/stage/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#3897116a{/stages/pool,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#68ab090f{/stages/pool/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#42ea3278{/storage,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#6eedf530{/storage/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#6e71a5c6{/storage/rdd,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#5e222a76{/storage/rdd/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#4dc8aa38{/environment,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#4c8d82c4{/environment/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#2fb15106{/executors,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#608faf1c{/executors/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#689e405f{/executors/threadDump,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#48a5742a{/executors/threadDump/json,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#6db93559{/static,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#4d7ed508{/,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#5510f12d{/api,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#6d87de7{/jobs/job/kill,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#62595660{/stages/stage/kill,null,AVAILABLE,#Spark}
19/12/12 15:22:50 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://airflowetl.local:4040
19/12/12 15:22:51 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
19/12/12 15:22:51 INFO client.RMProxy: Connecting to ResourceManager at hw001.local/172.18.4.46:8050
19/12/12 15:22:51 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
19/12/12 15:22:51 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (15360 MB per container)
19/12/12 15:22:51 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
19/12/12 15:22:51 INFO yarn.Client: Setting up container launch context for our AM
19/12/12 15:22:51 INFO yarn.Client: Setting up the launch environment for our AM container
19/12/12 15:22:51 INFO yarn.Client: Preparing resources for our AM container
19/12/12 15:22:51 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/12/12 15:22:53 INFO yarn.Client: Uploading resource file:/tmp/spark-4e600acd-2d34-4271-b01c-25f312906f93/__spark_libs__8368679994314392346.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/__spark_libs__8368679994314392346.zip
19/12/12 15:22:54 INFO yarn.Client: Uploading resource file:/home/airflow/projects/hph_etl_airflow/venv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/pyspark.zip
19/12/12 15:22:55 INFO yarn.Client: Uploading resource file:/home/airflow/projects/hph_etl_airflow/venv/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/py4j-0.10.7-src.zip
19/12/12 15:22:55 INFO yarn.Client: Uploading resource file:/tmp/spark-4e600acd-2d34-4271-b01c-25f312906f93/__spark_conf__5403285055443058510.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/__spark_conf__.zip
19/12/12 15:22:55 INFO spark.SecurityManager: Changing view acls to: airflow
19/12/12 15:22:55 INFO spark.SecurityManager: Changing modify acls to: airflow
19/12/12 15:22:55 INFO spark.SecurityManager: Changing view acls groups to:
19/12/12 15:22:55 INFO spark.SecurityManager: Changing modify acls groups to:
19/12/12 15:22:55 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(airflow); groups with view permissions: Set(); users with modify permissions: Set(airflow); groups with modify permissions: Set()
19/12/12 15:22:56 INFO yarn.Client: Submitting application application_1572898343646_0029 to ResourceManager
19/12/12 15:22:56 INFO impl.YarnClientImpl: Submitted application application_1572898343646_0029
19/12/12 15:22:56 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1572898343646_0029 and attemptId None
19/12/12 15:22:57 INFO yarn.Client: Application report for application_1572898343646_0029 (state: ACCEPTED)
19/12/12 15:22:57 INFO yarn.Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1576200176385
final status: UNDEFINED
tracking URL: http://hw001.local:8088/proxy/application_1572898343646_0029/
user: airflow
19/12/12 15:22:58 INFO yarn.Client: Application report for application_1572898343646_0029 (state: FAILED)
19/12/12 15:22:58 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1572898343646_0029 failed 2 times due to AM Container for appattempt_1572898343646_0029_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2019-12-12 15:22:58.214]Exception from container-launch.
Container id: container_e02_1572898343646_0029_02_000001
Exit code: 1
[2019-12-12 15:22:58.215]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/hadoop/yarn/local/usercache/airflow/appcache/application_1572898343646_0029/container_e02_1572898343646_0029_02_000001/launch_container.sh: line 38: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:$HADOOP_CONF_DIR:/usr/hdp/3.1.0.0-78/hadoop/*:/usr/hdp/3.1.0.0-78/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
....
Not sure what to make of this or how to proceed further and did not totally understand the error message after googling it.
Anyone with more experience have any further debugging tips for this or fixes?

spark-submit is a bash script, and uses Java classes to run, so using a virtualenv wouldn't necessarily help (although, you can see in the logs that files were uploaded from the environment).
The first error is because hdp-select requires Python2, but it looks like it ran with Python3 (probably due to your venv)
If you want to carry your Python environment to the executors and driver, you'd probably want to use the --pyfiles option instead, or setup the same python environment on each Spark node
Also, you seem to have Spark 2.4.4, not 2.3.2, like you say, which could explain the NoClassDef if you're mixing Spark versions (in particular pyspark from pip doesn't download any scheduler specific packages, like the YARN timeline)
But you ran the code fine and you can find the real exception under
http://hw001.local:8088/proxy/application_1572898343646_0029

Spark Streaming failing on YARN Cluster

I have a cluster of 1 master and 2 slaves. I'm running a spark streaming in master and I want to utilize all nodes in my cluster. i had specified some parameters like driver memory and executor memory in my code. when i give --deploy-mode cluster --master yarn-cluster in my spark-submit, it gives the following error.
> log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/12 13:24:49 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/08/12 13:24:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/08/12 13:24:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/12 13:24:49 INFO Client: Setting up container launch context for our AM
15/08/12 13:24:49 INFO Client: Preparing resources for our AM container
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.5.0-cdh5.3.5.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
15/08/12 13:24:49 INFO Client: Setting up the launch environment for our AM container
15/08/12 13:24:49 INFO SecurityManager: Changing view acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: Changing modify acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs); users with modify permissions: Set(hdfs)
15/08/12 13:24:49 INFO Client: Submitting application 3808 to ResourceManager
15/08/12 13:24:49 INFO YarnClientImpl: Submitted application application_1437639737006_3808
15/08/12 13:24:50 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:50 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hdfs
start time: 1439385889600
final status: UNDEFINED
tracking URL: http://hostname:port/proxy/application_1437639737006_3808/
user: hdfs
15/08/12 13:24:51 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:52 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:53 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:54 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:55 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:56 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:57 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:58 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:59 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:00 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:01 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:02 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:03 INFO Client: Application report for application_1437639737006_3808 (state: FAILED)
15/08/12 13:25:03 INFO Client:
client token: N/A
diagnostics: Application application_1437639737006_3808 failed 2 times due to AM Container for appattempt_1437639737006_3808_000002 exited with exitCode: -1000 due to: File file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip does not exist
.Failing this attempt.. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hdfs
start time: 1439385889600
final status: FAILED
tracking URL: http://hostname:port/cluster/app/application_1437639737006_3808
user: hdfs
Exception in thread "main" org.apache.spark.SparkException: Application application_1437639737006_3808 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:855)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
How to fix this issue ? Please help me if i'm doing wrong.

The file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip you submit does not exist.

While running with the Yarn Cluster mode, you always need to specify the other Memory settings for your executors and there individually memory, Plus you always need to specify the driver details also. Now for Example
Amazon EC2 Environment (Reserved already):
m3.xlarge | CORES : 4(1) | RAM : 15 (3.5) | HDD : 80 GB | Nodes : 3 Nodes
spark-submit --class <YourClassFollowedByPackage> --master yarn-cluster --num-executors 2 --driver-memory 8g --executor-memory 8g --executor-cores 1 <Your Jar with Full Path> <Jar Args>
Always remember to add the other third-party libraries or jars to your Classpath in each of the Task Nodes, You can add them directly to your Spark or Hadoop Classpath on each of your Node.
Notes :
1) If you're using the Amazon EMR then It can be achieved using Custom Bootstrap Actions and S3.
2) Remove the conflicting jars too. Sometimes you'll see an unnecessary NullPointerException and this could be one of the key reason for it.
If possible add your stacktrace using
yarn logs -applicationId <HadoopAppId>
So that I can answer you in more specific way.

I recently ran into the same issue. Here was my scenario:
Cloudera Managed CDH 5.3.3 cluster with 7 nodes. I was submitting the job from one of the nodes and it used to fail in both yarn-cluster and yarn-master modes with the same issue.
If you look at the stacktrace, you'll find this line-
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
This is the reason why the job fails because resources are not copied.
In my case, it was resolved by correcting the HADOOP_CONF_DIR path. It wasn't pointing to the exact folder that contains the core-site.xml and yarn-site.xml and other configuration files. Once this was fixed, the resources were copied during the initiation of the ApplicationMaster and the job ran correctly.

I was able to solve this by providing the driver memory and executor memory at run time.
spark-submit --driver-memory 1g --executor-memory 1g --class com.package.App --master yarn --deploy-mode cluster /home/spark.jar

spark-submit yarn-client run failed

Using the yarn-client to run spark program.
I've build the spark on yarn environment.
the scripts is
./bin/spark-submit --class WordCountTest \
--master yarn-client \
--num-executors 1 \
--executor-cores 1 \
--queue root.hadoop \
/root/Desktop/test2.jar \
10
when running I get the following exception.
15/05/12 17:42:01 INFO spark.SparkContext: Running Spark version 1.3.1
15/05/12 17:42:01 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to ':/usr/local/hadoop/hadoop-2.5.2/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
15/05/12 17:42:01 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to ':/usr/local/hadoop/hadoop-2.5.2/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar' as a work-around.
15/05/12 17:42:01 WARN spark.SparkConf: Setting 'spark.driver.extraClassPath' to ':/usr/local/hadoop/hadoop-2.5.2/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar' as a work-around.
15/05/12 17:42:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/12 17:42:02 INFO spark.SecurityManager: Changing view acls to: root
15/05/12 17:42:02 INFO spark.SecurityManager: Changing modify acls to: root
15/05/12 17:42:02 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/05/12 17:42:02 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/05/12 17:42:02 INFO Remoting: Starting remoting
15/05/12 17:42:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#master:49338]
15/05/12 17:42:03 INFO util.Utils: Successfully started service 'sparkDriver' on port 49338.
15/05/12 17:42:03 INFO spark.SparkEnv: Registering MapOutputTracker
15/05/12 17:42:03 INFO spark.SparkEnv: Registering BlockManagerMaster
15/05/12 17:42:03 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-57f5fb29-784d-4730-92b8-c2e8be97c038/blockmgr-752988bc-b2d0-42f7-891d-5d3edbb4526d
15/05/12 17:42:03 INFO storage.MemoryStore: MemoryStore started with capacity 267.3 MB
15/05/12 17:42:04 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-2f2a46eb-9259-4c6e-b9af-7159efb0b3e9/httpd-3c50fe1e-430e-4077-9cd0-58246e182d98
15/05/12 17:42:04 INFO spark.HttpServer: Starting HTTP Server
15/05/12 17:42:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/05/12 17:42:04 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:41749
15/05/12 17:42:04 INFO util.Utils: Successfully started service 'HTTP file server' on port 41749.
15/05/12 17:42:04 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/05/12 17:42:05 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/05/12 17:42:05 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/05/12 17:42:05 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/05/12 17:42:05 INFO ui.SparkUI: Started SparkUI at http://master:4040
15/05/12 17:42:05 INFO spark.SparkContext: Added JAR file:/root/Desktop/test2.jar at http://192.168.147.201:41749/jars/test2.jar with timestamp 1431423725289
15/05/12 17:42:05 WARN cluster.YarnClientSchedulerBackend: NOTE: SPARK_WORKER_MEMORY is deprecated. Use SPARK_EXECUTOR_MEMORY or --executor-memory through spark-submit instead.
15/05/12 17:42:06 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.147.201:8032
15/05/12 17:42:06 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
15/05/12 17:42:06 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/05/12 17:42:06 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/05/12 17:42:06 INFO yarn.Client: Setting up container launch context for our AM
15/05/12 17:42:06 INFO yarn.Client: Preparing resources for our AM container
15/05/12 17:42:07 WARN yarn.Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
15/05/12 17:42:07 INFO yarn.Client: Uploading resource file:/usr/local/spark/spark-1.3.1-bin-hadoop2.5.0-cdh5.3.2/lib/spark-assembly-1.3.1-hadoop2.5.0-cdh5.3.2.jar -> hdfs://master:9000/user/root/.sparkStaging/application_1431423592173_0003/spark-assembly-1.3.1-hadoop2.5.0-cdh5.3.2.jar
15/05/12 17:42:11 INFO yarn.Client: Setting up the launch environment for our AM container
15/05/12 17:42:11 WARN yarn.Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
15/05/12 17:42:11 INFO spark.SecurityManager: Changing view acls to: root
15/05/12 17:42:11 INFO spark.SecurityManager: Changing modify acls to: root
15/05/12 17:42:11 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/05/12 17:42:11 INFO yarn.Client: Submitting application 3 to ResourceManager
15/05/12 17:42:11 INFO impl.YarnClientImpl: Submitted application application_1431423592173_0003
15/05/12 17:42:12 INFO yarn.Client: Application report for application_1431423592173_0003 (state: FAILED)
15/05/12 17:42:12 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1431423592173_0003 submitted by user root to unknown queue: root.hadoop
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hadoop
start time: 1431423731271
final status: FAILED
tracking URL: N/A
user: root
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:113)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:59)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:381)
at WordCountTest$.main(WordCountTest.scala:14)
at WordCountTest.main(WordCountTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
My code very simple, just as following:
object WordCountTest {
def main (args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
val sparkConf = new SparkConf().setAppName("WordCountTest Prog")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
val file = sc.textFile("/data/test/pom.xml")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
println(counts)
counts.saveAsTextFile("/data/test/pom_count.txt")
}
}
I've debug this problem for 2 days. Help!Help! Thx.

Try changing queue name to hadoop

in my case,
change “--queue thequeue” to “--queue default”
it work
运行：
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue thequeue lib/spark-examples*.jar 10
时报如下错误，只需要将“--queue thequeue”改成“--queue default”即可。

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

YARN - application not getting accepted, error code 125 - apache-spark

Related

Spark streaming job on yarn keeps terminating after a few hours

How to end Spark Submit and State Accepted

How to run spark-submit in virtualenv for pyspark?

Spark Streaming failing on YARN Cluster

spark-submit yarn-client run failed

Categories

Resources