How to end Spark Submit and State Accepted - apache-spark

I'm running data cleaning job using apache griffin : https://griffin.apache.org/docs/quickstart.html
and after submitting the spark job
spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \
--driver-memory 1g --executor-memory 1g --num-executors 2 \
/home/bigdata/apache-hive-2.2.0-bin/measure-0.4.0.jar \
/home/bigdata/apache-hive-2.2.0-bin/env.json /home/bigdata/apache-hive-2.2.0-bin/dq.json
My job is being submitted like the below:
20/04/08 13:18:30 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:31 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:32 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:33 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:34 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:35 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:36 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:37 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:38 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:39 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
20/04/08 13:18:40 INFO yarn.Client: Application report for application_1586344612496_0247 (state: ACCEPTED)
And never stops:
and When I check the status of the yarn:
bigdata#dq2:~$ yarn application -status application_1586344612496_0231
20/04/08 13:16:31 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Application Report :
Application-Id : application_1586344612496_0231
Application-Name : batch_accu
Application-Type : SPARK
User : bigdata
Queue : default
Start-Time : 1586348775760
Finish-Time : 0
Progress : 0%
State : ACCEPTED
Final-State : UNDEFINED
Tracking-URL : N/A
RPC Port : -1
AM Host : N/A
Aggregate Resource Allocation : 0 MB-seconds, 0 vcore-seconds
Diagnostics :
Job is not moving can anyone pls help....

In my experience, there could be many causes for this issue, but the first checks you should do are the following:
Your firewall could be blocking some of the ports between the nodes inside your Hadoop cluster, so the computing never starts. Try to disable temporally the firewall for the private interface, and try again to exclude this problem (if this is the problem, reactivate the firewall and identify the ports you need to open!)
Spark might be configured uncorrectly (i.e. resources requirement)

Related

YARN - application not getting accepted, error code 125

I am trying to run a spark-submit to yarn, but the application first hangs in ACCEPTED state, then fails with the following error:
22/11/23 17:58:24 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:24 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.my_user
start time: 1669222703023
final status: UNDEFINED
tracking URL: https://mask:8090/proxy/application_1668608030982_2921/
user: my_user
22/11/23 17:58:25 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:26 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:27 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:28 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:29 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:30 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:31 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:32 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:33 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:34 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:35 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:36 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:37 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:38 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:39 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:40 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:41 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:42 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:43 INFO yarn.Client: Application report for application_1668608030982_2921 (state: ACCEPTED)
22/11/23 17:58:44 INFO yarn.Client: Application report for application_1668608030982_2921 (state: FAILED)
22/11/23 17:58:44 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1668608030982_2921 failed 2 times due to AM Container for appattempt_1668608030982_2921_000002 exited with exitCode: 125
Failing this attempt.Diagnostics: [2022-11-23 17:58:43.566]Exception from container-launch.
Container id: container_e172_1668608030982_2921_02_000001
Exit code: 125
Exception message: Launch container failed
Shell output: main : command provided 1
main : run as user is my_user
main : requested yarn user is my_user
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /var/SP/data12/yarn/nm/nmPrivate/application_1668608030982_2921/container_e172_1668608030982_2921_02_000001/container_e172_1668608030982_2921_02_000001.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
I cannot find any references to exit code 125 for yarn; Any idea why this fails?
deploy mode is cluster
this is the spark-submit with a mock class name and without app params at the end(they are verified to be good parameters):
nohup spark-submit\
--class com.myClass\
--master yarn\
--deploy-mode $DEPLOY_MODE\
--num-executors $NUM_EXEC\
--executor-memory $EXEC_MEM\
--executor-cores $NUM_CORES\
--driver-memory "2g"\
--jars /opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar\
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:log4j-driver.properties -Dvm.logging.level=$DRIVER_LOGLEVEL -Dvm.logging.name=$LOGGING_NAME"\
--conf spark.executor.extraJavaOptions="-Dlog4j.configuration=file:log4j-executor.properties -Dvm.logging.level=$EXECUTOR_LOGLEVEL -Dvm.logging.name=$LOGGING_NAME"\
--files "log4j-driver.properties,log4j-executor.properties"\
--conf spark.yarn.keytab=$KRB_KEYTAB\
--conf spark.yarn.principal=$KRB_PRINCIPAL\
--conf spark.dynamicAllocation.enabled=false\
--conf spark.sql.catalogImplementation=in-memory\
--conf spark.sql.files.ignoreCorruptFiles=true\
$JAR

Spark streaming job on yarn keeps terminating after a few hours

I have a spark streaming job that consumes a kafka topic and writes to a database. I submitted the job to yarn with the following parameters:
spark-submit \
--jars mongo-spark-connector_2.11-2.4.0.jar,mongo-java-driver-3.11.0.jar,spark-sql-kafka-0-10_2.11-2.4.5.jar \
--driver-class-path mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--conf spark.executor.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--conf spark.driver.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--class com.example.StreamingApp \
--driver-memory 2g \
--num-executors 6 --executor-cores 3 --executor-memory 3g \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.streaming.backpressure.pid.minRate=10 \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.maxAppAttempts=4 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.yarn.max.executor.failures=16 \
--conf spark.yarn.executor.failuresValidityInterval=1h \
--conf spark.task.maxFailures=8 \
--queue users.adminuser \
--conf spark.speculation=true \
StreamingApp-2-4.0.1-SNAPSHOT.jar
But it terminates after a few hours with the following message on the terminal:
21/02/14 04:05:14 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:14 INFO yarn.Client:
client token: N/A
diagnostics: Attempt recovered after RM restart
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.adminuser
start time: 1613260105314
final status: UNDEFINED
tracking URL: https://XXXXXXXXXx:8090/proxy/application_1613217899387_6697/
user: adminuser
21/02/14 04:05:15 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:16 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:17 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:18 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:19 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:20 INFO yarn.Client: Application report for application_1613217899387_6697 (state: FINISHED)
21/02/14 04:05:20 INFO yarn.Client:
client token: N/A
diagnostics: Attempt recovered after RM restartDue to executor failures all available nodes are blacklisted
ApplicationMaster host: XXXXXXXXXx
ApplicationMaster RPC port: 41848
queue: root.users.adminuser
start time: 1613260105314
final status: FAILED
tracking URL: https://XXXXXXXXXx:8090/proxy/application_1613217899387_6697/
user: adminuser
21/02/14 04:05:20 ERROR yarn.Client: Application diagnostics message: Attempt recovered after RM restartDue to executor failures all available nodes are blacklisted
Exception in thread "main" org.apache.spark.SparkException: Application application_1613217899387_6697 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1155)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1603)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/02/14 04:05:20 INFO util.ShutdownHookManager: Shutdown hook called
21/02/14 04:05:20 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e51d75e7-f19b-4f2f-8d46-b91b1af064b3
21/02/14 04:05:20 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-16d23c16-30f5-4ab7-95b2-4ac4ad584905
I've relaunched a few times, but the same thing keeps happening.
Spark version is 2.4.0-cdh6.2.1, ResourceManager version is 3.0.0-cdh6.2.1

Read spark stdout from driverLogUrl through livy batch API

Livy has a batch log endpoint: GET /batches/{batchId}/log, pointed out in How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow
As far as I can tell, these logs are the livy logs and not the spark driver logs. I have a print statement in a pyspark job which prints to driver log stdout.
I am able to find the driver log URL via the describe batch endpoint https://livy.incubator.apache.org/docs/latest/rest-api.html#batch: by visiting the json response['appInfo']['driverLogUrl'] URL and clicking through to the logs
The json response url looks like : http://ip-some-ip.emr.masternode:8042/node/containerlogs/container_1578061839438_0019_01_000001/livy/ and I can click through to an html page with the added url leaf: stdout/?start=-4096 to see the logs...
As it is, I can only get an HTML page of the stdout, does a JSON API like version of this stdout (and preferrably stderr too) exist in the yarn/emr/hadoop resource manager? Otherwise is livy able to retrieve these driver logs somehow?
Or, is this an issue because I am using cluster mode instead of client. When I try to use client mode, I've been unable to use python3 and the PYSPARK_PYTHON, which is maybe for a different question, but if I'm able to get the stdout of the driver using a different deployMode, then that would work too.
If it matters, I'm running the cluster with EMR
I meet the same problem.
The short answer is it will only work for the client mode, but not the cluster mode.
This is because we try to get all logs from the master node. But the print information is local and is from the driver node.
When the spark is running in the "client mode", the driver node is your master node, so we get both log info and print info as they are in the same physical machine
However, things are different when spark is running in the "cluster mode". In this case, the driver node is one of your worker node, not your master node. Therefore we lose the print info since livy only get info from the master node
You can fetch the all logs including stdout, stderr and yarn diagnostics by GET /batches/{batchId}. (as you can see through at a batch log endpoint)
Here are code examples:
# self.job is batch object returned by `POST /batches`
job_response = requests.get(self.job, headers=self.headers).json()
self.job_status = job_response['state']
print(f"Job status: {self.job_status}")
for log in job_response['log']:
print(log)
Printed logs are like this (note that it is a Spark job logs, not a livy logs):
20/01/10 05:28:57 INFO Client: Application report for application_1578623516978_0024 (state: ACCEPTED)
20/01/10 05:28:58 INFO Client: Application report for application_1578623516978_0024 (state: ACCEPTED)
20/01/10 05:28:59 INFO Client: Application report for application_1578623516978_0024 (state: RUNNING)
20/01/10 05:28:59 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.2.100.6
ApplicationMaster RPC port: -1
queue: default
start time: 1578634135032
final status: UNDEFINED
tracking URL: http://ip-10-2-100-176.ap-northeast-2.compute.internal:20888/proxy/application_1578623516978_0024/
user: livy
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: Application application_1578623516978_0024 has started running.
20/01/10 05:28:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38087.
20/01/10 05:28:59 INFO NettyBlockTransferService: Server created on ip-10-2-100-176.ap-northeast-2.compute.internal:38087
20/01/10 05:28:59 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/01/10 05:28:59 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-2-100-176.ap-northeast-2.compute.internal:38087 with 5.4 GB RAM, BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManager: external shuffle service port = 7337
20/01/10 05:28:59 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-2-100-176.ap-northeast-2.compute.internal, PROXY_URI_BASES -> http://ip-10-2-100-176.ap-northeast-2.compute.internal:20888/proxy/application_1578623516978_0024), /proxy/application_1578623516978_0024
20/01/10 05:28:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/01/10 05:28:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/01/10 05:28:59 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/01/10 05:28:59 INFO EventLoggingListener: Logging events to hdfs:/var/log/spark/apps/application_1578623516978_0024
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/01/10 05:28:59 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
...
Please check the Livy docs for REST API for further information.

Property spark.yarn.jars - how to deal with it?

My knowledge with Spark is limited and you would sense it after reading this question. I have just one node and spark, hadoop and yarn are installed on it.
I was able to code and run word-count problem in cluster mode by below command
spark-submit --class com.sanjeevd.sparksimple.wordcount.JobRunner
--master yarn
--deploy-mode cluster
--driver-memory=2g
--executor-memory 2g
--executor-cores 1
--num-executors 1
SparkSimple-0.0.1SNAPSHOT.jar
hdfs://sanjeevd.br:9000/user/spark-test/word-count/input
hdfs://sanjeevd.br:9000/user/spark-test/word-count/output
It works just fine.
Now I understood that 'spark on yarn' requires spark jar files available on the cluster and if I don't do anything then every time I run my program it will copy hundreds of jar files from $SPARK_HOME to each node (in my case it's just one node). I see that code's execution pauses for some time before it finishes copying. See below -
16/12/12 17:24:03 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/12/12 17:24:06 INFO yarn.Client: Uploading resource file:/tmp/spark-a6cc0d6e-45f9-4712-8bac-fb363d6992f2/__spark_libs__11112433502351931.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/__spark_libs__11112433502351931.zip
16/12/12 17:24:08 INFO yarn.Client: Uploading resource file:/home/sanjeevd/personal/Spark-Simple/target/SparkSimple-0.0.1-SNAPSHOT.jar -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/SparkSimple-0.0.1-SNAPSHOT.jar
16/12/12 17:24:08 INFO yarn.Client: Uploading resource file:/tmp/spark-a6cc0d6e-45f9-4712-8bac-fb363d6992f2/__spark_conf__6716604236006329155.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/__spark_conf__.zip
Spark's documentation suggests to set spark.yarn.jars property to avoid this copying. So I set below below property in spark-defaults.conf file.
spark.yarn.jars hdfs://sanjeevd.br:9000//user/spark/share/lib
http://spark.apache.org/docs/latest/running-on-yarn.html#preparations
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
Btw, I have all the jar files from LOCAL /opt/spark/jars to HDFS /user/spark/share/lib. They are 206 in number.
This makes my jar failed. Below is the error -
spark-submit --class com.sanjeevd.sparksimple.wordcount.JobRunner --master yarn --deploy-mode cluster --driver-memory=2g --executor-memory 2g --executor-cores 1 --num-executors 1 SparkSimple-0.0.1-SNAPSHOT.jar hdfs://sanjeevd.br:9000/user/spark-test/word-count/input hdfs://sanjeevd.br:9000/user/spark-test/word-count/output
16/12/12 17:43:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/12 17:43:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/12/12 17:43:07 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
16/12/12 17:43:07 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5120 MB per container)
16/12/12 17:43:07 INFO yarn.Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
16/12/12 17:43:07 INFO yarn.Client: Setting up container launch context for our AM
16/12/12 17:43:07 INFO yarn.Client: Setting up the launch environment for our AM container
16/12/12 17:43:07 INFO yarn.Client: Preparing resources for our AM container
16/12/12 17:43:07 INFO yarn.Client: Uploading resource file:/home/sanjeevd/personal/Spark-Simple/target/SparkSimple-0.0.1-SNAPSHOT.jar -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005/SparkSimple-0.0.1-SNAPSHOT.jar
16/12/12 17:43:07 INFO yarn.Client: Uploading resource file:/tmp/spark-fae6a5ad-65d9-4b64-9ba6-65da1310ae9f/__spark_conf__7881471844385719101.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005/__spark_conf__.zip
16/12/12 17:43:08 INFO spark.SecurityManager: Changing view acls to: sanjeevd
16/12/12 17:43:08 INFO spark.SecurityManager: Changing modify acls to: sanjeevd
16/12/12 17:43:08 INFO spark.SecurityManager: Changing view acls groups to:
16/12/12 17:43:08 INFO spark.SecurityManager: Changing modify acls groups to:
16/12/12 17:43:08 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sanjeevd); groups with view permissions: Set(); users with modify permissions: Set(sanjeevd); groups with modify permissions: Set()
16/12/12 17:43:08 INFO yarn.Client: Submitting application application_1481592214176_0005 to ResourceManager
16/12/12 17:43:08 INFO impl.YarnClientImpl: Submitted application application_1481592214176_0005
16/12/12 17:43:09 INFO yarn.Client: Application report for application_1481592214176_0005 (state: ACCEPTED)
16/12/12 17:43:09 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1481593388442
final status: UNDEFINED
tracking URL: http://sanjeevd.br:8088/proxy/application_1481592214176_0005/
user: sanjeevd
16/12/12 17:43:10 INFO yarn.Client: Application report for application_1481592214176_0005 (state: FAILED)
16/12/12 17:43:10 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1481592214176_0005 failed 1 times due to AM Container for appattempt_1481592214176_0005_000001 exited with exitCode: 1
For more detailed output, check application tracking page:http://sanjeevd.br:8088/cluster/app/application_1481592214176_0005Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1481592214176_0005_01_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1481593388442
final status: FAILED
tracking URL: http://sanjeevd.br:8088/cluster/app/application_1481592214176_0005
user: sanjeevd
16/12/12 17:43:10 INFO yarn.Client: Deleting staging directory hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005
Exception in thread "main" org.apache.spark.SparkException: Application application_1481592214176_0005 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/12/12 17:43:10 INFO util.ShutdownHookManager: Shutdown hook called
16/12/12 17:43:10 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fae6a5ad-65d9-4b64-9ba6-65da1310ae9f
Do you know what wrong am I doing? The task's log says below -
Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I understand the error that ApplicationMaster class is not found but my question is why it is not found - where this class is supposed to be? I don't have assembly jar since I'm using spark 2.0.1 where there is no assembly comes bundled.
What this has to do with spark.yarn.jars property? This property is to help spark run on yarn, and that should be it. What additional I need to do when using spark.yarn.jars?
Thanks in reading this question and for your help in advance.
You could also use the spark.yarn.archive option and set that to the location of an archive (you create) containing all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the archive. For example:
Create the archive: jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
Upload to HDFS: hdfs dfs -put spark-libs.jar /some/path/.
2a. For a large cluster, increase the replication count of the Spark archive so that you reduce the amount of times a NodeManager will do a remote copy. hdfs dfs –setrep -w 10 hdfs:///some/path/spark-libs.jar (Change the amount of replicas proportional to the number of total NodeManagers)
Set spark.yarn.archive to hdfs:///some/path/spark-libs.jar
I was finally able to make sense of this property. I found by hit-n-trial that correct syntax of this property is
spark.yarn.jars=hdfs://xx:9000/user/spark/share/lib/*.jar
I didn't put *.jar in the end and my path was just ended with /lib. I tried putting actual assembly jar like this - spark.yarn.jars=hdfs://sanjeevd.brickred:9000/user/spark/share/lib/spark-yarn_2.11-2.0.1.jar but no luck. All it said that unable to load ApplicationMaster.
I posted my response to a similar question asked by someone at https://stackoverflow.com/a/41179608/2332121
If you look at spark.yarn.jars documentation it says the following
List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
This means that you are actually overriding the SPARK_HOME/jars and telling yarn to pick up all the jars required for the application run from your path,If you set spark.yarn.jars property, all the dependent jars for spark to run should be present in this path, If you go and look inside spark-assembly.jar present in SPARK_HOME/lib , org.apache.spark.deploy.yarn.ApplicationMaster class is present, so make sure that all the spark dependencies are present in the HDFS path that you specify as spark.yarn.jars.

Spark Streaming failing on YARN Cluster

I have a cluster of 1 master and 2 slaves. I'm running a spark streaming in master and I want to utilize all nodes in my cluster. i had specified some parameters like driver memory and executor memory in my code. when i give --deploy-mode cluster --master yarn-cluster in my spark-submit, it gives the following error.
> log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/12 13:24:49 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/08/12 13:24:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/08/12 13:24:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/12 13:24:49 INFO Client: Setting up container launch context for our AM
15/08/12 13:24:49 INFO Client: Preparing resources for our AM container
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.5.0-cdh5.3.5.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
15/08/12 13:24:49 INFO Client: Setting up the launch environment for our AM container
15/08/12 13:24:49 INFO SecurityManager: Changing view acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: Changing modify acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs); users with modify permissions: Set(hdfs)
15/08/12 13:24:49 INFO Client: Submitting application 3808 to ResourceManager
15/08/12 13:24:49 INFO YarnClientImpl: Submitted application application_1437639737006_3808
15/08/12 13:24:50 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:50 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hdfs
start time: 1439385889600
final status: UNDEFINED
tracking URL: http://hostname:port/proxy/application_1437639737006_3808/
user: hdfs
15/08/12 13:24:51 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:52 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:53 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:54 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:55 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:56 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:57 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:58 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:59 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:00 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:01 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:02 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:03 INFO Client: Application report for application_1437639737006_3808 (state: FAILED)
15/08/12 13:25:03 INFO Client:
client token: N/A
diagnostics: Application application_1437639737006_3808 failed 2 times due to AM Container for appattempt_1437639737006_3808_000002 exited with exitCode: -1000 due to: File file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip does not exist
.Failing this attempt.. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hdfs
start time: 1439385889600
final status: FAILED
tracking URL: http://hostname:port/cluster/app/application_1437639737006_3808
user: hdfs
Exception in thread "main" org.apache.spark.SparkException: Application application_1437639737006_3808 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:855)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
How to fix this issue ? Please help me if i'm doing wrong.
The file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip you submit does not exist.
While running with the Yarn Cluster mode, you always need to specify the other Memory settings for your executors and there individually memory, Plus you always need to specify the driver details also. Now for Example
Amazon EC2 Environment (Reserved already):
m3.xlarge | CORES : 4(1) | RAM : 15 (3.5) | HDD : 80 GB | Nodes : 3 Nodes
spark-submit --class <YourClassFollowedByPackage> --master yarn-cluster --num-executors 2 --driver-memory 8g --executor-memory 8g --executor-cores 1 <Your Jar with Full Path> <Jar Args>
Always remember to add the other third-party libraries or jars to your Classpath in each of the Task Nodes, You can add them directly to your Spark or Hadoop Classpath on each of your Node.
Notes :
1) If you're using the Amazon EMR then It can be achieved using Custom Bootstrap Actions and S3.
2) Remove the conflicting jars too. Sometimes you'll see an unnecessary NullPointerException and this could be one of the key reason for it.
If possible add your stacktrace using
yarn logs -applicationId <HadoopAppId>
So that I can answer you in more specific way.
I recently ran into the same issue. Here was my scenario:
Cloudera Managed CDH 5.3.3 cluster with 7 nodes. I was submitting the job from one of the nodes and it used to fail in both yarn-cluster and yarn-master modes with the same issue.
If you look at the stacktrace, you'll find this line-
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
This is the reason why the job fails because resources are not copied.
In my case, it was resolved by correcting the HADOOP_CONF_DIR path. It wasn't pointing to the exact folder that contains the core-site.xml and yarn-site.xml and other configuration files. Once this was fixed, the resources were copied during the initiation of the ApplicationMaster and the job ran correctly.
I was able to solve this by providing the driver memory and executor memory at run time.
spark-submit --driver-memory 1g --executor-memory 1g --class com.package.App --master yarn --deploy-mode cluster /home/spark.jar

Resources