Converting spark-submit into Livy REST JSON protocol - apache-spark

I am trying to rewrite spark-submit which has arguments like packages, repositories, jars, files, arguments defined by users like this into Livy REST JSON Protocol. please find more details below.
spark-submit command:
spark-submit \
--packages com.hortonworks.shc:shc-core:1.1.0.3.1.6.5-3 \
--repositories http://repo.hortonworks.com/content/groups/public/ \
--jars /usr/hdp/current/phoenix-client/phoenix-server.jar \
--files x/y.yml,x/y1.yml $HOME/spark_apps/a/app.py \
--arg_name value \
--arg_name2 value
what I tried in Livy :
{
"conf": {"com.hortonworks.shc": "shc-core:1.1.0.3.1.6.5-3"},
"jars":["wasbs:///phoenix-server.jar"],
"file": "/home/admin/spark_apps/a/app.py",
"files": ["/home/admin/x/y.yml,/home/admin/x/y1.yml"],
"args": [
"--arg_name=value",
"--arg_name=value"]
}
And the error is :
ls: cannot access '/usr/hdp/current/hadoop/lib': No such file or directory
log4j:ERROR Could not find value for key log4j.appender.tcp
log4j:ERROR Could not instantiate appender named "tcp".
Warning: Ignoring non-spark config property: com.hortonworks.shc=shc-core:1.1.0.3.1.6.5-3
Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
21/03/14 12:05:07 WARN NativeCodeLoader [main]: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/03/14 12:05:08 WARN DependencyUtils [main]: Skip remote jar wasbs:///phoenix-server.jar.
21/03/14 12:05:09 INFO RequestHedgingRMFailoverProxyProvider [main]: Created wrapped proxy for [rm1, rm2]
21/03/14 12:05:09 INFO RequestHedgingRMFailoverProxyProvider [main]: Looking for the active RM in [rm1, rm2]...
21/03/14 12:05:09 INFO RequestHedgingRMFailoverProxyProvider [main]: Found active RM [rm2]
21/03/14 12:05:09 INFO Client [main]: Requesting a new application from cluster with 2 NodeManagers
21/03/14 12:05:09 INFO Configuration [main]: found resource resource-types.xml at file:/etc/hadoop/4.1.2.5/0/resource-types.xml
21/03/14 12:05:09 INFO Client [main]: Verifying our application has not requested more than the maximum memory capability of the cluster (51200 MB per container)
21/03/14 12:05:09 INFO Client [main]: Will allocate AM container, with 1408 MB memory including 384 MB overhead
21/03/14 12:05:09 INFO Client [main]: Setting up container launch context for our AM
21/03/14 12:05:10 INFO Client [main]: Setting up the launch environment for our AM container
21/03/14 12:05:10 INFO Client [main]: Preparing resources for our AM container
21/03/14 12:05:10 INFO Client [main]: Falling back to uploading libraries in this host
21/03/14 12:05:10 INFO Client [main]: Uploading resource file:/tmp/spark-c923cab1-6cf5-4fa8-9db3-73d156052819/__hive_libs__5946923461629036475.zip -> wasbs://container-spark-2021-01-12t10-28-51-042z#container.blob.core.windows.net/user/livy/.sparkStaging/application_1615371594106_0446/__hive_libs__5946923461629036475.zip
21/03/14 12:05:12 INFO Client [main]: Source and destination file systems are the same. Not copying wasbs:/phoenix-server.jar
21/03/14 12:05:12 WARN AzureFileSystemThreadPoolExecutor [main]: Disabling threads for Delete operation as thread count 0 is <= 1
21/03/14 12:05:12 INFO AzureFileSystemThreadPoolExecutor [main]: Time taken for Delete operation is: 11 ms with threads: 0
21/03/14 12:05:12 INFO Client [main]: Deleted staging directory wasbs://container-2021-01-12t10-28-51-042z#container.blob.core.windows.net/user/livy/.sparkStaging/application_1615371594106_0446
Exception in thread "main" java.io.FileNotFoundException: wasbs://container-2021-01-12t10-28-51-042z#container.blob.core.windows.net/phoenix-server.jar: No such file or directory.
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatusInternal(NativeAzureFileSystem.java:2716)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:2620)
at org.apache.spark.deploy.yarn.ClientDistributedCacheManager$$anonfun$1.apply(ClientDistributedCacheManager.scala:71)
at org.apache.spark.deploy.yarn.ClientDistributedCacheManager$$anonfun$1.apply(ClientDistributedCacheManager.scala:71)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:71)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:479)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$16$$anonfun$apply$6.apply(Client.scala:651)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$16$$anonfun$apply$6.apply(Client.scala:650)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$16.apply(Client.scala:650)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$16.apply(Client.scala:649)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:649)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:917)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1239)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:858)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:942)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/03/14 12:05:12 INFO ShutdownHookManager [shutdown-hook-0]: Shutdown hook called
21/03/14 12:05:12 INFO ShutdownHookManager [shutdown-hook-0]: Deleting directory /tmp/spark-c923cab1-6cf5-4fa8-9db3-73d156052819
21/03/14 12:05:12 INFO ShutdownHookManager [shutdown-hook-0]: Deleting directory /tmp/spark-424d5ba4-718c-470a-b110-9020578aef12
Could you please help- me to rewrite spark-submit into livy rest json please..?
thank you in advance.

Maybe you should check your JVM. To check the JVM config.
And the log shows that
ls: cannot access '/usr/hdp/current/hadoop/lib': No such file or directory. Also to check the path.
Maybe check the Spark User Guide .
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer toSpark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jarsand upload it to the distributed cache.

Related

Spark exit before status report

I have installed spark 2.3.2 on a cluster of 2 workers and 1 master.
Im executing
spark-submit \
--class com.group.Main \
--master spark:<publicIp>:6066 \
--deploy-mode cluster app.jar input.txt
and below is the output
2022-08-25 03:20:41 WARN Utils:66 - Your hostname, master resolves to a loopback address: 127.0.1.1; using <publicIp> instead (on interface eth1)
2022-08-25 03:20:41 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
Running Spark using the REST application submission protocol.
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - Submitting a request to launch an application in spark://<publicIp>:6066.
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - Submission successfully created as driver-20220825032052-0003. Polling submission state...
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - Submitting a request for the status of submission driver-20220825032052-0003 in spark://<publicIp>:6066.
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - State of driver driver-20220825032052-0003 is now RUNNING.
2022-08-25 03:20:52 INFO RestSubmissionClient:54 - Driver is running on worker worker-20220825031423-zz.zz.zz.zz-42773 at <publicIp>:42773.
2022-08-25 03:20:53 INFO RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20220825032052-0003",
"serverSparkVersion" : "2.3.2",
"submissionId" : "driver-20220825032052-0003",
"success" : true
}
2022-08-25 03:20:53 INFO ShutdownHookManager:54 - Shutdown hook called
2022-08-25 03:20:53 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-aaeb9959-ba1f-4a5b-b7c4-6d0d3eb6ca70
On spark UI the application is reported always as FAILED
Please note that on client mode the application runs
--Updated--
As I was looking in stderr driver logs that the issue might be relevant to the files input of the application
I have tried multiple ways having input directory/single file on the local system
And getting the bellow error
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
is there any way to achieve file input in spark without using hdfs or s3?

How do I use the portable runner and spark-submit to submit beams wordcount python example to a remote spark cluster on EMR running yarn?

I am trying to submit beams wordcount python example to a remote spark cluster on emr running yarn as its resource manager. According to the spark documentation this needs to be done using the portable runner.
Following the portable runner instructions, I have started the job service endpoint, and it appears to start correctly::
$ docker run --net=host apache/beam_spark_job_server:latest --spark-master-url=spark://*.***.***.***:7077
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: ArtifactStagingService started on localhost:8098
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: Java ExpansionService started on localhost:8097
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: JobService started on localhost:8099
20/08/31 12:13:08 INFO org.apache.beam.runners.jobsubmission.JobServerDriver: Job server now running, terminate with Ctrl+C
Now I try to submit the job using spark-submit, input is a plain text version of Sherlock Holmes:
$ spark-submit --master=yarn --deploy-mode=cluster wordcount.py --input data/sherlock.txt --output output --runner=PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER --environment_config=apachebeam/python3.7_sdk
20/08/31 12:19:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/08/31 12:19:40 INFO RMProxy: Connecting to ResourceManager at ip-***-**-**-***.ec2.internal/***.**.**.***:8032
20/08/31 12:19:40 INFO Client: Requesting a new application from cluster with 2 NodeManagers
20/08/31 12:19:40 INFO Configuration: resource-types.xml not found
20/08/31 12:19:40 INFO ResourceUtils: Unable to find 'resource-types.xml'.
20/08/31 12:19:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container)
20/08/31 12:19:40 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/08/31 12:19:40 INFO Client: Setting up container launch context for our AM
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: /usr/lib/spark/python/lib/pyspark.zip not found; cannot run pyspark application in YARN mode.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.deploy.yarn.Client.$anonfun$findPySparkArchives$2(Client.scala:1167)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.deploy.yarn.Client.findPySparkArchives(Client.scala:1163)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:858)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:178)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1134)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/08/31 12:19:40 INFO ShutdownHookManager: Shutdown hook called
20/08/31 12:19:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-ee751413-e29d-4b1f-8a16-fb8650b1ca10
It appears to want pyspark to be installed, I am fairly new to submitting beam jobs to a spark cluster, is there a reason why pyspark would need to be installed when submitting a beam job? I have a feeling my spark-submit command is wrong, but I am having a hard time finding any more concrete documentation on how to do what I am trying to do.

Unable to see output or error messages for Spark on Kubernetes

Trying to run a simple Spark application using Kubernetes master. But I don't get the intended output/processing, neither do I see any error messages. The final pod phase is 'Failed' and the error code is 101. The pod logs show the usual log4j warnings, but nothing else.
Running minikube v1.0.1 on windows (amd64) on my office laptop using hyperv. Have already increased the #cpus and memory on minikube VM to 3 and 4 GB as recommended.
Made sure that the applications run fine with Spark Standalone. The first application 'Hello' is supposed to print a 'Hello' message. The second application 'Calculate Monthly Revenue' is supposed to read data from Teradata over JDBC, aggregate it and write the result back to Teradata table over JDBC.
Also made sure that 'hello minikube' works fine.
In all the code snippets below, ... indicates portions omitted for brevity, >>> indicates command prompt.
>>> spark-submit --master k8s://https://153.65.225.219:8443 --deploy-mode cluster --name Hello --class Hello --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=rahulvkulkarni/default:spark-td-run --conf spark.kubernetes.container.image.pullSecrets=regcred local://hello_2.12-0.1.0-SNAPSHOT.jar
log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/05/20 16:59:09 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: hello-1558351748442-driver
...
phase: Pending
status: []
...
19/05/20 16:59:13 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: hello-1558351748442-driver
...
phase: Failed
status: [ContainerStatus(containerID=docker://464c9c0e23d543f20954d373218c9cefefc31107711cbd2ada4d93bb31ce4d80, image=rahulvkulkarni/default:spark-td-run, imageID=docker-pullable://rahulvkulkarni/default#sha256:1de9951c4ac9f0b5f26efa3949e1effa779b0605066f2043738402ce20e8179b, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://464c9c0e23d543f20954d373218c9cefefc31107711cbd2ada4d93bb31ce4d80, exitCode=101, finishedAt=2019-05-17T18:26:41Z, message=null, reason=Error, signal=null, startedAt=2019-05-17T18:26:40Z, additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})]
19/05/20 16:59:13 INFO LoggingPodStatusWatcherImpl: Container final statuses:
Container name: spark-kubernetes-driver
Container image: rahulvkulkarni/default:spark-td-run
Container state: Terminated
Exit code: 101
19/05/20 16:59:13 INFO Client: Application Hello finished.
...
>>> kubectl logs hello-1558351748442-driver
++ id -u
...
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$#")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.5 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class Hello spark-internal
19/05/17 18:26:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.SparkSubmit$$anon$2).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
What does exit code 101 mean? How to find the actual error?
Then I tried to configure log4j for detailed logging as described in How to stop INFO messages displaying on spark console?. Renamed and used the log4j.properties template provided in the conf directory. But spark-submit is not able to find the log4j.properties file that I have already included in the docker build.
>>> spark-submit --master k8s://https://153.65.225.219:8443 --deploy-mode cluster --files /opt/spark/conf/log4j.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/opt/spark/conf/log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/opt/spark/conf/log4j.properties" --name "Calculate Monthly Revenue" --class mthRev --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=rahulvkulkarni/default:spark-td-run --conf spark.kubernetes.container.image.pullSecrets=regcred local://mthrev_2.10-0.1-SNAPSHOT.jar <username> <password> <server name>
19/05/20 20:02:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/05/20 20:02:52 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: calculate-monthly-revenue-1558362771110-driver
...
Container name: spark-kubernetes-driver
Container image: rahulvkulkarni/default:spark-td-run
Container state: Terminated
Exit code: 1
>>> kubectl logs -c spark-kubernetes-driver calculate-monthly-revenue-1558362771110-driver
++ id -u
...
log4j:ERROR Could not read configuration file from URL [file:/opt/spark/conf/log4j.properties].
java.io.FileNotFoundException: /opt/spark/conf/log4j.properties (No such file or directory)
...
log4j:ERROR Ignoring configuration file [file:/opt/spark/conf/log4j.properties].
19/05/17 21:30:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 2: C:
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Globber.glob(Globber.java:211)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:192)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:147)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:145)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$4.apply(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$4.apply(SparkSubmit.scala:355)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 2: C:
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.failExpecting(URI.java:2854)
at java.net.URI$Parser.parse(URI.java:3057)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:202)
... 23 more
[INFO tini (1)] Main child exited normally (with status '1')
I tried several variations of specifying the log4j.properties file: local file on my Windows laptop (file:///C$/Users//spark-2.4.3-bin-hadoop2.7/conf/log4j.properties and file:///C:/Users//spark-2.4.3-bin-hadoop2.7/conf/log4j.properties), local file in the Linux container (file:///opt/spark/conf/log4j.properties). But I keep getting the message:
log4j:ERROR Could not read configuration file from URL [file:/C$/Users/<my-username>/spark-2.4.3-bin-hadoop2.7/conf/log4j.properties].
The IllegalArgumentException exception went away when I tried the path without the colon (C:), i.e. either the Linux path or the Windows path with C$.
But I still don't get the desired output of my program and don't know if/what is the error!
There was a typo in the spark-submit command in the specification of the application jar. I was using only two forward slashes: local://hello_2.12-0.1.0-SNAPSHOT.jar. Hence, Spark was not able to locate it and (I think) was ignoring it silently and then had no work to do. Hence, there was no message. I'd expect it to give a warning at least.
Changed it to three slashes and it moved ahead:
local:///hello_2.12-0.1.0-SNAPSHOT.jar
I now have another issue related to Kubernetes RBAC, which I will solve separately. The log4j issue still remains, but is not a concern for me now.
I solve this by deploying config file to blobs
https://$(container_name).blob.core.windows.net/jars/log4jconfig1
and give him config to spark-submit
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=https://<container_name>.blob.core.windows.net/jars/log4jconfig1" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=https://<container_name>.blob.core.windows.net/jars/log4jconfig1" \

Property spark.yarn.jars - how to deal with it?

My knowledge with Spark is limited and you would sense it after reading this question. I have just one node and spark, hadoop and yarn are installed on it.
I was able to code and run word-count problem in cluster mode by below command
spark-submit --class com.sanjeevd.sparksimple.wordcount.JobRunner
--master yarn
--deploy-mode cluster
--driver-memory=2g
--executor-memory 2g
--executor-cores 1
--num-executors 1
SparkSimple-0.0.1SNAPSHOT.jar
hdfs://sanjeevd.br:9000/user/spark-test/word-count/input
hdfs://sanjeevd.br:9000/user/spark-test/word-count/output
It works just fine.
Now I understood that 'spark on yarn' requires spark jar files available on the cluster and if I don't do anything then every time I run my program it will copy hundreds of jar files from $SPARK_HOME to each node (in my case it's just one node). I see that code's execution pauses for some time before it finishes copying. See below -
16/12/12 17:24:03 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/12/12 17:24:06 INFO yarn.Client: Uploading resource file:/tmp/spark-a6cc0d6e-45f9-4712-8bac-fb363d6992f2/__spark_libs__11112433502351931.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/__spark_libs__11112433502351931.zip
16/12/12 17:24:08 INFO yarn.Client: Uploading resource file:/home/sanjeevd/personal/Spark-Simple/target/SparkSimple-0.0.1-SNAPSHOT.jar -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/SparkSimple-0.0.1-SNAPSHOT.jar
16/12/12 17:24:08 INFO yarn.Client: Uploading resource file:/tmp/spark-a6cc0d6e-45f9-4712-8bac-fb363d6992f2/__spark_conf__6716604236006329155.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/__spark_conf__.zip
Spark's documentation suggests to set spark.yarn.jars property to avoid this copying. So I set below below property in spark-defaults.conf file.
spark.yarn.jars hdfs://sanjeevd.br:9000//user/spark/share/lib
http://spark.apache.org/docs/latest/running-on-yarn.html#preparations
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
Btw, I have all the jar files from LOCAL /opt/spark/jars to HDFS /user/spark/share/lib. They are 206 in number.
This makes my jar failed. Below is the error -
spark-submit --class com.sanjeevd.sparksimple.wordcount.JobRunner --master yarn --deploy-mode cluster --driver-memory=2g --executor-memory 2g --executor-cores 1 --num-executors 1 SparkSimple-0.0.1-SNAPSHOT.jar hdfs://sanjeevd.br:9000/user/spark-test/word-count/input hdfs://sanjeevd.br:9000/user/spark-test/word-count/output
16/12/12 17:43:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/12 17:43:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/12/12 17:43:07 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
16/12/12 17:43:07 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5120 MB per container)
16/12/12 17:43:07 INFO yarn.Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
16/12/12 17:43:07 INFO yarn.Client: Setting up container launch context for our AM
16/12/12 17:43:07 INFO yarn.Client: Setting up the launch environment for our AM container
16/12/12 17:43:07 INFO yarn.Client: Preparing resources for our AM container
16/12/12 17:43:07 INFO yarn.Client: Uploading resource file:/home/sanjeevd/personal/Spark-Simple/target/SparkSimple-0.0.1-SNAPSHOT.jar -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005/SparkSimple-0.0.1-SNAPSHOT.jar
16/12/12 17:43:07 INFO yarn.Client: Uploading resource file:/tmp/spark-fae6a5ad-65d9-4b64-9ba6-65da1310ae9f/__spark_conf__7881471844385719101.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005/__spark_conf__.zip
16/12/12 17:43:08 INFO spark.SecurityManager: Changing view acls to: sanjeevd
16/12/12 17:43:08 INFO spark.SecurityManager: Changing modify acls to: sanjeevd
16/12/12 17:43:08 INFO spark.SecurityManager: Changing view acls groups to:
16/12/12 17:43:08 INFO spark.SecurityManager: Changing modify acls groups to:
16/12/12 17:43:08 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sanjeevd); groups with view permissions: Set(); users with modify permissions: Set(sanjeevd); groups with modify permissions: Set()
16/12/12 17:43:08 INFO yarn.Client: Submitting application application_1481592214176_0005 to ResourceManager
16/12/12 17:43:08 INFO impl.YarnClientImpl: Submitted application application_1481592214176_0005
16/12/12 17:43:09 INFO yarn.Client: Application report for application_1481592214176_0005 (state: ACCEPTED)
16/12/12 17:43:09 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1481593388442
final status: UNDEFINED
tracking URL: http://sanjeevd.br:8088/proxy/application_1481592214176_0005/
user: sanjeevd
16/12/12 17:43:10 INFO yarn.Client: Application report for application_1481592214176_0005 (state: FAILED)
16/12/12 17:43:10 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1481592214176_0005 failed 1 times due to AM Container for appattempt_1481592214176_0005_000001 exited with exitCode: 1
For more detailed output, check application tracking page:http://sanjeevd.br:8088/cluster/app/application_1481592214176_0005Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1481592214176_0005_01_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1481593388442
final status: FAILED
tracking URL: http://sanjeevd.br:8088/cluster/app/application_1481592214176_0005
user: sanjeevd
16/12/12 17:43:10 INFO yarn.Client: Deleting staging directory hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005
Exception in thread "main" org.apache.spark.SparkException: Application application_1481592214176_0005 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/12/12 17:43:10 INFO util.ShutdownHookManager: Shutdown hook called
16/12/12 17:43:10 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fae6a5ad-65d9-4b64-9ba6-65da1310ae9f
Do you know what wrong am I doing? The task's log says below -
Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I understand the error that ApplicationMaster class is not found but my question is why it is not found - where this class is supposed to be? I don't have assembly jar since I'm using spark 2.0.1 where there is no assembly comes bundled.
What this has to do with spark.yarn.jars property? This property is to help spark run on yarn, and that should be it. What additional I need to do when using spark.yarn.jars?
Thanks in reading this question and for your help in advance.
You could also use the spark.yarn.archive option and set that to the location of an archive (you create) containing all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the archive. For example:
Create the archive: jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
Upload to HDFS: hdfs dfs -put spark-libs.jar /some/path/.
2a. For a large cluster, increase the replication count of the Spark archive so that you reduce the amount of times a NodeManager will do a remote copy. hdfs dfs –setrep -w 10 hdfs:///some/path/spark-libs.jar (Change the amount of replicas proportional to the number of total NodeManagers)
Set spark.yarn.archive to hdfs:///some/path/spark-libs.jar
I was finally able to make sense of this property. I found by hit-n-trial that correct syntax of this property is
spark.yarn.jars=hdfs://xx:9000/user/spark/share/lib/*.jar
I didn't put *.jar in the end and my path was just ended with /lib. I tried putting actual assembly jar like this - spark.yarn.jars=hdfs://sanjeevd.brickred:9000/user/spark/share/lib/spark-yarn_2.11-2.0.1.jar but no luck. All it said that unable to load ApplicationMaster.
I posted my response to a similar question asked by someone at https://stackoverflow.com/a/41179608/2332121
If you look at spark.yarn.jars documentation it says the following
List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
This means that you are actually overriding the SPARK_HOME/jars and telling yarn to pick up all the jars required for the application run from your path,If you set spark.yarn.jars property, all the dependent jars for spark to run should be present in this path, If you go and look inside spark-assembly.jar present in SPARK_HOME/lib , org.apache.spark.deploy.yarn.ApplicationMaster class is present, so make sure that all the spark dependencies are present in the HDFS path that you specify as spark.yarn.jars.

Spark Streaming failing on YARN Cluster

I have a cluster of 1 master and 2 slaves. I'm running a spark streaming in master and I want to utilize all nodes in my cluster. i had specified some parameters like driver memory and executor memory in my code. when i give --deploy-mode cluster --master yarn-cluster in my spark-submit, it gives the following error.
> log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/12 13:24:49 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/08/12 13:24:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/08/12 13:24:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/12 13:24:49 INFO Client: Setting up container launch context for our AM
15/08/12 13:24:49 INFO Client: Preparing resources for our AM container
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.5.0-cdh5.3.5.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
15/08/12 13:24:49 INFO Client: Setting up the launch environment for our AM container
15/08/12 13:24:49 INFO SecurityManager: Changing view acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: Changing modify acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs); users with modify permissions: Set(hdfs)
15/08/12 13:24:49 INFO Client: Submitting application 3808 to ResourceManager
15/08/12 13:24:49 INFO YarnClientImpl: Submitted application application_1437639737006_3808
15/08/12 13:24:50 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:50 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hdfs
start time: 1439385889600
final status: UNDEFINED
tracking URL: http://hostname:port/proxy/application_1437639737006_3808/
user: hdfs
15/08/12 13:24:51 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:52 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:53 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:54 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:55 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:56 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:57 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:58 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:59 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:00 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:01 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:02 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:03 INFO Client: Application report for application_1437639737006_3808 (state: FAILED)
15/08/12 13:25:03 INFO Client:
client token: N/A
diagnostics: Application application_1437639737006_3808 failed 2 times due to AM Container for appattempt_1437639737006_3808_000002 exited with exitCode: -1000 due to: File file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip does not exist
.Failing this attempt.. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.hdfs
start time: 1439385889600
final status: FAILED
tracking URL: http://hostname:port/cluster/app/application_1437639737006_3808
user: hdfs
Exception in thread "main" org.apache.spark.SparkException: Application application_1437639737006_3808 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:855)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
How to fix this issue ? Please help me if i'm doing wrong.
The file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip you submit does not exist.
While running with the Yarn Cluster mode, you always need to specify the other Memory settings for your executors and there individually memory, Plus you always need to specify the driver details also. Now for Example
Amazon EC2 Environment (Reserved already):
m3.xlarge | CORES : 4(1) | RAM : 15 (3.5) | HDD : 80 GB | Nodes : 3 Nodes
spark-submit --class <YourClassFollowedByPackage> --master yarn-cluster --num-executors 2 --driver-memory 8g --executor-memory 8g --executor-cores 1 <Your Jar with Full Path> <Jar Args>
Always remember to add the other third-party libraries or jars to your Classpath in each of the Task Nodes, You can add them directly to your Spark or Hadoop Classpath on each of your Node.
Notes :
1) If you're using the Amazon EMR then It can be achieved using Custom Bootstrap Actions and S3.
2) Remove the conflicting jars too. Sometimes you'll see an unnecessary NullPointerException and this could be one of the key reason for it.
If possible add your stacktrace using
yarn logs -applicationId <HadoopAppId>
So that I can answer you in more specific way.
I recently ran into the same issue. Here was my scenario:
Cloudera Managed CDH 5.3.3 cluster with 7 nodes. I was submitting the job from one of the nodes and it used to fail in both yarn-cluster and yarn-master modes with the same issue.
If you look at the stacktrace, you'll find this line-
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
This is the reason why the job fails because resources are not copied.
In my case, it was resolved by correcting the HADOOP_CONF_DIR path. It wasn't pointing to the exact folder that contains the core-site.xml and yarn-site.xml and other configuration files. Once this was fixed, the resources were copied during the initiation of the ApplicationMaster and the job ran correctly.
I was able to solve this by providing the driver memory and executor memory at run time.
spark-submit --driver-memory 1g --executor-memory 1g --class com.package.App --master yarn --deploy-mode cluster /home/spark.jar

Resources