spark-submit fails giving FileNotFoundException - apache-spark

I am using spark 1.6. Below is the spark-submit command I am executing. Variables prefixed with '$' are dynamic variables.
spark-submit --queue "queue1" --conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./conf/log4j.properties -Dspark.yarn.keytab=$KEYTAB -Dspark.yarn.principal=$PRINCIPAL -Dkeytab=$KEYTAB -Dprincipal=$PRINCIPAL" --conf spark.yarn.am.cores=$am_cores --conf spark.yarn.am.memory="$am_memory" --conf spark.yarn.executor.memoryOverhead=$executor_memoryOverhead --conf spark.yarn.principal=$PRINCIPAL --conf spark.yarn.keytab=$KEYTAB --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.ui.enabled=true --conf spark.driver.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/conf/ --conf spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/conf/ --conf spark.sql.shuffle.partitions=$shuffle_partitions --name $APPNAME --master yarn --num-executors $num_executors --executor-memory $executor_memory --executor-cores $executor_cores --driver-memory $driver_memory --deploy-mode cluster --principal $PRINCIPAL --keytab $KEYTAB --files ./conf/log4j.properties --jars $APPSCALALIBJARFILES --class $APPCLASS $APPLIB
I am running spark-submit job on yarn cluster during which it is uploading dependent jars in default HDFS staging directory which is /user/<user id>/.sparkStaging/<yarn applicationId>/*.jar.
On verification during spark-submit job, I see that jar is getting uploaded but spark-submit is failing with below error - file owner and group belongs to the same id using which spark-submit is performed. I also tried using configuration parameter spark.yarn.StagingDir but even that didn't helped.
Your professional inputs will help in addressing this issue.
Error stack trace -
Diagnostics: File does not exist: hdfs://user/<user id>/.sparkStaging/<yarn application_id>/chill-java-0.5.0.jar
java.io.FileNotFoundException: File does not exist:
hdfs://user/<user id>/.sparkStaging/<yarn application_id>/chill-java-0.5.0.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1257)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Related

How to use custom jars in spark-submit --packages

I have a PySpark project which is doing spark structured streaming, to get the query metrics I have created a java project which listens to the microbatch events and logs the data in a log files. The logging works when I pass the jar as a file and using --jar to read it in spark-submit.
But, this process has manual work involved after where jar has to be uploaded manually. To solve it
uploaded the jar in JFrog repository. Now when running the spark-submit command I have added --repositories and also added the coordinate in --packages which already includes few packages like kafka, avro etc. All the packages downloads from the jfrog but when it reaches myjar it throws below error, but the repo url from log if if I try from browser it actually downloads the jar and pom as well!!!
:: problems summary ::
:::: WARNINGS
module not found: <myjar>;<version>
==== central: tried
https://<repo>/myjar.pom
-- artifact <myjar>.jar:
https://<repo>/myjar.jar
==== repo-1: tried
https://<repo>/myjar.pom
-- artifact <myjar>.jar:
https://<repo>/myjar.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.spark.extension#<myjar>;<version>: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.spark.extension#<myjar>;<verion>: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1428)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:902)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
EDIT-
Taken from EMR (some of the url/names are omitted)-
spark-submit --name "A Adapter" --deploy-mode cluster --master yarn --repositories https://<jfrog repo>/artifactory/all/ --packages com.spark.extension:spark-listeners:0.3.8,org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.postgresql:postgresql:42.2.22,software.amazon.cloudwatchlogs:aws-embedded-metrics:2.0.0-beta-1 --driver-cores 2 --driver-memory 12g --executor-memory 12g --num-executors 1 --executor-cores 2 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=6 --files s3://<url>/log4j.properties,s3://<url>/logging.json --conf spark.yarn.dist.archives=s3://<url>/libs.zip#app-site-packages --conf spark.yarn.appMasterEnv.PYTHONPATH=app-site-packages --conf "spark.yarn.appMasterEnv.SPARK_APP_NAME=A Adapter" --conf spark.yarn.appMasterEnv.CLUSTER_STUB=dev-b1 --conf "spark.yarn.appMasterEnv.AWS_EMF_SERVICE_NAME=A Adapter" --conf spark.yarn.appMasterEnv.AWS_EMF_SERVICE_TYPE=dev-b1-emr --conf spark.yarn.appMasterEnv.AWS_EMF_LOG_GROUP_NAME=dev-b1-spark-structured-stream-logs --conf spark.yarn.appMasterEnv.AWS_EMF_LOG_STREAM_NAME=dev-b1-spark-structured-stream-logs --conf spark.yarn.appMasterEnv.AWS_EMF_AGENT_ENDPOINT=udp://127.0.0.1:25888 --conf spark.driver.extraJavaOptions= --conf spark.executor.extraJavaOptions= --conf spark.executorEnv.PYTHONPATH=app-site-packages --py-files s3://<url>/libs.zip,s3://<url>/jobs.zip,s3://<url>/.env s3://<url>/main.py --job acc

Spark submission on DCOS cluster fails with java.net.UnknownHostException: hdfs

I am running a spark-submit in cluster/rest mode on a DCOS cluster:
$ ./spark-submit --deploy-mode cluster --master mesos://localhost:7077 --conf spark.master.rest.enabled=true --conf spark.mesos.uris=http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/hdfs-site.xml,http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/core-site.xml --conf spark.mesos.executor.docker.image=someregistry:5000/someimage:2.0.0-rc3 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://hdfs/history --conf spark.executor.extraClassPath=/opr/spark/dist/elasticsearch-spark-20_2.11-6.4.2.jar --conf spark.mesos.driverEnv.SPARK_HDFS_CONFIG_URL=http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints/hdfs-site.xml --conf spark.executor.memory=42G --conf spark.driver.memory=8G --conf spark.executor.cores=8 --driver-class-path /opt/spark/dist/elasticsearch-spark-20_2.11-6.4.2.jar http://hostname/somescript.py
The task fails as follows:
java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:668)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:604)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:68)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: hdfs
... 26 more
I have created a tunnel from my localhost:7707 to master 7707 because the internal uri:7707 was directly reachable;
edit: I think it could be related to my executors not being able to read core-site.xml where the following declaration resides
<property>
<name>fs.defaultFS</name>
<value>hdfs://hdfs</value>
</property>
Do I need to point my local spark installation at these files somehow?
The problem is in the --conf spark.eventLog.dir=hdfs://hdfs/history
You need to change it to --conf spark.eventLog.dir=hdfs://HDFS_NAME_NODE_HOSTNAME:8020/hdfs/history
Note: Replace HDFS_NAME_NODE_HOSTNAME with the actual Namenode hostname.

Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image

I am trying to do spark-submit on minikube(Kubernetes) from local machine CLI with command
spark-submit --master k8s://https://127.0.0.1:8001 --name cfe2
--deploy-mode cluster --class com.yyy.Test --conf spark.executor.instances=2 --conf spark.kubernetes.container.image docker.io/anantpukale/spark_app:1.1 local://spark-0.0.1-SNAPSHOT.jar
I have a simple spark job jar built on verison 2.3.0. I also have containerized it in docker and minikube up and running on virtual box.
Below is exception stack:
Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep.<init>(BasicDriverConfigurationStep.scala:51)
at org.apache.spark.deploy.k8s.submit.DriverConfigOrchestrator.getAllConfigurationSteps(DriverConfigOrchestrator.scala:82)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:229)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:227)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:227)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:192)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Shutdown hook called 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Deleting directory C:\Users\anant\AppData\Local\Temp\spark-6da93408-88cb-4fc7-a2de-18ed166c3c66
Look like bug with default value for parameters spark.kubernetes.driver.container.image, that must be spark.kubernetes.container.image. So try specify driver/executor container image directly:
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
From the source code, the only available conf options are:
spark.kubernetes.container.image
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
And I noticed that Spark 2.3.0 has changed a lot in terms of k8s implementation compared to 2.2.0. For example, instead of specifying driver and executor separately, the official starter's guide is to use a single image given to spark.kubernetes.container.image.
See if this works:
spark-submit \
--master k8s://http://127.0.0.1:8001 \
--name cfe2 \
--deploy-mode cluster \
--class com.oracle.Test \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=docker/anantpukale/spark_app:1.1 \
--conf spark.kubernetes.authenticate.submission.oauthToken=YOUR_TOKEN \
--conf spark.kubernetes.authenticate.submission.caCertFile=PATH_TO_YOUR_CERT \
local://spark-0.0.1-SNAPSHOT.jar
The token and cert can be found on k8s dashboard. Follow the instructions to make Spark 2.3.0 compatible docker images.

Spark 2.1.1 with typesafeconfig

I'm trying to support some external configuration file for my spark application using typesafeconfig.
I'm loading the application.conf file in my application code like this (driver):
val config = ConfigFactory.load()
val myProp = config.getString("app.property")
val df = spark.read.avro(myProp)
application.conf looks like this:
app.propety="some value"
spark-submit execution looks like this:
spark-submit
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--driver-class-path $HOME/conf/*.conf \
--files $HOME/conf/application.conf \
my-app-0.0.1-SNAPSHOT.jar
seems it doesn't work and I'm getting:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'app'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at com.paypal.cfs.fpti.Main$.main(Main.scala:42)
at com.paypal.cfs.fpti.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
looking at the logs i do see that "--files" work, seems like a classpath issue...
18/03/13 01:08:30 INFO SparkContext: Added file file:/home/user/conf/application.conf at file:/home/user/conf/application.conf with timestamp 1520928510820
18/03/13 01:08:30 INFO Utils: Copying /home/user/conf/application.conf to /tmp/spark-2938fde1-fa4a-47af-8dc6-1c54b5e89d48/userFiles-c2cec57f-18c8-491d-8679-df7e7da45e05/application.conf
Turns out I was pretty close to the answer to begin with... here is how it worked for me:
spark-submit \
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--driver-class-path $APP_HOME/conf \
--files $APP_HOME/conf/application.conf \
$APP_HOME/my-app-0.0.1-SNAPSHOT.jar
then $APP_HOME will contain the below:
conf/application.conf
my-app-0.0.1-SNAPSHOT.jar
I guess you need to make sure the application.conf is placed inside a folder, that is the trick.
In order to specify the config file path, you may pass it as an application argument, and then read it from the args variable of the main class.
This is how you would execute the spark-submit command. Note that I've specified the config file after the application jar.
spark-submit
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
my-app-0.0.1-SNAPSHOT.jar $HOME/conf/application.conf
And then, load the config file from the path specified in args(0):
import com.typesafe.config.ConfigFactory
[...]
val dbconfig = ConfigFactory.parseFile(new File(args(0))
Now you have access to the properties of your application.conf file.
val myProp = config.getString("app.property")
Hope it helps.

spark throws java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition

when I use spark-submit command in Cloudera Yarn environment, I got this kind of exception:
java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethods(Class.java:1975)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$listMethods$1(BeanIntrospector.scala:93)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findMethod$1(BeanIntrospector.scala:99)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$findGetter$1(BeanIntrospector.scala:124)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3$$anonfun$apply$5.apply(BeanIntrospector.scala:177)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3$$anonfun$apply$5.apply(BeanIntrospector.scala:173)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3.apply(BeanIntrospector.scala:173)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3.apply(BeanIntrospector.scala:172)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
...
The spark-submit command is like:
spark-submit --master yarn-cluster \
--num-executors $2 \
--executor-cores $3 \
--class "APP" \
--deploy-mode cluster \
--properties-file $1 \
--files $HDFS_PATH/log4j.properties,$HDFS_PATH/metrics.properties \
--conf spark.metrics.conf=metrics.properties \
APP.jar
note that, TopicAndPartition.class is in shaded APP.jar.
Please try adding the Kafka jar using the --jars option as shown in the example below:
spark-submit --master yarn-cluster \
--num-executors $2 \
--executor-cores $3 \
--class "APP" \
--deploy-mode cluster \
--properties-file $1 \
--jars /path/to/kafka.jar
--files $HDFS_PATH/log4j.properties,$HDFS_PATH/metrics.properties \
--conf spark.metrics.conf=metrics.properties \
APP.jar
After using some methods, it turns out that the issue is caused because version incompatibility. As #user1050619 said, make sure the version of kafka, spark, zookeeper and scala are compatible with each other.

Resources