submit spark Diagnostics: java.io.FileNotFoundException: - apache-spark

I'm using following command
bin/spark-submit --class com.my.application.XApp
--master yarn-cluster
--executor-memory 100m
--num-executors 50
/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting java.io.FileNotFoundException: and I can see on my cluster Yarn the app status as FAILED.
The jar is available at the location. Is there any specific place I need to place this jar when use cluster mode spark submit ?
Exception:
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar does not exist
Failing this attempt. Failing the application.

You must pass the jar file to the execution nodes by adding it to the "--jar" argument of spark-submit. E.g.
bin/spark-submit --class com.my.application.XApp
--master yarn-cluster
--jars "/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar"
--executor-memory 100m
--num-executors 50
/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar 1000

Related

Why does spark-shell/pyspark gets less resources compared to spark-submit

I am running a PySpark code which runs on a single node due to code requirements.
But when I run the code using PySpark
pyspark --master yarn --executor-cores 1 --driver-memory 60g --executor-memory 5g --conf spark.driver.maxResultSize=4g
I get Segment error and the code fails, when I checked in Yarn I saw that my job was not getting resources even when they were available.
But when I use spark-submit
spark-submit --master yarn --executor-cores 1 --driver-memory 60g --executor-memory 5g --conf spark.driver.maxResultSize=4g code.py
the code gets all the resources it needs and the code runs perfectly.
Am I missing some fundamental aspect of spark or why is this happening?

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.
Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

spark-submit does not work with my jar located in hdfs

Here is my situation:
Apache spark version 2.4.4
Hadoop version 2.7.4
My application jar is located in hdfs.
My spark-submit looks like this:
/software/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \
--class com.me.MyClass --master spark://host2.local:7077 \
--deploy-mode cluster \
hdfs://host2.local:9000/apps/myapps.jar
I get this error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.tracing.SpanReceiverHost.get(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String;)Lorg/apache/hadoop/tracing/SpanReceiverHost;
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:634)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:144)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:139)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:139)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveAndDownloadJars$1.apply(DependencyUtils.scala:61)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveAndDownloadJars$1.apply(DependencyUtils.scala:64)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.DependencyUtils$.resolveAndDownloadJars(DependencyUtils.scala:60)
at org.apache.spark.deploy.worker.DriverWrapper$.setupDependencies(DriverWrapper.scala:96)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:60)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Any pointer how to solve this, please?
Thank you.
There is no need to transfer the jar into cluster, you can run your jar from your local id itself with executable permission.
Once your application is build transfer the .jar to your unix user account and give it executable permissions. Have a look at the below spark submit:-
spark-submit --master yarn --deploy-mode cluster --queue default
--files "full path of your properties file" --driver-memory 4G
--num-executors 8 --executor-cores 1 --executor-memory 4G
--class "main class name"
"full path of the jar which you have transferred to your local unix id"
You can use other spark submit configuration parameters if you want. Please note that in some version you have to use spark2-submit instead of spark-submit if there are multiple spark version involved.
--deploy-mode cluster will help in this case. taking the jars to cluster will be taken care by yarn cluster.

How to share files on master node to executors in Spark, How to use --files argument?

Can someone please explain, How can i ship my files in master to all executors using --files argument in spark-submit
/bin/spark-submit --master yarn --queue development --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=128G --files /keras/mnist.npz
But this gives me error. I am new to spark.
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
Obviously you didn't specify the application class on this command. Find more details on Running Spark On Yarn.

Erro spark-assembly-1.4.1-hadoop2.6.0.jar does not exist

I'm trying to submit a Spark app from local machine Terminal to my Cluster. I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File
file:/Users/nish1013/Dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar
does not exist
I can see in my service list ,
YARN + MapReduce2 2.7.1.2.3 Apache Hadoop NextGen MapReduce (YARN)
Spark 1.4.1.2.3 Apache Spark is a fast and general engine for
large-scale data processing.
already installed.
My spark-env.sh in local machine
export HADOOP_CONF_DIR=/Users/nish1013/Dev/hadoop-2.7.1/etc/hadoop
Has anyone encountered similar before ?
I think the right command to call is like following:
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 --conf spark.yarn.jars=hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
or you can add
spark.yarn.jars hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
in your spark.default.conf file

Resources