Can we Externalize parameters using Excel? - apache-spark

I have lot of parameters to be externalized for that can we use Excel if yes, how to call an excel file in spark submit else what is the Ideal way to do it.
spark-submit
spark-submit --class "com.syntel.spark.sparkDVT" --master yarn --jars /root/config-1.3.1.jar,mysql-connector-java-5.1.42.jar --files /root/testex.xls --executor-memory 512m --executor-cores 1 --num-executors 5 /root/sparkdvtparametres_2.11-1.0.jar /root/files/output/extra_in_src /root/files/output/extra_in_dest /root/files/output/misMatch /root/files/output/DestHeader /root/files/output/SrcHeader /root/files/output/Summary /root/files/output/Summary_TimeTaken /root/files/output/TC01_ColumnStats /root/files/output/TC01_MisMatchHeaders dev
Error:Exception in thread "main" java.io.FileNotFoundException: --files (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
I have externalized connection parameters in excel testex.xls --files /root/testex.xls how to call the excel in spark-submit. I have hard coded the path while doing it in local it is working fine how to call the excel file in spark-submit

Its working fine after I added Apache POI jars in the class path

Related

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.
Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

spark-submit does not work with my jar located in hdfs

Here is my situation:
Apache spark version 2.4.4
Hadoop version 2.7.4
My application jar is located in hdfs.
My spark-submit looks like this:
/software/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \
--class com.me.MyClass --master spark://host2.local:7077 \
--deploy-mode cluster \
hdfs://host2.local:9000/apps/myapps.jar
I get this error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.tracing.SpanReceiverHost.get(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String;)Lorg/apache/hadoop/tracing/SpanReceiverHost;
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:634)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:144)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:139)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:139)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveAndDownloadJars$1.apply(DependencyUtils.scala:61)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveAndDownloadJars$1.apply(DependencyUtils.scala:64)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.DependencyUtils$.resolveAndDownloadJars(DependencyUtils.scala:60)
at org.apache.spark.deploy.worker.DriverWrapper$.setupDependencies(DriverWrapper.scala:96)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:60)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Any pointer how to solve this, please?
Thank you.
There is no need to transfer the jar into cluster, you can run your jar from your local id itself with executable permission.
Once your application is build transfer the .jar to your unix user account and give it executable permissions. Have a look at the below spark submit:-
spark-submit --master yarn --deploy-mode cluster --queue default
--files "full path of your properties file" --driver-memory 4G
--num-executors 8 --executor-cores 1 --executor-memory 4G
--class "main class name"
"full path of the jar which you have transferred to your local unix id"
You can use other spark submit configuration parameters if you want. Please note that in some version you have to use spark2-submit instead of spark-submit if there are multiple spark version involved.
--deploy-mode cluster will help in this case. taking the jars to cluster will be taken care by yarn cluster.

How to share files on master node to executors in Spark, How to use --files argument?

Can someone please explain, How can i ship my files in master to all executors using --files argument in spark-submit
/bin/spark-submit --master yarn --queue development --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=128G --files /keras/mnist.npz
But this gives me error. I am new to spark.
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
Obviously you didn't specify the application class on this command. Find more details on Running Spark On Yarn.

How to pass a local file as input in spark-submit

How to pass a local file as input in spark-submit, I have tried like below:
spark-submit --jars /home/hduser/.ivy2/cache/com.typesafe/config/bundles/config-1.3.1.jar --class "retail.DataValidator" --master local[2] --executor-memory 2g --total-executor-cores 2 sample-spark-180417_2.11-1.0.jar file:///home/hduser/Downloads/Big_Data_Backup/ dev file:///home/hduser/spark-training/workspace/demos/output/destination file:///home/hduser/spark-training/workspace/demos/output/extrasrc file:///home/hduser/spark-training/workspace/demos/output/extradest
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/home/inputfile , expected: hdfs://hadoop:54310
also tried path without the prefix "file://", but no luck. Its working fine in eclipse.
Thank you.!
If you want those files be accessible by each executor you need to use option files. Example:
spark-submit --files file1,file2,file3

submit spark Diagnostics: java.io.FileNotFoundException:

I'm using following command
bin/spark-submit --class com.my.application.XApp
--master yarn-cluster
--executor-memory 100m
--num-executors 50
/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting java.io.FileNotFoundException: and I can see on my cluster Yarn the app status as FAILED.
The jar is available at the location. Is there any specific place I need to place this jar when use cluster mode spark submit ?
Exception:
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar does not exist
Failing this attempt. Failing the application.
You must pass the jar file to the execution nodes by adding it to the "--jar" argument of spark-submit. E.g.
bin/spark-submit --class com.my.application.XApp
--master yarn-cluster
--jars "/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar"
--executor-memory 100m
--num-executors 50
/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar 1000

Resources