How to pass a local file as input in spark-submit - apache-spark

How to pass a local file as input in spark-submit, I have tried like below:
spark-submit --jars /home/hduser/.ivy2/cache/com.typesafe/config/bundles/config-1.3.1.jar --class "retail.DataValidator" --master local[2] --executor-memory 2g --total-executor-cores 2 sample-spark-180417_2.11-1.0.jar file:///home/hduser/Downloads/Big_Data_Backup/ dev file:///home/hduser/spark-training/workspace/demos/output/destination file:///home/hduser/spark-training/workspace/demos/output/extrasrc file:///home/hduser/spark-training/workspace/demos/output/extradest
Error:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/home/inputfile , expected: hdfs://hadoop:54310
also tried path without the prefix "file://", but no luck. Its working fine in eclipse.
Thank you.!

If you want those files be accessible by each executor you need to use option files. Example:
spark-submit --files file1,file2,file3

Related

Spark --jars option added jar are not working

I am trying to add redshift jar using spark-submit option:
Running command on Spark 2.1.0
spark-submit --class Test --master spark://xyz.local:7077 --executor-cores 4 --total-executor-cores 32 --executor-memory 6G --driver-memory 4G --driver-cores 2 --deploy-mode cluster -jars s3a://d11-batch-jobs-on-spark/jars/redshift-jdbc42-1.2.10.1009.jar,s3a://mybucket/jars/spark-redshift_2.11-3.0.0-preview1.jar s3a://mybucket/jars/app.jar
and in code I am reading from redshift table but getting
ClassNotFoundException: com.databricks.spark.redshift.DefaultSource
What am I doing wrong?
I'm having issues using the --jars as well...
My advise is, for packages in the Maven repository, to use --packages instead of --jars, as it resolves other dependencies withing those packages.
USAGE
spark-submit --packages <groupId>:<artifactId>:<version>
In your case, except all other options and args, it'd look like this:
spark-submit --packages com.amazon.redshift:redshift-jdbc42:1.2.10.1009
You can find IDs and version from an XML-style provided by Maven after following the link to your desired version.
The accepted answer to this question provides more info on --jars and -packages

How to share files on master node to executors in Spark, How to use --files argument?

Can someone please explain, How can i ship my files in master to all executors using --files argument in spark-submit
/bin/spark-submit --master yarn --queue development --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=128G --files /keras/mnist.npz
But this gives me error. I am new to spark.
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
Obviously you didn't specify the application class on this command. Find more details on Running Spark On Yarn.

Can we Externalize parameters using Excel?

I have lot of parameters to be externalized for that can we use Excel if yes, how to call an excel file in spark submit else what is the Ideal way to do it.
spark-submit
spark-submit --class "com.syntel.spark.sparkDVT" --master yarn --jars /root/config-1.3.1.jar,mysql-connector-java-5.1.42.jar --files /root/testex.xls --executor-memory 512m --executor-cores 1 --num-executors 5 /root/sparkdvtparametres_2.11-1.0.jar /root/files/output/extra_in_src /root/files/output/extra_in_dest /root/files/output/misMatch /root/files/output/DestHeader /root/files/output/SrcHeader /root/files/output/Summary /root/files/output/Summary_TimeTaken /root/files/output/TC01_ColumnStats /root/files/output/TC01_MisMatchHeaders dev
Error:Exception in thread "main" java.io.FileNotFoundException: --files (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
I have externalized connection parameters in excel testex.xls --files /root/testex.xls how to call the excel in spark-submit. I have hard coded the path while doing it in local it is working fine how to call the excel file in spark-submit
Its working fine after I added Apache POI jars in the class path

Spark-submit throwing error from yarn-cluster mode

How do we pass a config file to executor when we submit a spark job on yarn-cluster?
If I change my below spark-submit command as --master yarn-client then it works fine , I get respective output
spark-submit\
--files /home/cloudera/conf/omega.config \
--class com.mdm.InitProcess \
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
/home/cloudera/Omega.jar \
/home/cloudera/conf/omega.config
My Spark Code:
object InitProcess
{
def main(args: Array[String]): Unit = {
val config_loc = args(0)
val config = ConfigFactory.parseFile(new File(config_loc ))
val jobName =config.getString("job_name")
.....
}
}
I am getting the below error
17/04/05 12:01:39 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
Could someone help me on running this command in --master yarn-cluster ?
The different between yarn-client and yarn-cluster is that in yarn-client the driver location is on the machine running the spark-submit command.
In your case, the location of the config file is /home/cloudera/conf/omega.config which can be found when you running as yarn-client as the driver is running from the current machine which holds this full/path/to/file.
But can't be access in yarn-cluster mode, as the driver is running on other host, which doesn't holds this full/path/to/file.
I'd suggest execution the command in the following format:
spark-submit\
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
--class com.mdm.InitProcess \
--files /home/cloudera/conf/omega.config \
--jar /home/cloudera/Omega.jar omega.config
Sending the config file using --files with its full-path-name, and providing it as parameter to the jar as it filename (not with a full path) as the file will be downloaded to unknown location on the workers.
In your code, you can use SparkFiles.get(filename) in order to get the actual full-path-name of the downloaded file on the worker
The change in your code should be something similar to:
val config_loc = SparkFiles.get(args(0))
SparkFiles docs
public class SparkFiles
Resolves paths to files added through SparkContext.addFile().

submit spark Diagnostics: java.io.FileNotFoundException:

I'm using following command
bin/spark-submit --class com.my.application.XApp
--master yarn-cluster
--executor-memory 100m
--num-executors 50
/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting java.io.FileNotFoundException: and I can see on my cluster Yarn the app status as FAILED.
The jar is available at the location. Is there any specific place I need to place this jar when use cluster mode spark submit ?
Exception:
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar does not exist
Failing this attempt. Failing the application.
You must pass the jar file to the execution nodes by adding it to the "--jar" argument of spark-submit. E.g.
bin/spark-submit --class com.my.application.XApp
--master yarn-cluster
--jars "/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar"
--executor-memory 100m
--num-executors 50
/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar 1000

Resources