I am runnig the following code in spark.
scala>import com.databricks.spark.xml.XmlInputFormat
scala>val record1 = sc.newAPIHadoopFile("file:///home/myuser/myfile.xml", classOf[XmlInputFormat], classOf[LongWritable],classOf[Text])
When I run it by setting master to local it works fine.
spark2-shell --jars spark-xml_2.10-0.4.1.jar --master local[*]
But when I try to run it in yarn it return File file:/home/myuser/myfile.xml does not exist.
spark2-shell --jars spark-xml_2.10-0.4.1.jar --master yarn
I tried to add --deply-mode as client and cluster but it didn't work.

Th file file:///home/myuser/myfile.xml" seems only accessible on your driver, but not on your executors. You should either
A) manually the file on HDFS and read from there
or B) use --files option of spark2-shell which does upload the file to HDFS for you:
spark2-shell --files /home/myuser/myfile.xml --jars spark-xml_2.10-0.4.1.jar --master yar
and then
val record1 = sc.newAPIHadoopFile(org.apache.spark.SparkFiles.get("myfile.xml"), classOf[XmlInputFormat], classOf[LongWritable],classOf[Text])


Spark Kafka Streaming not displaying data on spark-submit on EMR

I am trying to stream the data from a Kafka topic, it is working in spark shell. But if i create a .py file and use spark-submit for the same, it is failing:
spark_session = SparkSession.builder.appName("TestApp").enableHiveSupport().getOrCreate()
kafka_bootstrap_server = BOOTSTRAP_SERVERS
topic = 'ota-impactreportsync'
starting_offsets = 'earliest'
df = spark_session.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_bootstrap_server).option(
"subscribe", topic).option("startingOffsets", starting_offsets).option("failOnDataLoss", "false").load()
Commands used:
pyspark --master local --packages,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 --conf "" --conf ""
spark-submit --master local --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,,org.apache.spark:spark-avro_2.12:3.3.1 --conf "" --conf ""
If i use batch streaming with spark-submit it is working
The job fails in like 10 seconds everytime, the logs are not helpful either. No errors.

spark-submit command with --py-files fails if the driver class path or executor class path is not set

I have a main script as below
from pyspark.sql.session import SparkSession
import callmodule as cm <<<--- This is imported from another pyspark script which is in file
when I submit the spark command as below it fails with Error: No module named Callmodule
spark-submit --master local --py-files C:\pyspark\scripts\
when I submit the spark command with driver class path(without executor class path) as below it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\ --driver-class-path C:\pyspark\scripts\
spark-submit --master local --py-files C:\pyspark\scripts\ --conf spark.driver.extraClassPath=C:\pyspark\scripts\
spark-submit --master local --py-files C:\pyspark\scripts\ --conf spark.driver.extraLibraryPath=C:\pyspark\scripts\
when I submit the spark command with executor class path (without driver classpath) also it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\ --conf spark.executor.extraClassPath=C:\pyspark\scripts\
spark-submit --master local --py-files C:\pyspark\scripts\ --conf spark.executor.extraLibraryPath=C:\pyspark\scripts\
Can you explain me where does the below import statement work? on driver or executor?
import callmodule as cm
Why is the code not failing with Error: No Module Named callmodule when only the driver classpath is set or only the executor classpath is set?
You are using --master local, so the driver is the same as the executor. Therefore, setting classpath on either driver or executor produces the same behaviour, and neither would cause an error.

spark-submit yarn

spark-submit --master yarn --deploy-mode cluster --jars sqljdbc42.jar
I get Error :
Everything works when I use --deploy-mode client and copy the jar file to /usr/hdp/current/spark2-client/jars/sqljdbc42.jar
Should I copy the sqljdbc42.jar to /usr/hdp/current/hadoop-yarn-client/lib/ on all data nodes?

How to pass a local file as input in spark-submit

How to pass a local file as input in spark-submit, I have tried like below:
spark-submit --jars /home/hduser/.ivy2/cache/com.typesafe/config/bundles/config-1.3.1.jar --class "retail.DataValidator" --master local[2] --executor-memory 2g --total-executor-cores 2 sample-spark-180417_2.11-1.0.jar file:///home/hduser/Downloads/Big_Data_Backup/ dev file:///home/hduser/spark-training/workspace/demos/output/destination file:///home/hduser/spark-training/workspace/demos/output/extrasrc file:///home/hduser/spark-training/workspace/demos/output/extradest
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/home/inputfile , expected: hdfs://hadoop:54310
also tried path without the prefix "file://", but no luck. Its working fine in eclipse.
Thank you.!
If you want those files be accessible by each executor you need to use option files. Example:
spark-submit --files file1,file2,file3

Spark-submit throwing error from yarn-cluster mode

How do we pass a config file to executor when we submit a spark job on yarn-cluster?
If I change my below spark-submit command as --master yarn-client then it works fine , I get respective output
--files /home/cloudera/conf/omega.config \
--class com.mdm.InitProcess \
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
/home/cloudera/Omega.jar \
My Spark Code:
object InitProcess
def main(args: Array[String]): Unit = {
val config_loc = args(0)
val config = ConfigFactory.parseFile(new File(config_loc ))
val jobName =config.getString("job_name")
I am getting the below error
17/04/05 12:01:39 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
Could someone help me on running this command in --master yarn-cluster ?
The different between yarn-client and yarn-cluster is that in yarn-client the driver location is on the machine running the spark-submit command.
In your case, the location of the config file is /home/cloudera/conf/omega.config which can be found when you running as yarn-client as the driver is running from the current machine which holds this full/path/to/file.
But can't be access in yarn-cluster mode, as the driver is running on other host, which doesn't holds this full/path/to/file.
I'd suggest execution the command in the following format:
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
--class com.mdm.InitProcess \
--files /home/cloudera/conf/omega.config \
--jar /home/cloudera/Omega.jar omega.config
Sending the config file using --files with its full-path-name, and providing it as parameter to the jar as it filename (not with a full path) as the file will be downloaded to unknown location on the workers.
In your code, you can use SparkFiles.get(filename) in order to get the actual full-path-name of the downloaded file on the worker
The change in your code should be something similar to:
val config_loc = SparkFiles.get(args(0))
SparkFiles docs
public class SparkFiles
Resolves paths to files added through SparkContext.addFile().
