spark-submit command with --py-files fails if the driver class path or executor class path is not set - apache-spark

I have a main script as below
from pyspark.sql.session import SparkSession
..............
..............
..............
import callmodule as cm <<<--- This is imported from another pyspark script which is in callmod.zip file
..............
..............
..............
when I submit the spark command as below it fails with Error: No module named Callmodule
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with driver class path(without executor class path) as below it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --driver-class-path C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with executor class path (without driver classpath) also it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
Can you explain me where does the below import statement work? on driver or executor?
import callmodule as cm
Why is the code not failing with Error: No Module Named callmodule when only the driver classpath is set or only the executor classpath is set?

You are using --master local, so the driver is the same as the executor. Therefore, setting classpath on either driver or executor produces the same behaviour, and neither would cause an error.

Related

Spark-submit fails with return code 13 for example of wordCount

My spark-submit command is :
spark-submit --class com.sundogsoftware.spark.WordCountBetterDataset --master yarn --deploy-mode cluster SparkCourse.jar
And for defining the sparkSession, i use this :
val spark = SparkSession
.builder
.master("spark://youness:7077")
.appName("WordCount")
.getOrCreate()
but at the end, my job fails with return code 13.
You need to let the master unset in the code. It is preferable to set it later when you issue spark-submit (spark-submit --master yarn-client ...) and you are already doing that above. Just remove .master("spark://youness:7077") from your code.

Load properties file in Spark classpath during spark-submit execution

I'm installing the Spark Atlas Connector in a spark submit script (https://github.com/hortonworks-spark/spark-atlas-connector)
Due to security restrictions, I can't put the atlas-application.properties in the spark/conf repository.
I used the two options in the spark-submit :
--driver-class-path "spark.driver.extraClassPath=hdfs:///directory_to_properties_files" \
--conf "spark.executor.extraClassPath=hdfs:///directory_to_properties_files" \
When I launch the spark-submit, I encounter this issue :
20/07/20 11:32:50 INFO ApplicationProperties: Looking for atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Looking for /atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Loading atlas-application.properties from null
Please find CDP Atals Configuration article.
https://community.cloudera.com/t5/Community-Articles/How-to-pass-atlas-application-properties-configuration-file/ta-p/322158
Client Mode:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
Cluster Mode:
sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10

Dependency is not added to Spark + Zeppelin

I can't add custom dependency to the spark classpath from zeppelin.
Environment:
AWS EMR: Zeppelin 0.8.0, Spark 2.4.0
extra configs for spark interpreter:
spark.jars.ivySettings /tmp/ivy-settings.xml
spark.jars.packages my-group-name:artifact_2.11:version
The files from my-group-name were appeared at
spark.yarn.dist.jars
spark.yarn.secondary.jars
But not accessible via zeppelin notebook (checking by import my.lab._)
However, when i am running the same configs for spark-shell it is working on both local machine, and ssh on emr cluster
and imports are available from spark-shell
Sun.java.command for zeppelin:
org.apache.spark.deploy.SparkSubmit --master yarn-client ... --conf spark.jars.packages=my-group-name:artifact_2.11:version ... --conf spark.jars.ivySettings=/tmp/ivy-settings.xml ... --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer /usr/lib/zeppelin/interpreter/spark/spark-interpreter-0.8.0.jar <IP ADDRESS> 34717 :
Spark submit on emr:
spark-shell --master yarn-client --conf spark.jars.ivySettings="/tmp/ivy-settings.xml" --conf spark.jars.packages="my-group-name:artifact_2.11:version"
Any advices where to look for the errors?
You can try to add your jar directly to Zeppelin, in Interpreter settings.
http://zeppelin.apache.org/docs/0.8.0/usage/interpreter/dependency_management.html
Or, add jar to spark libs (in my case it's /usr/hdp/current/spark2/jars/ directory).

Spark/Yarn: FileNotFoundException

I am runnig the following code in spark.
scala>import com.databricks.spark.xml.XmlInputFormat
scala>import org.apache.hadoop.io._
scala>sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY,"<mytag>")
scala>sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY,"</mytag>")
scala>sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY,"utf-8")
scala>val record1 = sc.newAPIHadoopFile("file:///home/myuser/myfile.xml", classOf[XmlInputFormat], classOf[LongWritable],classOf[Text])
When I run it by setting master to local it works fine.
spark2-shell --jars spark-xml_2.10-0.4.1.jar --master local[*]
But when I try to run it in yarn it return java.io.FileNotFoundException: File file:/home/myuser/myfile.xml does not exist.
spark2-shell --jars spark-xml_2.10-0.4.1.jar --master yarn
I tried to add --deply-mode as client and cluster but it didn't work.
Th file file:///home/myuser/myfile.xml" seems only accessible on your driver, but not on your executors. You should either
A) manually the file on HDFS and read from there
or B) use --files option of spark2-shell which does upload the file to HDFS for you:
spark2-shell --files /home/myuser/myfile.xml --jars spark-xml_2.10-0.4.1.jar --master yar
and then
val record1 = sc.newAPIHadoopFile(org.apache.spark.SparkFiles.get("myfile.xml"), classOf[XmlInputFormat], classOf[LongWritable],classOf[Text])

Spark-submit throwing error from yarn-cluster mode

How do we pass a config file to executor when we submit a spark job on yarn-cluster?
If I change my below spark-submit command as --master yarn-client then it works fine , I get respective output
spark-submit\
--files /home/cloudera/conf/omega.config \
--class com.mdm.InitProcess \
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
/home/cloudera/Omega.jar \
/home/cloudera/conf/omega.config
My Spark Code:
object InitProcess
{
def main(args: Array[String]): Unit = {
val config_loc = args(0)
val config = ConfigFactory.parseFile(new File(config_loc ))
val jobName =config.getString("job_name")
.....
}
}
I am getting the below error
17/04/05 12:01:39 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
Could someone help me on running this command in --master yarn-cluster ?
The different between yarn-client and yarn-cluster is that in yarn-client the driver location is on the machine running the spark-submit command.
In your case, the location of the config file is /home/cloudera/conf/omega.config which can be found when you running as yarn-client as the driver is running from the current machine which holds this full/path/to/file.
But can't be access in yarn-cluster mode, as the driver is running on other host, which doesn't holds this full/path/to/file.
I'd suggest execution the command in the following format:
spark-submit\
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
--class com.mdm.InitProcess \
--files /home/cloudera/conf/omega.config \
--jar /home/cloudera/Omega.jar omega.config
Sending the config file using --files with its full-path-name, and providing it as parameter to the jar as it filename (not with a full path) as the file will be downloaded to unknown location on the workers.
In your code, you can use SparkFiles.get(filename) in order to get the actual full-path-name of the downloaded file on the worker
The change in your code should be something similar to:
val config_loc = SparkFiles.get(args(0))
SparkFiles docs
public class SparkFiles
Resolves paths to files added through SparkContext.addFile().

Resources