Add conf file to classpath in Google Dataproc

Add conf file to classpath in Google Dataproc - apache-spark

We're building a Spark application in Scala with a HOCON configuration, the config is called application.conf.
If I add the application.conf to my jar file and start a job on Google Dataproc, it works correctly:
gcloud dataproc jobs submit spark \
--cluster <clustername> \
--jar=gs://<bucketname>/<filename>.jar \
--region=<myregion> \
-- \
<some options>
I don't want to bundle the application.conf with my jar file but provide it separately, which I can't get working.
Tried different things, i.e.
Specifying the application.conf with --jars=gs://<bucketname>/application.conf (which should work according to this answer)
Using --files=gs://<bucketname>/application.conf
Same as 1. + 2. with the application conf in /tmp/ on the Master instance of the cluster, then specifying the local file with file:///tmp/application.conf
Defining extraClassPath for spark using --properties=spark.driver.extraClassPath=gs://<bucketname>/application.conf (and for executors)
With all these options I get an error, it can't find the key in the config:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: system properties: No configuration setting found for key 'xyz'
This error usually means that there's an error in the HOCON config (key xyz is not defined in HOCON) or that the application.conf is not in the classpath. Since the exact same config is working when inside my jar file, I assume it's the latter.
Are there any other options to put the application.conf on the classpath?

If --jars doesn't work as suggested in this answer, you can try init action. First upload your config to GCS, then write an init action to download it to the VMs, putting it to a folder in the classpath or update spark-env.sh to include the path to the config.

Related

"Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher" when running spark-submit or PySpark

I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
I am trying to run one of the spark examples using the following spark-submit command
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I get the following error:
[2022-07-25 13:32:39.253]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
I get the same error when trying to run a script with PySpark.
I have tried/verified the following:
environment variables: HADOOP_HOME, SPARK_HOME and HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir hdfs://hadoop-namenode:8020/user/bitnami/sparkStaging and spark.yarn.jars hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ in spark-defaults.conf
I have uploaded the jars into hdfs (i.e. hadoop fs -put $SPARK_HOME/jars/* hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ )
The logs accessible via the web interface (i.e. http://hadoop-namenode:8042 ) do not provide any further details about the error.

This section of the Spark documentation seems relevant to the error since the YARN libraries should be included, by default, but only if you've installed the appropriate Spark version
For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can set spark.yarn.populateHadoopClasspath=true. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. For with-hadoop Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.
https://spark.apache.org/docs/latest/running-on-yarn.html#preparations
Otherwise, yarn.application.classpath in yarn-site.xml refers to local filesystem paths in each of ResourceManager servers where JARs are available for all YARN applications (spark.yarn.jars or extra packages should get layered onto this)
Another problem could be file permissions. You probably shouldn't put Spark jars into an HDFS user folder if they're meant to be used by all users. Typically, I'd put it under hdfs:///apps/spark/<version>, then give that 744 HDFS permissions
In the Spark / YARN UI, it should show the complete classpath of the application for further debugging

I figured out why I was getting this error. It turns out that I made an error while specifying spark.yarn.jars in spark-defaults.conf
The value of this property must be
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/*
instead of
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/
i.e. Basically, we need to specify the jar files as the value to this property and not the folder containing the jar files.

How PYSPARK environmental setup is executed by YARN in launch_container.sh

While analyzing the yarn launch_container.sh logs for a spark job, I got confused by some part of log.
I will point out those asks step by step here
When you will submit a spark job with spark-submit having --pyfiles and --files on cluster mode on YARN:
The config files passed in --files , executable python files passed in --pyfiles are getting uploaded into .sparkStaging directory created under user hadoop home directory.
Along with these files pyspark.zip and py4j-version_number.zip from $SPARK_HOME/python/lib is also getting copied
into .sparkStaging directory created under user hadoop home directory
After this launch_container.sh is getting triggered by yarn and this will export all env variables required.
If we have exported anything explicitly such as PYSPARK_PYTHON in .bash_profile or at the time of building the spark-submit job in a shell script or in spark_env.sh , the default value will be replaced by the value which we
are providing
This PYSPARK_PYTHON is a path in my edge node.
Then how a container launched in another node will be able to use this python version ?
The default python version in data nodes of my cluster is 2.7.5.
So without setting this pyspark_python , containers are using 2.7.5.
But when I will set pyspark_python to 3.5.x , they are using what I have given.
It is defining PWD='/data/complete-path'
Where this PWD directory resides ?
This directory is getting cleaned up after job completion.
I have even tried to run the job in one session of putty
and kept the /data folder opened in another session of putty to see
if any directories are getting created on run time. but couldn't find any?
It is also setting the PYTHONPATH to $PWD/pyspark.zip:$PWD/py4j-version.zip
When ever I am doing a python specific operation
in spark code , its using PYSPARK_PYTHON. So for what purpose this PYTHONPATH is being used?
3.After this yarn is creating softlinks using ln -sf for all the files in step 1
soft links are created for for pyspark.zip , py4j-<version>.zip,
all python files mentioned in step 1.
Now these links are again pointing to '/data/different_directories'
directory (which I am not sure where they are present).
I know soft links can be used for accessing remote nodes ,
but here why the soft links are created ?
Last but not the least , whether this launch_container.sh will run for each container launch ?

Then how a container launched in another node will be able to use this python version ?
First of all, when we submit a Spark application, there are several ways to set the configurations for the Spark application.
Such as:
Setting spark-defaults.conf
Setting environment variables
Setting spark-submit options (spark-submit —help and —conf)
Setting a custom properties file (—properties-file)
Setting values in code (exposed in both SparkConf and SparkContext APIs)
Setting Hadoop configurations (HADOOP_CONF_DIR and spark.hadoop.*)
In my environment, the Hadoop configurations are placed in /etc/spark/conf/yarn-conf/, and the spark-defaults.conf and spark-env.sh is in /etc/spark/conf/.
As the order of precedence for configurations, this is the order that Spark will use:
Properties set on SparkConf or SparkContext in code
Arguments passed to spark-submit, spark-shell, or pyspark at run time
Properties set in /etc/spark/conf/spark-defaults.conf, a specified properties file
Environment variables exported or set in scripts
So broadly speaking:
For properties that apply to all jobs, use spark-defaults.conf,
for properties that are constant and specific to a single or a few applications use SparkConf or --properties-file,
for properties that change between runs use command line arguments.
Now, regarding the question:
In Cluster mode of Spark, the Spark driver is running in container in YARN, the Spark executors are running in container in YARN.
In Client mode of Spark, the Spark driver is running outside of the Hadoop cluster(out of YARN), and the executors are always in YARN.
So for your question, it is mostly relative with YARN.
When an application is submitted to YARN, first there will be an ApplicationMaster container, which nigotiates with NodeManager, and is responsible to control the application containers(in your case, they are Spark executors).
NodeManager will then create a local temporary directory for each of the Spark executors, to prepare to launch the containers(that's why the launch_container.sh has such a name).
We can find the location of the local temporary directory is set by NodeManager's ${yarn.nodemanager.local-dirs} defined in yarn-site.xml.
And we can set yarn.nodemanager.delete.debug-delay-sec to 10 minutes and review the launch_container.sh script.
In my environment, the ${yarn.nodemanager.local-dirs} is /yarn/nm, so in this directory, I can find the tempory directories of Spark executor containers, they looks like:
/yarn/nm/nm-local-dir/container_1603853670569_0001_01_000001.
And in this directory, I can find the launch_container.sh for this specific container and other stuffs for running this container.
Where this PWD directory resides ?
I think this is a special Environment Variable in Linux OS, so better not to modify it unless you know how it works percisely in your application.
As per above, if you export this PWD environment at the runtime, I think it is passed to Spark as same as any other Environment Variables.
I'm not sure how the PYSPARK_PYTHON Environment Variable is used in Spark's launch scripts chain, but here you can find the instruction in the official documentation, showing how to set Python binary executable while you are using spark-submit:
spark-submit --conf spark.pyspark.python=/<PATH>/<TO>/<FILE>
As for the last question, yes, YARN will create a temp dir for each of the containers, and the launch_container.sh is included in the dir.

Remove JAR from Spark default classpath in EMR

I'm executing a spark-submit script in an EMR step that has my super JAR as the main class, like
spark-submit \
....
--class ${MY_CLASS} "${SUPER_JAR_S3_PATH}"
... etc
but Spark is by default loading the jar file:/usr/lib/spark/jars/guice-3.0.jar which contains com.google.inject.internal.InjectorImpl, a class that's also in the Guice-4.x jar which is in my super JAR. This results in a java.lang.IllegalAccessError when my service is booting up.
I've tried setting some Spark conf in the spark-submit to put my super jar in the classpath in hopes of it getting loaded first, before Spark loads guice-3.0.jar. It looks like:
--jars "${ASSEMBLY_JAR_S3_PATH}" \
--driver-class-path "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:${SUPER_JAR_S3_PATH}" \
--conf spark.executor.extraClassPath="/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:${SUPER_JAR_S3_PATH}" \
but this results in the same error.
Is there a way to remove that guice-3.0.jar from the default spark classpath so my code can use the InjectorImpl that's packaged in the Guice-4.x JAR? I'm also running Spark in client mode so I can't use spark.driver.userClassPathFirst or spark.executor.userClassPathFirst

one way is point to lib where your guice old version of jar is there and then exclude it.
sample shell script for spark-submit :
export latestguicejar='your path to latest guice jar'
#!/bin/sh
# build all other dependent jars in OTHER_JARS
JARS=`find /usr/lib/spark/jars/ -name '*.jar'`
OTHER_JARS=""
for eachjarinlib in $JARS ; do
if [ "$eachjarinlib" != "guice-3.0.jar" ]; then
OTHER_JARS=$eachjarinlib,$OTHER_JARS
fi
done
echo ---final list of jars are : $OTHER_JARS
echo $CLASSPATH
spark-submit --verbose --class <yourclass>
... OTHER OPTIONS
--jars $OTHER_JARS,$latestguicejar,APPLICATIONJARTOBEADDEDSEPERATELY.JAR
also see holdens answer. check with your version of the spark what is available.
As per docs runtime-environment userClassPathFirst are present in the latest version of spark as of today.
spark.executor.userClassPathFirst
spark.driver.userClassPathFirst
for this to use you can make uber jar with all application level dependencies.

Passing multiple typesafe config files to a yarn cluster mode application

I'm struggling a bit on trying to use multiple (via include) Typesafe config files in my Spark Application that I am submitting to a YARN queue in cluster mode. I basically have two config files and file layouts are provided below:
env-common.properties
application-txn.conf (this file uses an "include" to reference the above one)
Both of the above files are external to my application.jar, so I pass them to yarn using the "--files" (can be seen below)
I am using the Typesafe config library to parse my "application-main.conf" and in this main conf, I am trying to use a property from the env.properties file via substitution, but the variable name does not get resolved :( and I'm not sure why.
env.properties
txn.hdfs.fs.home=hdfs://dev/1234/data
application-txn.conf:
# application-txn.conf
include required(file("env.properties"))
app {
raw-data-location = "${txn.hdfs.fs.home}/input/txn-raw"
}
Spark Application Code:
//propFile in the below block maps to "application-txn.conf" from the app's main method
def main {
val config = loadConfig("application-txn.conf")
val spark = SparkSession.builkder.getOrCreate()
//Code fails here:
val inputDF = spark.read.parquet(config.getString("app.raw-data-location"))
}
def loadConf(propFile:String): Config = {
ConfigFactory.load()
val cnf = ConfigFactory.parseResources(propFile)
cnf.resolve()
}
Spark Submit Code (called from a shell script):
spark-submit --class com.nic.cage.app.Transaction \
--master yarn \
--queue QUEUE_1 \
--deploy-mode cluster \
--name MyTestApp \
--files application-txn.conf,env.properties \
--jars #Typesafe config 1.3.3 and my app.jar go here \
--executor-memory 2g \
--executor-cores 2 \
app.jar application-txn.conf
When I run the above, I am able to parse the config file, but my app fails on trying to read the files from HDFS because it cannot find a directory with the name:
${txn.hdfs.fs.home}/input/txn-raw
I believe that the config is actually able to read both files...or else it would fail because of the "required" keyword. I verified this by adding another include statement with a dummy file name, and the application failed on parsing of the config. Really not sure what's going on right now :(.
Any ideas what could be causing this resolution to fail?
If it helps: When I run locally with multiple config files, the resolution works fine

The syntax in application-txn.conf is wrong.
The variable should be outside the string, like so:
raw-data-location = ${txn.hdfs.fs.home}"/input/txn-raw"

can't add alluxio.security.login.username to spark-submit

I have a spark driver program which I'm trying to set the alluxio user for.
I read this post: How to pass -D parameter or environment variable to Spark job? and although helpful, none of the methods in there seem to do the trick.
My environment:
- Spark-2.2
- Alluxio-1.4
- packaged jar passed to spark-submit
The spark-submit job is being run as root (under supervisor), and alluxio only recognizes this user.
Here's where I've tried adding "-Dalluxio.security.login.username=alluxio":
spark.driver.extraJavaOptions in spark-defaults.conf
on the command line for spark-submit (using --conf)
within the sparkservices conf file of my jar application
within a new file called "alluxio-site.properties" in my jar application
None of these work set the user for alluxio, though I'm easily able to set this property in a different (non-spark) client application that is also writing to alluxio.
Anyone able to make this setting apply in spark-submit jobs?

If spark-submit is in client mode, you should use --driver-java-options instead of --conf spark.driver.extraJavaOptions=... in order for the driver JVM to be started with the desired options. Therefore your command would look something like:
./bin/spark-submit ... --driver-java-options "-Dalluxio.security.login.username=alluxio" ...
This should start the driver with the desired Java options.
If the Spark executors also need the option, you can set that with:
--conf "spark.executor.extraJavaOptions=-Dalluxio.security.login.username=alluxio"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string