Setting Environment variables in Spark Cluster Mode - apache-spark

I was going through this Apache Spark documentation, and it mentions that:
When running Spark on YARN in cluster mode, environment variables
need to be set using the
spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your
conf/spark-defaults.conf file.
I am running my EMR cluster on AWS data pipeline. I wanted to know that where do I have to edit this conf file. Also, if I create my own custom conf file, and specify it as part of --configurations (in the spark-submit), will it solve my use-case?

One way to do it, is the following: (The tricky part is that you might need to setup the environment variables on both executor and driver parameters)
spark-submit \
--driver-memory 2g \
--executor-memory 4g \
--conf spark.executor.instances=4 \
--conf spark.driver.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--conf spark.executor.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--master yarn \
--deploy-mode cluster\
--class com.industry.class.name \
assembly-jar.jar
I have tested it in EMR and client mode but should work on cluster mode as well.

For future reference you could directly pass the environment variable when creating the EMR cluster using the Configurations parameter as described in the docs here.
Specifically, the spark-defaults file can be modified by passing a configuration JSON as follows:
{
'Classification': 'spark-defaults',
'Properties': {
'spark.yarn.appMasterEnv.[EnvironmentVariableName]' = 'some_value',
'spark.executorEnv.[EnvironmentVariableName]': 'some_other_value'
}
},
Where spark.yarn.appMasterEnv.[EnvironmentVariableName] would be used to pass a variable in cluster mode using YARN (here). And spark.executorEnv.[EnvironmentVariableName] to pass a variable to the executor process (here).

Related

Passing custom log4j.properties file from s3

I'm trying to set custom logging configurations. If I add the log file to the cluster and reference it in my spark submit, the configurations take effect. But if I try to access the file using --files s3://... then it doesn't work.
Works (assuming I placed the file in the home dir):
spark-submit \
--master yarn \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
Doesn't work:
spark-submit \
--master yarn \
--files s3://my_path/log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
How can I use a config file in s3 to set the logging configuration?
You can't directly Log4J loads its files from the local filesystem, always.
You can use configs inside a JAR, and as spark will download JARs with your job, you should be able to get it indirectly. Create a JAR containing only the log4j.properties file, tell spark to load it with the job

Spark in AKS. Error: Could not find or load main class org.apache.spark.launcher.Main

Update 1: After adding missing pieces and env variables from Spark installation - Error: Could not find or load main class org.apache.spark.launcher.Main, the command no longer throws an error, but prints itself and doesn't do anything else. This is the new result of running the command:
"C:\Program Files\Java\jdk1.8.0_271\bin\java" -cp "C:\Users\xxx\repos\spark/conf\;C:\Users\xxx\repos\spark\assembly\target\scala-2.12\jars\*" org.apache.spark.deploy.SparkSubmit --master k8s://http://127.0.0.1:8001 --deploy-mode cluster --conf "spark.kubernetes.container.image=xxx.azurecr.io/spark:spark2.4.5_scala2.12.12" --conf "spark.kubernetes.authenticate.driver.serviceAccountName=spark" --conf "spark.executor.instances=3" --class com.xxx.bigdata.xxx.XMain --name xxx_app https://storage.blob.core.windows.net/jars/xxx.jar
I have been following this guide for setting up Spark in AKS: https://learn.microsoft.com/en-us/azure/aks/spark-job. I am using Spark tag 2.4.5 with scala 2.12.12. I have done all the following steps:
created AKS with ACR and Azure storage, serviceaccount and role
built spark source
built docker image and push to ACR
built sample SparkPi jar and push to storage
proxied api-server (kubectl proxy) and executed spark-submit:
./bin/spark-submit \
--master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--name xxx_app\
--class com.xxx.bigdata.xxx.XMain\
--conf spark.executor.instances=3 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=xxx.azurecr.io/spark:spark2.4.5_scala2.12.12 \
"https://storage.blob.core.windows.net/jars/xxx.jar"
All I am getting is Error: Could not find or load main class org.apache.spark.launcher.Main
Now, the funny thing is that it doesn't matter at all what I change in the command. I can mess up ACR address, spark image name, jar location, api-server address, anything, and I still get the same error.
I guess I must be making some silly mistake as it seems nothing can break the command more than it already is, but I can't really nail it down.
Does someone have some ideas what might be wrong?
Looks like it might be a problem on the machine you are executing spark-submit. You might be missing some jars on the classpath on the machine you are executing spark-submit. Worth checking out Spark installation - Error: Could not find or load main class org.apache.spark.launcher.Main
Alright, so I managed to submit jobs with spark-submit.cmd, instead. It works, without any additional setup.
I didn't manage to get the bash script to work in the end and I do not have the time to investigate it further at this moment. So, sorry for providing a half-assed answer only partially resolving original problem, but it is a solution nonetheless.
The below command works fine
bin\spark-submit.cmd --master k8s://http://127.0.0.1:8001 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.namespace=dev --conf spark.kubernetes.container.image=xxx.azurecr.io/spark:spark-2.4.5_scala-2.12_hadoop-2.7.7 https://xxx.blob.core.windows.net/jars/SparkPi-assembly-0.1.0-SNAPSHOT.jar

Path of jars added to a Spark Job - spark-submit

I am using Spark 2.1 (BTW) on a YARN cluster.
I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) Spark JAR.
I am trying to do so through spark-submit.
The question Add jars to a Spark Job - spark-submit - and the related answers - are full of interesting points.
One helpful answer is the following one:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
So, I understand the following:
"--jars" is for uploading jar on each node
"--driver-class-path" is for using uploaded jar for the driver.
"--conf spark.executor.extraClassPath" is for using uploaded jar for executors.
While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?
The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"
Fine, but for the following command, what should I put instead of XXX and YYY ?
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path XXX:YYY \
--conf spark.executor.extraClassPath=XXX:YYY \
--class MyClass main-application.jar
When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?
Thanks.
PS: I have tried
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path some1.jar:some2.jar \
--conf spark.executor.extraClassPath=some1.jar:some2.jar \
--class MyClass main-application.jar
No success (if I made no mistake)
And I have tried also:
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path ./some1.jar:./some2.jar \
--conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
--class MyClass main-application.jar
No success either.
spark-submit by default uses client mode.
In client mode, you should not use --jars in conjunction with --driver-class-path.
--driver-class-path will overwrite original classpath, instead of prepending to it as one may expect.
--jars will automatically add the extra jars to the driver and executor classpath so you do not need to add its path manually.
It seems that in cluster mode --driver-class-path is ignored.

df.show() prints empty result while in hdfs it is not empty

I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs
in my code, i have a dataframe which is read directly from hdfs:
df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")
when i use df.show(n=2) directly in my code after the above code, it outputs:
+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+
But when i manually go to the hdfs path, data is not empty.
What i have tried?
1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.
2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist
What i am assuming?
1- i think this may have something to do with drivers and executors
2- it may i have something to do with yarn
3- configs provided when using spark-submit
current config:
spark-submit \
--master yarn \
--queue my_queue_name \
--deploy-mode cluster \
--jars some_jars \
--conf spark.yarn.dist.files some_files \
--conf spark.sql.catalogImplementation=in-memory \
--properties-file some_zip_file \
--py-files some_py_files \
main.py
What i am sure
data is not empty. the same hdfs path is provided in another project which is working fine.
So the problem was with the jar files i was providing
The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine

How to use --num-executors option with spark-submit?

I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below :
spark-submit --class WC.WordCount \
--num-executors 8 \
--executor-cores 5 \
--executor-memory 3584M \
...../<myjar>.jar \
/public/blahblahblah /user/blahblah
However its running with default number of executors which is 2. But I am able to override properties if I add
--master yarn
Can someone explain why it is so ? Interestingly , in my application code I am setting master as yarn-client:
val conf = new SparkConf()
.setAppName("wordcount")
.setMaster("yarn-client")
.set("spark.ui.port","56487")
val sc = new SparkContext(conf)
Can someone throw some light as to how the option --master works
I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below
It will not work (unless you override spark.master in conf/spark-defaults.conf file or similar so you don't have to specify it explicitly on the command line).
The reason is that the default Spark master is local[*] and the number of executors is exactly one, i.e. the driver. That's just the local deployment environment. See Master URLs.
As a matter of fact, num-executors is very YARN-dependent as you can see in the help:
$ ./bin/spark-submit --help
...
YARN-only:
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
That explains why it worked when you switched to YARN. It is supposed to work with YARN (regardless of the deploy mode, i.e. client or cluster which is about the driver alone not executors).
You may be wondering why it did not work with the master defined in your code then. The reason is that it is too late since the master has already been assigned on launch time when you started the application using spark-submit. That's exactly the reason why you should not specify deployment environment-specific properties in the code as:
It may not always work (see the case with master)
It requires that a code has to be recompiled every configuration change (and makes it a bit unwieldy)
That's why you should be always using spark-submit to submit your Spark applications (unless you've got reasons not to, but then you'd know why and could explain it with ease).
If you’d like to run the same application with different masters or different amounts of memory. Spark allows you to do that with an default SparkConf. As you are mentioning properties to SparkConf, those takes highest precedence for application, Check the properties precedence at the end.
Example:
val sc = new SparkContext(new SparkConf())
Then, you can supply configuration values at runtime:
./bin/spark-submit \
--name "My app" \
--deploy-mode "client" \
--conf spark.ui.port=56487 \
--conf spark.master=yarn \ #alternate to --master
--conf spark.executor.memory=4g \ #alternate to --executor-memory
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--class WC.WordCount \
/<myjar>.jar \
/public/blahblahblah \
/user/blahblah
Properties precedence order (top one is more)
Properties set directly on the SparkConf(in the code) take highest
precedence.
Any values specified as flags or in the properties file will be passed
on to the application and merged with those specified through
SparkConf.
then flags passed to spark-submit or spark-shell like --master etc
then options in the spark-defaults.conf file.
A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source: Dynamically Loading Spark Properties

Resources