getting & setting spark.driver/executor.extraClassPath on EMR - apache-spark

As far as I can tell, when setting / using spark.driver.extraClassPath and spark.executor.extraClassPath on AWS EMR within the spark-defaults.conf or elsewhere as a flag, I would have to first get the existing value that [...].extraClassPath is set to, then append :/my/additional/classpath to it in order for it to work.
is there a function in Spark that allows me to to just append an additional class path to it where it retains/respects the existing paths set by EMR in /etc/spark/conf/spark-defaults.conf?

No such "function" in Spark but:
On EMR AMI's you can write a bootstrap that will append/set whatever you want in spark-defaults, will of course affect all Spark jobs.
When EMR moved to the newer "release-label" this stopped working as bootstrap-steps were replaced with configuration JSONs and the manual bootstraps run before applications are installed ( At least when I tried it )

We use SageMaker Studio connected to EMR clusters via Apache Livy and none of the approaches suggested so far worked for me - we need to add a custom JAR on the classpath for a custom filesystem.
Following this guide:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
I made it work by adding an EMR Step to cluster set up with the following code (this is TypeScript CDK, passed into steps parameter of CfnCluster CDK construct:
//We are using this approach to add CkpFS JAR (in Docker image additional-jars) onto the class path because
// the spark.jars configuration in spark-defaults is not respected by Apache Livy on EMR when it creates a Spark session
// AWS guide followed: https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
private createStepToAddCkpFSJAROntoEMRSparkClassPath() {
return {
name: "AddCkpFSJAROntoEMRSparkClassPath",
hadoopJarStep: {
jar: "command-runner.jar",
args: [
"bash",
"-c",
"while [ ! -f /etc/spark/conf/spark-defaults.conf ]; do sleep 1; done;" +
"sudo sed -i '/spark.*.extraClassPath/s/$/:\\/additional-jars\\/\\*/' /etc/spark/conf/spark-defaults.conf",
],
},
actionOnFailure: "CONTINUE",
};
}

Related

Spark InProcessLauncher not picking up Hadoop config

I'm trying to submit a cluster-mode spark 2 application from a Java Spring app using InProcessLauncher. I was previously using the SparkLauncher class, which worked, but it fires up a long-lived SparkSubmit java process for each job, which was eating up too many resources with lots of jobs in play.
My code sets sparkLauncher.setMaster("yarn") and sparkLauncher.setDeployMode("cluster")
I set the HADOOP_CONF_DIR env variable to the directory containing my config (yarn-site.xml etc) before starting my Spring app, and it logs that it is picking up this variable:
INFO System Environment - HADOOP_CONF_DIR = /etc/hadoop/conf
Yet when it comes to submitting, I see INFO o.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032 - i.e. it is using the default 0.0.0.0 rather than the actual ResourceManager IP, and of course it fails. It seems not to be picking up the Hadoop config.
I can submit jobs from the same shell directly using spark-submit, and even by directly invoking java -cp /usr/hdp/current/spark2-client/conf/:/usr/hdp/current/spark2-client/jars/*:/etc/hadoop/conf/ org.apache.spark.deploy.SparkSubmit .... So I'm not sure why my Spring App isn't picking up the same config.
I managed to get my app to pick up the hadoop config by adding the conf folders to the classpath. This is something spark-submit does for you when launching as a separate process, but doesn't happen when using InProcessLauncher.
Because my Spring Boot app is launched using -jar xxx.jar, I couldn't use -cp on the command line (cannot be combined with -jar), but had to add it to the manifest in the jar. I did this by adding the following to build.gradle (which is using the Spring Boot gradle plugin):
bootJar {
manifest {
attributes 'Class-Path': '/usr/hdp/current/spark2-client/conf/ /etc/hadoop/conf/'
}
}

How PYSPARK environmental setup is executed by YARN in launch_container.sh

While analyzing the yarn launch_container.sh logs for a spark job, I got confused by some part of log.
I will point out those asks step by step here
When you will submit a spark job with spark-submit having --pyfiles and --files on cluster mode on YARN:
The config files passed in --files , executable python files passed in --pyfiles are getting uploaded into .sparkStaging directory created under user hadoop home directory.
Along with these files pyspark.zip and py4j-version_number.zip from $SPARK_HOME/python/lib is also getting copied
into .sparkStaging directory created under user hadoop home directory
After this launch_container.sh is getting triggered by yarn and this will export all env variables required.
If we have exported anything explicitly such as PYSPARK_PYTHON in .bash_profile or at the time of building the spark-submit job in a shell script or in spark_env.sh , the default value will be replaced by the value which we
are providing
This PYSPARK_PYTHON is a path in my edge node.
Then how a container launched in another node will be able to use this python version ?
The default python version in data nodes of my cluster is 2.7.5.
So without setting this pyspark_python , containers are using 2.7.5.
But when I will set pyspark_python to 3.5.x , they are using what I have given.
It is defining PWD='/data/complete-path'
Where this PWD directory resides ?
This directory is getting cleaned up after job completion.
I have even tried to run the job in one session of putty
and kept the /data folder opened in another session of putty to see
if any directories are getting created on run time. but couldn't find any?
It is also setting the PYTHONPATH to $PWD/pyspark.zip:$PWD/py4j-version.zip
When ever I am doing a python specific operation
in spark code , its using PYSPARK_PYTHON. So for what purpose this PYTHONPATH is being used?
3.After this yarn is creating softlinks using ln -sf for all the files in step 1
soft links are created for for pyspark.zip , py4j-<version>.zip,
all python files mentioned in step 1.
Now these links are again pointing to '/data/different_directories'
directory (which I am not sure where they are present).
I know soft links can be used for accessing remote nodes ,
but here why the soft links are created ?
Last but not the least , whether this launch_container.sh will run for each container launch ?
Then how a container launched in another node will be able to use this python version ?
First of all, when we submit a Spark application, there are several ways to set the configurations for the Spark application.
Such as:
Setting spark-defaults.conf
Setting environment variables
Setting spark-submit options (spark-submit —help and —conf)
Setting a custom properties file (—properties-file)
Setting values in code (exposed in both SparkConf and SparkContext APIs)
Setting Hadoop configurations (HADOOP_CONF_DIR and spark.hadoop.*)
In my environment, the Hadoop configurations are placed in /etc/spark/conf/yarn-conf/, and the spark-defaults.conf and spark-env.sh is in /etc/spark/conf/.
As the order of precedence for configurations, this is the order that Spark will use:
Properties set on SparkConf or SparkContext in code
Arguments passed to spark-submit, spark-shell, or pyspark at run time
Properties set in /etc/spark/conf/spark-defaults.conf, a specified properties file
Environment variables exported or set in scripts
So broadly speaking:
For properties that apply to all jobs, use spark-defaults.conf,
for properties that are constant and specific to a single or a few applications use SparkConf or --properties-file,
for properties that change between runs use command line arguments.
Now, regarding the question:
In Cluster mode of Spark, the Spark driver is running in container in YARN, the Spark executors are running in container in YARN.
In Client mode of Spark, the Spark driver is running outside of the Hadoop cluster(out of YARN), and the executors are always in YARN.
So for your question, it is mostly relative with YARN.
When an application is submitted to YARN, first there will be an ApplicationMaster container, which nigotiates with NodeManager, and is responsible to control the application containers(in your case, they are Spark executors).
NodeManager will then create a local temporary directory for each of the Spark executors, to prepare to launch the containers(that's why the launch_container.sh has such a name).
We can find the location of the local temporary directory is set by NodeManager's ${yarn.nodemanager.local-dirs} defined in yarn-site.xml.
And we can set yarn.nodemanager.delete.debug-delay-sec to 10 minutes and review the launch_container.sh script.
In my environment, the ${yarn.nodemanager.local-dirs} is /yarn/nm, so in this directory, I can find the tempory directories of Spark executor containers, they looks like:
/yarn/nm/nm-local-dir/container_1603853670569_0001_01_000001.
And in this directory, I can find the launch_container.sh for this specific container and other stuffs for running this container.
Where this PWD directory resides ?
I think this is a special Environment Variable in Linux OS, so better not to modify it unless you know how it works percisely in your application.
As per above, if you export this PWD environment at the runtime, I think it is passed to Spark as same as any other Environment Variables.
I'm not sure how the PYSPARK_PYTHON Environment Variable is used in Spark's launch scripts chain, but here you can find the instruction in the official documentation, showing how to set Python binary executable while you are using spark-submit:
spark-submit --conf spark.pyspark.python=/<PATH>/<TO>/<FILE>
As for the last question, yes, YARN will create a temp dir for each of the containers, and the launch_container.sh is included in the dir.

Add conf file to classpath in Google Dataproc

We're building a Spark application in Scala with a HOCON configuration, the config is called application.conf.
If I add the application.conf to my jar file and start a job on Google Dataproc, it works correctly:
gcloud dataproc jobs submit spark \
--cluster <clustername> \
--jar=gs://<bucketname>/<filename>.jar \
--region=<myregion> \
-- \
<some options>
I don't want to bundle the application.conf with my jar file but provide it separately, which I can't get working.
Tried different things, i.e.
Specifying the application.conf with --jars=gs://<bucketname>/application.conf (which should work according to this answer)
Using --files=gs://<bucketname>/application.conf
Same as 1. + 2. with the application conf in /tmp/ on the Master instance of the cluster, then specifying the local file with file:///tmp/application.conf
Defining extraClassPath for spark using --properties=spark.driver.extraClassPath=gs://<bucketname>/application.conf (and for executors)
With all these options I get an error, it can't find the key in the config:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: system properties: No configuration setting found for key 'xyz'
This error usually means that there's an error in the HOCON config (key xyz is not defined in HOCON) or that the application.conf is not in the classpath. Since the exact same config is working when inside my jar file, I assume it's the latter.
Are there any other options to put the application.conf on the classpath?
If --jars doesn't work as suggested in this answer, you can try init action. First upload your config to GCS, then write an init action to download it to the VMs, putting it to a folder in the classpath or update spark-env.sh to include the path to the config.

Spark Job-Server configuaration in StandAlone cluster

I am trying to set up a Spark JobServer (SJS) to execute jobs on a Standalone Spark cluster. I am trying to deploy SJS on one of the non-master nodes of SPARK cluster. I am not using the docker, but trying to do manually.
I am confused with the help documents in SJS github particulary the deployment section. Do I need to edit both local.conf and local.sh to run this?
Can someone point out the steps to set up the SJS in the spark cluster?
Thanks!
Kiran
Update:
I created a new environment to deploy jobserver in one of the nodes of the cluster: Here are the details of it:
env1.sh:
DEPLOY_HOSTS="masked.mo.cpy.corp"
APP_USER=kiran
APP_GROUP=spark
INSTALL_DIR=/home/kiran/job-server
LOG_DIR=/var/log/job-server
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=1.6.1
MAX_DIRECT_MEMORY=512M
SPARK_HOME=/home/spark/spark-1.6.1-bin-hadoop2.6
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.6
env1.conf
spark {
master = "local[1]"
webUrlPort = 8080
job-number-cpus = 2
jobserver {
port = 8090
bind-address = "0.0.0.0"
jar-store-rootdir = /tmp/jobserver/jars
context-per-jvm = false
jobdao = spark.jobserver.io.JobFileDAO
filedao {
rootdir = /tmp/spark-job-server/filedao/data
}
datadao {
rootdir = /tmp/spark-jobserver/upload
}
result-chunk-size = 1m
}
context-settings {
num-cpu-cores = 1
memory-per-node = 1G
}
home = "/home/spark/spark-1.6.1-bin-hadoop2.6"
}
Why don't you set JOBSERVER_FG=1 and try running server_start.sh, this would run the process in foreground and should display the error to stderr.
Yes, you have edit both files adapting them for your cluster.
The deploy steps are explained below:
Copy config/local.sh.template to <environment>.sh and edit as appropriate.
This file is mostly for environment variables that are used by the deployment script and by the server_start.sh script. The most important ones are: deploy host (it's the ip or hostname where the jobserver will be run), user and group of execution, JobServer memory (it will be the driver memory), spark version and spark home.
Copy config/shiro.ini.template to shiro.ini and edit as appropriate. NOTE: only required when authentication = on
If you are going to use shiro authentication, then you need this step.
Copy config/local.conf.template to <environment>.conf and edit as appropriate.
This is the main configuration file for JobServer and for the contexts that JobServer will create. The full list of the properties you can set in this file can be seen on this link.
bin/server_deploy.sh <environment>
After editing the configuration files, you can deploy using this script. The parameter must be the name that you chose for your .conf and .sh files.
Once you run the script, JobServer will connect to the host entered in the .sh file and will create a new directory with some control files. Then, every time you need to change a configuration entry, you can do it directly on the remote machine: the .conf file will be there with the name you chose and the .sh file will be renamed to settings.sh.
Please note that, if you haven't configured an SSH key based connection between the machine where you run this script and the remote machine, you will be prompted for password during its execution.
If you have problems with the creation of directories on the remote machine, you can try and create them yourself with mkdir (they must match the INSTALL_DIR configuration entry of the .sh file) and change their owner user and group to match the ones entered in the .sh configuration file.
On the remote server, start it in the deployed directory with server_start.sh and stop it with server_stop.sh
This is very informative. Once you have done all other steps, you can start JobServer service on the remote machine by running the script server_start.sh and you can stop it with server_stop.sh

How to enable Spark mesos docker executor?

I'm working on integration between Mesos & Spark. For now, I can start SlaveMesosDispatcher in a docker; and I like to also run Spark executor in Mesos docker. I do the following configuration for it, but I got an error; any suggestion?
Configuration:
Spark: conf/spark-defaults.conf
spark.mesos.executor.docker.image ubuntu
spark.mesos.executor.docker.volumes /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
spark.mesos.executor.home /root/spark
#spark.executorEnv.SPARK_HOME /root/spark
spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib
NOTE: The spark are installed in /home/test/workshop/spark, and all dependencies are installed.
After submit SparkPi to the dispatcher, the driver job is started but failed. The error messes is:
I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave b7e24114-7585-40bc-879b-6a1188cb65b6-S1
WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
/bin/sh: 1: ./bin/spark-submit: not found
Does any know how to map/set spark home in docker for this case?
I think the issue you're seeing here is a result of the current working directory of the container isn't where Spark is installed. When you specify a docker image for Spark to use with Mesos, it expects the default working directory of the container to be inside $SPARK_HOME where it can find ./bin/spark-submit.
You can see that logic here.
It doesn't look like you're able to configure the working directory through Spark configuration itself, which means you'll need to build a custom image on top of ubuntu that simply does a WORKDIR /root/spark.

Resources