How to pass custom SPARK_CONF_DIR to slaves in standalone mode - apache-spark

I am in the process of installing Spark in a shared cluster environment. We've decided to go with spark standalone mode, and are using the "start-all.sh" command included in sbin to launch the Spark workers. Due to the shared architecture of the cluster, SPARK_HOME is in a common directory not writeable by users. Therefore, we're creating "run" directories in the user's scratch, into which SPARK_CONF_DIR, log directory, and work directories can be pointed.
The problem is that SPARK_CONF_DIR is never set on the worker nodes, so they default to $SPARK_HOME/conf, which has only the templates. What I want to do is pass through SPARK_CONF_DIR from the master node to the slave nodes. I've identified a solution, but it requires a patch to sbin/start-slaves.sh:
sbin/start_slaves.sh
46c46
< "${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; export SPARK_CONF_DIR=${SPARK_CONF_DIR} \; "$SPARK_HOME/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
---
> "${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_HOST:$SPARK_MASTER_PORT"
Are there are any better solutions here that do not require a patch to the Spark source code?
One solution, of course, would be to copy and rename start-all.sh and start-slaves.sh and use those instead of sbin/start-all.sh. But is there anything more elegant?
Thank you for your time.

If you want to run standalone mode, you can try to setup SPARK_CONF_DIR on your program. Take pyspark for example:
import os
from pyspark.sql import SparkSession
os.environ["SPARK_CONF_DIR"] = "/path/to/configs/conf1"
spark = SparkSession.builder.getOrCreate()

Related

How PYSPARK environmental setup is executed by YARN in launch_container.sh

While analyzing the yarn launch_container.sh logs for a spark job, I got confused by some part of log.
I will point out those asks step by step here
When you will submit a spark job with spark-submit having --pyfiles and --files on cluster mode on YARN:
The config files passed in --files , executable python files passed in --pyfiles are getting uploaded into .sparkStaging directory created under user hadoop home directory.
Along with these files pyspark.zip and py4j-version_number.zip from $SPARK_HOME/python/lib is also getting copied
into .sparkStaging directory created under user hadoop home directory
After this launch_container.sh is getting triggered by yarn and this will export all env variables required.
If we have exported anything explicitly such as PYSPARK_PYTHON in .bash_profile or at the time of building the spark-submit job in a shell script or in spark_env.sh , the default value will be replaced by the value which we
are providing
This PYSPARK_PYTHON is a path in my edge node.
Then how a container launched in another node will be able to use this python version ?
The default python version in data nodes of my cluster is 2.7.5.
So without setting this pyspark_python , containers are using 2.7.5.
But when I will set pyspark_python to 3.5.x , they are using what I have given.
It is defining PWD='/data/complete-path'
Where this PWD directory resides ?
This directory is getting cleaned up after job completion.
I have even tried to run the job in one session of putty
and kept the /data folder opened in another session of putty to see
if any directories are getting created on run time. but couldn't find any?
It is also setting the PYTHONPATH to $PWD/pyspark.zip:$PWD/py4j-version.zip
When ever I am doing a python specific operation
in spark code , its using PYSPARK_PYTHON. So for what purpose this PYTHONPATH is being used?
3.After this yarn is creating softlinks using ln -sf for all the files in step 1
soft links are created for for pyspark.zip , py4j-<version>.zip,
all python files mentioned in step 1.
Now these links are again pointing to '/data/different_directories'
directory (which I am not sure where they are present).
I know soft links can be used for accessing remote nodes ,
but here why the soft links are created ?
Last but not the least , whether this launch_container.sh will run for each container launch ?
Then how a container launched in another node will be able to use this python version ?
First of all, when we submit a Spark application, there are several ways to set the configurations for the Spark application.
Such as:
Setting spark-defaults.conf
Setting environment variables
Setting spark-submit options (spark-submit —help and —conf)
Setting a custom properties file (—properties-file)
Setting values in code (exposed in both SparkConf and SparkContext APIs)
Setting Hadoop configurations (HADOOP_CONF_DIR and spark.hadoop.*)
In my environment, the Hadoop configurations are placed in /etc/spark/conf/yarn-conf/, and the spark-defaults.conf and spark-env.sh is in /etc/spark/conf/.
As the order of precedence for configurations, this is the order that Spark will use:
Properties set on SparkConf or SparkContext in code
Arguments passed to spark-submit, spark-shell, or pyspark at run time
Properties set in /etc/spark/conf/spark-defaults.conf, a specified properties file
Environment variables exported or set in scripts
So broadly speaking:
For properties that apply to all jobs, use spark-defaults.conf,
for properties that are constant and specific to a single or a few applications use SparkConf or --properties-file,
for properties that change between runs use command line arguments.
Now, regarding the question:
In Cluster mode of Spark, the Spark driver is running in container in YARN, the Spark executors are running in container in YARN.
In Client mode of Spark, the Spark driver is running outside of the Hadoop cluster(out of YARN), and the executors are always in YARN.
So for your question, it is mostly relative with YARN.
When an application is submitted to YARN, first there will be an ApplicationMaster container, which nigotiates with NodeManager, and is responsible to control the application containers(in your case, they are Spark executors).
NodeManager will then create a local temporary directory for each of the Spark executors, to prepare to launch the containers(that's why the launch_container.sh has such a name).
We can find the location of the local temporary directory is set by NodeManager's ${yarn.nodemanager.local-dirs} defined in yarn-site.xml.
And we can set yarn.nodemanager.delete.debug-delay-sec to 10 minutes and review the launch_container.sh script.
In my environment, the ${yarn.nodemanager.local-dirs} is /yarn/nm, so in this directory, I can find the tempory directories of Spark executor containers, they looks like:
/yarn/nm/nm-local-dir/container_1603853670569_0001_01_000001.
And in this directory, I can find the launch_container.sh for this specific container and other stuffs for running this container.
Where this PWD directory resides ?
I think this is a special Environment Variable in Linux OS, so better not to modify it unless you know how it works percisely in your application.
As per above, if you export this PWD environment at the runtime, I think it is passed to Spark as same as any other Environment Variables.
I'm not sure how the PYSPARK_PYTHON Environment Variable is used in Spark's launch scripts chain, but here you can find the instruction in the official documentation, showing how to set Python binary executable while you are using spark-submit:
spark-submit --conf spark.pyspark.python=/<PATH>/<TO>/<FILE>
As for the last question, yes, YARN will create a temp dir for each of the containers, and the launch_container.sh is included in the dir.

google dataproc: which spark directory to use for setting $SPARK_HOME env var?

On google DataProc, $whereis spark returned two directories. $SPARK_HOME env wasn't set by default. Which one of the two directories should I set $SPARK_HOME to?
svermoli#cluster-bigdl-0p4-deploy-m:~/BigDL$ whereis spark
spark: /usr/lib/spark /etc/spark
svermoli#cluster-bigdl-0p4-deploy-m:~/BigDL$ ls /usr/lib/spark
bin data external LICENSE NOTICE R RELEASE work
conf examples jars licenses python README.md sbin yarn
svermoli#cluster-bigdl-0p4-deploy-m:~/BigDL$ ls /etc/spark
conf conf.dist
svermoli#cluster-bigdl-0p4-deploy-m:~/BigDL$ echo $SPARK_HOME
svermoli#cluster-bigdl-0p4-deploy-m:~/BigDL$
Use /usr/lib/spark/ as is where the runtime is located, etc/spark is for configuration files. Also, you can set it as an initialization action if you need to do it everytime you create a cluster as explained in Dennis Huo's answer here.

How to configure Executor in Spark Local Mode

In Short
I want to configure my application to use lz4 compression instead of snappy, what I did is:
session = SparkSession.builder()
.master(SPARK_MASTER) //local[1]
.appName(SPARK_APP_NAME)
.config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
.getOrCreate();
but looking at the console output, it's still using snappy in the executor
org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
and
[Executor task launch worker-0] compress.CodecPool (CodecPool.java:getCompressor(153)) - Got brand-new compressor [.snappy]
According to this post, what I did here only configure the driver, but not the executor. The solution on the post is to change the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.
Some more detail:
I need to run the application in local mode (for the purpose of unit test). The tests works fine locally on my machine, but when I submit the test to a build engine(RHEL5_64), I got the error
snappy-1.0.5-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found
I did some research and it seems the simplest fix is to use lz4 instead of snappy for codec, so I try the above solution.
I have been stuck in this issue for several hours, any help is appreciated, thank you.
what I did here only configure the driver, but not the executor.
In local mode there is only one JVM which hosts both driver and executor threads.
the spark-defaults.conf file, but I'm running spark in local mode, I don't have that file anywhere.
Mode is not relevant here. Spark in local mode uses the same configuration files. If you go to the directory where you store Spark binaries you should see conf directory:
spark-2.2.0-bin-hadoop2.7 $ ls
bin conf data examples jars LICENSE licenses NOTICE python R README.md RELEASE sbin yarn
In this directory there is a bunch of template files:
spark-2.2.0-bin-hadoop2.7 $ ls conf
docker.properties.template log4j.properties.template slaves.template spark-env.sh.template
fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template
If you want to set configuration option copy spark-defaults.conf.template to spark-defaults.conf and edit it according to your requirements.
Posting my solution here, #user8371915 does answered the question, but did not solve my problem, because in my case I can't modified the property files.
What I end up doing is adding another configuration
session = SparkSession.builder()
.master(SPARK_MASTER) //local[1]
.appName(SPARK_APP_NAME)
.config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
.config("spark.sql.parquet.compression.codec", "uncompressed")
.getOrCreate();

Where is Spark writing SaveAsTextFile in cluster?

I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.

How to enable Spark mesos docker executor?

I'm working on integration between Mesos & Spark. For now, I can start SlaveMesosDispatcher in a docker; and I like to also run Spark executor in Mesos docker. I do the following configuration for it, but I got an error; any suggestion?
Configuration:
Spark: conf/spark-defaults.conf
spark.mesos.executor.docker.image ubuntu
spark.mesos.executor.docker.volumes /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
spark.mesos.executor.home /root/spark
#spark.executorEnv.SPARK_HOME /root/spark
spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib
NOTE: The spark are installed in /home/test/workshop/spark, and all dependencies are installed.
After submit SparkPi to the dispatcher, the driver job is started but failed. The error messes is:
I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave b7e24114-7585-40bc-879b-6a1188cb65b6-S1
WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
/bin/sh: 1: ./bin/spark-submit: not found
Does any know how to map/set spark home in docker for this case?
I think the issue you're seeing here is a result of the current working directory of the container isn't where Spark is installed. When you specify a docker image for Spark to use with Mesos, it expects the default working directory of the container to be inside $SPARK_HOME where it can find ./bin/spark-submit.
You can see that logic here.
It doesn't look like you're able to configure the working directory through Spark configuration itself, which means you'll need to build a custom image on top of ubuntu that simply does a WORKDIR /root/spark.

Resources