spark standalone on a cluster - apache-spark

i installed pre_built version of spark on each node of my cluster, (just download it then unzip it)
Question 1 :
Do i have to copy into conf directory the files slaves.template and spark-env.sh.template then edit them to connect my machines to each other ? if yes how can i do it only by command
Question2:
i lunched master on one remote machine (and when i wanted to access to spark web UI from my local machine using
http://IPofRemoteMachine:8080
IP_address:8080 or IP_address:4040
nothing has displayed on my browser, why and what i am missing ,?
Question3:
if i have 6 nodes on my cluster and if i want to use only 4 for example, do i have to lunch the master , then lunch workers only in nodes that i want to use?

Answer 1 :
You need to rename files by removing .template from them as slaves & spark-env.sh.
Suppose there are two machines 10.1.1.11(A) & 10.1.1.12(B) and you want to run spark master on machine A and workers on both A & B then in slaves you should write all IPs on which workers will run:
sample slaves file
10.1.1.11
10.1.1.12
sample spark-env.sh file
export SPARK_MASTER_MEMORY=1024M
export SPARK_DRIVER_MEMORY=1024M
export SPARK_WORKER_INSTANCES=1
export SPARK_EXECUTOR_INSTANCES=1
export SPARK_WORKER_MEMORY=1024M
export SPARK_EXECUTOR_MEMORY=1024M
export SPARK_WORKER_CORES=2
export SPARK_EXECUTOR_CORES=2
export SPARK_MASTER_IP=10.1.1.11
export SPARK_MASTER_WEBUI_PORT=8081
You can configure spark-env.sh (just a script file) with more options available here
Answer 2 :
You can change your spark web UI port
by editing spark-env.sh to include SPARK_MASTER_WEBUI_PORT=8081
Then you can acess spark web ui on 10.1.1.11:8081.
If you get Could not resolve hostname check my answer here.
Answer 3 :
You can change nodes on which worker will be running in slaves file.

Related

"Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher" when running spark-submit or PySpark

I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
I am trying to run one of the spark examples using the following spark-submit command
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I get the following error:
[2022-07-25 13:32:39.253]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
I get the same error when trying to run a script with PySpark.
I have tried/verified the following:
environment variables: HADOOP_HOME, SPARK_HOME and HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir hdfs://hadoop-namenode:8020/user/bitnami/sparkStaging and spark.yarn.jars hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ in spark-defaults.conf
I have uploaded the jars into hdfs (i.e. hadoop fs -put $SPARK_HOME/jars/* hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/ )
The logs accessible via the web interface (i.e. http://hadoop-namenode:8042 ) do not provide any further details about the error.
This section of the Spark documentation seems relevant to the error since the YARN libraries should be included, by default, but only if you've installed the appropriate Spark version
For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark. To override this behavior, you can set spark.yarn.populateHadoopClasspath=true. For no-hadoop Spark distribution, Spark will populate Yarn’s classpath by default in order to get Hadoop runtime. For with-hadoop Spark distribution, if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library in your application jar.
https://spark.apache.org/docs/latest/running-on-yarn.html#preparations
Otherwise, yarn.application.classpath in yarn-site.xml refers to local filesystem paths in each of ResourceManager servers where JARs are available for all YARN applications (spark.yarn.jars or extra packages should get layered onto this)
Another problem could be file permissions. You probably shouldn't put Spark jars into an HDFS user folder if they're meant to be used by all users. Typically, I'd put it under hdfs:///apps/spark/<version>, then give that 744 HDFS permissions
In the Spark / YARN UI, it should show the complete classpath of the application for further debugging
I figured out why I was getting this error. It turns out that I made an error while specifying spark.yarn.jars in spark-defaults.conf
The value of this property must be
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/*
instead of
hdfs://hadoop-namenode:8020/user/bitnami/spark/jars/
i.e. Basically, we need to specify the jar files as the value to this property and not the folder containing the jar files.

How to configure Spark spark_worker_opts for Jupyter notebooks

I use Pyspark with Spark 2.4 in the standalone mode on Linux for processing a lot of incoming data via Kafka using a Jupyter notebook (currently for testing). I want to add these options to this notebook in order to prevent the /tmp/ directory to be filled with dozens of gigabytes after few hours:
spark.worker.cleanup.enabled=true
spark.worker.cleanup.appDataTtl=120
But these conf locations do not work:
Spark’s default configuration (spark/conf/spark-env.sh) seems not be used by Juypter notebooks at all:
SPARK_WORKER_OPTS="spark.worker.cleanup.enabled=true
spark.worker.cleanup.appDataTtl=120"
So, I created a sperate kernel configuration in ~/.local/share/jupyter/kernels/python3-spark1/kernel.json that I can select in Jupyterhub and that is really used for the RAM adjustments what I can see in htop:
"env": {
"PYSPARK_SUBMIT_ARGS": "--master local[*]
--conf spark.worker.cleanup.enabled=true --conf=spark.worker.cleanup.appDataTtl=120 driver-memory 145g --executor-memory 50g pyspark-shell"
but the /tmp still fills with dozens of gigs.
I also tried the “magic” in a jupyter cell but it also did not work.
Do you know how to configure the Jupyter notebooks for this Spark adjustments properly?
Configuration properties that apply only to the worker in the form "-Dx=y"
export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=60 -Dspark.worker.cleanup.appDataTtl=120"
If that not work you can try any of the below options.
Option-1: Updating default.conf
In Worker node set the following configuration option in the /spark/conf/spark-defaults.conf file:
spark.worker.cleanup.enabled: Enables periodic cleanup of worker and application directories. Disabled by default. Set to true to enable it. Note: that this only affects standalone mode, as YARN works differently.
spark.worker.cleanup.interval: The frequency, in seconds, that the worker cleans up old application work directories. The default is 30 minutes.
spark.worker.cleanup.appDataTtl: The number of seconds to retain application work directories on each worker. The default is 7 days.
Then stop and start the workers.
sbin/stop-worker.sh - Stops all worker instances on the machine the script is executed on.
sbin/start-worker.sh - Starts a worker instance on the machine the script is executed on.
Option-2: If you setup a spark cluster using docker-compose then set environment in Docker compose file
spark-worker-x:
image: spark-worker
container_name: spark-worker-x
environment:
- SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=60 -Dspark.worker.cleanup.appDataTtl=120"

How PYSPARK environmental setup is executed by YARN in launch_container.sh

While analyzing the yarn launch_container.sh logs for a spark job, I got confused by some part of log.
I will point out those asks step by step here
When you will submit a spark job with spark-submit having --pyfiles and --files on cluster mode on YARN:
The config files passed in --files , executable python files passed in --pyfiles are getting uploaded into .sparkStaging directory created under user hadoop home directory.
Along with these files pyspark.zip and py4j-version_number.zip from $SPARK_HOME/python/lib is also getting copied
into .sparkStaging directory created under user hadoop home directory
After this launch_container.sh is getting triggered by yarn and this will export all env variables required.
If we have exported anything explicitly such as PYSPARK_PYTHON in .bash_profile or at the time of building the spark-submit job in a shell script or in spark_env.sh , the default value will be replaced by the value which we
are providing
This PYSPARK_PYTHON is a path in my edge node.
Then how a container launched in another node will be able to use this python version ?
The default python version in data nodes of my cluster is 2.7.5.
So without setting this pyspark_python , containers are using 2.7.5.
But when I will set pyspark_python to 3.5.x , they are using what I have given.
It is defining PWD='/data/complete-path'
Where this PWD directory resides ?
This directory is getting cleaned up after job completion.
I have even tried to run the job in one session of putty
and kept the /data folder opened in another session of putty to see
if any directories are getting created on run time. but couldn't find any?
It is also setting the PYTHONPATH to $PWD/pyspark.zip:$PWD/py4j-version.zip
When ever I am doing a python specific operation
in spark code , its using PYSPARK_PYTHON. So for what purpose this PYTHONPATH is being used?
3.After this yarn is creating softlinks using ln -sf for all the files in step 1
soft links are created for for pyspark.zip , py4j-<version>.zip,
all python files mentioned in step 1.
Now these links are again pointing to '/data/different_directories'
directory (which I am not sure where they are present).
I know soft links can be used for accessing remote nodes ,
but here why the soft links are created ?
Last but not the least , whether this launch_container.sh will run for each container launch ?
Then how a container launched in another node will be able to use this python version ?
First of all, when we submit a Spark application, there are several ways to set the configurations for the Spark application.
Such as:
Setting spark-defaults.conf
Setting environment variables
Setting spark-submit options (spark-submit —help and —conf)
Setting a custom properties file (—properties-file)
Setting values in code (exposed in both SparkConf and SparkContext APIs)
Setting Hadoop configurations (HADOOP_CONF_DIR and spark.hadoop.*)
In my environment, the Hadoop configurations are placed in /etc/spark/conf/yarn-conf/, and the spark-defaults.conf and spark-env.sh is in /etc/spark/conf/.
As the order of precedence for configurations, this is the order that Spark will use:
Properties set on SparkConf or SparkContext in code
Arguments passed to spark-submit, spark-shell, or pyspark at run time
Properties set in /etc/spark/conf/spark-defaults.conf, a specified properties file
Environment variables exported or set in scripts
So broadly speaking:
For properties that apply to all jobs, use spark-defaults.conf,
for properties that are constant and specific to a single or a few applications use SparkConf or --properties-file,
for properties that change between runs use command line arguments.
Now, regarding the question:
In Cluster mode of Spark, the Spark driver is running in container in YARN, the Spark executors are running in container in YARN.
In Client mode of Spark, the Spark driver is running outside of the Hadoop cluster(out of YARN), and the executors are always in YARN.
So for your question, it is mostly relative with YARN.
When an application is submitted to YARN, first there will be an ApplicationMaster container, which nigotiates with NodeManager, and is responsible to control the application containers(in your case, they are Spark executors).
NodeManager will then create a local temporary directory for each of the Spark executors, to prepare to launch the containers(that's why the launch_container.sh has such a name).
We can find the location of the local temporary directory is set by NodeManager's ${yarn.nodemanager.local-dirs} defined in yarn-site.xml.
And we can set yarn.nodemanager.delete.debug-delay-sec to 10 minutes and review the launch_container.sh script.
In my environment, the ${yarn.nodemanager.local-dirs} is /yarn/nm, so in this directory, I can find the tempory directories of Spark executor containers, they looks like:
/yarn/nm/nm-local-dir/container_1603853670569_0001_01_000001.
And in this directory, I can find the launch_container.sh for this specific container and other stuffs for running this container.
Where this PWD directory resides ?
I think this is a special Environment Variable in Linux OS, so better not to modify it unless you know how it works percisely in your application.
As per above, if you export this PWD environment at the runtime, I think it is passed to Spark as same as any other Environment Variables.
I'm not sure how the PYSPARK_PYTHON Environment Variable is used in Spark's launch scripts chain, but here you can find the instruction in the official documentation, showing how to set Python binary executable while you are using spark-submit:
spark-submit --conf spark.pyspark.python=/<PATH>/<TO>/<FILE>
As for the last question, yes, YARN will create a temp dir for each of the containers, and the launch_container.sh is included in the dir.

Spark cleanup job not running

Whenever I do a dse spark-submit <jarname>,it copies the jar in SPARK_WORKER_DIR (in my case /var/lib/spark-worker/worker-0). I want to get the jar automatically deleted once the spark job is successfully completed/run. Using this, I changed my SPARK_WORKER_OPTS in spark-env.sh as follows :
export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=1800"
But the jar is still not getting deleted. Am I doing something wrong? What should I do?
Adding this line to spark-env.sh and restarting the dse service worked for me:
export SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=3600 -Dspark.worker.cleanup.appDataTtl=172800 "
I restarted the dse service by
nodetool drain
sudo service dse restart
This deletes the log 2 days after the job is complete.

Spark Job-Server configuaration in StandAlone cluster

I am trying to set up a Spark JobServer (SJS) to execute jobs on a Standalone Spark cluster. I am trying to deploy SJS on one of the non-master nodes of SPARK cluster. I am not using the docker, but trying to do manually.
I am confused with the help documents in SJS github particulary the deployment section. Do I need to edit both local.conf and local.sh to run this?
Can someone point out the steps to set up the SJS in the spark cluster?
Thanks!
Kiran
Update:
I created a new environment to deploy jobserver in one of the nodes of the cluster: Here are the details of it:
env1.sh:
DEPLOY_HOSTS="masked.mo.cpy.corp"
APP_USER=kiran
APP_GROUP=spark
INSTALL_DIR=/home/kiran/job-server
LOG_DIR=/var/log/job-server
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=1.6.1
MAX_DIRECT_MEMORY=512M
SPARK_HOME=/home/spark/spark-1.6.1-bin-hadoop2.6
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.6
env1.conf
spark {
master = "local[1]"
webUrlPort = 8080
job-number-cpus = 2
jobserver {
port = 8090
bind-address = "0.0.0.0"
jar-store-rootdir = /tmp/jobserver/jars
context-per-jvm = false
jobdao = spark.jobserver.io.JobFileDAO
filedao {
rootdir = /tmp/spark-job-server/filedao/data
}
datadao {
rootdir = /tmp/spark-jobserver/upload
}
result-chunk-size = 1m
}
context-settings {
num-cpu-cores = 1
memory-per-node = 1G
}
home = "/home/spark/spark-1.6.1-bin-hadoop2.6"
}
Why don't you set JOBSERVER_FG=1 and try running server_start.sh, this would run the process in foreground and should display the error to stderr.
Yes, you have edit both files adapting them for your cluster.
The deploy steps are explained below:
Copy config/local.sh.template to <environment>.sh and edit as appropriate.
This file is mostly for environment variables that are used by the deployment script and by the server_start.sh script. The most important ones are: deploy host (it's the ip or hostname where the jobserver will be run), user and group of execution, JobServer memory (it will be the driver memory), spark version and spark home.
Copy config/shiro.ini.template to shiro.ini and edit as appropriate. NOTE: only required when authentication = on
If you are going to use shiro authentication, then you need this step.
Copy config/local.conf.template to <environment>.conf and edit as appropriate.
This is the main configuration file for JobServer and for the contexts that JobServer will create. The full list of the properties you can set in this file can be seen on this link.
bin/server_deploy.sh <environment>
After editing the configuration files, you can deploy using this script. The parameter must be the name that you chose for your .conf and .sh files.
Once you run the script, JobServer will connect to the host entered in the .sh file and will create a new directory with some control files. Then, every time you need to change a configuration entry, you can do it directly on the remote machine: the .conf file will be there with the name you chose and the .sh file will be renamed to settings.sh.
Please note that, if you haven't configured an SSH key based connection between the machine where you run this script and the remote machine, you will be prompted for password during its execution.
If you have problems with the creation of directories on the remote machine, you can try and create them yourself with mkdir (they must match the INSTALL_DIR configuration entry of the .sh file) and change their owner user and group to match the ones entered in the .sh configuration file.
On the remote server, start it in the deployed directory with server_start.sh and stop it with server_stop.sh
This is very informative. Once you have done all other steps, you can start JobServer service on the remote machine by running the script server_start.sh and you can stop it with server_stop.sh

Resources