Spark driver always fail to bind to submit host in cluster mode - apache-spark

Hi I'm trying to deploy Spark streaming job using standalone cluster. All the jars are installed locally on each node and I run spark-submit inside one of the nodes. The driver is then started in one of the workers randomly but always try to bind to the node where I submitted the job. And if it happens to be on a different node, the driver always fails. I tried to set spark.driver.host to different values but didn't help.
Anyone with the same problem? Or is there any better ways to submit spark jobs, ideally in Standalone cluster.
spark-env.sh
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_LOCAL_HOSTNAME=local_host_name
export SPARK_LOG_DIR=/var/log/spark
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOCAL_DIRS=/var/run/spark/tmp
export STANDALONE_SPARK_MASTER_HOST=master_host_name
spark-defaults.conf
spark.master spark://master_host_name:6066
spark.io.compression.codec lz4
I run it with spark-submit --deploy-mode cluster --supervise
Thanks a lot

Related

AWS EMR - ModuleNotFoundError: No module named 'arrow'

i'm running into this issue when trying to upgrade to Python3.9 for our EMR jobs using Pyspark 3.0.1/ EMR release 6.2.1. I've created the EMR Cluster using a bootstrap script and here are spark environment variables that were set:
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip
I've installed all the application dependency libs using a shell script and are located in /home/ec2-user. However, when I try to spark submit a job with following command by user hadoop, i'm seeing the "ModuleNotFoundError".
Spark-submit cmd:
/bin/sh -c "MYAPP_ENV=dev PYSPARK_PYTHON=/usr/local/bin/python3 PYTHONHASHSEED=0 SETUPTOOLS_USE_DISTUTILS=stdlib spark-submit --master yarn --deploy-mode client --jars /home/hadoop/ext_lib/*.jar --py-files /home/hadoop/myapp.zip --conf spark.sql.parquet.compression.codec=gzip --conf spark.executorEnv.MYAPP_ENV=dev /home/hadoop/myapp/oasis/etl/spark/daily/run_daily_etl.py '--lookback_days' '1' '--s3_file_system' 's3'"
Error: ModuleNotFoundError: No module named 'arrow'
However, the same works when we use the EMR cluster settings with "EMR Release label:emr-5.28.0 and Spark 2.4.4.
Can someone provide help on identifying the cause as I'm fully stuck with this. I suspect it may be due to the access of ec2-user home folder from hadoop user.
Thanks

Spark job not showing up on standalone cluster GUI

I am playing with running spark jobs in my lab and have a three node standalone cluster. When I execute a new job on the master node via CLI
spark-submit sparktest.py --master spark://myip:7077
while the job completes as expected it does not show up at all on the cluster GIU. After some investigation, I added the --master to the submit command but to no avail. During job execution as well as after completion when I navigate to http://mymasternodeip:8080/
none of these jobs are recognized in Running Jobs nor Completed Jobs. Any thoughts as to why the jobs dont show up would be appreciated.
You should specify --master flag first then remaining flags/options. If not master will be considered as local.
spark-submit --master spark://myip:7077 sparktest.py
Make sure that you don't override master config in your code while creating SparkSession object. Provide same master url in code also or don't add it.

Submitting application on Spark Cluster using spark submit

I am new to Spark.
I want to run a Spark Structured Streaming application on cluster.
Master and workers has same configuration.
I have few queries for submitting app on cluster using spark-submit:
You may find them comical or strange.
How can I give path for 3rd party jars like lib/*? ( Application has 30+ jars)
Will Spark automatically distribute application and required jars to workers?
Does it require to host application on all the workers?
How can i know status of my application as I am working on console.
I am using following script for Spark-submit.
spark-submit
--class <class-name>
--master spark://master:7077
--deploy-mode cluster
--supervise
--conf spark.driver.extraClassPath <jar1, jar2..jarn>
--executor-memory 4G
--total-executor-cores 8
<running-jar-file>
But code is not running as per expectation.
Am i missing something?
To pass multiple jar file to Spark-submit you can set the following attributes in file SPARK_HOME_PATH/conf/spark-defaults.conf (create if not exists):
Don't forget to use * at the end of the paths
spark.driver.extraClassPath /fullpath/to/jar/folder/*
spark.executor.extraClassPath /fullpathto/jar/folder/*
Spark will set the attributes in the file spark-defaults.conf when you use the spark-submit command.
Copy your jar file on that directory and when you submit your Spark App on the cluster, the jar files in the specified paths will be loaded, too.
spark.driver.extraClassPath: Extra classpath entries to prepend
to the classpath of the driver. Note: In client mode, this config
must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead,
please set this through the --driver-class-path command line option or
in your default properties file.
--jars will transfer your jar files to worker nodes, and become available in both driver and executors' classpaths.
Please refer below link to see more details.
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
You can make a fat jar containing all dependencies. Below link helps you understand that.
https://community.hortonworks.com/articles/43886/creating-fat-jars-for-spark-kafka-streaming-using.html

Running Spark Job on Zeppelin

I have written a custom spark library in scala. I am able to run this successfully as a spark-submit step by spawning the cluster and running the following commands. Here I first get my 2 jars by -
aws s3 cp s3://jars/RedshiftJDBC42-1.2.10.1009.jar .
aws s3 cp s3://jars/CustomJar .
and then i run my spark job as
spark-submit --deploy-mode client --jars RedshiftJDBC42-1.2.10.1009.jar --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,com.databricks:spark-avro_2.11:3.2.0 --class com.activities.CustomObject CustomJar.jar
This runs my CustomObject successfully. I want to run the similar thing in Zeppelin, But I do not know how to add jars and then run a spark-submit step?
You can add these dependencies to the Spark interpreter within Zeppelin:
Go to "Interpreter"
Choose edit and add the jar file
Restart the interpreter
More info here
EDIT
You might also want to use the %dep paragraph in order to access the zvariable (which is an implicit Zeppeling context) in order to do something like this:
%dep
z.load("/some_absolute_path/myjar.jar")
It depend how you run Spark. Most of the time, the Zeppelin interpreter will embed the Spark driver.
The solution is to configure the Zeppelin interpreter instead:
ZEPPELIN_INTP_JAVA_OPTS will configure java options
SPARK_SUBMIT_OPTIONS will configure spark options

Apache Spark Multi Node Clustering

I am currently working on logger analyse by using apache spark. I am new for Apache Spark. I have tried to use apache spark standalone mode. I can run my code by submitting jar with deploy-mode on the client. But I can not run with multi node cluster. I have used worker nodes are different machine.
sh spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:7077 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
when i Run this command, it shows following error
15/10/20 18:04:23 ERROR ClientEndpoint: Exception from cluster was: java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
at org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
at org.spark-project.guava.io.Files.copy(Files.java:436)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
As my understanding, The driver program sends the data and application code to worker node. I don't know my understanding is correct or not. So Please help me to run application on a cluster.
I have tried to run jar on cluster and Now there is no exception but why the task is not assigned to worker node?
I have tried without clustering. Its working fine. shown in following figure
Above image shows, Task assigned to worker nodes. But I have one more problem to analyse the log file. Actually, I have log files in master node which is in a folder (ex: '/home/visva/log'). But the worker node searching the file on their own file system.
I met same problem.
My solution was that I uploaded my .jar file on the HDFS.
Enter the command line like this:
spark-submit --class com.example.RunRecommender --master spark://Hadoop-NameNode:7077 --deploy-mode cluster --executor-memory 6g --executor-cores 3 hdfs://Hadoop-NameNode:9000/spark-practise-assembly-1.0.jar
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
If you use the cluster model in spark-submit , you need use the 6066 port(the default port of rest in spark) :
spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:6066 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
In my case, i upload the jar of app to every node in cluster because i do not know how does the spark-submit to transfer the app automatically and i don't know how to specify a node as driver node .
Note: The jar path of app is a path that in the any node of cluster.
There are two deploy modes in Spark to run the script.
1.client (default): In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.(Master node)
2.cluster : If your application is submitted from a machine far from the worker machines, it is common to use cluster mode to minimize network latency between the drivers and the executors.
Reference Spark Documentation For Submitting JAR

Resources