How to get SparkContext if Spark runs on Yarn? - apache-spark

We have a program based on Spark standalone, and in this program we use SparkContext and SqlContext to do lots of queries.
Now we want to deploy the system on a Spark which runs on Yarn. But when we modify the spark.master to yarn-cluster, the application throws an exception says this works with spark-submit type only. When we switch to yarn-client, although it no longer throws exceptions, it doesn't work properly.
It seems that if runs on Yarn, we can no longer use SparkContext to work, instead we should use something like yarn.Client, but in this way we don't know how to change our code to achieve what we have done before using SparkContext and SqlContext.
Is there a good way to solve this? Can we get SparkContext from yarn.Client or we should change our code to utilize new interfaces of yarn.Client?
Thank you!

When you run on cluster , you need to do a spark-submit like this
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
--master will be yarn
--deploy-mode will be cluster
In your application if you have something like setMaster("local[]") , you can remove it and build the code. when you do spark-submit with --Master yarn, yarn will launch containers for you instead of spark-standalone scheduler.
Your app code can look like this without any setting for Master
val conf = new SparkConf().setAppName("App Name")
val sc = new SparkContext(conf)
yarn deploy mode client is use when you want to launch driver on same machine from code is running. On a cluster the deploy mode should be cluster, this will make sure driver is launched on one of the worker node by yarn.

Related

hadoop multi node with spark sample job

I have just configured spark on my Hadoop cluster and i want to run the spark sample job.
before that I want to understand what, this below job code stands for.
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
You can see all possible parameters for submitting a spark job on here. I summarized the ones in your submit script as below:
spark-submit
--deploy-mode client # client/cluster. default value client. Whether to deploy your driver on the worker nodes or locally
--class org.apache.spark.examples.SparkPi # The entry point for your application
$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10 #jar file path and expected arguments
--master is another parameter usually defined in submit scripts. For my HDP cluster default value of master is yarn. You can see all possible values for master in spark documentation again.

How to create Pyspark application

My requirement is to read the data from HDFS using pyspark, filter only required columns, remove the NULL values and then writing back the processed data to HDFS. Once the these steps are completed, we need to deleted the RAW Dirty data from HDFS. Here is my script for each operations .
Import the Libraries and dependencies
#Spark Version = > version 2.4.0-cdh6.3.1
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
import pyspark.sql.functions as F
Read the Data from HDFS
df_load_1 = sparkSession.read.csv('hdfs:///cdrs/file_path/*.csv', sep = ";")
Select only the required columns
col = [ '_c0', '_c1', '_c2', '_c3', '_c5', '_c7', '_c8', '_c9', '_c10', '_C11', '_c12', '_c13', '_c22', '_C32', '_c34', '_c38', '_c40',
'_c43', '_c46', '_c47', '_c50', '_c52', '_c53', '_c54', '_c56', '_c57', '_c59', '_c62', '_c63','_c77', '_c81','_c83']
df1=df_load_1.select(*[col])
Check for NULL values and we have any remove them
df_agg_1 = df1.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df1.columns])
df_agg_1.show()
df1 = df1.na.drop()
Writing the pre-processed data to HDFS, same cluster but different directory
df1.write.csv("hdfs://nm/pyspark_cleaned_data/py_in_gateway.csv")
Deleting the original raw data from HDFS
def delete_path(spark , path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
Executing below by passing the HDFS absolute path
delete_path(spark , '/cdrs//cdrs/file_path/')
pyspark and HDFS commands
I am able to do all the operations successfully from pyspark prompt .
Now i want to develop the application and submit the job using spark-submit
For example
spark-submit --master yarn --deploy-mode client project.py for local
spark-submit --master yarn --deploy-mode cluster project.py for cluster
At this point i am stuck, i am not sure what parameter i am supposed to pass in place yarn in spark-submit. i am not sure whether simply copying and pasting all above commands and make .py file will help. I am very new to this technology.
Basically your spark job will run on a cluster. Spark 2.4.4 supports yarn, kubernetes, mesos and spark-standalone cluster doc.
--master yarn specifies that you are submitting your spark job to a yarn cluster.
--deploy-mode specifies whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
spark-submit --master yarn --deploy-mode client project.py for client mode
spark-submit --master yarn --deploy-mode cluster project.py for cluster mode
spark-submit --master local project.py for local mode
You can provide other arguments while submitting your spark job like --driver-memory, --executor-memory, --num-executors etc check here.

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this:
spark-shell yarn --name myQuery -i ./my-query.scala
Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running these queries I get a nice progress bar like this:
[Stage7:===========> (14174 + 5) / 62500]
When I create a jar using the exact same query and run it with the following command-line:
spark-submit \
--master yarn-cluster \
--driver-memory 16G \
--queue default \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 32G \
--name MyQuery \
--class com.data.MyQuery \
target/uber-my-query-0.1-SNAPSHOT.jar
I don't get any such progress bar. The command simply says repeatedly
17/10/20 17:52:25 INFO yarn.Client: Application report for application_1507058523816_0443 (state: RUNNING)
The query works fine and the results are fine. But I just need to have feedback when the process will finish. I have tried the following.
The web page of RUNNING Hadoop Applications does have a progress bar but it basically never moves. Even in the case of the spark-shell query that progress bar is useless.
I have tried get the progress bar through the YARN logs but they are not aggregated until the job is complete. Even then there is no progress bar in the logs.
Is there is a way to launch a spark query in jar on a cluster and have a progressbar?
When I create a jar using the exact same query and run it with the following command-line (...) I don't get any such progress bar.
The difference between these two seemingly similar Spark executions is the master URL.
In the former Spark execution with spark-shell yarn, the master is YARN in client deploy mode, i.e. the driver runs on the machine where you start spark-shell from.
In the latter Spark execution with spark-submit --master yarn-cluster, the master is YARN in cluster deploy mode (which is actually equivalent to --master yarn --deploy-mode cluster), i.e. the driver runs on a YARN node.
With that said, you won't get the nice progress bar (which is actually called ConsoleProgressBar) on the local machine but on the machine where the driver runs.
A simple solution is to replace yarn-cluster with yarn.
ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr.
The progress includes the stage id, the number of completed, active, and total tasks.
ConsoleProgressBar is created when spark.ui.showConsoleProgress Spark property is turned on and the logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e. less messages are printed out and so there is a "space" for ConsoleProgressBar).
You can find more information in Mastering Apache Spark 2's ConsoleProgressBar.

How to use --num-executors option with spark-submit?

I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below :
spark-submit --class WC.WordCount \
--num-executors 8 \
--executor-cores 5 \
--executor-memory 3584M \
...../<myjar>.jar \
/public/blahblahblah /user/blahblah
However its running with default number of executors which is 2. But I am able to override properties if I add
--master yarn
Can someone explain why it is so ? Interestingly , in my application code I am setting master as yarn-client:
val conf = new SparkConf()
.setAppName("wordcount")
.setMaster("yarn-client")
.set("spark.ui.port","56487")
val sc = new SparkContext(conf)
Can someone throw some light as to how the option --master works
I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below
It will not work (unless you override spark.master in conf/spark-defaults.conf file or similar so you don't have to specify it explicitly on the command line).
The reason is that the default Spark master is local[*] and the number of executors is exactly one, i.e. the driver. That's just the local deployment environment. See Master URLs.
As a matter of fact, num-executors is very YARN-dependent as you can see in the help:
$ ./bin/spark-submit --help
...
YARN-only:
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
That explains why it worked when you switched to YARN. It is supposed to work with YARN (regardless of the deploy mode, i.e. client or cluster which is about the driver alone not executors).
You may be wondering why it did not work with the master defined in your code then. The reason is that it is too late since the master has already been assigned on launch time when you started the application using spark-submit. That's exactly the reason why you should not specify deployment environment-specific properties in the code as:
It may not always work (see the case with master)
It requires that a code has to be recompiled every configuration change (and makes it a bit unwieldy)
That's why you should be always using spark-submit to submit your Spark applications (unless you've got reasons not to, but then you'd know why and could explain it with ease).
If you’d like to run the same application with different masters or different amounts of memory. Spark allows you to do that with an default SparkConf. As you are mentioning properties to SparkConf, those takes highest precedence for application, Check the properties precedence at the end.
Example:
val sc = new SparkContext(new SparkConf())
Then, you can supply configuration values at runtime:
./bin/spark-submit \
--name "My app" \
--deploy-mode "client" \
--conf spark.ui.port=56487 \
--conf spark.master=yarn \ #alternate to --master
--conf spark.executor.memory=4g \ #alternate to --executor-memory
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--class WC.WordCount \
/<myjar>.jar \
/public/blahblahblah \
/user/blahblah
Properties precedence order (top one is more)
Properties set directly on the SparkConf(in the code) take highest
precedence.
Any values specified as flags or in the properties file will be passed
on to the application and merged with those specified through
SparkConf.
then flags passed to spark-submit or spark-shell like --master etc
then options in the spark-defaults.conf file.
A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source: Dynamically Loading Spark Properties

spark-submit classNotFoundException

I'm building a spark app with maven (with shade plugin) and scp'ing it to a data node for execution with spark-submit --deploy-mode cluster (since launching right from the build system with --deploy-mode client doesn't work because of asymmetric network not under my control).
Here's my launch command
spark-submit
--class Test
--master yarn
--deploy-mode cluster
--supervise
--verbose
jarName.jar
hdfs:///somePath/Test.txt
hdfs:///somePath/Test.out
The job quickly fails with a ClassNotFoundException for Test$1; one of the anonymous classes java creates from my main class
6/03/18 12:59:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
0.0 (TID 0, dataNode3): java.lang.ClassNotFoundException: Test$1
I've seen this error mentioned many times (google) and most recommendations boil down to calling conf.setJars(jarPaths) or similar.
I really don't see why this is needed when the missing class is definitely (I've checked) available in jarName.jar , why specifying this at compile time is preferable to doing it at run time with --jar as a spark-submit argument, and in either case, what path I should provide for the jar. I've been copying it to my home directory on the datanode from target/jarName.jar on the build system but it seems spark-submit copies it to hdfs somewhere that's hard to nail down into a hard-coded path name at either compile time or launch time.
And most of all, why isn't spark-submit handling this automatically based on the someJar.jar argument, and if not, what should I do to fix it?
Check the answer from here
spark submit java.lang.ClassNotFoundException
spark-submit --class Test --master yarn --deploy-mode cluster --supervise --verbose jarName.jar hdfs:///somePath/Test.txt hdfs:///somePath/Test.out
Try to use, also you could check the absolute path in your project
--class com.myclass.Test
I had the same issue with my Scala Spark application when I tried to run it in "cluster" mode:
--master yarn --deploy-mode cluster
I found the solution on this page. Basically what I was missing (that is missing also in your command) is the "--jars" parameter that allows you to distribute the application jars to your cluster.
Suggestion: to be able to troubleshooting this kind of error you could use the following command:
yarn logs --applicationId yourApplicationId
where yourApplicationId shoould be in your yarn exception log.

Resources