hadoop multi node with spark sample job - apache-spark

I have just configured spark on my Hadoop cluster and i want to run the spark sample job.
before that I want to understand what, this below job code stands for.
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10

You can see all possible parameters for submitting a spark job on here. I summarized the ones in your submit script as below:
spark-submit
--deploy-mode client # client/cluster. default value client. Whether to deploy your driver on the worker nodes or locally
--class org.apache.spark.examples.SparkPi # The entry point for your application
$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10 #jar file path and expected arguments
--master is another parameter usually defined in submit scripts. For my HDP cluster default value of master is yarn. You can see all possible values for master in spark documentation again.

Related

How to create Pyspark application

My requirement is to read the data from HDFS using pyspark, filter only required columns, remove the NULL values and then writing back the processed data to HDFS. Once the these steps are completed, we need to deleted the RAW Dirty data from HDFS. Here is my script for each operations .
Import the Libraries and dependencies
#Spark Version = > version 2.4.0-cdh6.3.1
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
import pyspark.sql.functions as F
Read the Data from HDFS
df_load_1 = sparkSession.read.csv('hdfs:///cdrs/file_path/*.csv', sep = ";")
Select only the required columns
col = [ '_c0', '_c1', '_c2', '_c3', '_c5', '_c7', '_c8', '_c9', '_c10', '_C11', '_c12', '_c13', '_c22', '_C32', '_c34', '_c38', '_c40',
'_c43', '_c46', '_c47', '_c50', '_c52', '_c53', '_c54', '_c56', '_c57', '_c59', '_c62', '_c63','_c77', '_c81','_c83']
df1=df_load_1.select(*[col])
Check for NULL values and we have any remove them
df_agg_1 = df1.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df1.columns])
df_agg_1.show()
df1 = df1.na.drop()
Writing the pre-processed data to HDFS, same cluster but different directory
df1.write.csv("hdfs://nm/pyspark_cleaned_data/py_in_gateway.csv")
Deleting the original raw data from HDFS
def delete_path(spark , path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
Executing below by passing the HDFS absolute path
delete_path(spark , '/cdrs//cdrs/file_path/')
pyspark and HDFS commands
I am able to do all the operations successfully from pyspark prompt .
Now i want to develop the application and submit the job using spark-submit
For example
spark-submit --master yarn --deploy-mode client project.py for local
spark-submit --master yarn --deploy-mode cluster project.py for cluster
At this point i am stuck, i am not sure what parameter i am supposed to pass in place yarn in spark-submit. i am not sure whether simply copying and pasting all above commands and make .py file will help. I am very new to this technology.
Basically your spark job will run on a cluster. Spark 2.4.4 supports yarn, kubernetes, mesos and spark-standalone cluster doc.
--master yarn specifies that you are submitting your spark job to a yarn cluster.
--deploy-mode specifies whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
spark-submit --master yarn --deploy-mode client project.py for client mode
spark-submit --master yarn --deploy-mode cluster project.py for cluster mode
spark-submit --master local project.py for local mode
You can provide other arguments while submitting your spark job like --driver-memory, --executor-memory, --num-executors etc check here.

Submitting application on Spark Cluster using spark submit

I am new to Spark.
I want to run a Spark Structured Streaming application on cluster.
Master and workers has same configuration.
I have few queries for submitting app on cluster using spark-submit:
You may find them comical or strange.
How can I give path for 3rd party jars like lib/*? ( Application has 30+ jars)
Will Spark automatically distribute application and required jars to workers?
Does it require to host application on all the workers?
How can i know status of my application as I am working on console.
I am using following script for Spark-submit.
spark-submit
--class <class-name>
--master spark://master:7077
--deploy-mode cluster
--supervise
--conf spark.driver.extraClassPath <jar1, jar2..jarn>
--executor-memory 4G
--total-executor-cores 8
<running-jar-file>
But code is not running as per expectation.
Am i missing something?
To pass multiple jar file to Spark-submit you can set the following attributes in file SPARK_HOME_PATH/conf/spark-defaults.conf (create if not exists):
Don't forget to use * at the end of the paths
spark.driver.extraClassPath /fullpath/to/jar/folder/*
spark.executor.extraClassPath /fullpathto/jar/folder/*
Spark will set the attributes in the file spark-defaults.conf when you use the spark-submit command.
Copy your jar file on that directory and when you submit your Spark App on the cluster, the jar files in the specified paths will be loaded, too.
spark.driver.extraClassPath: Extra classpath entries to prepend
to the classpath of the driver. Note: In client mode, this config
must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead,
please set this through the --driver-class-path command line option or
in your default properties file.
--jars will transfer your jar files to worker nodes, and become available in both driver and executors' classpaths.
Please refer below link to see more details.
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
You can make a fat jar containing all dependencies. Below link helps you understand that.
https://community.hortonworks.com/articles/43886/creating-fat-jars-for-spark-kafka-streaming-using.html

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this:
spark-shell yarn --name myQuery -i ./my-query.scala
Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running these queries I get a nice progress bar like this:
[Stage7:===========> (14174 + 5) / 62500]
When I create a jar using the exact same query and run it with the following command-line:
spark-submit \
--master yarn-cluster \
--driver-memory 16G \
--queue default \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 32G \
--name MyQuery \
--class com.data.MyQuery \
target/uber-my-query-0.1-SNAPSHOT.jar
I don't get any such progress bar. The command simply says repeatedly
17/10/20 17:52:25 INFO yarn.Client: Application report for application_1507058523816_0443 (state: RUNNING)
The query works fine and the results are fine. But I just need to have feedback when the process will finish. I have tried the following.
The web page of RUNNING Hadoop Applications does have a progress bar but it basically never moves. Even in the case of the spark-shell query that progress bar is useless.
I have tried get the progress bar through the YARN logs but they are not aggregated until the job is complete. Even then there is no progress bar in the logs.
Is there is a way to launch a spark query in jar on a cluster and have a progressbar?
When I create a jar using the exact same query and run it with the following command-line (...) I don't get any such progress bar.
The difference between these two seemingly similar Spark executions is the master URL.
In the former Spark execution with spark-shell yarn, the master is YARN in client deploy mode, i.e. the driver runs on the machine where you start spark-shell from.
In the latter Spark execution with spark-submit --master yarn-cluster, the master is YARN in cluster deploy mode (which is actually equivalent to --master yarn --deploy-mode cluster), i.e. the driver runs on a YARN node.
With that said, you won't get the nice progress bar (which is actually called ConsoleProgressBar) on the local machine but on the machine where the driver runs.
A simple solution is to replace yarn-cluster with yarn.
ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr.
The progress includes the stage id, the number of completed, active, and total tasks.
ConsoleProgressBar is created when spark.ui.showConsoleProgress Spark property is turned on and the logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e. less messages are printed out and so there is a "space" for ConsoleProgressBar).
You can find more information in Mastering Apache Spark 2's ConsoleProgressBar.

how to execute spark program efficient in cluster

I have 2 node hadoop cluster. Each with 16GB RAM and 512GB Harddisk.
I have written spark program like below one
Code :
val input = sc.wholeTextFiles("folderpath/*")
do some operations on input.
convert it to dataframe. then register temptable. execute insert command to insert the dataframe value to hive table.
Then I open host 1 (which is my namenode of the cluster) terminal & I run spark submit command like
>spark-submit --class com.sample.parser --master yarn Parser.jar.
But it takes more than 50 mins to process 25 files which totals around 1gb.And when I check spark UI, executor list has only my host 2. host 1 is listed as driver.
So practically only one node is executing the program(host 2). Why?
Is there a way that I can have my driver also to execute the program. so that it runs little faster? Am I doing something wrong? Basically I want my driver node also to be part of executor(Both machines have 8 cores).
Thanks in Advance.
spark-submit by default runs in client(local) mode, in order to submit spark job in cluster mode use --deploy-mode as:
spark-submit \
--class com.sample.parser \
--master yarn \
--deploy-mode cluster \
Parser.jar
--deploy-mode: Whether to deploy your driver on the worker nodes
(cluster) or locally as an external client (client) (default: client)
also, experiment with --num-executors <n> - with different <n> values...and see if it make any difference with perfomance of your app.

spark-submit in deploy mode client not reading all the jars

I'm trying to submit an application to my spark cluster (standalone mode) through the spark-submit command. I'm following the
official spark documentation, as well as relying on this other one. Now the problem is that I get strange behaviors. My setup is the following:
I have a directory where all the dependency jars for my application are located, that is /home/myuser/jars
The jar of my application is in the same directory (/home/myuser/jars), and is called dat-test.jar
The entry point class in dat-test.jar is at the package path my.package.path.Test
Spark master is at spark://master:7077
Now, I submit the application directly on the master node, thus using the client deploy mode, running the command
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 /home/myuser/jars/*
and I received an error as
java.lang.ClassNotFoundException: my.package.path.Test
If I activate the verbose mode, what I see is that the primaryResource selected as jar containing the entry point is the first jar by alphabetical order in /home/myuser/jars/ (that is not dat-test.jar), leading (I supppose) to the ClassNotFoundException. All the jars in the same directory are anyway loaded as arguments.
Of course if I run
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 /home/myuser/jars/dat-test.jar
it finds the Test class, but it doesn't find other classes contained in other jars. Finally, if I use the --jars flag and run
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 --jars /home/myuser/jars/* /home/myuser/jars/dat-test.jar
I obtain the same result as the first option. First jar in /home/myuser/jars/ is loaded as primaryResource, leading to ClassNotFoundException for my.package.path.Test. Same if I add --jars /home/myuser/jars/*.jar.
Important points are:
I do not want to have a single jar with all the dependencies for development reasons
The jars in /home/myuser/jars/ are many. I'd like to know if there's a way to include them all instead of using the comma separated syntax
If I try to run the same commands with --deploy-cluster on the master node, I don't get the error, but the computation fails for some other reasons (but this is another problem).
Which is then the correct way of running a spark-submit in client mode?
Thanks
There is no way to include all jars using the --jars option, you will have to create a small script to enumerate them. This part is a bit sub-optimal.

Resources