Spark job not showing up on standalone cluster GUI - apache-spark

I am playing with running spark jobs in my lab and have a three node standalone cluster. When I execute a new job on the master node via CLI
spark-submit sparktest.py --master spark://myip:7077
while the job completes as expected it does not show up at all on the cluster GIU. After some investigation, I added the --master to the submit command but to no avail. During job execution as well as after completion when I navigate to http://mymasternodeip:8080/
none of these jobs are recognized in Running Jobs nor Completed Jobs. Any thoughts as to why the jobs dont show up would be appreciated.

You should specify --master flag first then remaining flags/options. If not master will be considered as local.
spark-submit --master spark://myip:7077 sparktest.py
Make sure that you don't override master config in your code while creating SparkSession object. Provide same master url in code also or don't add it.

Related

hadoop multi node with spark sample job

I have just configured spark on my Hadoop cluster and i want to run the spark sample job.
before that I want to understand what, this below job code stands for.
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
You can see all possible parameters for submitting a spark job on here. I summarized the ones in your submit script as below:
spark-submit
--deploy-mode client # client/cluster. default value client. Whether to deploy your driver on the worker nodes or locally
--class org.apache.spark.examples.SparkPi # The entry point for your application
$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10 #jar file path and expected arguments
--master is another parameter usually defined in submit scripts. For my HDP cluster default value of master is yarn. You can see all possible values for master in spark documentation again.

Running Spark Job on Zeppelin

I have written a custom spark library in scala. I am able to run this successfully as a spark-submit step by spawning the cluster and running the following commands. Here I first get my 2 jars by -
aws s3 cp s3://jars/RedshiftJDBC42-1.2.10.1009.jar .
aws s3 cp s3://jars/CustomJar .
and then i run my spark job as
spark-submit --deploy-mode client --jars RedshiftJDBC42-1.2.10.1009.jar --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,com.databricks:spark-avro_2.11:3.2.0 --class com.activities.CustomObject CustomJar.jar
This runs my CustomObject successfully. I want to run the similar thing in Zeppelin, But I do not know how to add jars and then run a spark-submit step?
You can add these dependencies to the Spark interpreter within Zeppelin:
Go to "Interpreter"
Choose edit and add the jar file
Restart the interpreter
More info here
EDIT
You might also want to use the %dep paragraph in order to access the zvariable (which is an implicit Zeppeling context) in order to do something like this:
%dep
z.load("/some_absolute_path/myjar.jar")
It depend how you run Spark. Most of the time, the Zeppelin interpreter will embed the Spark driver.
The solution is to configure the Zeppelin interpreter instead:
ZEPPELIN_INTP_JAVA_OPTS will configure java options
SPARK_SUBMIT_OPTIONS will configure spark options

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this:
spark-shell yarn --name myQuery -i ./my-query.scala
Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running these queries I get a nice progress bar like this:
[Stage7:===========> (14174 + 5) / 62500]
When I create a jar using the exact same query and run it with the following command-line:
spark-submit \
--master yarn-cluster \
--driver-memory 16G \
--queue default \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 32G \
--name MyQuery \
--class com.data.MyQuery \
target/uber-my-query-0.1-SNAPSHOT.jar
I don't get any such progress bar. The command simply says repeatedly
17/10/20 17:52:25 INFO yarn.Client: Application report for application_1507058523816_0443 (state: RUNNING)
The query works fine and the results are fine. But I just need to have feedback when the process will finish. I have tried the following.
The web page of RUNNING Hadoop Applications does have a progress bar but it basically never moves. Even in the case of the spark-shell query that progress bar is useless.
I have tried get the progress bar through the YARN logs but they are not aggregated until the job is complete. Even then there is no progress bar in the logs.
Is there is a way to launch a spark query in jar on a cluster and have a progressbar?
When I create a jar using the exact same query and run it with the following command-line (...) I don't get any such progress bar.
The difference between these two seemingly similar Spark executions is the master URL.
In the former Spark execution with spark-shell yarn, the master is YARN in client deploy mode, i.e. the driver runs on the machine where you start spark-shell from.
In the latter Spark execution with spark-submit --master yarn-cluster, the master is YARN in cluster deploy mode (which is actually equivalent to --master yarn --deploy-mode cluster), i.e. the driver runs on a YARN node.
With that said, you won't get the nice progress bar (which is actually called ConsoleProgressBar) on the local machine but on the machine where the driver runs.
A simple solution is to replace yarn-cluster with yarn.
ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr.
The progress includes the stage id, the number of completed, active, and total tasks.
ConsoleProgressBar is created when spark.ui.showConsoleProgress Spark property is turned on and the logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e. less messages are printed out and so there is a "space" for ConsoleProgressBar).
You can find more information in Mastering Apache Spark 2's ConsoleProgressBar.

spark-submit classNotFoundException

I'm building a spark app with maven (with shade plugin) and scp'ing it to a data node for execution with spark-submit --deploy-mode cluster (since launching right from the build system with --deploy-mode client doesn't work because of asymmetric network not under my control).
Here's my launch command
spark-submit
--class Test
--master yarn
--deploy-mode cluster
--supervise
--verbose
jarName.jar
hdfs:///somePath/Test.txt
hdfs:///somePath/Test.out
The job quickly fails with a ClassNotFoundException for Test$1; one of the anonymous classes java creates from my main class
6/03/18 12:59:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
0.0 (TID 0, dataNode3): java.lang.ClassNotFoundException: Test$1
I've seen this error mentioned many times (google) and most recommendations boil down to calling conf.setJars(jarPaths) or similar.
I really don't see why this is needed when the missing class is definitely (I've checked) available in jarName.jar , why specifying this at compile time is preferable to doing it at run time with --jar as a spark-submit argument, and in either case, what path I should provide for the jar. I've been copying it to my home directory on the datanode from target/jarName.jar on the build system but it seems spark-submit copies it to hdfs somewhere that's hard to nail down into a hard-coded path name at either compile time or launch time.
And most of all, why isn't spark-submit handling this automatically based on the someJar.jar argument, and if not, what should I do to fix it?
Check the answer from here
spark submit java.lang.ClassNotFoundException
spark-submit --class Test --master yarn --deploy-mode cluster --supervise --verbose jarName.jar hdfs:///somePath/Test.txt hdfs:///somePath/Test.out
Try to use, also you could check the absolute path in your project
--class com.myclass.Test
I had the same issue with my Scala Spark application when I tried to run it in "cluster" mode:
--master yarn --deploy-mode cluster
I found the solution on this page. Basically what I was missing (that is missing also in your command) is the "--jars" parameter that allows you to distribute the application jars to your cluster.
Suggestion: to be able to troubleshooting this kind of error you could use the following command:
yarn logs --applicationId yourApplicationId
where yourApplicationId shoould be in your yarn exception log.

Apache Spark Multi Node Clustering

I am currently working on logger analyse by using apache spark. I am new for Apache Spark. I have tried to use apache spark standalone mode. I can run my code by submitting jar with deploy-mode on the client. But I can not run with multi node cluster. I have used worker nodes are different machine.
sh spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:7077 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
when i Run this command, it shows following error
15/10/20 18:04:23 ERROR ClientEndpoint: Exception from cluster was: java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
java.io.FileNotFoundException: /home/rishon/loganalyzer.jar (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
at org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
at org.spark-project.guava.io.Files.copy(Files.java:436)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
As my understanding, The driver program sends the data and application code to worker node. I don't know my understanding is correct or not. So Please help me to run application on a cluster.
I have tried to run jar on cluster and Now there is no exception but why the task is not assigned to worker node?
I have tried without clustering. Its working fine. shown in following figure
Above image shows, Task assigned to worker nodes. But I have one more problem to analyse the log file. Actually, I have log files in master node which is in a folder (ex: '/home/visva/log'). But the worker node searching the file on their own file system.
I met same problem.
My solution was that I uploaded my .jar file on the HDFS.
Enter the command line like this:
spark-submit --class com.example.RunRecommender --master spark://Hadoop-NameNode:7077 --deploy-mode cluster --executor-memory 6g --executor-cores 3 hdfs://Hadoop-NameNode:9000/spark-practise-assembly-1.0.jar
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
If you use the cluster model in spark-submit , you need use the 6066 port(the default port of rest in spark) :
spark-submit --class Spark.LogAnalyzer.App --deploy-mode cluster --master spark://rishon.server21:6066 /home/rishon/loganalyzer.jar "/home/rishon/apache-tomcat-7.0.63/LogAnalysisBackup/"
In my case, i upload the jar of app to every node in cluster because i do not know how does the spark-submit to transfer the app automatically and i don't know how to specify a node as driver node .
Note: The jar path of app is a path that in the any node of cluster.
There are two deploy modes in Spark to run the script.
1.client (default): In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.(Master node)
2.cluster : If your application is submitted from a machine far from the worker machines, it is common to use cluster mode to minimize network latency between the drivers and the executors.
Reference Spark Documentation For Submitting JAR

Resources