pull out metrics from spark logs - apache-spark

how do I pull out these metrics from spark history logs? Is there some api I can pull these from?
I tried downloading the json event logs, but I can't grep for the numbers seen in the photo

The spark history server keeps all that information for you. You can access it via a rest API.
If you are on EMR:
You can view the Spark web UIs by following the procedures to create
an SSH tunnel or create a proxy in the section called Connect to the
cluster in the Amazon EMR Management Guide and then navigating to the
YARN ResourceManager for your cluster. Choose the link under Tracking
UI for your application. If your application is running, you see
ApplicationMaster. This takes you to the application master's web UI
at port 20888 wherever the driver is located. The driver may be
located on the cluster's primary node if you run in YARN client mode.
If you are running an application in YARN cluster mode, the driver is
located in the ApplicationMaster for the application on the cluster.
If your application has finished, you see History, which takes you to
the Spark HistoryServer UI port number at 18080 of the EMR cluster's
primary node. This is for applications that have already completed.
You can also navigate to the Spark HistoryServer UI directly at
http://master-public-dns-name:18080/.

Related

Determine where spark program is failing?

Is there anyway to debug a Spark application that is running in a cluster mode? I have a program that has been running successfully for a while, which processes a couple hundred GB at a time. Recently I had some data cause the run to fail due to executors being disconnected. From what I have read, this is likely a memory issue. I'm trying to determine what function/action is causing the memory issue to trigger. I am using Spark on an EMR cluster(which uses YARN), what would be the best way to debug this issue?
For cluster mode you can go to the YARN Resource Manager UI and select the Tracking UI for your specific running application (which points to the spark driver running on the Application Master within the YARN Node Manager) to open up the Spark UI which is the core developer interface for debugging spark apps.
For client mode you can also go to the YARN RM UI like previously mentioned as well as hit the Spark UI via this address => http://[driverHostname]:4040 where driverHostName is the Master Node in EMR and 4040 is the default port (this can be changed).
Additionally you can access submitted and completed spark apps via the Spark History Server via this default address => http://master-public-dns-name:18080/
These are the essential resources with the Spark UI being the main toolkit for your request.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-webui.html

Spark jobs not showing up in Hadoop UI in Google Cloud

I created a cluster in Google Cloud and submitted a spark job. Then I connected to the UI following these instructions: I created an ssh tunnel and used it to open the Hadoop web interface. But the job is not showing up.
Some extra information:
If I connect to the master node of the cluster via ssh and run spark-shell, this "job" does show up in the hadoop web interface.
I'm pretty sure I did this before and I could see my jobs (both running and already finished). I don't know what happened in between for them to stop appearing.
The problem was that I was running my jobs in local mode. My code had a .master("local[*]") that was causing this. After removing it, the jobs showed up in the Hadoop UI as before.

How can I see the aggregated logs for a Spark standalone cluster

With Spark running over Yarn, I could simply use yarn -logs -applicationId appId to see the aggregated log, after a Spark job is finished. What is the equivalent method for a Spark standalone cluster?
Via the Web Interface:
Spark’s standalone mode offers a web-based user interface to monitor
the cluster. The master and each worker has its own web UI that shows
cluster and job statistics. By default you can access the web UI for
the master at port 8080. The port can be changed either in the
configuration file or via command-line options.
In addition, detailed log output for each job is also written to the
work directory of each slave node (SPARK_HOME/work by default). You
will see two files for each job, stdout and stderr, with all output it
wrote to its console.
Please find more information in Monitoring and Instrumentation.

Why does spark UI at port 18080 say "cluster mode" when I launched in client mode in the config file?

In my spark-defaults.conf file I have the following line:
spark.master=yarn-client
Now, I launch a job, and I look at the spark UI (which is at <master ip address>:18080) and I see the following at the top of the page:
REST URL: spark://<master ip address>:6066 (cluster mode)
I restarted all of the spark workers and spark master, and distributed the spark-defaults.conf file to all of the spark workers/slaves.
I cannot tell if this is running in cluster mode or client mode? And why is my setting not getting picked up by the spark UI?
I cannot tell if this is running in cluster mode or client mode? And
why is my setting not getting picked up by the spark UI?
Spark UI running on port 18080 is spark history server. If you want to find which mode your particular application ran in, go to :18080), and click on the any ID under App ID which will take you to Spark Jobs page.
On that page, click on Environment tab. In that tab, look for section Spark Properties and under that you will find spark-master property which will tell you which mode that application ran in.

What is the difference between web UIs on 4040 and 8080?

There are two different web UIs (one is for standalone mode only). Can I use web UI on port 4040 when I am launching Spark in standalone mode? (example:spark-class.cmd org.apache.spark.deploy.master.Master- web ui 8080 is working, 4040 - not.) What is the main difference between these UIs?
Is it possible for me to launch Spark (without hadoop, hdfs, yarn etc), to keep it up and to submit my jars(classes) into it? I want to watch job statistics after it finishes. I am trying something like this:
Server: Spark\bin>spark-class.cmd org.apache.spark.deploy.master.Master
Worker: Spark\bin>spark-class.cmd org.apache.spark.deploy.worker.Worker spark://169.254.8.45:7077 --cores 4 --memory 512M
Submit: Spark\bin>spark-submit.cmd --class demo.TreesSample --master spark://169.254.8.45:7077 file:///E:/spark-demo/target/demo.jar
It runs. It gets new WebUI on port 4040 up for this task. I dont see anything in Master's ui on 8080.
Currently I'm using win7 x64, spark-1.5.2-bin-hadoop2.6. I can switch into linux if it matters.
You should be able to change the web UI port for standalone Master using spark.master.ui.port or SPARK_MASTER_WEBUI_PORT as described in Configuring Ports for Network Security / Standalone mode only.
Standalone Master's web UI is a management console of a cluster manager (that happens to be part of Apache Spark, but could've been a separate product as Hadoop YARN and Apache Mesos). Having said that, it can often be confusing what the two web UIs have in common, and the answer is nothing.
The Spark driver's web UI is to show the progress of your computations (jobs, stages, storage for RDD persistence, broadcasts, accumulators) while standalone Master's web UI is to let you know the current state of your "operating environment" (aka the Spark Standalone cluster).
I leave the other part of your question about History server to #Sumit's answer.
Yes, you can launch the Spark as a standalone server, without any Hadoop or HDFS. Also as soon as you submit your job to master, it will show your job either in in-"Running jobs" or "Jobs Completed" section.
You can also enable History Server for preserving the job Statistics and analyzing the same at a later time -
./sbin/start-history-server.sh
Refer Here for more details on enabling History server

Resources