Spark Ui not showing completed applications - apache-spark

Im using Spark standalone cluster and SparkUI is not showing completed applications though job ran successfully.please suggest

Current screenshot is showing Spark Standalone Master UI. It shows links to SparkUI's of currently running applications/drivers and applications that were completed, though without the links.
In oder to see SparkUI's of completed applications, you need to have the following configuration in spark-defaults.conf:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///path/to/event-log-folder
spark.eventLog.dir file:///path/to/event-log-folder
and you also need to start sbin/start-history-server.sh as well to see the results. It might be used also for looking at running applications (as "show incomplete applications" link on it's UI), but on highly loaded Spark Master you'd get some delays and results will appear with delays.

Related

Retain spark node history

How to retain spark worker and master node history such as completed applications , completed drivers in a cluster. When there is a restart all these history are lost. Is there any specific config to enable for maintaining the history.
Enabled spark event log in spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir file:////app/spark/logs/data/event_log_dir
But still unable to retain the history
There is inbox solution - Spark History Server
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
Spark UI is available only while application is running.
There is a Spark History Server tool, that allows you to see the UI after the application is finished.
More information is in Spark documentation:
Spark: Monitoring and Instrumentation - Viewing After the Fact

Spark jobs not showing up in Hadoop UI in Google Cloud

I created a cluster in Google Cloud and submitted a spark job. Then I connected to the UI following these instructions: I created an ssh tunnel and used it to open the Hadoop web interface. But the job is not showing up.
Some extra information:
If I connect to the master node of the cluster via ssh and run spark-shell, this "job" does show up in the hadoop web interface.
I'm pretty sure I did this before and I could see my jobs (both running and already finished). I don't know what happened in between for them to stop appearing.
The problem was that I was running my jobs in local mode. My code had a .master("local[*]") that was causing this. After removing it, the jobs showed up in the Hadoop UI as before.

Why are my Spark completed applications still using my worker's disk space?

My Datastax Spark completed applications are using my worker's disc space. Therefore my spark can't run because it doesn't have any disk space left.
This is my spark worker directory. These blue lined applications in total take up 92GB but they shouldn't even exist anymore since they are completed applications Thanks for the help don't know where the problem lies.
This is my spark front UI:
Spark doesn't automatically clean up the jars transfered to the worker nodes. If you want it to do so, and you're running Spark Standalone (YARN is a bit different and won't work the same) you can set spark.worker.cleanup.enabled to true, and set the cleanup interval via spark.worker.cleanup.interval. This will allow Spark to clean up the data retained in your workers. You may also configure a default TTL for all application directories.
From the docs of spark.worker.cleanup.enabled:
Enable periodic cleanup of worker / application directories. Note that
this only affects standalone mode, as YARN works differently. Only the
directories of stopped applications are cleaned up.
For more, see Spark Configuration.

What is the difference between web UIs on 4040 and 8080?

There are two different web UIs (one is for standalone mode only). Can I use web UI on port 4040 when I am launching Spark in standalone mode? (example:spark-class.cmd org.apache.spark.deploy.master.Master- web ui 8080 is working, 4040 - not.) What is the main difference between these UIs?
Is it possible for me to launch Spark (without hadoop, hdfs, yarn etc), to keep it up and to submit my jars(classes) into it? I want to watch job statistics after it finishes. I am trying something like this:
Server: Spark\bin>spark-class.cmd org.apache.spark.deploy.master.Master
Worker: Spark\bin>spark-class.cmd org.apache.spark.deploy.worker.Worker spark://169.254.8.45:7077 --cores 4 --memory 512M
Submit: Spark\bin>spark-submit.cmd --class demo.TreesSample --master spark://169.254.8.45:7077 file:///E:/spark-demo/target/demo.jar
It runs. It gets new WebUI on port 4040 up for this task. I dont see anything in Master's ui on 8080.
Currently I'm using win7 x64, spark-1.5.2-bin-hadoop2.6. I can switch into linux if it matters.
You should be able to change the web UI port for standalone Master using spark.master.ui.port or SPARK_MASTER_WEBUI_PORT as described in Configuring Ports for Network Security / Standalone mode only.
Standalone Master's web UI is a management console of a cluster manager (that happens to be part of Apache Spark, but could've been a separate product as Hadoop YARN and Apache Mesos). Having said that, it can often be confusing what the two web UIs have in common, and the answer is nothing.
The Spark driver's web UI is to show the progress of your computations (jobs, stages, storage for RDD persistence, broadcasts, accumulators) while standalone Master's web UI is to let you know the current state of your "operating environment" (aka the Spark Standalone cluster).
I leave the other part of your question about History server to #Sumit's answer.
Yes, you can launch the Spark as a standalone server, without any Hadoop or HDFS. Also as soon as you submit your job to master, it will show your job either in in-"Running jobs" or "Jobs Completed" section.
You can also enable History Server for preserving the job Statistics and analyzing the same at a later time -
./sbin/start-history-server.sh
Refer Here for more details on enabling History server

View worker / executor logs in Spark UI since 1.0.0+

In 0.9.0 to view worker logs it was simple, they where one click away from the spark ui home page.
Now (1.0.0+) I cannot find them. Furthermore the Spark UI stops working when my job crashes! This is annoying, what is the point of a debugging tool that only works when your application does not need debugging. According to http://apache-spark-user-list.1001560.n3.nabble.com/Viewing-web-UI-after-fact-td12023.html I need to find out what my master-url is, but I don't how to, spark doesn't spit out this information at startup, all it says is:
... -Dspark.master=\"yarn-client\" ...
and obviously http://yarn-client:8080 doesn't work. Some sites talk about how now in YARN finding logs has been super obfuscated - rather than just being on the UI, you have to login to the boxes to find them. Surely this is a massive regression and there has to be a simpler way??
How am I supposed to find out what the master URL is? How can I find my worker (now called executor) logs?
Depending on your configuration of YARN NodeManager log aggregation, the spark job logs are aggregated automatically. Runtime log is usually be found in following ways:
Spark Master Log
If you're running with yarn-cluster, go to YARN Scheduler web UI. You can find the Spark Master log there. Job description page "log' button gives the content.
With yarn-client, the driver runs in your spark-submit command. Then what you see is the driver log, if log4j.properties is configured to output in stderr or stdout.
Spark Executor Log
Search for "executorHostname" in driver logs. See comments for more detail.
These answers document how to find them from command line or UI
Where are logs in Spark on YARN?
For UI, on an edge node
Look in /etc/hadoop/conf/yarn-site.xml for the yarn resource manager URI (yarn.resourcemanager.webapp.address).
Or use command line:
yarn logs -applicationId [OPTIONS]

Resources