Spark history for Standalone Cluster mode - apache-spark

I have seen this text on Spark website. I am trying to view Spark logs on the UI even after application ended or killed.
Is there anyway that i could view the logs in Standalone mode?
Spark is run on Mesos or YARN, it is still possible to construct the UI of an application through Spark’s history server, provided that the application’s event logs exist. You can start the history server by executing:
./sbin/start-history-server.sh
This creates a web interface at http://<server-url>:18080 by default, listing incomplete and completed applications and attempts.
When using the file-system provider class (see spark.history.provider below), the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option, and should contain sub-directories that each represents an application’s event logs.
The spark jobs themselves must be configured to log events, and to log them to the same shared, writeable directory. For example, if the server was configured with a log directory of hdfs://namenode/shared/spark-logs, then the client-side options would be:
spark.eventLog.enabled true spark.eventLog.dir hdfs://namenode/shared/spark-logs

Related

Which directory spark applications on yarn output their logs to? spark.eventLog.dir or var/log/ in each node?

I am building a log analysis planform to monitor spark jobs on a yarn cluster and I want to get a clear idea about spark/yarn logging.
I have searched a lot about this and these are the confusions I have.
The directory specified in spark.eventLog.dir or spark.history.fs.logDirectory get stored all the
application master logs and through log4j.properties in spark conf we can customize those logs ?
In default all data nodes output their executor logs to a folder in /var/log/. with log-aggregation enabled you can get those executer logs to the spark.eventLog.dir location as well?
I've managed to set up a 3 node virtual hadoop yarn cluster, spark installed in the master node. When I'm running spark in client mode I'm thinking this node becomes the application master node.
I'm a beginner to Big data and appreciate any effort to help me out with these confusions.
Spark log4j logging is written to the Yarn container stderr logs. The directory for these is controlled by yarn.nodemanager.log-dirs configuration parameter (default value on EMR is /var/log/hadoop-yarn/containers).
(spark.eventLog.dir is only used by the Spark History Server to display the Web UI after a job has finished. Here, Spark writes events that encode the information displayed in the UI to persisted storage).

Retain spark node history

How to retain spark worker and master node history such as completed applications , completed drivers in a cluster. When there is a restart all these history are lost. Is there any specific config to enable for maintaining the history.
Enabled spark event log in spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir file:////app/spark/logs/data/event_log_dir
But still unable to retain the history
There is inbox solution - Spark History Server
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
Spark UI is available only while application is running.
There is a Spark History Server tool, that allows you to see the UI after the application is finished.
More information is in Spark documentation:
Spark: Monitoring and Instrumentation - Viewing After the Fact

Apache Spark History Server Logs

My Apache Spark application handles giant RDDs and generates EventLogs through the History Server.
How can I export these logs and import them to another computer to view them through History Server UI?
My cluster uses Windows 10 and for some reason, with this OS, the log files don't load if they aren't generated on the machine itself. Using another OS like Ubuntu, I was able to view History Server's logs on the browser.
The spark while running applications writes events to the spark.eventLog.dir (for eg HDFS - hdfs://namenode/shared/spark-logs) as configured in the spark-defaults.conf.
These are then read by the spark history server based on the
spark.history.fs.logDirectory setting.
Both these log directories need to be the same and spark history server process should have permissions to read those files.
So these would be json files in the event log directory for each application. These you can access using appropriate filesystem commands.

Running spark application doesn't show up on spark history server

I am creating a long running spark application. After spark session has been created and application starts to run, I am not able to see it after click on the "show incomplete applications" on the spark history server. However, If I force my application to close, I can see it under the "completed applications" page.
I have spark parameters configured correctly on both client and server, as follow:
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://10.18.51.117:8020/history/ (a hdfs path on my spark history server)
I also configured the same on server side. So configuration shouldn't be a concern (since completed applications can also show up after I force my application to stop).
Do you guys have any thoughts on this behavior??
I look at the hdfs files on spark history server, I see a very small size .inprogress file associated with my running application (close to empty, see the picture below). It seems that the results get flushed to the file only when the application stops, which is not ideal for my long running application...Is there any way or parameters we can tweak to force flushing the log?
Very small size .inprogress file shown on hdfs during application is running

How can I see the aggregated logs for a Spark standalone cluster

With Spark running over Yarn, I could simply use yarn -logs -applicationId appId to see the aggregated log, after a Spark job is finished. What is the equivalent method for a Spark standalone cluster?
Via the Web Interface:
Spark’s standalone mode offers a web-based user interface to monitor
the cluster. The master and each worker has its own web UI that shows
cluster and job statistics. By default you can access the web UI for
the master at port 8080. The port can be changed either in the
configuration file or via command-line options.
In addition, detailed log output for each job is also written to the
work directory of each slave node (SPARK_HOME/work by default). You
will see two files for each job, stdout and stderr, with all output it
wrote to its console.
Please find more information in Monitoring and Instrumentation.

Resources