Retain spark node history - apache-spark

How to retain spark worker and master node history such as completed applications , completed drivers in a cluster. When there is a restart all these history are lost. Is there any specific config to enable for maintaining the history.
Enabled spark event log in spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir file:////app/spark/logs/data/event_log_dir
But still unable to retain the history

There is inbox solution - Spark History Server
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact

Spark UI is available only while application is running.
There is a Spark History Server tool, that allows you to see the UI after the application is finished.
More information is in Spark documentation:
Spark: Monitoring and Instrumentation - Viewing After the Fact

Related

Communicate to cluster that Spark History server is running

I have a working Spark cluster, with a master node and some worker nodes running on Kubernetes. This cluster has been used for multiple spark submit jobs and is operational.
On the master node, I have started up a Spark History server using the $SPARK_HOME/sbin/start-history-server.sh script and some configs to determine where the History Server's logs should be written:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
spark.hadoop.fs.s3a.access.key=...
spark.hadoop.fs.s3a.secret.key=...
spark.hadoop.fs.s3a.endpoint=...
spark.hadoop.fs.s3a.path.style.access=true
This was done a while after the cluster was operational. The server is writing the logs to an external DB (minIO using the s3a protocol).
Now, whenever I submit spark jobs it seems like nothing is being written away in the location I'm specifying.
I'm wondering about the following: How can the workers know I have started up the spark history server on the master node? Do I need to communicate this to the workers somehow?
Possible causes that I have checked:
No access/permissions to write to minIO: This shouldn't be the case as I'm running spark submit jobs that read/write files to the same minIO using the same settings
Logs folder does not exist: I was getting these errors before, but then I created a location for the files to be written away and since then I'm not getting issues
spark.eventLog.dir should be the same as spark.history.fs.logDirectory: they are
Just found out the answer: the way your workers will know where to store the logs is by supplying the following configs to your spark-submit job:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
It is probably also enough to have these in your spark-defaults.conf on the driver program, which is why I couldn't find a lot of info on this as I didn't add it to my spark-defaults.conf.

Which directory spark applications on yarn output their logs to? spark.eventLog.dir or var/log/ in each node?

I am building a log analysis planform to monitor spark jobs on a yarn cluster and I want to get a clear idea about spark/yarn logging.
I have searched a lot about this and these are the confusions I have.
The directory specified in spark.eventLog.dir or spark.history.fs.logDirectory get stored all the
application master logs and through log4j.properties in spark conf we can customize those logs ?
In default all data nodes output their executor logs to a folder in /var/log/. with log-aggregation enabled you can get those executer logs to the spark.eventLog.dir location as well?
I've managed to set up a 3 node virtual hadoop yarn cluster, spark installed in the master node. When I'm running spark in client mode I'm thinking this node becomes the application master node.
I'm a beginner to Big data and appreciate any effort to help me out with these confusions.
Spark log4j logging is written to the Yarn container stderr logs. The directory for these is controlled by yarn.nodemanager.log-dirs configuration parameter (default value on EMR is /var/log/hadoop-yarn/containers).
(spark.eventLog.dir is only used by the Spark History Server to display the Web UI after a job has finished. Here, Spark writes events that encode the information displayed in the UI to persisted storage).

Spark Ui not showing completed applications

Im using Spark standalone cluster and SparkUI is not showing completed applications though job ran successfully.please suggest
Current screenshot is showing Spark Standalone Master UI. It shows links to SparkUI's of currently running applications/drivers and applications that were completed, though without the links.
In oder to see SparkUI's of completed applications, you need to have the following configuration in spark-defaults.conf:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///path/to/event-log-folder
spark.eventLog.dir file:///path/to/event-log-folder
and you also need to start sbin/start-history-server.sh as well to see the results. It might be used also for looking at running applications (as "show incomplete applications" link on it's UI), but on highly loaded Spark Master you'd get some delays and results will appear with delays.

Spark history for Standalone Cluster mode

I have seen this text on Spark website. I am trying to view Spark logs on the UI even after application ended or killed.
Is there anyway that i could view the logs in Standalone mode?
Spark is run on Mesos or YARN, it is still possible to construct the UI of an application through Spark’s history server, provided that the application’s event logs exist. You can start the history server by executing:
./sbin/start-history-server.sh
This creates a web interface at http://<server-url>:18080 by default, listing incomplete and completed applications and attempts.
When using the file-system provider class (see spark.history.provider below), the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option, and should contain sub-directories that each represents an application’s event logs.
The spark jobs themselves must be configured to log events, and to log them to the same shared, writeable directory. For example, if the server was configured with a log directory of hdfs://namenode/shared/spark-logs, then the client-side options would be:
spark.eventLog.enabled true spark.eventLog.dir hdfs://namenode/shared/spark-logs

Application execution monitoring for Spark job on yarn

I can see the application execution information in detail on the Web UI in Spark standalone mode, but when comes to yarn, it is gone. So, where can I see the execution information when job is ran on yarn?
You need to configure spark history server with yarn ,and then start it
in your spark-defaults.conf file add the following properties,
spark.eventLog.enabled true
spark.eventLog.dir hdfs://LOCATION/TO/SPARK/EVENT/LOG
spark.yarn.historyServer.address SPARK_HISTORY_SERVER_HOST
spark.history.ui.port SPARK_HISTORY_SERVER_PORT
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.history.fs.logDirectory hdfs://LOCATION/TO/SPARK/EVENT/LOG
and then start spark history server:
$/PATH/TO/SPARK/sbin/start-history-server.sh
P.S. I assume that Spark is already configured with hadoop/yarn (so you have set the location of configuration files in spark-env.sh)
You can debug your application , but I guess there is no UI dedicated for that.

Resources