spark history server in yarn mode - apache-spark

Can someone help in understanding the clear difference between
spark.eventLog.dir and spark.history.fs.logDirectory?
Also please relate these properties to yarn.nodemanager.remote-app-log-dir and mapreduce.jobhistory.done-dir
Please don't paste documentation as the response :)
I have already gone through the below link, but could not understand.
What's the difference between spark.eventLog.dir and spark.history.fs.logDirectory?

As the linked post says,
spark.eventLog.dir is to write logs while spark.history.fs.logDirectory is the place where Spark History Server reads log events.
In certain scenarios, these could be different - for example, some external, periodic job could move the files being actively written into the history location
The Spark log directories should be separated from YARN and MapReduce history server

Related

Spark history-server stays empty

I set up spark alongside Hadoop with YARN as a resource manager. I set both spark.history.fs.logDirectory and spark.eventLog.dir to the same path in my hdfs file system. Also, spark.eventLog.enabled is set to true and I also checked history servers logs, but there are no errors (Only INFO). So I assume my problem isn't caused by permission errors. Also, I verified that application logs are actually created in the correct place, which is indeed the case. History servers logs also indicate that it is looking in the correct folder.
I don't have any idea why there are no application logs shown in the history server. Maybe I'm missing something fundamental.
Here are all important files (if that helps)
Logs: https://pastebin.com/6TGE3NbQ
spark-defaults.conf: https://pastebin.com/ZRv4JWbV
ansible-playbook.yml: https://pastebin.com/dVqsGENk (Important lines: 166 - 192 and 370)
The ansible-playbook is used to set up the whole cluster.
Edit: The history server is even parsing files (see Logs) but it just refuses to display them.
To have a proper History server setup, you need 3 things (also documented here):
Your Spark applications need to write their logs to a certain directory. This can be done with the following configurations:
spark.eventLog.enabled true
spark.eventLog.dir someDirectory
Your History Server needs to be running. This can be done like so:
./$SPARK_HOME/sbin/start-history-server.sh
Your History Server needs to be looking at the correct directory to look at the logs. This can be configured like so:
spark.history.fs.logDirectory sameDirectoryAsTheOneAbove
So in your case, it seems like something is going wrong. There are a few things you can verify:
Are your spark application correctly writing event logs?
Go to spark.eventLog.dir and check whether there are entries in there. You should have an entry per spark application that you ran.
Is my History Server running?
There are multiple ways to check this.
Type jps on the machine on which you're running the History Server. You should see a Java Process called HistoryServer running.
Visit port 18080 on that machine (if local, go to localhost:18080) to see if it's running
If your applications are writing to the correct location, and your history server is running but you still don't see any application entry on port 18080 from the point above, your history server might not be reading from the correct directory. Verify the value of spark.history.fs.logDirectory.

Which directory spark applications on yarn output their logs to? spark.eventLog.dir or var/log/ in each node?

I am building a log analysis planform to monitor spark jobs on a yarn cluster and I want to get a clear idea about spark/yarn logging.
I have searched a lot about this and these are the confusions I have.
The directory specified in spark.eventLog.dir or spark.history.fs.logDirectory get stored all the
application master logs and through log4j.properties in spark conf we can customize those logs ?
In default all data nodes output their executor logs to a folder in /var/log/. with log-aggregation enabled you can get those executer logs to the spark.eventLog.dir location as well?
I've managed to set up a 3 node virtual hadoop yarn cluster, spark installed in the master node. When I'm running spark in client mode I'm thinking this node becomes the application master node.
I'm a beginner to Big data and appreciate any effort to help me out with these confusions.
Spark log4j logging is written to the Yarn container stderr logs. The directory for these is controlled by yarn.nodemanager.log-dirs configuration parameter (default value on EMR is /var/log/hadoop-yarn/containers).
(spark.eventLog.dir is only used by the Spark History Server to display the Web UI after a job has finished. Here, Spark writes events that encode the information displayed in the UI to persisted storage).

Spark Ui not showing completed applications

Im using Spark standalone cluster and SparkUI is not showing completed applications though job ran successfully.please suggest
Current screenshot is showing Spark Standalone Master UI. It shows links to SparkUI's of currently running applications/drivers and applications that were completed, though without the links.
In oder to see SparkUI's of completed applications, you need to have the following configuration in spark-defaults.conf:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///path/to/event-log-folder
spark.eventLog.dir file:///path/to/event-log-folder
and you also need to start sbin/start-history-server.sh as well to see the results. It might be used also for looking at running applications (as "show incomplete applications" link on it's UI), but on highly loaded Spark Master you'd get some delays and results will appear with delays.

Apache Spark: Yarn logs Analysis

I am having a spark-streaming application, and I want to analyse the logs of the job using Elasticsearch-Kibana. My job is run on yarn cluster, so the logs are getting written to HDFS as I have set yarn.log-aggregation-enable to true. But, when I try to do this :
hadoop fs -cat ${yarn.nodemanager.remote-app-log-dir}/${user.name}/logs/<application ID>
I am seeing some encrypted/compressed data. What file format is this? How can I read the logs from this file? Can I use logstash to read this?
Also, if there is a better approach to analyse Spark logs, I am open to your suggestions.
Thanks.
The format is called a TFile, and it is a compressed file format.
Yarn however chooses to write the application logs into a TFile!! For those of you who don’t know what a TFile is (and I bet a lot of you don’t), you can learn more about it here, but for now this basic definition should suffice “A TFile is a container of key-value pairs. Both keys and values are type-less bytes”.
Splunk / Hadoop Rant
There may be a way to edit YARN and Spark's log4j.properties to send messages to Logstash using SocketAppender
However, that method is being deprecated

View worker / executor logs in Spark UI since 1.0.0+

In 0.9.0 to view worker logs it was simple, they where one click away from the spark ui home page.
Now (1.0.0+) I cannot find them. Furthermore the Spark UI stops working when my job crashes! This is annoying, what is the point of a debugging tool that only works when your application does not need debugging. According to http://apache-spark-user-list.1001560.n3.nabble.com/Viewing-web-UI-after-fact-td12023.html I need to find out what my master-url is, but I don't how to, spark doesn't spit out this information at startup, all it says is:
... -Dspark.master=\"yarn-client\" ...
and obviously http://yarn-client:8080 doesn't work. Some sites talk about how now in YARN finding logs has been super obfuscated - rather than just being on the UI, you have to login to the boxes to find them. Surely this is a massive regression and there has to be a simpler way??
How am I supposed to find out what the master URL is? How can I find my worker (now called executor) logs?
Depending on your configuration of YARN NodeManager log aggregation, the spark job logs are aggregated automatically. Runtime log is usually be found in following ways:
Spark Master Log
If you're running with yarn-cluster, go to YARN Scheduler web UI. You can find the Spark Master log there. Job description page "log' button gives the content.
With yarn-client, the driver runs in your spark-submit command. Then what you see is the driver log, if log4j.properties is configured to output in stderr or stdout.
Spark Executor Log
Search for "executorHostname" in driver logs. See comments for more detail.
These answers document how to find them from command line or UI
Where are logs in Spark on YARN?
For UI, on an edge node
Look in /etc/hadoop/conf/yarn-site.xml for the yarn resource manager URI (yarn.resourcemanager.webapp.address).
Or use command line:
yarn logs -applicationId [OPTIONS]

Resources