Spark history-server stays empty - apache-spark

I set up spark alongside Hadoop with YARN as a resource manager. I set both spark.history.fs.logDirectory and spark.eventLog.dir to the same path in my hdfs file system. Also, spark.eventLog.enabled is set to true and I also checked history servers logs, but there are no errors (Only INFO). So I assume my problem isn't caused by permission errors. Also, I verified that application logs are actually created in the correct place, which is indeed the case. History servers logs also indicate that it is looking in the correct folder.
I don't have any idea why there are no application logs shown in the history server. Maybe I'm missing something fundamental.
Here are all important files (if that helps)
Logs: https://pastebin.com/6TGE3NbQ
spark-defaults.conf: https://pastebin.com/ZRv4JWbV
ansible-playbook.yml: https://pastebin.com/dVqsGENk (Important lines: 166 - 192 and 370)
The ansible-playbook is used to set up the whole cluster.
Edit: The history server is even parsing files (see Logs) but it just refuses to display them.

To have a proper History server setup, you need 3 things (also documented here):
Your Spark applications need to write their logs to a certain directory. This can be done with the following configurations:
spark.eventLog.enabled true
spark.eventLog.dir someDirectory
Your History Server needs to be running. This can be done like so:
./$SPARK_HOME/sbin/start-history-server.sh
Your History Server needs to be looking at the correct directory to look at the logs. This can be configured like so:
spark.history.fs.logDirectory sameDirectoryAsTheOneAbove
So in your case, it seems like something is going wrong. There are a few things you can verify:
Are your spark application correctly writing event logs?
Go to spark.eventLog.dir and check whether there are entries in there. You should have an entry per spark application that you ran.
Is my History Server running?
There are multiple ways to check this.
Type jps on the machine on which you're running the History Server. You should see a Java Process called HistoryServer running.
Visit port 18080 on that machine (if local, go to localhost:18080) to see if it's running
If your applications are writing to the correct location, and your history server is running but you still don't see any application entry on port 18080 from the point above, your history server might not be reading from the correct directory. Verify the value of spark.history.fs.logDirectory.

Related

Apache Spark History Server Logs

My Apache Spark application handles giant RDDs and generates EventLogs through the History Server.
How can I export these logs and import them to another computer to view them through History Server UI?
My cluster uses Windows 10 and for some reason, with this OS, the log files don't load if they aren't generated on the machine itself. Using another OS like Ubuntu, I was able to view History Server's logs on the browser.
The spark while running applications writes events to the spark.eventLog.dir (for eg HDFS - hdfs://namenode/shared/spark-logs) as configured in the spark-defaults.conf.
These are then read by the spark history server based on the
spark.history.fs.logDirectory setting.
Both these log directories need to be the same and spark history server process should have permissions to read those files.
So these would be json files in the event log directory for each application. These you can access using appropriate filesystem commands.

Running spark application doesn't show up on spark history server

I am creating a long running spark application. After spark session has been created and application starts to run, I am not able to see it after click on the "show incomplete applications" on the spark history server. However, If I force my application to close, I can see it under the "completed applications" page.
I have spark parameters configured correctly on both client and server, as follow:
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://10.18.51.117:8020/history/ (a hdfs path on my spark history server)
I also configured the same on server side. So configuration shouldn't be a concern (since completed applications can also show up after I force my application to stop).
Do you guys have any thoughts on this behavior??
I look at the hdfs files on spark history server, I see a very small size .inprogress file associated with my running application (close to empty, see the picture below). It seems that the results get flushed to the file only when the application stops, which is not ideal for my long running application...Is there any way or parameters we can tweak to force flushing the log?
Very small size .inprogress file shown on hdfs during application is running

spark history server in yarn mode

Can someone help in understanding the clear difference between
spark.eventLog.dir and spark.history.fs.logDirectory?
Also please relate these properties to yarn.nodemanager.remote-app-log-dir and mapreduce.jobhistory.done-dir
Please don't paste documentation as the response :)
I have already gone through the below link, but could not understand.
What's the difference between spark.eventLog.dir and spark.history.fs.logDirectory?
As the linked post says,
spark.eventLog.dir is to write logs while spark.history.fs.logDirectory is the place where Spark History Server reads log events.
In certain scenarios, these could be different - for example, some external, periodic job could move the files being actively written into the history location
The Spark log directories should be separated from YARN and MapReduce history server

Spark history for Standalone Cluster mode

I have seen this text on Spark website. I am trying to view Spark logs on the UI even after application ended or killed.
Is there anyway that i could view the logs in Standalone mode?
Spark is run on Mesos or YARN, it is still possible to construct the UI of an application through Spark’s history server, provided that the application’s event logs exist. You can start the history server by executing:
./sbin/start-history-server.sh
This creates a web interface at http://<server-url>:18080 by default, listing incomplete and completed applications and attempts.
When using the file-system provider class (see spark.history.provider below), the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option, and should contain sub-directories that each represents an application’s event logs.
The spark jobs themselves must be configured to log events, and to log them to the same shared, writeable directory. For example, if the server was configured with a log directory of hdfs://namenode/shared/spark-logs, then the client-side options would be:
spark.eventLog.enabled true spark.eventLog.dir hdfs://namenode/shared/spark-logs

How can I pass app-specific configuration to Spark workers?

I have a Spark app which uses many workers. I'd like to be able to pass simple configuration information to them easily (without having to recompile): e.g. USE_ALGO_A. If this was a local app, I'd just set the info in environment variables, and read them. I've tried doing something similar using spark-env.sh, but the variables don't seem to propagate properly.
How can I do simple runtime configuration of my code in the workers?
(PS I'm running a spark-ec2 type cluster)
You need to take care of configuring each worker.
From the Spark docs:
You can edit /root/spark/conf/spark-env.sh on each machine to set Spark configuration options, such as JVM options. This file needs to be copied to every machine to reflect the change.
If you use an Amazon EC2 cluster, there is a script that RSYNC s a directory between teh master and all workers.
The easiest way to do this is to use a script we provide called copy-dir. First edit your spark-env.sh file on the master, then run ~/spark-ec2/copy-dir /root/spark/conf to RSYNC it to all the workers.
see https://spark.apache.org/docs/latest/ec2-scripts.html

Resources