Bluemix Analytics for Apache Spark log file information required - apache-spark

I would like more information when debugging my spark notebook. I have found some log files:
!ls $HOME/notebook/logs/
The files are:
bootstrap-nnnnnnnn_nnnnnn.log
jupyter-nnnnnnnn_nnnnnn.log
kernel-pyspark-nnnnnnnn_nnnnnn.log
kernel-scala-nnnnnnnn_nnnnnn.log
logs-nnnnnnnn.tgz
monitor-nnnnnnnn_nnnnnn.log
spark160master-ego.log
Which applications log to these files and what information is written to each of these files?

When debugging notebooks, the kernel-*-*.log files are the ones you're looking for.
In logical order...
bootstrap-*.log is written when the service starts. One file for each start, the timestamp indicates when that happened. Contains output from the startup script which initializes the user environment, creates kernel specs, prepares the Spark config, and the like.
bootstrap-*_allday.log has a record for each service start and stop on that day.
jupyter-*.log contains output from the Jupyter server. After the initializations from bootstrap-*.log are done, the Jupyter server is started. That's when this file is created. You'll see log entries when notebook kernels are started or stopped, and when a notebook is saved.
monitor-*.log contains output from a monitoring script that is started with the service. The monitoring script has to detect on which port the Jupyter server is listening. Afterwards, it keeps an eye on service activity and shuts down the service when it's been idle too long.
kernel-*-*.log contains output from notebook kernels. Every kernel gets a separate log file, the timestamp indicates when the kernel started. The second word in the filename indicates the type of kernel.
spark*-ego.log contains output from Spark job scheduling. It's used by the monitoring script to detect whether Spark is active although the notebook kernels are idle.
logs-*.tgz contains archived logs of the respective day. They'll be deleted automatically after a few days.

With the recently enabled "environment" feature in DSX, the logs have moved to directory /var/pod/logs/. You will still see the kernel-*-*.log and jupyter-*.log files for your current session. However, they're not useful for debugging.
In the Spark as a Service backend, each kernel has a Spark driver process which logs to the kernel-*-*.log file. The environment feature comes without Spark, and the kernel itself does not generate output for the log file.

Related

Spark history-server stays empty

I set up spark alongside Hadoop with YARN as a resource manager. I set both spark.history.fs.logDirectory and spark.eventLog.dir to the same path in my hdfs file system. Also, spark.eventLog.enabled is set to true and I also checked history servers logs, but there are no errors (Only INFO). So I assume my problem isn't caused by permission errors. Also, I verified that application logs are actually created in the correct place, which is indeed the case. History servers logs also indicate that it is looking in the correct folder.
I don't have any idea why there are no application logs shown in the history server. Maybe I'm missing something fundamental.
Here are all important files (if that helps)
Logs: https://pastebin.com/6TGE3NbQ
spark-defaults.conf: https://pastebin.com/ZRv4JWbV
ansible-playbook.yml: https://pastebin.com/dVqsGENk (Important lines: 166 - 192 and 370)
The ansible-playbook is used to set up the whole cluster.
Edit: The history server is even parsing files (see Logs) but it just refuses to display them.
To have a proper History server setup, you need 3 things (also documented here):
Your Spark applications need to write their logs to a certain directory. This can be done with the following configurations:
spark.eventLog.enabled true
spark.eventLog.dir someDirectory
Your History Server needs to be running. This can be done like so:
./$SPARK_HOME/sbin/start-history-server.sh
Your History Server needs to be looking at the correct directory to look at the logs. This can be configured like so:
spark.history.fs.logDirectory sameDirectoryAsTheOneAbove
So in your case, it seems like something is going wrong. There are a few things you can verify:
Are your spark application correctly writing event logs?
Go to spark.eventLog.dir and check whether there are entries in there. You should have an entry per spark application that you ran.
Is my History Server running?
There are multiple ways to check this.
Type jps on the machine on which you're running the History Server. You should see a Java Process called HistoryServer running.
Visit port 18080 on that machine (if local, go to localhost:18080) to see if it's running
If your applications are writing to the correct location, and your history server is running but you still don't see any application entry on port 18080 from the point above, your history server might not be reading from the correct directory. Verify the value of spark.history.fs.logDirectory.

Spark - Is there a way to cleanup orphaned RDD files and block manager folders (using pyspark)?

I am currently running/experimenting with Spark in a Windows environment and have noticed a large number of orphaned blockmgr folders and rdd files. These are being created when I have insufficient memory to cache a full data set.
I suspect that they are being left behind when processes fail.
At the moment, I am manually deleting them from time to time (when I run out of disk space..... ). I have also toyed around with a simple file operation script.
I was wondering, is there any pyspark functions or scripts available that would clean these up, or any way to check for them when a process is initiated?
Thanks
As per #cronoik, this was solved by setting the following properties:
spark.worker.cleanup.enabled true
In my instance, using both 'local' and 'standalone' modes on a single node Windows environment, I have set this within spark-defaults.conf file.
Refer to the documentation for more information: Spark Standalone Mode

Apache Spark History Server Logs

My Apache Spark application handles giant RDDs and generates EventLogs through the History Server.
How can I export these logs and import them to another computer to view them through History Server UI?
My cluster uses Windows 10 and for some reason, with this OS, the log files don't load if they aren't generated on the machine itself. Using another OS like Ubuntu, I was able to view History Server's logs on the browser.
The spark while running applications writes events to the spark.eventLog.dir (for eg HDFS - hdfs://namenode/shared/spark-logs) as configured in the spark-defaults.conf.
These are then read by the spark history server based on the
spark.history.fs.logDirectory setting.
Both these log directories need to be the same and spark history server process should have permissions to read those files.
So these would be json files in the event log directory for each application. These you can access using appropriate filesystem commands.

Running spark application doesn't show up on spark history server

I am creating a long running spark application. After spark session has been created and application starts to run, I am not able to see it after click on the "show incomplete applications" on the spark history server. However, If I force my application to close, I can see it under the "completed applications" page.
I have spark parameters configured correctly on both client and server, as follow:
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://10.18.51.117:8020/history/ (a hdfs path on my spark history server)
I also configured the same on server side. So configuration shouldn't be a concern (since completed applications can also show up after I force my application to stop).
Do you guys have any thoughts on this behavior??
I look at the hdfs files on spark history server, I see a very small size .inprogress file associated with my running application (close to empty, see the picture below). It seems that the results get flushed to the file only when the application stops, which is not ideal for my long running application...Is there any way or parameters we can tweak to force flushing the log?
Very small size .inprogress file shown on hdfs during application is running

Output file is getting generated on slave machine in apache spark

I am facing some issue while running a spark java program that reads a file, do some manipulation and then generates output file at a given path.
Every thing works fine when master and slaves are on same machine .ie: in Standalone-cluster mode.
But problem started when I deployed same program in multi machine multi node cluster set up. That means the master is running at x.x.x.102 and slave is running on x.x.x.104.
Both the master -slave have shared their SSH keys and are reachable from each other.
Initially slave was not able to read input file , for that I came to know I need to call sc.addFile() before sc.textFile(). that solved issue. But now I see output is being generated on slave machine in a _temporary folder under the output path. ie: /tmp/emi/_temporary/0/task-xxxx/part-00000
In local cluster mode it works fine and generates output file in /tmp/emi/part-00000.
I came to know that i need to use SparkFiles.get(). but i am not able to understand how and where to use this method.
till now I am using
DataFrame dataobj = ...
dataObj.javaRDD().coalesce(1).saveAsTextFile("file:/tmp/emi");
Can any one please let me know how to call SparkFiles.get()?
In short how can I tell slave to create output file in the machine where driver is running?
Please help.
Thanks a lot in advance.
There is nothing unexpected here. Each worker writes its own part of the data separately. Using file scheme only means that data is writer to a file in the file system local from the worker perspective.
Regarding SparkFiles it is not applicable in this particular case. SparkFiles can be used to distribute common files to the worker machines not to deal with the results.
If for some reason you want to perform writes on the machine used to run driver code you'll have to fetch data to the driver machine first (either collect which requires enough memory to fit all data or toLocalIterator which collects partition at the time and requires multiple jobs) and use standard tools to write results to local file system. In general though writing to driver is not a good practice and most of the time is simply useless.

Resources