Which directory spark applications on yarn output their logs to? spark.eventLog.dir or var/log/ in each node? - apache-spark

I am building a log analysis planform to monitor spark jobs on a yarn cluster and I want to get a clear idea about spark/yarn logging.
I have searched a lot about this and these are the confusions I have.
The directory specified in spark.eventLog.dir or spark.history.fs.logDirectory get stored all the
application master logs and through log4j.properties in spark conf we can customize those logs ?
In default all data nodes output their executor logs to a folder in /var/log/. with log-aggregation enabled you can get those executer logs to the spark.eventLog.dir location as well?
I've managed to set up a 3 node virtual hadoop yarn cluster, spark installed in the master node. When I'm running spark in client mode I'm thinking this node becomes the application master node.
I'm a beginner to Big data and appreciate any effort to help me out with these confusions.

Spark log4j logging is written to the Yarn container stderr logs. The directory for these is controlled by yarn.nodemanager.log-dirs configuration parameter (default value on EMR is /var/log/hadoop-yarn/containers).
(spark.eventLog.dir is only used by the Spark History Server to display the Web UI after a job has finished. Here, Spark writes events that encode the information displayed in the UI to persisted storage).

Related

Communicate to cluster that Spark History server is running

I have a working Spark cluster, with a master node and some worker nodes running on Kubernetes. This cluster has been used for multiple spark submit jobs and is operational.
On the master node, I have started up a Spark History server using the $SPARK_HOME/sbin/start-history-server.sh script and some configs to determine where the History Server's logs should be written:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
spark.hadoop.fs.s3a.access.key=...
spark.hadoop.fs.s3a.secret.key=...
spark.hadoop.fs.s3a.endpoint=...
spark.hadoop.fs.s3a.path.style.access=true
This was done a while after the cluster was operational. The server is writing the logs to an external DB (minIO using the s3a protocol).
Now, whenever I submit spark jobs it seems like nothing is being written away in the location I'm specifying.
I'm wondering about the following: How can the workers know I have started up the spark history server on the master node? Do I need to communicate this to the workers somehow?
Possible causes that I have checked:
No access/permissions to write to minIO: This shouldn't be the case as I'm running spark submit jobs that read/write files to the same minIO using the same settings
Logs folder does not exist: I was getting these errors before, but then I created a location for the files to be written away and since then I'm not getting issues
spark.eventLog.dir should be the same as spark.history.fs.logDirectory: they are
Just found out the answer: the way your workers will know where to store the logs is by supplying the following configs to your spark-submit job:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
It is probably also enough to have these in your spark-defaults.conf on the driver program, which is why I couldn't find a lot of info on this as I didn't add it to my spark-defaults.conf.

Retain spark node history

How to retain spark worker and master node history such as completed applications , completed drivers in a cluster. When there is a restart all these history are lost. Is there any specific config to enable for maintaining the history.
Enabled spark event log in spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir file:////app/spark/logs/data/event_log_dir
But still unable to retain the history
There is inbox solution - Spark History Server
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
Spark UI is available only while application is running.
There is a Spark History Server tool, that allows you to see the UI after the application is finished.
More information is in Spark documentation:
Spark: Monitoring and Instrumentation - Viewing After the Fact

How does Spark prepare executors on Hadoop YARN?

I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend and invoked
Thread.currentThread().getContextClassLoader.getResource("")
It points out to the following directory:
/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/
Looking at the directory I found the following files:
default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__
The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?
I looked at org.apache.spark.deploy.SparkSubmit, but didn't find anything useful inside.
Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...
You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).
By default, while submitting a Spark application using spark-submit, the Spark application (i.e. the SparkContext it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster that knows how to talk to YARN and request two other YARN containers for two Spark executors.
You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.
While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).
who delivers the files to each executor
That's how YARN gets told how to launch the ApplicationMaster for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.
That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.
The code that's responsible for the communication is in Spark's Client, esp. Client.submitApplication.
and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.
Quoting Mastering Apache Spark 2 gitbook:
CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.
ExecutorRunnable is started when when Spark on YARN's YarnAllocator schedules it in allocated YARN resource containers.
What are the scripts? Are they all YARN-autogenerated?
Kind of.
Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.
Enable DEBUG logging level in your Spark application and you'll see the file transfer.
You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.

spark-submit error : failed in initilizing sparkContext for non driver program vms

Cluster Specifications : Apache Spark on top of Mesos with 5 Vms and HDFS as storage.
spark-env.sh
export SPARK_LOCAL_IP=192.168.xx.xxx #to set the IP address Spark binds to on this node
enter code here`export MESOS_NATIVE_JAVA_LIBRARY="/home/xyz/tools/mesos-1.0.0/build/src/.libs/libmesos-1.0.0.so" #to point to your libmesos.so if you use Mesos
export SPARK_EXECUTOR_URI="hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz"
HADOOP_CONF_DIR="/usr/local/tools/hadoop" #To point Spark towards Hadoop configuration files
spark-defaults.conf
spark.executor.uri hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz
spark.driver.host 192.168.xx.xxx
spark.rpc netty
spark.rpc.numRetries 5
spark.ui.port 48888
spark.driver.port 48889
spark.port.maxRetries 32
I did some experiments with submitting word-count scala application in cluster mode, I observed that it executes successfully only when it finds driver program (containing main method) from the Vm it was submitted. As per my knowledge scheduling of resources (VMs) is handled by Mesos. for example if i submit my application from vm12 and coincidently if Mesos also schedules vm12 for executing application then it will execute successfully.In contrast it will fail if mesos scheduler decides to allocate let's say vm15.I checked logs in stderr of mesos UI and found error..
16/09/27 11:15:49 ERROR SparkContext: Error initializing SparkContext.
Besides I tried looking for configuration aspects of spark in following link.
[http://spark.apache.org/docs/latest/configuration.html][1] I tried setting rpc as it seemed necessary to keep driver program near to worker-node in LAN.
But couldn't get much insights.
I also tried uploading my code (application) in HDFS and submitting application jar file from HDFS.The same observations I received.
While connecting apache-spark with Mesos according to the documentation in
following link http://spark.apache.org/docs/latest/running-on-mesos.html
I also tried configuring spark-defaults.conf, spark-env.sh in other VM's in order to check if it successfully runs at least from 2 Vm's. That also didn't workout.
Am I missing any conceptual clarity here.?
So how can I make my application run successfully regardless of Vm's I'm submitting from ?

Application execution monitoring for Spark job on yarn

I can see the application execution information in detail on the Web UI in Spark standalone mode, but when comes to yarn, it is gone. So, where can I see the execution information when job is ran on yarn?
You need to configure spark history server with yarn ,and then start it
in your spark-defaults.conf file add the following properties,
spark.eventLog.enabled true
spark.eventLog.dir hdfs://LOCATION/TO/SPARK/EVENT/LOG
spark.yarn.historyServer.address SPARK_HISTORY_SERVER_HOST
spark.history.ui.port SPARK_HISTORY_SERVER_PORT
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.history.fs.logDirectory hdfs://LOCATION/TO/SPARK/EVENT/LOG
and then start spark history server:
$/PATH/TO/SPARK/sbin/start-history-server.sh
P.S. I assume that Spark is already configured with hadoop/yarn (so you have set the location of configuration files in spark-env.sh)
You can debug your application , but I guess there is no UI dedicated for that.

Resources