How to view log in Spark in HDInsight after app exit? - apache-spark

When the application is running, I am able to view the log from the RM UI. But after the application exits, I got this message when trying to view the log:
Failed while trying to construct the redirect url to the log server.
Log Server url may not be configured java.lang.Exception: Unknown
container. Container either has not started or has already completed
or doesn't belong to this node at all.
I looked around my HDInsight storage but I could not find any log file.

In case you are using YARN for your Spark execution, you could use its built-in log system.
According to the official Spark documentation:
If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the “yarn logs” command.
HDInsight clusters support this type of logging. In order to access them, the command below can be used from a command line:
yarn logs -applicationId <app ID>
To identify the application ID, you might want to access the Hadoop user interface and look up for the All Applications section:
Note: In order to output the entire log into a file, you might want to append > TextFile.txt to the above command.

Related

Get Dataproc Logs to Stackdriver Logging

I am running Dataproc and submitting Spark Jobs using the default client-mode.
The logs for the jobs are visible in the GCP console and is available in the GCS bucket. However, I would like to see the logs in Stackdriver Logging.
Currently, the only way I found was to use cluster-mode instead.
Is there a way to push logs to Stackdriver when using client-mode?
This is something the Dataproc team is actively working on and should have a solution for you sometime soon. If you want to file a public feature request for tracking this that is an option, but I will try to update this response when this feature is usable by you.
Digging into it a bit, the reason why you can see the logs when using cluster-mode is that we have Fluentd configurations that pick up YARN container logs (userlogs) by default. When running in cluster-mode the driver runs in a YARN container and those logs are picked up by that configuration.
Currently, output produced by the driver is forwarded directly to GCS by the Dataproc agent. In the future there will be an option to have all driver output sent to Stackdriver when starting a cluster.
Update:
This feature is now in Beta and is stable to use. When creating a Cluster, the property "dataproc:dataproc.logging.stackdriver.job.driver.enable" can be used to toggle whether the cluster will send Job driver logs to Stackdriver. Additionally you can use the property "dataproc:dataproc.logging.stackdriver.job.yarn.container.enable" to have the cluster associate YARN container logs with the Job they were created by instead of the Cluster they ran on.
Documentation is available here

How can I see the aggregated logs for a Spark standalone cluster

With Spark running over Yarn, I could simply use yarn -logs -applicationId appId to see the aggregated log, after a Spark job is finished. What is the equivalent method for a Spark standalone cluster?
Via the Web Interface:
Spark’s standalone mode offers a web-based user interface to monitor
the cluster. The master and each worker has its own web UI that shows
cluster and job statistics. By default you can access the web UI for
the master at port 8080. The port can be changed either in the
configuration file or via command-line options.
In addition, detailed log output for each job is also written to the
work directory of each slave node (SPARK_HOME/work by default). You
will see two files for each job, stdout and stderr, with all output it
wrote to its console.
Please find more information in Monitoring and Instrumentation.

Spark history for Standalone Cluster mode

I have seen this text on Spark website. I am trying to view Spark logs on the UI even after application ended or killed.
Is there anyway that i could view the logs in Standalone mode?
Spark is run on Mesos or YARN, it is still possible to construct the UI of an application through Spark’s history server, provided that the application’s event logs exist. You can start the history server by executing:
./sbin/start-history-server.sh
This creates a web interface at http://<server-url>:18080 by default, listing incomplete and completed applications and attempts.
When using the file-system provider class (see spark.history.provider below), the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option, and should contain sub-directories that each represents an application’s event logs.
The spark jobs themselves must be configured to log events, and to log them to the same shared, writeable directory. For example, if the server was configured with a log directory of hdfs://namenode/shared/spark-logs, then the client-side options would be:
spark.eventLog.enabled true spark.eventLog.dir hdfs://namenode/shared/spark-logs

Spark client reconnect to YARN cluster

From the official spark documentation (http://spark.apache.org/docs/1.2.0/running-on-yarn.html):
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
Is there a way that a client reconnects back to the driver at some point later to collect the results?
No simple way that I know of.
Broadly, yarn-cluster mode makes sense for production jobs, while yarn-client mode makes sense for interactive and debugging uses where you want to see your application’s output immediately.
In a production job, the simplest is perhaps to have your driver ship the results somewhere once it has them (e.g. write them to HDFS, logging ...).
Usually you could check the logs with
yarn logs -applicationId <app ID>
Check https://spark.apache.org/docs/2.2.0/running-on-yarn.html
If log aggregation is turned on (with the yarn.log-aggregation-enable
config), container logs are copied to HDFS and deleted on the local
machine. These logs can be viewed from anywhere on the cluster with
the yarn logs command.
yarn logs -applicationId <app ID>
will print out the contents of all log files from all containers from
the given application

View worker / executor logs in Spark UI since 1.0.0+

In 0.9.0 to view worker logs it was simple, they where one click away from the spark ui home page.
Now (1.0.0+) I cannot find them. Furthermore the Spark UI stops working when my job crashes! This is annoying, what is the point of a debugging tool that only works when your application does not need debugging. According to http://apache-spark-user-list.1001560.n3.nabble.com/Viewing-web-UI-after-fact-td12023.html I need to find out what my master-url is, but I don't how to, spark doesn't spit out this information at startup, all it says is:
... -Dspark.master=\"yarn-client\" ...
and obviously http://yarn-client:8080 doesn't work. Some sites talk about how now in YARN finding logs has been super obfuscated - rather than just being on the UI, you have to login to the boxes to find them. Surely this is a massive regression and there has to be a simpler way??
How am I supposed to find out what the master URL is? How can I find my worker (now called executor) logs?
Depending on your configuration of YARN NodeManager log aggregation, the spark job logs are aggregated automatically. Runtime log is usually be found in following ways:
Spark Master Log
If you're running with yarn-cluster, go to YARN Scheduler web UI. You can find the Spark Master log there. Job description page "log' button gives the content.
With yarn-client, the driver runs in your spark-submit command. Then what you see is the driver log, if log4j.properties is configured to output in stderr or stdout.
Spark Executor Log
Search for "executorHostname" in driver logs. See comments for more detail.
These answers document how to find them from command line or UI
Where are logs in Spark on YARN?
For UI, on an edge node
Look in /etc/hadoop/conf/yarn-site.xml for the yarn resource manager URI (yarn.resourcemanager.webapp.address).
Or use command line:
yarn logs -applicationId [OPTIONS]

Resources