Logging and Debuging on Qubole - apache-spark

How does one log on Qubole/access logs from spark on Qubole? The setup I have:
java library (JAR)
Zeppelin Notebook (Scala), simply calling a method from the library
Spark, Yarn cluster
Log4j2 used in the library (configured to log on stdout)
How can I access my logs from the log4j2 logger? What I tried so far:
Looking into the 'Logs' section of my Interpreters
Going through Spark UI's stdout logs of each executor

When a Spark job or application fails, you can use the Spark logs to analyze the failures.
The QDS UI provides links to the logs in the Application UI and Spark Application UI.
If you are running the Spark job or application from the Analyze page, you can access the logs via the Application UI and Spark Application UI.
If you are running the Spark job or application from the Notebooks page, you can access the logs via the Spark Application UI.
Accessing the Application UI Accessing the Spark Application UI You can also additional logs to identify the errors and exceptions in Spark job or application failures.
Accessing the Application UI
To access the logs via the Application UI from the Analyze page of the QDS UI:
Note the command id, which is unique to the Qubole job or command.
Click on the down arrow on the right of the search bar.
The Search History page appears as shown in the following figure.
../../_images/spark-debug1.png
Enter the command id in the Command Id field and click Apply.
Logs of any Spark job are displayed in Application UI and Spark Application UI, which are accessible in the Logs and Resources tabs. The information in these UIs can be used to trace any information related to command status.
The following figure shows an example of Logs tab with links.
Click on the Application UI hyperlink in the Logs tab or Resources tab.
The Hadoop MR application UI is displayed as shown in the following figure.
../../_images/application-ui.png
The Hadoop MR application UI displays the following information:
MR application master logs
Total Mapper/Reducer tasks
Completed/Failed/Killed/Successful tasks
Note
The MR application master logs corresponds to the Spark driver logs. For any Spark driver related issues, you should verify the AM logs (driver logs).
If you want to check the exceptions of the failed jobs, you can click on the logs link in the Hadoop MR application UI page. The Application Master (AM) logs page that contains stdout, stderr and syslog is displayed.
Accessing the Spark Application UI
You can access the logs by using the Spark Application UI from the Analyze page and Notebooks page.
From the Analyze page
From the Home menu, navigate to the Analyze page.
Note the command id, which is unique to the Qubole job or command.
Click on the down arrow on the right of the search bar. The Search History page appears as shown in the following figure.
../../_images/spark-debug1.png
Enter the command id in the Command Id field and click Apply.
Click on the Logs tab or Resources tab.
Click on the Spark Application UI hyperlink.
From the Notebooks page
From the Home menu, navigate to the Notebooks page.
Click on the Spark widget on the top right and click on Spark UI as shown in the following figure.
../../_images/spark-ui.png
OR
Click on the i icon in the paragraph as shown in the following figure.
../../_images/spark-debug2.png
When you open the Spark UI from the Spark widget of the Notebooks page or from the Analyze page, the Spark Application UI is displayed in a separate tab as shown in the following figure.
../../_images/spark-application-ui.png
The Spark Application UI displays the following information:
Jobs: The Jobs tab shows the total number of completed, succeeded and failed jobs. It also shows the number of stages that a job has succeeded.
Stages: The Stages tab shows the total number of completed and failed stages. If you want to check more details about the failed stages, click on the failed stage in the Description column. The details of the failed stages are displayed as shown in the following figure.
../../_images/spark-app-stage.png
The Errors column shows the detailed error message for the failed tasks. You should note the executor id and the hostname to view details in the container logs. For more details about the error stack trace, you should check the container logs.
Storage: The Storage tab displayed the cached data if caching is enabled.
Environment : The Environment tab shows the information about JVM, Spark properties, System properties and classpath entries which helps to know the values for a property that is used by the spark cluster during runtime. The following figure shows the Environment tab.
../../_images/spark-app-env.png
Executors : The Executors tab shows the container logs. You can map the container logs using the executor id and the hostname, which is displayed in the Stages tab.
Spark on Qubole provides the following additional fields in the Executors tab:
Resident size/Container size: Displays the total physical memory used within the container (which is the executor’s java heap + off heap memory) as Resident size, and the configured yarn container size (which is executor memory + executor overhead) as Container size.
Heap used/committed/max: Displays values corresponding to the executor’s java heap.
The following figure shows the Executors tab.
../../_images/spark-app-exec.png
The Logs column in shows the links to the container logs. Additionally, the number of tasks executed by each executor with number of active, failed, completed and total tasks are displayed.
Note
For debugging container memory issues, you can check the statistics on container size, Heap used, the input size, and shuffle read/write.
Feedback
Accessing Additional Spark Logs
Apart from accessing the logs from the QDS UI, you can also access the following logs, which reside on the cluster, to identify the errors and exceptions in Spark jobs failures:
and it contains the Spark event logs.
Spark History Server Logs: The spark-yarn-org.apache.spark.deploy.history.HistoryServer-1-localhost.localdomain.log
files are stored at /media/ephemeral0/logs/spark. The Spark history server logs are stored only on the master node of the cluster.
Spark Event Logs: The Spark eventlog files are stored at
/logs/hadoop///spark-eventlogs where:
scheme is the Cloud-specific URI scheme: s3:// for AWS; wasb:// or adl:// or abfs[s] for Azure; oci:// for Oracle OCI.
defloc is the default storage location for the QDS account.
cluster_id is the cluster ID as shown on the Clusters page of the QDS UI.
cluster_inst_id is the cluster instance ID. You should contact Qubole Support to obtain the cluster instance ID.

Related

How to access Spark web UI in Yarn mode

I am running a pyspark job as a AWS EMR step and the script takes well over 15 minutes to run. I have 1 master and 3 core nodes in the EMR cluster. I want to find out why and which part of my script is taking long. For that I wanted to see the Spark web UI. When I click on "Tracking URL: Applicationmaster" in Yarn UI (port 8088), my browser keeps spinning and it unable to display the Spark UI. The URL link in the browser is:
http://ip-172-31-x-x.ec2.internal:20888/proxy/application_1579701541309_1029/
This is obviously private DNS. How do I see the Spark UI even if this is on a temporary basis for me to troubleshoot. I can change the AWS Security group, if needed. And later, how can this be handled when I am in production environment?
Thanks

What is 'Active Jobs' in Spark History Server Spark UI Jobs section

I'm trying to understand Spark History server components.
I know that, History server shows completed Spark applications.
Nonetheless, I see 'Active Jobs' set to 1 for a completed Spark application. I'm trying to understand what is 'Active Jobs' mean in Jobs section.
Also, Application completed within 30 minutes, but when I opened History Server after 8 hours, 'Duration' shows 8.0h.
Please see the screenshot.
Could you please help me understand 'Active Jobs', 'Duration' and 'Stages: Succeeded/Total' items in above image?
Finally after some research, found answer for my question.
A Spark application consists of a driver and one or more executors. The driver program instantiates SparkContext, which coordinates the executors to run the Spark application. This information is displayed on Spark History Server Web UI 'Active Jobs' section.
The executors run tasks assigned by the driver.
When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.
YARN application has a yarn client, yarn application master and list of container running on node managers.
In my case Yarn is running in standalone mode, thus driver program is running as a thread of the yarn application master. The Yarn client pulls status from the application master and application master coordinates the containers to run the tasks.
This running job could be monitored in YARN applications page in the Cloudera Manager Admin Console, while it is running.
If application succeeds, then History server will show list of 'Completed Jobs' and also 'Active Jobs' section will be removed.
If application fails at the containers level and YARN communicates this information to Driver then, History server will show list of 'Failed Jobs' and also 'Active Jobs' section will be removed.
Nonetheless, if application fails at the containers level and YARN couldn't communicate that to driver, then Driver instantiated job gets into oblivion state. It thinks job is still being run and keeps waiting to hear from YARN application master for the job status. Hence, in History Server, it still shows up in 'Active Jobs' as running.
So my take away from this is:
To check the status of running job, go to YARN applications page in the Cloudera Manager Admin Console or use YARN CLI command.
After job completion/failure, Open the Spark History Server to get more details on resources usage, DAG and execution timeline information.
Invoking an action(count is action in your case) inside a Spark application triggers the launch of a job to fulfill it. Spark examines the dataset on which that action depends and formulates an execution plan. The execution plan assembles the dataset transformations into stages.
A stage is a physical unit of the execution plan. In shorts, Stage is a set of parallel tasks i.e. one task per partition. Basically, each job which gets divided into smaller sets of tasks is a stage. Although, it totally depends on each other. However, it somewhat same as the map and reduce stages in MapReduce.
each type of Spark Stages in detail:
a. ShuffleMapStage in Spark
ShuffleMapStage is considered as an intermediate Spark stage in the physical execution of DAG.
Basically, it produces data for another stage(s).
consider ShuffleMapStage in Spark as input for other following Spark stages in the DAG of stages.
However, it is possible that there is n number of multiple pipeline operations, in ShuffleMapStage.
like map and filter, before shuffle operation. Furthermore, we can share single ShuffleMapStage among different jobs.
b. ResultStage in Spark
By running a function on a spark RDD Stage which executes a Spark action in a user program is a ResultStage.It is considered as a final stage in spark. ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark, helps for computation of the result of an action.
coming back to the question of active jobs on history sever there some notes listed on official docs
as history server.Also there is jira [SPARK-7889] issue regarding the same link.
for more details follow the link
source-1

Running spark application doesn't show up on spark history server

I am creating a long running spark application. After spark session has been created and application starts to run, I am not able to see it after click on the "show incomplete applications" on the spark history server. However, If I force my application to close, I can see it under the "completed applications" page.
I have spark parameters configured correctly on both client and server, as follow:
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://10.18.51.117:8020/history/ (a hdfs path on my spark history server)
I also configured the same on server side. So configuration shouldn't be a concern (since completed applications can also show up after I force my application to stop).
Do you guys have any thoughts on this behavior??
I look at the hdfs files on spark history server, I see a very small size .inprogress file associated with my running application (close to empty, see the picture below). It seems that the results get flushed to the file only when the application stops, which is not ideal for my long running application...Is there any way or parameters we can tweak to force flushing the log?
Very small size .inprogress file shown on hdfs during application is running

Spark history for Standalone Cluster mode

I have seen this text on Spark website. I am trying to view Spark logs on the UI even after application ended or killed.
Is there anyway that i could view the logs in Standalone mode?
Spark is run on Mesos or YARN, it is still possible to construct the UI of an application through Spark’s history server, provided that the application’s event logs exist. You can start the history server by executing:
./sbin/start-history-server.sh
This creates a web interface at http://<server-url>:18080 by default, listing incomplete and completed applications and attempts.
When using the file-system provider class (see spark.history.provider below), the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option, and should contain sub-directories that each represents an application’s event logs.
The spark jobs themselves must be configured to log events, and to log them to the same shared, writeable directory. For example, if the server was configured with a log directory of hdfs://namenode/shared/spark-logs, then the client-side options would be:
spark.eventLog.enabled true spark.eventLog.dir hdfs://namenode/shared/spark-logs

View worker / executor logs in Spark UI since 1.0.0+

In 0.9.0 to view worker logs it was simple, they where one click away from the spark ui home page.
Now (1.0.0+) I cannot find them. Furthermore the Spark UI stops working when my job crashes! This is annoying, what is the point of a debugging tool that only works when your application does not need debugging. According to http://apache-spark-user-list.1001560.n3.nabble.com/Viewing-web-UI-after-fact-td12023.html I need to find out what my master-url is, but I don't how to, spark doesn't spit out this information at startup, all it says is:
... -Dspark.master=\"yarn-client\" ...
and obviously http://yarn-client:8080 doesn't work. Some sites talk about how now in YARN finding logs has been super obfuscated - rather than just being on the UI, you have to login to the boxes to find them. Surely this is a massive regression and there has to be a simpler way??
How am I supposed to find out what the master URL is? How can I find my worker (now called executor) logs?
Depending on your configuration of YARN NodeManager log aggregation, the spark job logs are aggregated automatically. Runtime log is usually be found in following ways:
Spark Master Log
If you're running with yarn-cluster, go to YARN Scheduler web UI. You can find the Spark Master log there. Job description page "log' button gives the content.
With yarn-client, the driver runs in your spark-submit command. Then what you see is the driver log, if log4j.properties is configured to output in stderr or stdout.
Spark Executor Log
Search for "executorHostname" in driver logs. See comments for more detail.
These answers document how to find them from command line or UI
Where are logs in Spark on YARN?
For UI, on an edge node
Look in /etc/hadoop/conf/yarn-site.xml for the yarn resource manager URI (yarn.resourcemanager.webapp.address).
Or use command line:
yarn logs -applicationId [OPTIONS]

Resources