Spark webUI - completed application details page - apache-spark

I use spark 1.1.0 on a standalone cluster with 3 nodes.
I want to see the detailed logs of Completed Applications so I've set in my program :
set("spark.eventLog.enabled","true")
set("spark.eventLog.dir","file:/tmp/spark-events")
but when I click on the application in the webui, I got a page with the message :
Application history not found (app-20150126000651-0331)
No event logs found for application xxx$ in file:/tmp/spark-events/xxx-1422227211500. Did you specify the correct logging directory?
despite the fact that the directory exist and contains 3 files :
APPLICATION_COMPLETE*, EVENT_LOG_1* and SPARK_VERSION_1.1.0*
Any suggestion to solve the problem ?
Thanks.

why is your application name xxx$ and then xxx in your error message ? Is that really what Spark reports ?
Permissions problem : check that the directory in which you log is readable and executable by the user under which you run Spark (and that the inside files a readable as well).
Check that you do specify master correctly, i.e. --master spark://<localhostname>:7077
Dig in the EVENT_LOG_1* file. The last event (on the last line) of the file should be an "Application Complete" event. If it doesn't, it's likely that your application did not call sc.stop(), though the logs should still show up nonetheless.

I had the same error "Did you specify the correct logging directory?" and for me the fix was to add a '/' at the end of the path for 'spark.eventLog.dir' i.e. /root/ephemeral-hdfs/spark-events/
>> cat spark/conf/spark-defaults.conf
spark.eventLog.dir /root/ephemeral-hdfs/spark-events/
spark.executor.memory 5929m

Related

Airflow can't reach logs from webserver due to 403 error

I use Apache Airflow for daily ETL jobs. I installed it in Azure Kubernetes Service using the provided Helm chart. It's been running fine for half a year, but since recently I'm unable to access the logs in the webserver (this used to always work fine).
I'm getting the following error:
*** Log file does not exist: /opt/airflow/logs/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** Fetching from: http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
****** See more at https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key
****** Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url 'http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log'
For more information check: https://httpstatuses.com/403
What have I tried:
I've made sure that the log file exists (I can exec into the airflow-worker-0 pod and read the file on command line in the location specified in the error).
I've rolled back my deployment to an earlier commit from when I know for sure it was still working, but it made no difference.
I was using webserverSecretKeySecretName in the values.yaml configuration. I changed the secret to which that name was pointing (deleted it and created a new one, as described here: https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key) but it didn't work (no difference, same error).
I changed the config to use a webserverSecretKey instead (in plain text), no difference.
My thoughts/observations:
The error states that the log file doesn't exist, but that's not true. It probably just can't access it.
The time is the same in all pods (I double checked be exec-ing into them and typing date in the command line)
The webserver secret is the same in the worker, the scheduler, and the webserver (I double checked by exec-ing into them and finding the corresponding env variable)
Any ideas?
Turns out this was a known bug with the latest release (2.4.0) of the official Airflow Helm chart, reported here:
https://github.com/apache/airflow/discussions/26490
Should be resolved in version 2.4.1 which should be available in the next couple of days.

spark.worker.cleanup does not work, logs are not deleted

I want periodically cleanup the log files that stored in the ${SPARK_HOME}/logs for our spark cluster (1 master + 4 workers).
The default log directory for spark logs should be ${SPARK_HOME}/logs, since I didn't configure the SPARK_LOG_DIR in spark-env, so all logs are being stored there.
In order to test it, I have added the conf below (spark.worker.cleanup.enabled) in one of the worker node.
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=300 -Dspark.worker.cleanup.appDataTtl=300"
And then executed stop-slave.sh to stop the worker node, and start the worker with start-slave.sh.
But those log files in ${SPARK_HOME}/logs are not being deleted after the configured interval time.
I would like to know am I doing the right step ? Or something more has to be done? I also put that spark.worker.cleanup conf in the master node's spark-env.sh. And I don't see any effect there also.
I think I was a bit confused on which folder to cleanup. In the spark document, it mentioned the spark.worker.cleanup.enabled is only cleanup worker "APPLICATION" directory.
And our application directory is located at "spark-2.3.3-bin-hadoop2.7/work", and that directory did get cleaned up .
So after changing the spark-env.sh, and then stop-slave, and then start-slave again. Everything working good.

stackdriver logging agent not showing logs read from a custom log file in stackdriver logging viewer on Google cloud platform

I decided to post this question because, I have ran out of debugging ideas, just ideas are golden since I know it can be difficult to help debugging a virtual instance through here (debugging code is hard enough jaja). Anyway, I have created a virtual machine in Compute engine , I created a logs file that I populate, for example, with this command in a python script, let's call it logging.py:
import logging
logging.basicConfig(filename= 'app.log' , level = logging.INFO , format = ' %(asctime)s - %(name) - %(levelname)s - %(message)s')
logging.info('Some message ' + str(type(variable)))
everytime I use python3 logging.py , the app.log is effectively populated. ( Logging.py and app.log are in the same directory the /home/username/ folder )
I want stackdriver to show this log in the logging viewer everytime it's written, so , I installed the stackdriver agent as follows, in the virtual machine command line:
$ curl -sSO https://dl.google.com/cloudagents/install-logging-agent.sh
$ sudo bash install-logging-agent.sh
No errors that I see are delivered here, in fact, you can see here the messages obtained
Messags on the stackdriver viewer:
After this, I proceed to create a .conf file that I create in /etc/google-fluentd/config.d/app.conf
with this parameters
<source>
type tail
format none
path /home/username/app.log
pos_file /var/lib/google-fluentd/pos/app.pos
read_from_head true
tag whatever-tag
</source>
After that is created, I launch sudo service google-fluentd restart.
Aftert I execute, python3 logging.py , no logs are added to stack drivers logging viewer.
So, where might Have I gone wrong?
Things I have tried/checked:
-Have more than 13 gygabytes of RAM available
-If I run logger "some message" on the command line, I effectively add a log with "some message" to the log viewer
-If I run
ps ax | grep fluentd
I obtain :
3033 ? Sl 0:09 /opt/google-fluentd/embedded/bin/ruby /usr/sbin/google-fluentd --log /var/log/google-fluentd/google-fluentd.log --no-supervisor
3309 pts/0 S+ 0:00 grep --color=auto fluentd
-Both my user, and the service account I use, have logger admin permission in IAM roles.
-This is the documentation I have based myself on:
https://cloud.google.com/logging/docs/agent/troubleshooting?hl=es-419
https://cloud.google.com/logging/docs/reference/v2/rest/v2/entries/list?hl=es-419
https://cloud.google.com/logging/docs/agent/configuration?hl=es-419
https://medium.com/google-cloud/how-to-log-your-application-on-google-compute-engine-6600d81e70e3
https://cloud.google.com/logging/docs/agent/installation
-If I run sudo service google-fluentd status , the agent appears active.
-My instance hass access, to all the apis. It's an n1-standard-4 (4 vCPUs, 15 GB of memory) using ubuntu linux 18:04
So, what else can I check to debug this? I'm out of ideas here , hope I'm not being an idiot here :(
Based on my understanding, I think that you looking for the following fluentd resource types:
generic_node
“A generic node identifies a machine or other computational resource for which no more specific resource type is applicable. The label values must uniquely identify the node.”
generic_task
“A generic task identifies an application process for which no more specific resource is applicable, such as a process scheduled by a custom orchestration system. The label values must uniquely identify the task.”
The source of my information has been found here
This document explain how to send logs from your application in different ways:
Cloud Logging API
Cloud Logging Agent
Generic fluentd
As you mentioned having installed fluentd, let me provide more focused documentation about Cloud Logging Agent. I also found some python Client Library documentation that you may be interested.
Finally, I found a nginx/apache use-case guide that you may use as reference.
For some reason, if I change the directory to which both the .conf file points, and the directory where the logg is to /var/logs/ , being the final path as /var/logs/app.logs, it does work correctly. Possibly there is a configuration issue, causing the logging agent to only capture logs in specific predetermined folders, or a permissions issue that stops it from working if the log is in the username directory.
I found this solution, however, by chance(random testing basically.
). Did not find anything in the main articles that are supposed to teach me how to configure the logging agent, that could point me in the right direction, being those articles this ones,
https://cloud.google.com/logging/docs/agent/troubleshooting?hl=es-419 https://cloud.google.com/logging/docs/reference/v2/rest/v2/entries/list?hl=es-419 https://cloud.google.com/logging/docs/agent/configuration?hl=es-419 https://medium.com/google-cloud/how-to-log-your-application-on-google-compute-engine-6600d81e70e3 https://cloud.google.com/logging/docs/agent/installation
If I needed it to work in my username directory, it's not clear just by checking this articles how to do it,what configuration file I would need to change or where to start, so I recommend to google to improve that aspect of the docs.
This documentation you have sent https://docs.fluentd.org/quickstart is pretty interesting, maybe I can find the explanation there, thank you for your help.

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?
The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

Resources