PySpark logging from the executor in a standalone cluster - apache-spark

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?

The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

Related

Capture PySpark error logs and traceback in project logger

I have a complex logger in my python project with multiple handlers (stream, file, ES etc.)
I want to capture the errors from pyspark py4j and write it in the log file created by the package logger.
I'm running pyspark in client mode hence my driver is outside the cluster. I create spark context in my code and run it by triggering the script and not by doing spark-submit.
After some research I came to know that I can use log4j.properties file to capture the logs, but I don't see how I can use it in the client mode solution.
I create spark context like this and do logs like this,
from pyspark.sql import SparkSession
import logging
logging.getLogger(__name__)
logger.info("creating spark context")
spark = SparkSession.builder.appName("myapp").getOrCreate()
logger.info("done")
# some pyspark related codes that throws an error
The project logger currently don't capture the logs from pyspark directly. I want to capture that logs as well using log4j in client mode.
How can I do that? Is there some spark related config that I can tweak while creating spark context?

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

Pipeline Transform Logging on Apache Beam in Dataproc

Recently, I deployed a very simple Apache Beam pipeline to get some insights into how it behaved executing in Dataproc as opposed to on my local machine. I quickly realized that after executing that any DoFn or transform-level logging didn't appear within the job logs within the Google Cloud Console as I would have expected and I'm not entirely sure what might be missing.
All of the high level logging messages are emitted as expected:
// This works
log.info("Testing logging operations...")
pipeline
.apply(Create.of(...))
.apply(ParDo.of(LoggingDoFn))
The LoggingDoFn class here is a very basic transform that emits each of the values that it encounters as seen below:
object LoggingDoFn : DoFn<String, ...>() {
private val log = LoggerFactory.getLogger(LoggingDoFn::class.java)
#ProcessElement
fun processElement(c: ProcessContext) {
// This is never emitted within the logs
log.info("Attempting to parse ${c.element()}")
}
}
As detailed in the comments, I can see logging messages outside of the processElement() calls (presumably because those are being executed by the Spark runner), but is there a way to easily expose those within the inner transform as well? When viewing the logs related to this job, we can see the higher-level logging present, but no mention of a "Attempting to parse ..." message from the DoFn:
The job itself is being executed by the following gcloud command, which has the driver log levels explicitly defined, but perhaps there's another level of logging or configuration that needs to be added:
gcloud dataproc jobs submit spark --jar=gs://some_bucket/deployment/example.jar --project example-project --cluster example-cluster --region us-example --driver-log-levels com.example=DEBUG -- --runner=SparkRunner --output=gs://some_bucket/deployment/out
To summarize, log messages are not being emitted to the Google Cloud Console for tasks that would generally be assigned to the Spark runner itself (e.g. processElement()). I'm unsure if it's a configuration-related issue or something else entirely.

spark.worker.cleanup does not work, logs are not deleted

I want periodically cleanup the log files that stored in the ${SPARK_HOME}/logs for our spark cluster (1 master + 4 workers).
The default log directory for spark logs should be ${SPARK_HOME}/logs, since I didn't configure the SPARK_LOG_DIR in spark-env, so all logs are being stored there.
In order to test it, I have added the conf below (spark.worker.cleanup.enabled) in one of the worker node.
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=300 -Dspark.worker.cleanup.appDataTtl=300"
And then executed stop-slave.sh to stop the worker node, and start the worker with start-slave.sh.
But those log files in ${SPARK_HOME}/logs are not being deleted after the configured interval time.
I would like to know am I doing the right step ? Or something more has to be done? I also put that spark.worker.cleanup conf in the master node's spark-env.sh. And I don't see any effect there also.
I think I was a bit confused on which folder to cleanup. In the spark document, it mentioned the spark.worker.cleanup.enabled is only cleanup worker "APPLICATION" directory.
And our application directory is located at "spark-2.3.3-bin-hadoop2.7/work", and that directory did get cleaned up .
So after changing the spark-env.sh, and then stop-slave, and then start-slave again. Everything working good.

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

Resources