How to turnoff Spark session builder logs - apache-spark

Check the below-attached screenshot and I have marked logs in red color. These are all the logs we get when the Spark session is created. I would like to disable the same.
How to turn off the spark session logs?

Have you tried setting the log level
import org.apache.log4j.Logger
Logger.getLogger(“org.apache”).setLevel(Level.ERROR);

Related

Capture PySpark error logs and traceback in project logger

I have a complex logger in my python project with multiple handlers (stream, file, ES etc.)
I want to capture the errors from pyspark py4j and write it in the log file created by the package logger.
I'm running pyspark in client mode hence my driver is outside the cluster. I create spark context in my code and run it by triggering the script and not by doing spark-submit.
After some research I came to know that I can use log4j.properties file to capture the logs, but I don't see how I can use it in the client mode solution.
I create spark context like this and do logs like this,
from pyspark.sql import SparkSession
import logging
logging.getLogger(__name__)
logger.info("creating spark context")
spark = SparkSession.builder.appName("myapp").getOrCreate()
logger.info("done")
# some pyspark related codes that throws an error
The project logger currently don't capture the logs from pyspark directly. I want to capture that logs as well using log4j in client mode.
How can I do that? Is there some spark related config that I can tweak while creating spark context?

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

Hive INFO logs are not getting suppressed in Spark job

There are two approaches to control logging. One is via log4j.properties and another via controlling it programmatically. I have tried both:
Via log4j.properties file:
# disable logging for spark libraries
log4j.additivity.org=false
log4j.additivity.org.apache=false
#log4j.logger.org.apache=ERROR, NOAPPENDER
log4j.logger.org=ERROR, NOAPPENDER
and via programmatically:
org.apache.log4j.Logger logger = LogManager.getLogger(pkgName);
logger.setLevel(Level.ERROR);
I was able to suppress other logs but there are few INFO logs which are still getting printed:
INFO metastore: Connected to metastore.
INFO Hive: Registering function addfunc ca.nextpathway.hive.UDFToDate
and
INFO ContextHandler: Started o.s.j.s.ServletContextHandler#17f9344b{/static,null,AVAILABLE}
I want to suppress all the INFO logs except for few specific packages. But I think I am nowhere near it. If anyone knows what could be the problem here please let me know.
Try using the below. This should work.
Logger.getLogger("org.apache.hadoop.hive").setLevel(Level.ERROR);
The code
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java has a bug. It creates the LOg as below:
Logger LOG = LoggerFactory.getLogger("hive.ql.metadata.Hive");
So the regular filter with org.apache.hadoop.hive does not work. Instead, you have to use "hive.ql.metadata.Hive". For example:
org.apache.log4j.Logger.getLogger("hive.ql.metadata.Hive").setLevel(Level.WARN);

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?
The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

Resources