Hive INFO logs are not getting suppressed in Spark job - apache-spark

There are two approaches to control logging. One is via log4j.properties and another via controlling it programmatically. I have tried both:
Via log4j.properties file:
# disable logging for spark libraries
log4j.additivity.org=false
log4j.additivity.org.apache=false
#log4j.logger.org.apache=ERROR, NOAPPENDER
log4j.logger.org=ERROR, NOAPPENDER
and via programmatically:
org.apache.log4j.Logger logger = LogManager.getLogger(pkgName);
logger.setLevel(Level.ERROR);
I was able to suppress other logs but there are few INFO logs which are still getting printed:
INFO metastore: Connected to metastore.
INFO Hive: Registering function addfunc ca.nextpathway.hive.UDFToDate
and
INFO ContextHandler: Started o.s.j.s.ServletContextHandler#17f9344b{/static,null,AVAILABLE}
I want to suppress all the INFO logs except for few specific packages. But I think I am nowhere near it. If anyone knows what could be the problem here please let me know.

Try using the below. This should work.
Logger.getLogger("org.apache.hadoop.hive").setLevel(Level.ERROR);

The code
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java has a bug. It creates the LOg as below:
Logger LOG = LoggerFactory.getLogger("hive.ql.metadata.Hive");
So the regular filter with org.apache.hadoop.hive does not work. Instead, you have to use "hive.ql.metadata.Hive". For example:
org.apache.log4j.Logger.getLogger("hive.ql.metadata.Hive").setLevel(Level.WARN);

Related

How to turnoff Spark session builder logs

Check the below-attached screenshot and I have marked logs in red color. These are all the logs we get when the Spark session is created. I would like to disable the same.
How to turn off the spark session logs?
Have you tried setting the log level
import org.apache.log4j.Logger
Logger.getLogger(“org.apache”).setLevel(Level.ERROR);

Databricks connect does not work from intellj?

I am trying to use databricks connect to run the spark job on databricks cluster from intellj .I followed the below link documentation.
https://docs.databricks.com/dev-tools/databricks-connect.html
However I could not make it work with intellj and it throws below exception
21/10/01 18:32:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/10/01 18:32:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
Exception in thread "main" java.lang.NoSuchFieldError: JAVA_9
at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207)
at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:95)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:443)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:384)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:432)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
I could not find a workaround this as the documentation does not say anything clearly I cross checked from intellj its pointed to correct jar directory returned by (databricks-connect get-jar-dir).Any clue on this will be helpful?
Note:databricks-connect test is returning success

Avoid Google Dataproc logging

I'm performing millions of operations using Google Dataproc with one problem, the logging data size.
I do not perform any show or any other kind of print, but the 7 lines of INFO, multiplied by millions gets a really big logging size.
Is there any way to avoid Google Dataproc from logging?
Already tried without success in Dataproc:
https://cloud.google.com/dataproc/docs/guides/driver-output#configuring_logging
These are the 7 lines I want to get rid off:
18/07/30 13:11:54 INFO org.spark_project.jetty.util.log: Logging initialized #...
18/07/30 13:11:55 INFO org.spark_project.jetty.server.Server: ....z-SNAPSHOT
18/07/30 13:11:55 INFO org.spark_project.jetty.server.Server: Started #...
18/07/30 13:11:55 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector#...
18/07/30 13:11:56 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: ...
18/07/30 13:11:57 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ...
18/07/30 13:12:01 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_...
What you are looking for is an exclusion filter: you need to browse from your Console to Stackdriver Logging > Logs ingestion > Exclusions and click on "Create exclusion". As explained there:
To create a logs exclusion, edit the filter on the left to only match
logs that you do not want to be included in Stackdriver Logging. After
an exclusion has been created, matched logs will no longer be
accessible in Stackdriver Logging.
In your case, the filter should be something like this:
resource.type="cloud_dataproc_cluster"
textPayload:"INFO org.spark_project.jetty.util.log: Logging initialized"
...

spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days.
My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. The spark jobs are submitted using cluster deploy mode. ( In normal client mode , its getting printed )
Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding.
My approach:-
1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory.
Part from my log4j.properties:-
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out
log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console
log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false
log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console
log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console
log4j.logger.org.apache.hadoop.ipc.Client=ALL
2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console.
Code logic :-
public void onApplicationEnd(SparkListenerApplicationEnd arg0)
{
for (Thread t : Thread.getAllStackTraces().keySet())
{
if (t.getName().equals("main"))
{
System.out.println("The current state : "+t.getState());
Configuration config = new Configuration();
ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);
// some logic to write to communicate with the main thread to print the app id to console.
}
}
}
3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command .
If anyone could help me in finding an approach to the solution, it would be really helpful. Or if I am doing something very wrong, any insights would help me.
Thanks.
After being stuck at the same place for some days, I was finally able to get a solution to my problem.
After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. It might help someone else looking to achieve the same result.
In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing.
So we need to use :-
SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration=<location>/log4j.properties" spark-submit <rest of the parameters>
To clarify:
client mode means the Spark driver is running on the same machine you ran spark submit from
cluster mode means the Spark driver is running out on the cluster somewhere
You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine.
Some ideas:
Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID.
Write the appIDs to some shared location like HDFS or a database. You might be able to use a Log4j appender if you want to keep log4j.

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?
The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

Resources