Avoid Google Dataproc logging - apache-spark

I'm performing millions of operations using Google Dataproc with one problem, the logging data size.
I do not perform any show or any other kind of print, but the 7 lines of INFO, multiplied by millions gets a really big logging size.
Is there any way to avoid Google Dataproc from logging?
Already tried without success in Dataproc:
https://cloud.google.com/dataproc/docs/guides/driver-output#configuring_logging
These are the 7 lines I want to get rid off:
18/07/30 13:11:54 INFO org.spark_project.jetty.util.log: Logging initialized #...
18/07/30 13:11:55 INFO org.spark_project.jetty.server.Server: ....z-SNAPSHOT
18/07/30 13:11:55 INFO org.spark_project.jetty.server.Server: Started #...
18/07/30 13:11:55 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector#...
18/07/30 13:11:56 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: ...
18/07/30 13:11:57 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ...
18/07/30 13:12:01 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_...

What you are looking for is an exclusion filter: you need to browse from your Console to Stackdriver Logging > Logs ingestion > Exclusions and click on "Create exclusion". As explained there:
To create a logs exclusion, edit the filter on the left to only match
logs that you do not want to be included in Stackdriver Logging. After
an exclusion has been created, matched logs will no longer be
accessible in Stackdriver Logging.
In your case, the filter should be something like this:
resource.type="cloud_dataproc_cluster"
textPayload:"INFO org.spark_project.jetty.util.log: Logging initialized"
...

Related

Spark application in incomplete section of spark-history even when complited

In my Spark-history some applications are "incomplete" for a week now. I've tried to kill them, close sparkContext(), kill main .py process, but nothing helped.
For example,
yarn application -status <id>
shows:
...
State: FINISHED
Final-State: SUCCEDED
...
Log Aggregation Status: TIME_OUT
...
But in Spark-History I still see it in incomplete section of my applications. If I open this application there, I can see 1 Active job with 1 Alive executor, but they are doing nothing for all week. This seems like a logging bug, but as I know this problem is only with me, other coworkers don't have this problem.
This thread doesn't helped me, because I dont have access to start-history-server.sh.
I suppose this problem because of
Log Aggregation Status: TIME_OUT
because my "completed" applications have
Log Aggregation Status: SUCCEDED
What can I do to fix this? Right now I have 90+ incomplete applications.
I've found clear description of my problem with same situation (yarn, spark, etc.), but there is no solution: What is 'Active Jobs' in Spark History Server Spark UI Jobs section
From Spark Monitoring and Instrumentation:
...
3. Applications which exited without registering themselves as completed will be listed as incomplete --even though they are no
longer running. This can happen if an application crashes.
...
Meaning:
History Server's UI shows only those Spark applications whose event logs it can find in its spark.eventLog.dir directory (a config typically set to /user/spark/applicationHistory in Hadoop). If a log doesn't end with the special ApplicationEnd event
:
{"Event":"SparkListenerApplicationEnd","Timestamp":1667223930402}
...the application is considered incomplete (even if it is no longer running) and will be displayed on the Incomplete Applications page.
To your question it means that "moving" application to the Completed Apps page won't be trivial, and will require manually editing eventlog and re-uploading it to SHS directory in Hadoop. Moreover, it won't solve anything, since most likely, your application keeps crashing before it can write that final message, and its next run will end up on the same Incomplete page again.
To diagnose the reason why it fails, perhaps you can look at the application driver logs for any clues -- errors or exception messages. Graceful shutdown looks different depending on what kind of resource manager and what deploy mode your app is using. For deploy-mode=cluster and YARN RM, it would look something like this:
:
22/10/31 11:11:11 INFO spark.SparkContext: Successfully stopped SparkContext
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
22/10/31 11:11:11 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://.../.../.sparkStaging/application_<appId>
22/10/31 11:11:11 INFO util.ShutdownHookManager: Shutdown hook called
22/10/31 11:11:11 INFO util.ShutdownHookManager: Deleting directory /.../.../appcache/application_<appId>/spark-<guid>

Databricks connect does not work from intellj?

I am trying to use databricks connect to run the spark job on databricks cluster from intellj .I followed the below link documentation.
https://docs.databricks.com/dev-tools/databricks-connect.html
However I could not make it work with intellj and it throws below exception
21/10/01 18:32:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/10/01 18:32:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
Exception in thread "main" java.lang.NoSuchFieldError: JAVA_9
at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207)
at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:95)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:443)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:384)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:432)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
I could not find a workaround this as the documentation does not say anything clearly I cross checked from intellj its pointed to correct jar directory returned by (databricks-connect get-jar-dir).Any clue on this will be helpful?
Note:databricks-connect test is returning success

IP address and port of log requests in Kibana

Please can someone tell me if it is possible to add the IP address and port available fields in my Kibana to see which logs belong to my application instance. Where do i configure in order to enable this feature.
For example: I am sending log requests like this and I have 4 applications with multiple instances of them
2020-01-14 00:21:12.869 INFO [microservice1,48f1befc87d3f220,48f1befc87d3f220,false] 8278 --- [nio-8001-exec-7] c.s.m.c.Microservice1Controller : This is an INFO log
2020-01-14 00:21:12.869 ERROR [microservice1,48f1befc87d3f220,48f1befc87d3f220,false] 8278 --- [nio-8001-exec-7] c.s.m.c.Microservice1Controller : This is an ERROR log
Picture of my kibana UI with the available fields:
Kibana can only display fields that were indexed into Elasticsearch. Kibana is just a visual platform that lets you search your data in a graphical manner instead of using the REST-Api.
So if your documents don't contain any source.ip or source.port fields, how should Kibana display them?
Q: Where do i configure in order to enable this feature
A: There is no general setting that tracks the IP's and Ports
You would need to add these fields into your created logs, e.g.:
2020-01-14 00:21:12.869 INFO [microservice1,48f1befc87d3f220,48f1befc87d3f220,false] 192.168.19.100:4712 8278 --- [nio-8001-exec-7] c.s.m.c.Microservice1Controller : This is an INFO log
2020-01-14 00:21:12.869 ERROR [microservice1,48f1befc87d3f220,48f1befc87d3f220,false] 192.168.19.101:4812 8278 --- [nio-8001-exec-7] c.s.m.c.Microservice1Controller : This is an ERROR log
With that, you can extract the IP's and Ports and index them as separate fields of your documents into elasticsearch.

Hive INFO logs are not getting suppressed in Spark job

There are two approaches to control logging. One is via log4j.properties and another via controlling it programmatically. I have tried both:
Via log4j.properties file:
# disable logging for spark libraries
log4j.additivity.org=false
log4j.additivity.org.apache=false
#log4j.logger.org.apache=ERROR, NOAPPENDER
log4j.logger.org=ERROR, NOAPPENDER
and via programmatically:
org.apache.log4j.Logger logger = LogManager.getLogger(pkgName);
logger.setLevel(Level.ERROR);
I was able to suppress other logs but there are few INFO logs which are still getting printed:
INFO metastore: Connected to metastore.
INFO Hive: Registering function addfunc ca.nextpathway.hive.UDFToDate
and
INFO ContextHandler: Started o.s.j.s.ServletContextHandler#17f9344b{/static,null,AVAILABLE}
I want to suppress all the INFO logs except for few specific packages. But I think I am nowhere near it. If anyone knows what could be the problem here please let me know.
Try using the below. This should work.
Logger.getLogger("org.apache.hadoop.hive").setLevel(Level.ERROR);
The code
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java has a bug. It creates the LOg as below:
Logger LOG = LoggerFactory.getLogger("hive.ql.metadata.Hive");
So the regular filter with org.apache.hadoop.hive does not work. Instead, you have to use "hive.ql.metadata.Hive". For example:
org.apache.log4j.Logger.getLogger("hive.ql.metadata.Hive").setLevel(Level.WARN);

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?
The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

Resources