How to reduce logs for Apache Spark in EMR?

How to reduce logs for Apache Spark in EMR? - apache-spark

I have a question regarding the Apache Spark job running on AWS EMR. Each time when I executed the Spark job it generated a lot of logs, in my case the logs size around 5-10GB, but the 80% of the logs is information(useless), how can I reduce those logs?
I was used log4j2 for Spark to change the log level to "warn" to avoid the unnecessary logs but as those logs from different components in spark some of theose logs from YARN, some of the logs from EMR, it merged together. so how to fix this issue? Does anyone have such experiences? because for me I don't want to re-configuration each node in the cluster.
I have tried the below solution, seems it doesn't work in the EMR
Logger logger = LogManager.getLogger("sparklog");
logger.setlevel()
xml configuration below.
String used to match the log4j2.xml configuration files
<Configuration status="WARN" monitorInterval="300">////reload the configuration file each 300 seconds
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n" /> //// control output format
</Console>
</Appenders>
<Loggers>
<Logger name="sparklog" level="warn" additivity="false">//// configuration the myloger loger level
<AppenderRef ref="Console" />
</Logger>
<Root level="error">
<AppenderRef ref="Console" />
</Root>
</Loggers>
</Configuration>

Since no one answers my question, here I got solutions by myself.
1.upload the configuration file to your master node.
scp -i ~/.ssh/emr_dev.pem /Users/x/log4j_files/log4j.properties hadoop#ec2-xxx-xxx-xxx.eu-west-1.compute.amazonaws.com:/usr/tmp/
2.In your submit script just attach
"--files": "/usr/tmp/log4j.properties"
This above solution is working properly for me.

Configuring Applications - Amazon EMR
when creating EMR - log level should be set to INFO in config.json
...
[
{
"Classification": "spark-log4j",
"Properties": {
"log4j.rootCategory": "INFO, console"
}
}
]
...
use config.json when creating EMR
aws emr create-cluster --release-label emr-5.27.0 --applications Name=Spark \
--instance-type m4.large --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations https://s3.amazonaws.com/mybucket/myfolder/config.json
pyspark example to set WARN as default DEBUG when troubleshooting
from org.apache.spark.sql import SparkSession
spark = SparkSession.builder().master("/emr/spark/master").getOrCreate()
#normal run
spark.sparkContext.setLogLevel("WARN")
#troubleshooting
spark.sparkContext.setLogLevel("DEBUG")

EMR 6.9.0 with Spark 3.3.0 uses log4j2 so this is the configuration you need to provide in order to change the log output. You can do this for the entire Spark cluster and steps by setting a custom "spark-log4j2" configuration. The example below demonstrates setting a "KafkaConsumer" logger to "warn" level, any valid log4j2 properties can be appended as necessary.
In the EMR console you do this under "Software settings - optional" when creating the cluster and then amend the JSON configuration something like this:
[
{
"Classification": "spark-log4j2",
"Properties": {
"logger.KafkaConsumer.name": "org.apache.kafka.clients.consumer.KafkaConsumer",
"logger.KafkaConsumer.level": "warn"
}
}
]
If you're creating an EMR cluster from the command line then do something like this:
aws emr create-cluster \
--name "Your cluster name" \
--release-label "emr-6.9.0" \
--configurations '[{"Classification":"spark-log4j2","Properties":{"logger.KafkaConsumer.name":"org.apache.kafka.clients.consumer.KafkaConsumer","logger.KafkaConsumer.level":"warn"}}]' \
.... other arguments go here

Related

Dataproc: limit log size for long-running / streaming Spark jobs

I've a Spark Structured Streaming job on GCP Dataproc - which picks up data from Kafka, does processing and pushes data back into kafka topics.
Couple of questions :
Does Spark put all the log (incl. INFO, WARN etc) into stderr ?
What I notice is that stdout is empty, while all the logging is put in to stderr
Is there a way for me to expire the data in stderr (i.e. expire the older logs) ?
Since I've a long running streaming job, the stderr gets filled up over time and nodes/VMs become unavailable.
Pls advice.
Here is output of the yarn logs command :
root#versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:25:34,876 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:25:35,144 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:25:35 +0000 2022
LogLength:43251469683
LogContents:
applianceName=usa-isn0784-rt01, tenantName=NOV, mstatsTimeBlock=1663507200, tenantId=2, vsnId=0, mstatsTotSentOctets=11596, mstatsTotRecvdOctets=24481, mstatsTotSessDuration=300000, mstatsTotSessCount=1, mstatsType=sdwan-acc-ckt-app-stats, appId=https, site=usa-isn0784-rt01, accCkt=WAN-DIA, siteId=442, accCktId=1, user=10.126.117.196, risk=3, productivity=3, family=general-internet, subFamily=web, bzTag=Unknown,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa type(row) is -> <class 'str'>
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************
Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
***********************************************************************
root#versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:26:01,439 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:26:01,696 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:26:02 +0000 2022
LogLength:44309782124
LogContents:
, tenantId=3, vsnId=0, mstatsTotSentOctets=48210, mstatsTotRecvdOctets=242351, mstatsTotSessDuration=300000, mstatsTotSessCount=34, mstatsType=dest-stats, destIp=165.225.216.24, mstatsAttribs=,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa type(row) is -> <class 'str'>
22/09/19 23:26:02 WARN org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************
Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
Update :
Based on #Dagang's note, i'm using the RollingFileAppender in the log4j.properties .. and the new log file is getting created. However - some data is still getting into std err.
Here is the updated code :
spark-submit
gcloud dataproc jobs submit pyspark process-appstat.py \
--cluster $CLUSTER \
--properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.dynamicAllocation.executorIdleTimeout=120s#spark.shuffle.service.enabled=true#spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j-executor.properties#spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j-driver.properties\
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.3.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar,gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar,gs://dataproc-spark-jars/bson-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar \
--files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani-noacl.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/alarm-compression-user-test.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/appstats-user-test.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/insights-user-test.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/intfutil-user-test.p12,gs://dataproc-spark-configs/metrics.properties,gs://dataproc-spark-configs/params.cfg,gs://kafka-certs/appstat-anomaly-user.p12,gs://kafka-certs/appstat-anomaly-user-test.p12,gs://kafka-certs/appstat-agg-user.p12,gs://kafka-certs/appstat-agg-user-test.p12,gs://kafka-certs/alarmblock-user.p12,gs://kafka-certs/alarmblock-user-test.p12,gs://kafka-certs/versa-alarmblock-test-user.p12,gs://kafka-certs/versa-bandwidth-test-user.p12,gs://kafka-certs/versa-appstat-test-user.p12,gs://kafka-certs/versa-alarmblock-user.p12,gs://kafka-certs/versa-bandwidth-user.p12,gs://kafka-certs/versa-appstat-user.p12,gs://dataproc-spark-configs/log4j-executor.properties,gs://dataproc-spark-configs/log4j-driver.properties \
--region $REGION \
--py-files streams.zip,utils.zip \
-- isdebug=$isdebug
log4j-executor.properties:
--------------------------
# Set everything to be logged to the console
# log4j.rootCategory=INFO, console
# log4j.appender.console=org.apache.log4j.ConsoleAppender
# log4j.appender.console.target=System.err
# log4j.appender.console.layout=org.apache.log4j.PatternLayout
# log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# logging to rolling_file, using RolligFileAppender
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/versa-ss-executor.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
log4j.appender.rolling_file.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling_file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.eclipse.jetty=WARN
# Allow INFO logging from Spark Env for EFM
log4j.logger.org.apache.spark.SparkEnv=INFO
# Spark 3.x
log4j.logger.org.sparkproject.jetty.server.handler.ContextHandler=WARN
# Spark 2.x
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=WARN
# Reduce verbosity for other spammy core classes
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=WARN
log4j.logger.org.apache.spark.ExecutorAllocationManager=ERROR
log4j.logger.org.apache.spark=WARN
log4j-driver.properties:
-------------------------
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/versa-ss-driver.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
log4j.appender.rolling_file.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling_file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.eclipse.jetty=WARN
# Allow INFO logging from Spark Env for EFM
log4j.logger.org.apache.spark.SparkEnv=INFO
# Spark 3.x
log4j.logger.org.sparkproject.jetty.server.handler.ContextHandler=WARN
# Spark 2.x
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=WARN
# Reduce verbosity for other spammy core classes
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=WARN
log4j.logger.org.apache.spark.ExecutorAllocationManager=ERROR
log4j.logger.org.apache.spark=WARN
any ideas on what needs to be done for this ?
Question on -> ${spark.yarn.app.container.log.dir}
What location does this get translated to ?
when i logon worker node and check this, i get the following :
karanalang#versa-structured-stream-v1-w-0:~$ echo $spark.yarn.app.container.log.dir
.yarn.app.container.log.dir
In yarn-site.xml:
Here are the relevant configs:
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/hadoop/yarn/nm-local-dir</value>
<description>
Directories on the local machine in which to application temp files.
</description>
</property>
<property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>gs://dataproc-temp-us-east1-939354532596-4ln8c3y1/fe57047f-13d9-4b9b-8bce-baa4a911aa65/yarn-logs</value>
<description>
The remote path, on the default FS, to store logs.
</description>
</property>
However the logs are in the location below:
root#versa-structured-stream-v1-w-0:/# find . -name versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0002/container_1664926662510_0002_01_000001/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000179/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000250/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000299/versa-ss-executor.log
where is the location - ./var/log/hadoop-yarn/userlogs - taken from (it is not in yarn-site.sml)?

Short answer:
You can use a custom log4j config with RollingFileAppender to limit the log size for long-running jobs.
Long answer:
The default log4j config for Spark on Dataproc is at /etc/spark/conf/log4j.properties. It configures root logger to stderr at INFO level. But at runtime driver logs (in client mode) will be directed by the Dataproc agent to GCS and streamed back to the client, and executor logs (and driver logs in cluster mode) will be redirected by YARN to the stderr file in the container's YARN log dir. Consider using /etc/spark/conf/log4j.properties as the template for your custom config.
In your custom config, you can configure logs to be written to a RollingFileAppender, e.g.,
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/my_app.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
...
Note that for executors (and drivers in cluster mode), the value of log4j.appender.rolling_file.File needs to be a path under ${spark.yarn.app.container.log.dir}, see this question and this doc.
Upload your log4j config(s) to a GCS bucket, driver and executor may or may not share the same config. In your case, you might want to update executor log4j config only, just use the default for driver.
Then submit the job with the custom log4j config with one of the following ways:
The file name must be log4j.properties, driver and executor will share the same config:
gcloud dataproc jobs submit spark ... \
--files gs://my-bucket/log4j.properties
The file name doesn't have to be log4j.properties, driver and executor can have different config:
gcloud dataproc jobs submit spark ... \
--files gs://my-bucket/my-log4j.properties \
--properties 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:my-log4j.properties'
The expectation is that there will be rolling logs under the YARN container log dirs (configurable through yarn.nodemanager.log-dirs with default value /var/log/hadoop-yarn/userlogs on Dataproc) for the Spark executors, they will be automatically aggregated and stored in GCS and Cloud Logging.

Log4j not showing logs when spark job is submited on yarn cluster

No application logs are shown in the yarn logs when a job is submit on yarn cluster. Everything works on local mode.
The following error is shown:
StatusLogger Log4j2 could not find a logging implementation
I provide following to the spark-submit:
--driver-library-path /opt/spark-extras/apache-log4j-2.11.2-bin \
--conf spark.driver.extraLibraryPath=/opt/spark-extras/apache-log4j-2.11.2-bin \
--conf spark.executor.extraLibraryPath=/opt/spark-extras/apache-log4j-2.11.2-bin \
The spark-defaults.conf specifies the log4j2.xml:
spark.executor.extraJavaOptions ... -Dlog4j.configurationFile=log4j2.xml
The xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n" />
</Console>
</Appenders>
<Loggers>
<Root level="info">
<AppenderRef ref="Console" />
</Root>
</Loggers>
</Configuration>
Then my log on yarn shows only something like:
2019-06-05 16:45:28,495 [main] WARN org.apache.spark.SparkConf - The configuration key 'spark.executor.port' has been deprecated as of Spark 2.0.0 and may be removed in the future. Not used anymore
2019-06-05 16:45:46,298 [Driver] WARN org.apache.spark.SparkConf - The configuration key 'spark.executor.port' has been deprecated as of Spark 2.0.0 and may be removed in the future. Not used anymore
2019-06-05 16:45:46,314 [Driver] WARN org.apache.spark.SparkConf - The configuration key 'spark.executor.port' has been deprecated as of Spark 2.0.0 and may be removed in the future. Not used anymore
2019-06-05 16:45:46,315 [Driver] WARN org.apache.spark.SparkConf - The configuration key 'spark.executor.port' has been deprecated as of Spark 2.0.0 and may be removed in the future. Not used anymore
2019-06-05 16:45:48,324 [Driver] WARN org.apache.spark.scheduler.FairSchedulableBuilder - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
2019-06-05 16:45:49,977 [Reporter] INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for :
2019-06-05 16:45:50,654 [ContainerLauncher-2] INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - yarn.client.max-cached-nodemanagers-proxies : 0
2019-06-05 16:45:53,692 [Reporter] INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for :
How can I configure that spark-submit job to make it log using log4j2 as it works on local mode?

Unable to schedule job in oozie. Getting Error while creating HiveContext

Trying to run a spark job from oozie. Below is the code which I am trying to run.
SparkConf conf = getConf(appName);
JavaSparkContext sc = new JavaSparkContext(conf);
HiveContext hiveContext = new HiveContext(sc);
I am getting the following error:
JOB[0000000-170808082825775-oozie-oozi-W] ACTION[0000000-170808082825775-oozie-oozi-W#Sample-node] Launcher exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
Here's my workflow xml file
<workflow-app name="DataSampling" xmlns="uri:oozie:workflow:0.4">
<start to='Sample-node'/>
<action name="Sample-node">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>tez.lib.uris</name>
<value>/hdp/apps/2.5.3.0-37/tez/tez.tar.gz</value>
</property>
</configuration>
<master>${master}</master>
<mode>${mode}</mode>
<name>Sample class on Oozie - Sampling</name>
<class>Sampling</class>
<jar>/path/jarfile.jar</jar>
<arg>${numEventsPerPattern}</arg>
<arg>${eventdate}</arg>
<arg>${eventtype}</arg>
<arg>${user}</arg>
</spark>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end'/>
</workflow-app>
I am using Hortonworks Data Platform 2.5. Can any one please help if I am missing some thing in the classpath.
Thanks in advance.

Finally it worked. Oozie is able to create HiveContext.
Issue is with classpath. Delete the folder /user/oozie/share/lib in hdfs.
Update the following properties in Ambari under core-site.xml
Set the following properties to *
hadoop.proxyuser.oozie.groups
hadoop.proxyuser.oozie.hosts
hadoop.proxyuser.root.groups
hadoop.proxyuser.root.hosts
Created new shared library using the following command:
/usr/hdp/current/oozie-client/bin/oozie-setup.sh sharelib create -fs /user/oozie/share/lib
Restart oozie service
Above 2 steps should be done using oozie user
Added the following tags to work flow xml file
<spark-opts>--num-executors 6 --driver-memory 8g --executor-memory 6g</spark-opts>
Run the oozie job as hdfs user.

Oozie spark action error: Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]

I am currently setting up an Oozie workflow that uses a Spark action. The Spark code that I use works correctly, tested on both local and YARN. However, when running it as an Oozie workflow I am getting the following error:
Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]
Having read up on this error, I saw that the most common cause was a problem with Oozie sharelibs. I have added all Spark jar files to the Oozie /user/oozie/share/lib/spark on hdfs, restarted Oozie and run sudo -u oozie oozie admin -oozie http://192.168.26.130:11000/oozie -sharelibupdate
to ensure the sharelibs are properly updated. Unforunately none of this has stopped the error occurring.
My workflow is as follows:
<workflow-app xmlns='uri:oozie:workflow:0.4' name='SparkBulkLoad'>
<start to = 'bulk-load-node'/>
<action name = 'bulk-load-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>client</mode>
<name>BulkLoader</name>
<jar>${nameNode}/user/spark-test/BulkLoader.py</jar>
<spark-opts>
--num-executors 3 --executor-cores 1 --executor-memory 512m --driver-memory 512m\
</spark-opts>
</spark>
<ok to = 'end'/>
<error to = 'fail'/>
</action>
<kill name = 'fail'>
<message>
Error occurred while bulk loading files
</message>
</kill>
<end name = 'end'/>
</workflow-app>
and job.properties is as follows:
nameNode=hdfs://192.168.26.130:8020
jobTracker=http://192.168.26.130:8050
queueName=spark
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/spark-test/workflow.xml
workflowAppUri=${nameNode}/user/spark-test/BulkLoader.py
Any advice would be greatly appreciated.

I have also specified the libpath
oozie.libpath=<path>/oozie/share/lib/lib_<timestamp>
It is the value you see after the command you wrote
sudo -u oozie oozie admin -oozie http://192.168.26.130:11000/oozie -sharelibupdate
Example:
[ShareLib update status]
sharelibDirOld = hdfs://nameservice1/user/oozie/share/lib/lib_20190328034943
host = http://vghd08hr.dc-ratingen.de:11000/oozie
sharelibDirNew = hdfs://nameservice1/user/oozie/share/lib/lib_20190328034943
status = Successful
Optional:
You can also specify the yarn configuration within Cloudera folder:
oozie.launcher.yarn.app.mapreduce.am.env=/opt/SP/apps/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2
BUT
This might not solve the issue. The other hint I have is if you are using Spark 1.x this folder is necessary in your oozie sharelib folder
/user/oozie/share/lib/lib_20190328034943/spark2/oozie-sharelib-spark.jar
If you copy it in your spark2 folder, it solves the issue of the "missing SparkMain" but ask for other dependencies (it might be a problem in my environment). I think it worth a try, so copy and paste the lib, run your job, and see the logs.

How to stop INFO messages displaying on spark console?

I'd like to stop various messages that are coming on spark shell.
I tried to edit the log4j.properties file in order to stop these message.
Here are the contents of log4j.properties
# Define the root logger with appender file
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
But messages are still getting displayed on the console.
Here are some example messages
15/01/05 15:11:45 INFO SparkEnv: Registering BlockManagerMaster
15/01/05 15:11:45 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150105151145-b1ba
15/01/05 15:11:45 INFO MemoryStore: MemoryStore started with capacity 0.0 B.
15/01/05 15:11:45 INFO ConnectionManager: Bound socket to port 44728 with id = ConnectionManagerId(192.168.100.85,44728)
15/01/05 15:11:45 INFO BlockManagerMaster: Trying to register BlockManager
15/01/05 15:11:45 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 192.168.100.85:44728 with 0.0 B RAM
15/01/05 15:11:45 INFO BlockManagerMaster: Registered BlockManager
15/01/05 15:11:45 INFO HttpServer: Starting HTTP Server
15/01/05 15:11:45 INFO HttpBroadcast: Broadcast server star
How do I stop these?

Edit your conf/log4j.properties file and change the following line:
log4j.rootCategory=INFO, console
to
log4j.rootCategory=ERROR, console
Another approach would be to :
Start spark-shell and type in the following:
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
You won't see any logs after that.
Other options for Level include: all, debug, error, fatal, info, off, trace, trace_int, warn
Details about each can be found in the documentation.

Right after starting spark-shell type ;
sc.setLogLevel("ERROR")
you could put this in preload file and use like:
spark-shell ... -I preload-file ...
In Spark 2.0 (Scala):
spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
API Docs : https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.SparkSession
For Java:
spark = SparkSession.builder.getOrCreate();
spark.sparkContext().setLogLevel("ERROR");

All the methods collected with examples
Intro
Actually, there are many ways to do it.
Some are harder from others, but it is up to you which one suits you best. I will try to showcase them all.
#1 Programatically in your app
Seems to be the easiest, but you will need to recompile your app to change those settings. Personally, I don't like it but it works fine.
Example:
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.spark-project").setLevel(Level.WARN)
You can achieve much more just using log4j API.
Source: [Log4J Configuration Docs, Configuration section]
#2 Pass log4j.properties during spark-submit
This one is very tricky, but not impossible. And my favorite.
Log4J during app startup is always looking for and loading log4j.properties file from classpath.
However, when using spark-submit Spark Cluster's classpath has precedence over app's classpath! This is why putting this file in your fat-jar will not override the cluster's settings!
Add -Dlog4j.configuration=<location of configuration file> to
spark.driver.extraJavaOptions (for the driver) or
spark.executor.extraJavaOptions (for executors).
Note that if using a
file, the file: protocol should be explicitly provided, and the file
needs to exist locally on all the nodes.
To satisfy the last condition, you can either upload the file to the location available for the nodes (like hdfs) or access it locally with driver if using deploy-mode client. Otherwise:
upload a custom log4j.properties using spark-submit, by adding it to
the --files list of files to be uploaded with the application.
Source: Spark docs, Debugging
Steps:
Example log4j.properties:
# Blacklist all to warn level
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Whitelist our app to info :)
log4j.logger.com.github.atais=INFO
Executing spark-submit, for cluster mode:
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" \
--files "/absolute/path/to/your/log4j.properties" \
--class com.github.atais.Main \
"SparkApp.jar"
Note that you must use --driver-java-options if using client mode. Spark docs, Runtime env
Executing spark-submit, for client mode:
spark-submit \
--master yarn \
--deploy-mode client \
--driver-java-options "-Dlog4j.configuration=file:/absolute/path/to/your/log4j.properties" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" \
--files "/absolute/path/to/your/log4j.properties" \
--class com.github.atais.Main \
"SparkApp.jar"
Notes:
Files uploaded to spark-cluster with --files will be available at root dir, so there is no need to add any path in file:log4j.properties.
Files listed in --files must be provided with absolute path!
file: prefix in configuration URI is mandatory.
#3 Edit cluster's conf/log4j.properties
This changes global logging configuration file.
update the $SPARK_CONF_DIR/log4j.properties file and it will be
automatically uploaded along with the other configurations.
Source: Spark docs, Debugging
To find your SPARK_CONF_DIR you can use spark-shell:
atais#cluster:~$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
scala> System.getenv("SPARK_CONF_DIR")
res0: String = /var/lib/spark/latest/conf
Now just edit /var/lib/spark/latest/conf/log4j.properties (with example from method #2) and all your apps will share this configuration.
#4 Override configuration directory
If you like the solution #3, but want to customize it per application, you can actually copy conf folder, edit it contents and specify as the root configuration during spark-submit.
To specify a different configuration directory other than the default “SPARK_HOME/conf”, you can set SPARK_CONF_DIR. Spark will use the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc) from this directory.
Source: Spark docs, Configuration
Steps:
Copy cluster's conf folder (more info, method #3)
Edit log4j.properties in that folder (example in method #2)
Set SPARK_CONF_DIR to this folder, before executing spark-submit,
example:
export SPARK_CONF_DIR=/absolute/path/to/custom/conf
spark-submit \
--master yarn \
--deploy-mode cluster \
--class com.github.atais.Main \
"SparkApp.jar"
Conclusion
I am not sure if there is any other method, but I hope this covers the topic from A to Z. If not, feel free to ping me in the comments!
Enjoy your way!

Thanks #AkhlD and #Sachin Janani for suggesting changes in .conf file.
Following code solved my issue:
1) Added import org.apache.log4j.{Level, Logger} in import section
2) Added following line after creation of spark context object i.e. after val sc = new SparkContext(conf):
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)

Use below command to change log level while submitting application using spark-submit or spark-sql:
spark-submit \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:<file path>/log4j.xml" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:<file path>/log4j.xml"
Note: replace <file path> where log4j config file is stored.
Log4j.properties:
log4j.rootLogger=ERROR, console
# set the log level for these components
log4j.logger.com.test=DEBUG
log4j.logger.org=ERROR
log4j.logger.org.apache.spark=ERROR
log4j.logger.org.spark-project=ERROR
log4j.logger.org.apache.hadoop=ERROR
log4j.logger.io.netty=ERROR
log4j.logger.org.apache.zookeeper=ERROR
# add a ConsoleAppender to the logger stdout to write to the console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
# use a simple message format
log4j.appender.console.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
<appender name="console" class="org.apache.log4j.ConsoleAppender">
<param name="Target" value="System.out"/>
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n" />
</layout>
</appender>
<logger name="org.apache.spark">
<level value="error" />
</logger>
<logger name="org.spark-project">
<level value="error" />
</logger>
<logger name="org.apache.hadoop">
<level value="error" />
</logger>
<logger name="io.netty">
<level value="error" />
</logger>
<logger name="org.apache.zookeeper">
<level value="error" />
</logger>
<logger name="org">
<level value="error" />
</logger>
<root>
<priority value ="ERROR" />
<appender-ref ref="console" />
</root>
</log4j:configuration>
Switch to FileAppender in log4j.xml if you want to write logs to file instead of console. LOG_DIR is a variable for logs directory which you can supply using spark-submit --conf "spark.driver.extraJavaOptions=-D.
<appender name="file" class="org.apache.log4j.DailyRollingFileAppender">
<param name="file" value="${LOG_DIR}"/>
<param name="datePattern" value="'.'yyyy-MM-dd"/>
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d [%t] %-5p %c %x - %m%n"/>
</layout>
</appender>
Another important thing to understand here is, when job is launched in distributed mode ( deploy-mode cluster and master as yarn or mesos) the log4j configuration file should exist on driver and worker nodes (log4j.configuration=file:<file path>/log4j.xml) else log4j init will complain-
log4j:ERROR Could not read configuration file [log4j.properties].
java.io.FileNotFoundException: log4j.properties (No such file or
directory)
Hint on solving this problem-
Keep log4j config file in distributed file system(HDFS or mesos) and add external configuration using log4j PropertyConfigurator.
or use sparkContext addFile to make it available on each node then use log4j PropertyConfigurator to reload configuration.

You set disable the Logs by setting its level to OFF as follows:
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
or edit log file and set log level to off by just changing the following property:
log4j.rootCategory=OFF, console

I just add this line to all my pyspark scripts on top just below the import statements.
SparkSession.builder.getOrCreate().sparkContext.setLogLevel("ERROR")
example header of my pyspark scripts
from pyspark.sql import SparkSession, functions as fs
SparkSession.builder.getOrCreate().sparkContext.setLogLevel("ERROR")

Answers above are correct but didn't exactly help me as there was additional information I required.
I have just setup Spark so the log4j file still had the '.template' suffix and wasn't being read. I believe that logging then defaults to Spark core logging conf.
So if you are like me and find that the answers above didn't help, then maybe you too have to remove the '.template' suffix from your log4j conf file and then the above works perfectly!
http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-td11278.html

In Python/Spark we can do:
def quiet_logs( sc ):
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
The after defining Sparkcontaxt 'sc'
call this function by : quiet_logs( sc )

tl;dr
For Spark Context you may use:
sc.setLogLevel(<logLevel>)
where loglevel can be ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE or
WARN.
Details-
Internally, setLogLevel calls org.apache.log4j.Level.toLevel(logLevel) that it then uses to set using org.apache.log4j.LogManager.getRootLogger().setLevel(level).
You may directly set the logging levels to OFF using:
LogManager.getLogger("org").setLevel(Level.OFF)
You can set up the default logging for Spark shell in conf/log4j.properties. Use conf/log4j.properties.template as a starting point.
Setting Log Levels in Spark Applications
In standalone Spark applications or while in Spark Shell session, use the following:
import org.apache.log4j.{Level, Logger}
Logger.getLogger(classOf[RackResolver]).getLevel
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
Disabling logging(in log4j):
Use the following in conf/log4j.properties to disable logging completely:
log4j.logger.org=OFF
Reference: Mastering Spark by Jacek Laskowski.

Simply add below param to your spark-shell OR spark-submit command
--conf "spark.driver.extraJavaOptions=-Dlog4jspark.root.logger=WARN,console"
Check exact property name (log4jspark.root.logger here) from log4j.properties file.
Hope this helps, cheers!

Adding the following to the PySpark did the job for me:
self.spark.sparkContext.setLogLevel("ERROR")
self.spark is the spark session (self.spark = spark_builder.getOrCreate())

Simple to do on the command line...
spark2-submit --driver-java-options="-Droot.logger=ERROR,console" ..other options..

An interesting idea is to use the RollingAppender as suggested here: http://shzhangji.com/blog/2015/05/31/spark-streaming-logging-configuration/
so that you don't "polute" the console space, but still be able to see the results under $YOUR_LOG_PATH_HERE/${dm.logging.name}.log.
log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=$YOUR_LOG_PATH_HERE/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8
Another method that solves the cause is to observe what kind of loggings do you usually have (coming from different modules and dependencies), and set for each the granularity for the logging, while turning "quiet" third party logs that are too verbose:
For instance,
# Silence akka remoting
log4j.logger.Remoting=ERROR
log4j.logger.akka.event.slf4j=ERROR
log4j.logger.org.spark-project.jetty.server=ERROR
log4j.logger.org.apache.spark=ERROR
log4j.logger.com.anjuke.dm=${dm.logging.level}
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

If you don't have the ability to edit the java code to insert the .setLogLevel() statements and you don't want yet more external files to deploy, you can use a brute force way to solve this. Just filter out the INFO lines using grep.
spark-submit --deploy-mode client --master local <rest-of-cmd> | grep -v -F "INFO"

Adjust conf/log4j.properties as described by other
log4j.rootCategory=ERROR, console
Make sure while executing your spark job you pass --file flag with log4j.properties file path
If it still doesn't work you might have a jar that has log4j.properties that is being called before your new log4j.properties. Remove that log4j.properties from jar (if appropriate)

sparkContext.setLogLevel("OFF")

In addition to all the above posts, here is what solved the issue for me.
Spark uses slf4j to bind to loggers. If log4j is not the first binding found, you can edit log4j.properties files all you want, the loggers are not even used. For example, this could be a possible SLF4J output:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/Users/~/.m2/repository/org/slf4j/slf4j-simple/1.6.6/slf4j-simple-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/Users/~/.m2/repository/org/slf4j/slf4j-log4j12/1.7.19/slf4j-log4j12-1.7.19.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]
So here the SimpleLoggerFactory was used, which does not care about log4j settings.
Excluding the slf4j-simple package from my project via
<dependency>
...
<exclusions>
...
<exclusion>
<artifactId>slf4j-simple</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
resolved the issue, as now the log4j logger binding is used and any setting in log4j.properties is adhered to.
F.Y.I. my log4j properties file contains (besides the normal configuration)
log4j.rootLogger=WARN, stdout
...
log4j.category.org.apache.spark = WARN
log4j.category.org.apache.parquet.hadoop.ParquetRecordReader = FATAL
log4j.additivity.org.apache.parquet.hadoop.ParquetRecordReader=false
log4j.logger.org.apache.parquet.hadoop.ParquetRecordReader=OFF
Hope this helps!

This one worked for me.
For only ERROR messages to be displayed as stdout, log4j.properties file may look like:
# Root logger option
log4j.rootLogger=ERROR, stdout
# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
NOTE: Put log4j.properties file in src/main/resources folder to be
effective.
And if log4j.properties doesn't exist (meaning spark is using log4j-defaults.properties file) then you can create it by going to SPARK_HOME/conf and then mv log4j.properties.template log4j.properties and then proceed with above said changes.

If anyone else is stuck on this,
nothing of the above worked for me.
I had to remove
implementation group: "ch.qos.logback", name: "logback-classic", version: "1.2.3"
implementation group: 'com.typesafe.scala-logging', name: "scala-logging_$scalaVersion", version: '3.9.2'
from my build.gradle for the logs to disappear. TLDR: Don't import any other logging frameworks, you should be fine just using org.apache.log4j.Logger

Another way of stopping logs completely is:
import org.apache.log4j.Appender;
import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.varia.NullAppender;
public class SomeClass {
public static void main(String[] args) {
Appender nullAppender = new NullAppender();
BasicConfigurator.configure(nullAppender);
{...more code here...}
}
}
This worked for me.
An NullAppender is
An Appender that ignores log events. (https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/appender/NullAppender.html)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string