Spark logs configuration - log4j - apache-spark

I have an application and the code under com.myapplication and I am using Spark.
My log4j configuration is:
# Root logger option
log4j.rootLogger=WARN, stdout
log4j.category.com.myapplication=INFO
# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
When I run a script with this configuration I get log messages two times in the console, e.g.:
2022-07-26 09:47:00 INFO Spark$:49 - Configuring Spark...
1969 [main] INFO com.commerzbank.cda.spark.Spark$ - Configuring Spark...
2022-07-26 09:47:00 WARN SparkConf:66 - The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
1971 [main] WARN org.apache.spark.SparkConf - The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
2022-07-26 09:47:00 WARN SparkConf:66 - The configuration key 'spark.yarn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
1971 [main] WARN org.apache.spark.SparkConf - The configuration key 'spark.yarn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
2022-07-26 09:47:00 INFO Spark$:82 - Pick up Spark property master...
1972 [main] INFO com.commerzbank.cda.spark.Spark$ - Pick up Spark property master...
The messages are printed double one with date and one with [main]. Is this because of a wrong log4j configuration or where should I look into my code?

Related

Dataproc: limit log size for long-running / streaming Spark jobs

I've a Spark Structured Streaming job on GCP Dataproc - which picks up data from Kafka, does processing and pushes data back into kafka topics.
Couple of questions :
Does Spark put all the log (incl. INFO, WARN etc) into stderr ?
What I notice is that stdout is empty, while all the logging is put in to stderr
Is there a way for me to expire the data in stderr (i.e. expire the older logs) ?
Since I've a long running streaming job, the stderr gets filled up over time and nodes/VMs become unavailable.
Pls advice.
Here is output of the yarn logs command :
root#versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:25:34,876 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:25:35,144 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:25:35 +0000 2022
LogLength:43251469683
LogContents:
applianceName=usa-isn0784-rt01, tenantName=NOV, mstatsTimeBlock=1663507200, tenantId=2, vsnId=0, mstatsTotSentOctets=11596, mstatsTotRecvdOctets=24481, mstatsTotSessDuration=300000, mstatsTotSessCount=1, mstatsType=sdwan-acc-ckt-app-stats, appId=https, site=usa-isn0784-rt01, accCkt=WAN-DIA, siteId=442, accCktId=1, user=10.126.117.196, risk=3, productivity=3, family=general-internet, subFamily=web, bzTag=Unknown,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa type(row) is -> <class 'str'>
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************
Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
***********************************************************************
root#versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:26:01,439 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:26:01,696 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:26:02 +0000 2022
LogLength:44309782124
LogContents:
, tenantId=3, vsnId=0, mstatsTotSentOctets=48210, mstatsTotRecvdOctets=242351, mstatsTotSessDuration=300000, mstatsTotSessCount=34, mstatsType=dest-stats, destIp=165.225.216.24, mstatsAttribs=,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa type(row) is -> <class 'str'>
22/09/19 23:26:02 WARN org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************
Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
Update :
Based on #Dagang's note, i'm using the RollingFileAppender in the log4j.properties .. and the new log file is getting created. However - some data is still getting into std err.
Here is the updated code :
spark-submit
gcloud dataproc jobs submit pyspark process-appstat.py \
--cluster $CLUSTER \
--properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.dynamicAllocation.executorIdleTimeout=120s#spark.shuffle.service.enabled=true#spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j-executor.properties#spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j-driver.properties\
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.3.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar,gs://dataproc-spark-jars/mongo-spark-connector_2.12-3.0.2.jar,gs://dataproc-spark-jars/bson-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-sync-4.0.5.jar,gs://dataproc-spark-jars/mongodb-driver-core-4.0.5.jar \
--files=gs://kafka-certs/versa-kafka-gke-ca.p12,gs://kafka-certs/syslog-vani-noacl.p12,gs://kafka-certs/alarm-compression-user.p12,gs://kafka-certs/alarm-compression-user-test.p12,gs://kafka-certs/appstats-user.p12,gs://kafka-certs/appstats-user-test.p12,gs://kafka-certs/insights-user.p12,gs://kafka-certs/insights-user-test.p12,gs://kafka-certs/intfutil-user.p12,gs://kafka-certs/intfutil-user-test.p12,gs://dataproc-spark-configs/metrics.properties,gs://dataproc-spark-configs/params.cfg,gs://kafka-certs/appstat-anomaly-user.p12,gs://kafka-certs/appstat-anomaly-user-test.p12,gs://kafka-certs/appstat-agg-user.p12,gs://kafka-certs/appstat-agg-user-test.p12,gs://kafka-certs/alarmblock-user.p12,gs://kafka-certs/alarmblock-user-test.p12,gs://kafka-certs/versa-alarmblock-test-user.p12,gs://kafka-certs/versa-bandwidth-test-user.p12,gs://kafka-certs/versa-appstat-test-user.p12,gs://kafka-certs/versa-alarmblock-user.p12,gs://kafka-certs/versa-bandwidth-user.p12,gs://kafka-certs/versa-appstat-user.p12,gs://dataproc-spark-configs/log4j-executor.properties,gs://dataproc-spark-configs/log4j-driver.properties \
--region $REGION \
--py-files streams.zip,utils.zip \
-- isdebug=$isdebug
log4j-executor.properties:
--------------------------
# Set everything to be logged to the console
# log4j.rootCategory=INFO, console
# log4j.appender.console=org.apache.log4j.ConsoleAppender
# log4j.appender.console.target=System.err
# log4j.appender.console.layout=org.apache.log4j.PatternLayout
# log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# logging to rolling_file, using RolligFileAppender
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/versa-ss-executor.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
log4j.appender.rolling_file.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling_file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.eclipse.jetty=WARN
# Allow INFO logging from Spark Env for EFM
log4j.logger.org.apache.spark.SparkEnv=INFO
# Spark 3.x
log4j.logger.org.sparkproject.jetty.server.handler.ContextHandler=WARN
# Spark 2.x
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=WARN
# Reduce verbosity for other spammy core classes
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=WARN
log4j.logger.org.apache.spark.ExecutorAllocationManager=ERROR
log4j.logger.org.apache.spark=WARN
log4j-driver.properties:
-------------------------
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/versa-ss-driver.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
log4j.appender.rolling_file.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling_file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.eclipse.jetty=WARN
# Allow INFO logging from Spark Env for EFM
log4j.logger.org.apache.spark.SparkEnv=INFO
# Spark 3.x
log4j.logger.org.sparkproject.jetty.server.handler.ContextHandler=WARN
# Spark 2.x
log4j.logger.org.spark_project.jetty.server.handler.ContextHandler=WARN
# Reduce verbosity for other spammy core classes
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter=WARN
log4j.logger.org.apache.spark.ExecutorAllocationManager=ERROR
log4j.logger.org.apache.spark=WARN
any ideas on what needs to be done for this ?
Question on -> ${spark.yarn.app.container.log.dir}
What location does this get translated to ?
when i logon worker node and check this, i get the following :
karanalang#versa-structured-stream-v1-w-0:~$ echo $spark.yarn.app.container.log.dir
.yarn.app.container.log.dir
In yarn-site.xml:
Here are the relevant configs:
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/hadoop/yarn/nm-local-dir</value>
<description>
Directories on the local machine in which to application temp files.
</description>
</property>
<property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>gs://dataproc-temp-us-east1-939354532596-4ln8c3y1/fe57047f-13d9-4b9b-8bce-baa4a911aa65/yarn-logs</value>
<description>
The remote path, on the default FS, to store logs.
</description>
</property>
However the logs are in the location below:
root#versa-structured-stream-v1-w-0:/# find . -name versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0002/container_1664926662510_0002_01_000001/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000179/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000250/versa-ss-executor.log
./var/log/hadoop-yarn/userlogs/application_1664926662510_0003/container_1664926662510_0003_01_000299/versa-ss-executor.log
where is the location - ./var/log/hadoop-yarn/userlogs - taken from (it is not in yarn-site.sml)?
Short answer:
You can use a custom log4j config with RollingFileAppender to limit the log size for long-running jobs.
Long answer:
The default log4j config for Spark on Dataproc is at /etc/spark/conf/log4j.properties. It configures root logger to stderr at INFO level. But at runtime driver logs (in client mode) will be directed by the Dataproc agent to GCS and streamed back to the client, and executor logs (and driver logs in cluster mode) will be redirected by YARN to the stderr file in the container's YARN log dir. Consider using /etc/spark/conf/log4j.properties as the template for your custom config.
In your custom config, you can configure logs to be written to a RollingFileAppender, e.g.,
log4j.rootLogger=INFO, rolling_file
log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/my_app.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
...
Note that for executors (and drivers in cluster mode), the value of log4j.appender.rolling_file.File needs to be a path under ${spark.yarn.app.container.log.dir}, see this question and this doc.
Upload your log4j config(s) to a GCS bucket, driver and executor may or may not share the same config. In your case, you might want to update executor log4j config only, just use the default for driver.
Then submit the job with the custom log4j config with one of the following ways:
The file name must be log4j.properties, driver and executor will share the same config:
gcloud dataproc jobs submit spark ... \
--files gs://my-bucket/log4j.properties
The file name doesn't have to be log4j.properties, driver and executor can have different config:
gcloud dataproc jobs submit spark ... \
--files gs://my-bucket/my-log4j.properties \
--properties 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:my-log4j.properties'
The expectation is that there will be rolling logs under the YARN container log dirs (configurable through yarn.nodemanager.log-dirs with default value /var/log/hadoop-yarn/userlogs on Dataproc) for the Spark executors, they will be automatically aggregated and stored in GCS and Cloud Logging.

Log4j not showing logs when spark job is submited on yarn cluster

No application logs are shown in the yarn logs when a job is submit on yarn cluster. Everything works on local mode.
The following error is shown:
StatusLogger Log4j2 could not find a logging implementation
I provide following to the spark-submit:
--driver-library-path /opt/spark-extras/apache-log4j-2.11.2-bin \
--conf spark.driver.extraLibraryPath=/opt/spark-extras/apache-log4j-2.11.2-bin \
--conf spark.executor.extraLibraryPath=/opt/spark-extras/apache-log4j-2.11.2-bin \
The spark-defaults.conf specifies the log4j2.xml:
spark.executor.extraJavaOptions ... -Dlog4j.configurationFile=log4j2.xml
The xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n" />
</Console>
</Appenders>
<Loggers>
<Root level="info">
<AppenderRef ref="Console" />
</Root>
</Loggers>
</Configuration>
Then my log on yarn shows only something like:
2019-06-05 16:45:28,495 [main] WARN org.apache.spark.SparkConf - The configuration key 'spark.executor.port' has been deprecated as of Spark 2.0.0 and may be removed in the future. Not used anymore
2019-06-05 16:45:46,298 [Driver] WARN org.apache.spark.SparkConf - The configuration key 'spark.executor.port' has been deprecated as of Spark 2.0.0 and may be removed in the future. Not used anymore
2019-06-05 16:45:46,314 [Driver] WARN org.apache.spark.SparkConf - The configuration key 'spark.executor.port' has been deprecated as of Spark 2.0.0 and may be removed in the future. Not used anymore
2019-06-05 16:45:46,315 [Driver] WARN org.apache.spark.SparkConf - The configuration key 'spark.executor.port' has been deprecated as of Spark 2.0.0 and may be removed in the future. Not used anymore
2019-06-05 16:45:48,324 [Driver] WARN org.apache.spark.scheduler.FairSchedulableBuilder - Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
2019-06-05 16:45:49,977 [Reporter] INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for :
2019-06-05 16:45:50,654 [ContainerLauncher-2] INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - yarn.client.max-cached-nodemanagers-proxies : 0
2019-06-05 16:45:53,692 [Reporter] INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for :
How can I configure that spark-submit job to make it log using log4j2 as it works on local mode?

How to get driver-Id in spark submission

Spark cluster information:
Spark version: 2.2.0
Cluster contains a master node with 2 worker nodes
Cluster Manager Type: standalone
I submit a jar to spark cluster from one of the workers and I want to receive the driver-Id from the submission so that I can use that id for later application status checking. The problem is that I am not getting any output in the console. I use port 6066 for submission and set deploy mode to cluster.
By running
spark-submit --deploy-mode cluster --supervise --class "path/to/class" --master "spark://spark-master-headless:6066" path/to/app.jar
in the spark log file I am able to see the json response of the submission below, which is the exact thing that I want:
[INFO] 2018-07-18 12:48:40,030 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Submitting a request to launch an application in spark://spark-master-headless:6066.
[INFO] 2018-07-18 12:48:41,074 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Submission successfully created as driver-20180718124840-0023. Polling submission state...
[INFO] 2018-07-18 12:48:41,077 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Submitting a request for the status of submission driver-20180718124840-0023 in spark://spark-master-headless:6066.
[INFO] 2018-07-18 12:48:41,092 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - State of driver driver-20180718124840-0023 is now RUNNING.
[INFO] 2018-07-18 12:48:41,093 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Driver is running on worker worker-20180707104934-<some-ip-was-here>-7078 at <some-ip-was-here>:7078.
[INFO] 2018-07-18 12:48:41,114 org.apache.spark.deploy.rest.RestSubmissionClient logInfo - Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20180718124840-0023",
"serverSparkVersion" : "2.2.0",
"submissionId" : "driver-20180718124840-0023",
"success" : true
}
[INFO] 2018-07-18 12:48:46,572 org.apache.spark.executor.CoarseGrainedExecutorBackend initDaemon - Started daemon with process name: 31605#spark-worker-662224983-4qpfw
[INFO] 2018-07-18 12:48:46,580 org.apache.spark.util.SignalUtils logInfo - Registered signal handler for TERM
[INFO] 2018-07-18 12:48:46,583 org.apache.spark.util.SignalUtils logInfo - Registered signal handler for HUP
[INFO] 2018-07-18 12:48:46,583 org.apache.spark.util.SignalUtils logInfo - Registered signal handler for INT
[WARN] 2018-07-18 12:48:47,293 org.apache.hadoop.util.NativeCodeLoader <clinit> - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[INFO] 2018-07-18 12:48:47,607 org.apache.spark.SecurityManager logInfo - Changing view acls to: root
[INFO] 2018-07-18 12:48:47,608 org.apache.spark.SecurityManager logInfo - Changing modify acls to: root
...
But I want to have this information in the console so that I can direct it to a separate file than the spark logs. I assume that some messages should be get printed when the above command gets run. I even used --verbose mode in the command so that maybe that helps but still the output in the console is empty.
The only thing that gets printed to console is
Running Spark using the REST application submission protocol. While in this page's question section, the user is able to see more than this.
I even tried to change the Logger level in my application code but that also didn't help. (based on some ideas from here)
Then the question would be, how come that I am not getting any output in the console and what can I do to get the info that I want to get printed to console?
P.S. I have developed and tweaked the cluster and the jar file to a good amount that maybe I have something somewhere causing the output not to get printed. What are the possible places I can check to fix this.
Update:
I found out that the default log4j.properties of spark has been edited. Here is the content:
# Set everything to be logged to the console
log4j.rootCategory=INFO, RollingAppender
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.RollingAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RollingAppender.File=/var/log/spark.log
log4j.appender.RollingAppender.DatePattern='.'yyyy-MM-dd
log4j.appender.RollingAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.RollingAppender.layout.ConversionPattern=[%p] %d %c %M - %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=INFO
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=INFO
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
I assume that this is not letting the verbose command to work. How can I change this to have get some content with --verbose?
As you are running the job in cluster mode the driver can be any node in cluster, so whatever you do print/console redirect may not return to client/edge node/worker node where the console is opened.
Try submitting the apllication in client mode

PySpark WARN messages

How can I disable the following WARN messages when running PySpark code:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/06/08 21:04:55 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
18/06/08 21:04:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
I spent some time playing with log4.properties, but cannot figure out exactly which class logs these.
put this in your init of the spark context:
sc.setLogLevel("INFO")

How should spark sql be configured to access hive metastore? [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I'm trying using Spark SQL to read a table from Hive metastore but Spark gives an error about table not found. I'm afraid that Spark SQL creates a whole new empty metastore.
I submit the spark task through this command:
spark-submit --class etl.EIServerSpark --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' --jars $HIVE_CLASSPATH --files /etc/hive/conf/hive-site.xml,/etc/hadoop/conf/yarn-site.xml --master yarn-client /root/etl.jar
This is the error:
2015-06-30 17:50:51,563 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Copying /etc/hive/conf/hive-site.xml to /tmp/spark-568de027-8b66-40fa-97a4-2ec50614f486/hive-site.xml
2015-06-30 17:50:51,568 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Added file file:/etc/hive/conf/hive-site.xml at http://10.136.149.126:43349/files/hive-site.xml with timestamp 1435683051561
2015-06-30 17:50:51,568 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Copying /etc/hadoop/conf/yarn-site.xml to /tmp/spark-568de027-8b66-40fa-97a4-2ec50614f486/yarn-site.xml
2015-06-30 17:50:51,570 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Added file file:/etc/hadoop/conf/yarn-site.xml at http://10.136.149.126:43349/files/yarn-site.xml with timestamp 1435683051568
2015-06-30 17:50:51,637 INFO [sparkDriver-akka.actor.default-dispatcher-5] util.AkkaUtils (Logging.scala:logInfo(59)) - Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#gateway.edp.hadoop:52818/user/HeartbeatReceiver
2015-06-30 17:50:51,756 INFO [main] netty.NettyBlockTransferService (Logging.scala:logInfo(59)) - Server created on 40198
2015-06-30 17:50:51,757 INFO [main] storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Trying to register BlockManager
2015-06-30 17:50:51,759 INFO [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerMasterActor (Logging.scala:logInfo(59)) - Registering block manager localhost:40198 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 40198)
2015-06-30 17:50:51,761 INFO [main] storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Registered BlockManager
2015-06-30 17:50:52,840 INFO [main] parse.ParseDriver (ParseDriver.java:parse(185)) - Parsing command: SELECT id, name FROM eiserver.eismpt
2015-06-30 17:50:53,141 INFO [main] parse.ParseDriver (ParseDriver.java:parse(206)) - Parse Completed
2015-06-30 17:50:54,041 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:newRawStore(502)) - 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
2015-06-30 17:50:54,064 INFO [main] metastore.ObjectStore (ObjectStore.java:initialize(247)) - ObjectStore, initialize called
2015-06-30 17:50:54,227 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/datanucleus-rdbms-3.2.9.jar."
2015-06-30 17:50:54,268 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/datanucleus-api-jdo-3.2.6.jar."
2015-06-30 17:50:54,274 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/datanucleus-core-3.2.10.jar."
2015-06-30 17:50:54,314 INFO [main] DataNucleus.Persistence (Log4JLogger.java:info(77)) - Property datanucleus.cache.level2 unknown - will be ignored
2015-06-30 17:50:54,315 INFO [main] DataNucleus.Persistence (Log4JLogger.java:info(77)) - Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
2015-06-30 17:50:56,109 INFO [main] metastore.ObjectStore (ObjectStore.java:getPMF(318)) - Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
2015-06-30 17:50:56,170 INFO [main] metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:<init>(110)) - MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "#" (64), after : "".
2015-06-30 17:50:57,315 INFO [main] DataNucleus.Datastore (Log4JLogger.java:info(77)) - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
2015-06-30 17:50:57,316 INFO [main] DataNucleus.Datastore (Log4JLogger.java:info(77)) - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
2015-06-30 17:50:57,688 INFO [main] DataNucleus.Datastore (Log4JLogger.java:info(77)) - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
2015-06-30 17:50:57,688 INFO [main] DataNucleus.Datastore (Log4JLogger.java:info(77)) - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
2015-06-30 17:50:57,842 INFO [main] DataNucleus.Query (Log4JLogger.java:info(77)) - Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
2015-06-30 17:50:57,844 INFO [main] metastore.ObjectStore (ObjectStore.java:setConf(230)) - Initialized ObjectStore
2015-06-30 17:50:58,113 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:createDefaultRoles(560)) - Added admin role in metastore
2015-06-30 17:50:58,115 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:createDefaultRoles(569)) - Added public role in metastore
2015-06-30 17:50:58,198 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:addAdminUsers(597)) - No user is added in admin role, since config is empty
2015-06-30 17:50:58,376 INFO [main] session.SessionState (SessionState.java:start(383)) - No Tez session required at this point. hive.execution.engine=mr.
2015-06-30 17:50:58,525 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(632)) - 0: get_table : db=eiserver tbl=eismpt
2015-06-30 17:50:58,525 INFO [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(314)) - ugi=root ip=unknown-ip-addr cmd=get_table : db=eiserver tbl=eismpt
2015-06-30 17:50:58,567 ERROR [main] metadata.Hive (Hive.java:getTable(1003)) - NoSuchObjectException(message:eiserver.eismpt table not found)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1569)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
How can I configure spark sql to access hive metastore deployed on a postgres? I'm using CDH 5.3.2.
Thank you
Configure Spark to use the Hive metastore thriftserver:
Edit $SPARK_HOME/conf/hive-site.xml to remove the direct connection information and to add this property:
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value> /*make sure to replace with your hive-metastore service's thrift url*/
<description>URI for client to contact metastore server</description>
</property>
</configuration>
If hive-site.xml is not there in $SPARK_HOME/conf then, to connect to hive metastore you need to copy the hive-site.xml file into spark/conf directory. So run the following command after logging in as root user,
cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/
Create Hive Context
At a scala> REPL prompt type the following:
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Create Hive Table
hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (key INT, value STRING)")
Show Hive Tables
scala> hiveContext.hql("SHOW TABLES").collect().foreach(println)
Test out the configuration(Optional)
Stop the Spark SQL thriftserver with cd $SPARK_HOME; sbin/stop-thriftserver.sh
Start the Hive metastore thriftserver with cd;./start-thriftserver.sh
Check the logs at $HIVE_HOME/logs/metastore.out for any errors.
The Spark SQL thriftserver won't start until it can make a successful connection to
this server, so it must be running.
Start the Spark SQL thriftserver
with cd $SPARK_HOME; sbin/start-thriftserver.sh
Check the log file that are indicated in the returned line.
You should see lines like this:
16/12/29 20:22:19 INFO metastore: Trying to connect to metastore with URI thrift://localhost:9083
16/12/29 20:22:19 INFO metastore: Connected to metastore.
Run $SPARK_HOME/bin/beeline -u 'jdbc:hive2://localhost:10000/' and try out the !tables command to make sure that you are able to list the metadata.
The doc says to put spark.sql.hive.metastore.sharedPrefixes = org.postgresql in the configuration file, did you try this ?
Make sure the $HIVE_HOME/conf/hive-site.xml configuration which is pointing to complete path of metastore.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/hive/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
Place the hive-site.xml file in $SPARK_HOME/conf to point SparkR to the same metastore as Hive.
Hope this solves your issue.

Resources