HADOOP_CONF_DIR seems to overrule SPARK_CONF_DIR for log4j configuration - apache-spark

Hi we're running a spark driver program in pyspark in yarn-client mode
spark version = spark 3.2.1
We have following environment variables set:
HADOOP_CONF_DIR = points to a folder containing all hadoop configuration files like hdfs-site.xml, hive-site.xml, etc. It also contains a log4j.properties file.
SPARK_CONF_DIR = points to a folder containing the spark-defaults file and the log4j2.properties file
These are the contents of the log4j.properties file in the folder referred to by HADOOP_CONF_DIR:
log4j.rootLogger=${hadoop.root.logger}
hadoop.root.logger=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
These are the contents of the log4j2.properties file in the folder referred bo by SPARK_CONF_DIR:
# Log files location
property.basePath = ${env:LOG_PATH}
# Set everything to be logged to the console
appender.rolling.type = RollingFile
appender.rolling.name = fileLogger
appender.rolling.fileName= ${basePath}/vdp-ingestion.log
appender.rolling.filePattern= ${basePath}/vdp-ingestion_%d{yyyyMMdd}.log.gz
# log in json-format -> based on LogstashJsonEventLayout
appender.rolling.layout.type = JsonTemplateLayout
appender.rolling.layout.eventTemplateUri = classpath:LogstashJsonEventLayoutV1.json
# overrule message -> by default treated as a string, however we want an object so we can use the native JSON format
# and use the underlying objects in kibana log filters
appender.rolling.layout.eventTemplateAdditionalField[0].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[0].key = message
appender.rolling.layout.eventTemplateAdditionalField[0].value = {"$resolver": "message", "fallbackKey": "message"}
appender.rolling.layout.eventTemplateAdditionalField[0].format = JSON
appender.rolling.layout.eventTemplateAdditionalField[1].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[1].key = pid
appender.rolling.layout.eventTemplateAdditionalField[1].value = {"$resolver": "pattern", "pattern": "%pid"}
appender.rolling.layout.eventTemplateAdditionalField[1].format = JSON
appender.rolling.layout.eventTemplateAdditionalField[2].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[2].key = tid
appender.rolling.layout.eventTemplateAdditionalField[2].value = {"$resolver": "pattern", "pattern": "%tid"}
appender.rolling.layout.eventTemplateAdditionalField[2].format = JSON
appender.rolling.policies.type = Policies
# RollingFileAppender rotation policy
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 10MB
appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.rolling.policies.time.interval = 1
appender.rolling.policies.time.modulate = true
appender.rolling.strategy.type = DefaultRolloverStrategy
appender.rolling.strategy.delete.type = Delete
appender.rolling.strategy.delete.basePath = ${basePath}
appender.rolling.strategy.delete.maxDepth = 10
appender.rolling.strategy.delete.ifLastModified.type = IfLastModified
# Delete all files older than 30 days
appender.rolling.strategy.delete.ifLastModified.age = 30d
rootLogger.level = INFO
rootLogger.appenderRef.rolling.ref = fileLogger
logger.spark.name = org.apache.spark
logger.spark.level = WARN
logger.spark.additivity = false
logger.spark.appenderRef.stdout.ref = fileLogger
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
logger.spark.repl.Main.level = WARN
logger.spark.repl.SparkIMain$exprTyper.level = INFO
logger.spark.repl.SparkILoop$SparkILoopInterpreter.level = INFO
# Settings to quiet third party logs that are too verbose
logger.jetty.name = org.sparkproject.jetty
logger.jetty.level = WARN
logger.jetty.util.component.AbstractLifeCycle.level = ERROR
logger.parquet.name = org.apache.parquet
logger.parquet.level = ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.hadoop.name = org.apache.hadoop
logger.hadoop.level = WARN
logger.hadoop.hive.metastore.RetryingHMSHandler.level = FATAL
logger.hadoop.hive.ql.exec.FunctionRegistry.level = ERROR
logger.spark.sql.level = WARN
When we start up the pyspark program it finds the log4j2.properties file and we can see that all non root level logs are captured in json for all dependencies
However, for some reason the settings of the log4j.properties apply to the spark driver logs and all these are reported to the console instead. If we change the level or format in the log4j.properties file these settings are applied to the driver log output.
Is there any reason why spark would use the hadoop log4j.properties file instead of the log4j2.properties file? Are we missing some setting here?
We also tried to provide the log4j2.properties file to the drivers extra java options in spark-defaults:
spark.driver.extraJavaOptions -Djava.net.preferIPv4Stack=true -Djava.security.auth.login.config=conf/jaas_driver.conf -Djava.security.krb5.conf=conf/krb5_driver.conf -Dsun.security.krb5.debug=false -Dlog4j.configurationFile=file:/spark_conf_dir/log4j2.properties
where spark_conf_dir = the folder referred to by SPARK_CONF_DIR
But also this didn't work. For some reason the system always applies the log4j.properties settings for the driver program. It seems that it overrules the settings in the log4j2.properties file with the settings in the log4j.properties file.
This is on a virtual machine. If we remove the log4j.properties file in the HADOOP_CONF_DIR then nothing gets reported for the driver program (maybe the default error but currently nothing shows up).
If we build up a docker instead with the same program but from a base python image with pyspark we don't have this issue. Then the log output of the driver program and dependent spark packages are delivered in the log file in json format.
I would expect that providing -Dlog4j.configurationFile=file:/spark_conf_dir/log4j2.properties in the spark.driver.extraJavaOptions would solve the issue.
Or that SPARK_CONF_DIR would take precedence over HADOOP_CONF_DIR for the log4j configuration.

Related

Suppress INFO-log in Spark by changing log4j.properties file

I read that to suppress the plethora of INFO-log messages in Spark I need to change the line
log4j.rootCategory=INFO, console
to
log4j.rootCategory=ERROR, console
in my log4j.properties, which in my case I found in
/usr/local/Cellar/apache-spark/3.3.0/libexec/conf/log4j.properties
Yet the structure of my file seems to be different to what I found from other users:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=error
# Set everything to be logged to the console
rootLogger.level = error
rootLogger.appenderRef.stdout.ref = console
# In the pattern layout configuration below, we specify an explicit `%ex` conversion
# pattern for logging Throwables. If this was omitted, then (by default) Log4J would
# implicitly add an `%xEx` conversion pattern which logs stacktraces with additional
# class packaging information. That extra information can sometimes add a substantial
# performance overhead, so we disable it in our default logging config.
# For more information, see SPARK-39361.
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
# Set the default spark-shell/spark-sql log level to WARN. When running the
# spark-shell/spark-sql, the log level for these classes is used to overwrite
# the root logger's log level, so that the user can have different defaults
# for the shell and regular Spark apps.
logger.repl.name = org.apache.spark.repl.Main
logger.repl.level = warn
logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
logger.thriftserver.level = warn
# Settings to quiet third party logs that are too verbose
logger.jetty1.name = org.sparkproject.jetty
logger.jetty1.level = warn
logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
logger.jetty2.level = error
logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
logger.replexprTyper.level = info
logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
logger.replSparkILoopInterpreter.level = info
logger.parquet1.name = org.apache.parquet
logger.parquet1.level = error
logger.parquet2.name = parquet
logger.parquet2.level = error
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
logger.RetryingHMSHandler.level = fatal
logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
logger.FunctionRegistry.level = error
# For deploying Spark ThriftServer
# SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805
appender.console.filter.1.type = RegexFilter
appender.console.filter.1.regex = .*Thrift error occurred during processing of message.*
appender.console.filter.1.onMatch = deny
appender.console.filter.1.onMismatch = neutral
So I simply added said line into the script. This does still not change anything.
When I type into my mac terminal spark-submit somepythonscript.py, right at the beginning of the following output, I read:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Unfortunately I have no idea where that "org"-folder even is. This is confusing because , as stated above, my properties folder is entirely elsewhere.
I further learnt here that I should add
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///usr/local/Cellar/apache-spark/3.3.0/libexec/conf/log4j.properties
to my spark-submit. In whole that'd be in my case:
spark-submit somepythonscript.py --conf
spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///usr/local/Cellar/apache-spark/3.3.0/libexec/conf/log4j.properties
Yet that does not do anything. Script works, but the INFO-logging is still there. What am I doing wrong?
This answer has been helpful for me while I was running Spark in local mode, with the only difference being that it is log4j2.properties:
https://stackoverflow.com/a/52648367/11951587

SonarQube( SAST SCAN) log injection hotspot issue

I have written code to add logs using logging module in python. I tried running code through Sonarqube, It is showing following error .
Make sure that this logger's configuration is safe.
python code:
from logging.config import fileConfig
import logging
#this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config
# Interpret the config file for Python logging.
# This line sets up loggers basically.
fileConfig(config.config_file_name)
logger = logging.getLogger("alembic.env")
class DefaultConfig:
DEVELOPMENT = False
DEBUG = False
TESTING = False
LOGGING_LEVEL = "DEBUG"
CSRF_ENABLED = True
Please help to resolve this hotspot. And one more question I have, Is it mandatory to look into low priority hotspots.

Tomcat is generating logs in multiple places, one in the default path "/logs" and another in the custom directory that is specified externally

We are planning to rotate the log that is generated by Tomcat using Logrotate for volume maintenance. When I checked for the logs I was able to find two places in which these logs were been generated "../apache-tomcat-7.0.57/logs" and in the path that is specified in the "logging.properties". I did check in the Tomcat document, from which I was able to understand that Tomcat uses the default path which is "/logs" is no path is mentioned externally in "logging.properties". I was not able to find if I have missed any configuration.
logging.properties file:
handlers = 1catalina.org.apache.juli.FileHandler, 2localhost.org.apache.juli.FileHandler, 3manager.org.apache.juli.FileHandler, 4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler
.handlers = 1catalina.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler
############################################################
# Handler specific properties.
# Describes specific configuration info for Handlers.
############################################################
1catalina.org.apache.juli.FileHandler.level = FINE
1catalina.org.apache.juli.FileHandler.directory = <custome path>
1catalina.org.apache.juli.FileHandler.prefix = catalina.
2localhost.org.apache.juli.FileHandler.level = FINE
2localhost.org.apache.juli.FileHandler.directory = <custome path>
2localhost.org.apache.juli.FileHandler.prefix = localhost.
3manager.org.apache.juli.FileHandler.level = FINE
3manager.org.apache.juli.FileHandler.directory = <custome path>
3manager.org.apache.juli.FileHandler.prefix = manager.
4host-manager.org.apache.juli.FileHandler.level = FINE
4host-manager.org.apache.juli.FileHandler.directory = <custome path>
4host-manager.org.apache.juli.FileHandler.prefix = host-manager.
java.util.logging.ConsoleHandler.level = FINE
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
############################################################
# Facility specific properties.
# Provides extra control for each logger.
############################################################
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].level = INFO
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].handlers = 2localhost.org.apache.juli.FileHandler
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].level = INFO
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/manager].handlers = 3manager.org.apache.juli.FileHandler
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].level = INFO
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/host-manager].handlers = 4host-manager.org.apache.juli.FileHandler
# For example, set the org.apache.catalina.util.LifecycleBase logger to log
# each component that extends LifecycleBase changing state:
#org.apache.catalina.util.LifecycleBase.level = FINE
# To see debug messages in TldLocationsCache, uncomment the following line:
#org.apache.jasper.compiler.TldLocationsCache.level = FINE
My question is why are the logs getting generated in multiple places and how to make it log in just in one directory for maintaining the same ?
Reference link
https://tomcat.apache.org/tomcat-7.0-doc/logging.html
By default - It'll log to ${catalina.base}/logs which is what you should see in ${catalina.base}/conf/logging.properties
Additionally standard output (aka exception.printStackTrace()) goes into (by default) ${catalina.base}/logs/catalina.out
${catalina.base}/logs/catalina.out can be set to a different file by setting the environment variable CATALINA_OUT or CATALINA_OUT_CMD. So see what CATALINA_OUT_CMD does - It'll be easier to read the comments in ${catalina.home}/bin/catalina.sh

How to use two databases: Postgres and Snowflake with alembic and sqlachemy?

I want to use two databases: Postgres and Snowflake using alembic migration tool in a single fastapi app.
Able to perform alembic migrations and alembic upgrade if using single database i.e. Postgres but on using multiple databases alembic is creating problems.
Tried using snowflake database independently on a different app with only snowflake as a database. It is working perfectly fine with revisions and alembic upgrade but not working on a single app with two databases.
Here is my directory structure as suggested in few articles:
main-project-->
postgres-migration-->
versions/
env.py
README
script.py.mako
snowflake-migration-->
versions/
env.py
README
script.py.mako
Here is my alembic.ini file generated for postgres, but I manipulated it to support snowflake:
# A generic, single database configuration.
[alembic]
# path to migration scripts
script_location = db-migration
[A_SNOWFLAKE_SCHEMA]
# path to env.py and migration scripts for schema1
script_location = snowflake-db-migration
# template used to generate migration files
# file_template = %%(rev)s_%%(slug)s
# sys.path path, will be prepended to sys.path if present.
# defaults to the current working directory.
prepend_sys_path = .
# timezone to use when rendering the date
# within the migration file as well as the filename.
# string value is passed to dateutil.tz.gettz()
# leave blank for localtime
# timezone =
# max length of characters to apply to the
# "slug" field
# truncate_slug_length = 40
# set to 'true' to run the environment during
# the 'revision' command, regardless of autogenerate
# revision_environment = false
# set to 'true' to allow .pyc and .pyo files without
# a source .py file to be detected as revisions in the
# versions/ directory
# sourceless = false
# version location specification; this defaults
# to db-migration/versions. When using multiple version
# directories, initial revisions must be specified with --version-path
# version_locations = %(here)s/bar %(here)s/bat db-migration/versions
# the output encoding used when revision files
# are written from script.py.mako
# output_encoding = utf-8
#sqlalchemy.url = driver://user:pass#localhost/dbname
[post_write_hooks]
# post_write_hooks defines scripts or Python functions that are run
# on newly generated revision scripts. See the documentation for further
# detail and examples
# format using "black" - use the console_scripts runner, against the "black" entrypoint
# hooks = black
# black.type = console_scripts
# black.entrypoint = black
# black.options = -l 79 REVISION_SCRIPT_FILENAME
# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic
[handlers]
keys = console
[formatters]
keys = generic
[logger_root]
level = WARN
handlers = console
qualname =
[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine
[logger_alembic]
level = INFO
handlers =
qualname = alembic
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic
[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S
Above file is able to create revision scripts inside versions/ folder of snowflake-db-migration. But not able to run the generated revision using(throwing errors):
alembic upgrade head
How to convert this .ini file to support multi-database .ini file?
Tried using command:
alembic init --template multidb snowflake-db-migration
But not sure how to integrate changes here.
Also don't want to hardcode sqlalchemy.url variable in .ini file as required to use snowflake variables from a .toml file which contains all environment variables.
There are few answers but none of them are accepted and exactly similar to my use case.

Which logger should I use to get my data in Cloud Logging

I am running a PySpark job using Cloud Dataproc, and want to log info using the logging module of Python. The goal is to then push these logs to Cloud Logging.
From this question, I learned that I can achieve this by adding a logfile to the fluentd configuration, which is located at /etc/google-fluentd/google-fluentd.conf.
However, when I look at the log files in /var/log, I cannot find the files that contain my logs. I've tried using the default python logger and the 'py4j' logger.
logger = logging.getLogger()
logger = logging.getLogger('py4j')
Can anyone shed some light as to which logger I should use, and which file should be added to the fluentd configuration?
Thanks
tl;dr
This is not natively supported now but will be natively supported in a future version of Cloud Dataproc. That said, there is a manual workaround in the interim.
Workaround
First, make sure you are sending the python logs to the correct log4j logger from the spark context. To do this declare your logger as:
import pyspark
sc = pyspark.SparkContext()
logger = sc._jvm.org.apache.log4j.Logger.getLogger(__name__)
The second part involves a workaround that isn't natively supported yet. If you look at the spark properties file under
/etc/spark/conf/log4j.properties
on the master of your cluster, you can see how log4j is configured for spark. Currently it looks like the following:
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
...
Note that this means log4j logs are sent only to the console. The dataproc agent will pick up this output and return it as the job driver ouput. However in order for fluentd to pick up the output and send it to Google Cloud Logging, you will need log4j to write to a local file. Therefore you will need to modify the log4j properties as follows:
# Set everything to be logged to the console and a file
log4j.rootCategory=INFO, console, file
# Set up console appender.
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Set up file appender.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=/var/log/spark/spark-log4j.log
log4j.appender.file.MaxFileSize=512KB
log4j.appender.file.MaxBackupIndex=3
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.conversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
...
If you set the file to /var/log/spark/spark-log4j.log as shown above, the default fluentd configuration on your Dataproc cluster should pick it up. If you want to set the file to something else you can follow the instructions in this question to get fluentd to pick up that file.

Resources