How do I specify output log file during spark submit - apache-spark

I know how to provide the logger properties file to spark. So my logger properties file looks something like:
log4j.rootCategory=INFO,FILE
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/tmp/outfile.log
log4j.appender.FILE.MaxFileSize=1000MB
log4j.appender.FILE.MaxBackupIndex=2
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.appender.rolling.strategy.type = DefaultRolloverStrategy
And then I provide the logger properties file path to spark-submit via:
-Dlog4j.configuration=file:logger_file_path
However, I wanted to provide log4j.appender.FILE.File value during spark-submit. Is there a way I can do that?
With regards to justification for the above approach, I am doing spark-submit on multiple YARN queues. Since the Spark code base is the same, I would just want a different log file for spark submit on different queues.

In the log4j file properties, you can use expressions like this:
log4j.appender.FILE.File=${LOGGER_OUTPUT_FILE}
When parsed, the value for log4j.appender.FILE.File will be picked from the system property LOGGER_OUTPUT_FILE.
As per this SO post, you can set the value for the system property by adding -DLOGGER_OUTPUT_FILE=/tmp/outfile.log when invoking the JVM.
So using spark-submit you may try this (I haven't tested it):
spark-submit --master yarn \
--files /path/to/my-custom-log4j.properties \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=my-custom-log4j.properties -DLOGGER_OUTPUT_FILE=/tmp/outfile.log" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=my-custom-log4j.properties -DLOGGER_OUTPUT_FILE=/tmp/outfile.log"

Related

spark-submit overriding default application.conf not working

I am building a jar which has application.conf under src/main/resources folder. However, I am trying to overwrite that while doing spark-submit. However it's not working.
following is my command
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--files application.conf \
$sandbox_jar flagFile/test.FLG \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
application.conf - is located in same directory my jar file is.
-Dconfig.file=path/to/config-file mayn't work due to internal cache on ConfigFactory. The documentation suggest to run ConfigFactory.invalidateCaches().
The other way is following, which merges the supplied properties with existing properties available.
ConfigFactory.invalidateCaches()
val c = ConfigFactory.parseFile(new File(path-to-file + "/" + "application.conf"))
val config : Config = c.withFallback(ConfigFactory.load()).resolve
I think the best way to override the properties would be to supply them using -D. Typesafe gives highest priority to system properties, so -D will override reference.conf and application.conf.
Considering application.conf is properties file. There is other option, which can solve the same purpose of using properties file.
Not sure but packaging properties file with jar might not provide flexibility? Here keeping properties file separate from jar packaging, this will provide flexibility as, whenever if any property changes just replace new properties file instead of building and deploying whole jar.
This can be achieved as, keep your properties in properties file are prefix your property key with "spark."
spark.inputpath /input/path
spark.outputpath /output/path
Spark Submit command would be like,
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--properties-file application.conf \
$sandbox_jar flagFile/test.FLG
Getting properties in code like,
sc.getConf.get("spark.inputpath") // /input/path
sc.getConf.get("spark.outputpath") // /output/path
Not nessesary it will solve your problem though. But here just try to put another approach to work.

Turn off pyspark logging through python sript

How can I turn off pyspark logging from a python script?
Pls Note : I do not want to make any changes in the spark logger properties file.
To remove (or modify) logging from a python script:
conf = SparkConf()
conf.set('spark.logConf', 'true') # necessary in order to be able to change log level
... # other stuff and configuration
# create the session
spark = SparkSession.builder\
.config(conf=conf) \
.appName(app_name) \
.getOrCreate()
# set the log level to one of ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
spark.sparkContext.setLogLevel("OFF")
docs configuration
docs setLogLevel
Hope this helps, good luck!
Edit: For earlier versions, e.g. 1.6, you can try something like the following, taken from here
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel(logger.Level.OFF)
# or
logger.LogManager.getRootLogger().setLevel(logger.Level.OFF)
I haven't tested it unfortunately, please, let me know if it works.

Using log4j2 in Spark java application

I'm trying to use log4j2 logger in my Spark job. Essential requirement: log4j2 config is located outside classpath, so I need to specify its location explicitly. When I run my code directly within IDE without using spark-submit, log4j2 works well. However when I submit the same code to Spark cluster using spark-submit, it fails to find log42 configuration and falls back to default old log4j.
Launcher command
${SPARK_HOME}/bin/spark-submit \
--class my.app.JobDriver \
--verbose \
--master 'local[*]' \
--files "log4j2.xml" \
--conf spark.executor.extraJavaOptions="-Dlog4j.configurationFile=log4j2.xml" \
--conf spark.driver.extraJavaOptions="-Dlog4j.configurationFile=log4j2.xml" \
myapp-SNAPSHOT.jar
Log4j2 dependencies in maven
<dependencies>
. . .
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>${log4j2.version}</version>
</dependency>
<!-- Bridge log4j to log4j2 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
<version>${log4j2.version}</version>
</dependency>
<!-- Bridge slf4j to log4j2 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependencies>
Any ideas what I could miss?
Apparently at the moment there is no official support official for log4j2 in Spark. Here is detailed discussion on the subject: https://issues.apache.org/jira/browse/SPARK-6305
On practical side that means:
If you have access to Spark configs and jars and can modify them, you still can use log4j2 after manually adding log4j2 jars to SPARK_CLASSPATH, and providing log4j2 configuration file to Spark.
If you run on managed Spark cluster and have no access to Spark jars/configs, then you still can use log4j2, however its use will be limited to the code executed at driver side. Any code part running by executors will use Spark executors logger (which is old log4j)
Spark falls back to log4j because it probably cannot initialize logging system during startup (your application code is not added to classpath).
If you are permitted to place new files on your cluster nodes then create directory on all of them (for example /opt/spark_extras), place there all log4j2 jars and add two configuration options to spark-submit:
--conf spark.executor.extraClassPath=/opt/spark_extras/*
--conf spark.driver.extraClassPath=/opt/spark_extras/*
Then libraries will be added to classpath.
If you have no access to modify files on cluster you can try another approach. Add all log4j2 jars to spark-submit parameters using --jars. According to the documentation all these libries will be added to driver's and executor's classpath so it should work in the same way.
Try using the --driver-java-options
${SPARK_HOME}/bin/spark-submit \
--class my.app.JobDriver \
--verbose \
--master 'local[*]' \
--files "log4j2.xml" \
--driver-java-options "-Dlog4j.configuration=log4j2.xml" \
--jars log4j-api-2.8.jar,log4j-core-2.8.jar,log4j-1.2-api-2.8.jar \
myapp-SNAPSHOT.jar
If log4j2 is being used in one of your own dependencies, it's quite easy to bipass all configuration files and use programmatic configuration for one or two high level loggers IF and only IF the configuration file is not found.
The code below does the trick. Just name the logger to your top level logger.
private static boolean configured = false;
private static void buildLog()
{
try
{
final LoggerContext ctx = (LoggerContext) LogManager.getContext(false);
System.out.println("Configuration found at "+ctx.getConfiguration().toString());
if(ctx.getConfiguration().toString().contains(".config.DefaultConfiguration"))
{
System.out.println("\n\n\nNo log4j2 config available. Configuring programmatically\n\n");
ConfigurationBuilder<BuiltConfiguration> builder = ConfigurationBuilderFactory
.newConfigurationBuilder();
builder.setStatusLevel(Level.ERROR);
builder.setConfigurationName("IkodaLogBuilder");
AppenderComponentBuilder appenderBuilder = builder.newAppender("Stdout", "CONSOLE")
.addAttribute("target", ConsoleAppender.Target.SYSTEM_OUT);
appenderBuilder.add(builder.newLayout("PatternLayout").addAttribute("pattern",
"%d [%t] %msg%n%throwable"));
builder.add(appenderBuilder);
LayoutComponentBuilder layoutBuilder = builder.newLayout("PatternLayout").addAttribute("pattern",
"%d [%t] %-5level: %msg%n");
appenderBuilder = builder.newAppender("file", "File").addAttribute("fileName", "./logs/ikoda.log")
.add(layoutBuilder);
builder.add(appenderBuilder);
builder.add(builder.newLogger("ikoda", Level.DEBUG)
.add(builder.newAppenderRef("file"))
.add(builder.newAppenderRef("Stdout"))
.addAttribute("additivity", false));
builder.add(builder.newRootLogger(Level.DEBUG)
.add(builder.newAppenderRef("file"))
.add(builder.newAppenderRef("Stdout")));
((org.apache.logging.log4j.core.LoggerContext) LogManager.getContext(false)).start(builder.build());
ctx.updateLoggers();
}
else
{
System.out.println("Configuration file found.");
}
configured=true;
}
catch(Exception e)
{
System.out.println("\n\n\n\nFAILED TO CONFIGURE LOG4J2"+e.getMessage());
configured=true;
}
}

Which logger should I use to get my data in Cloud Logging

I am running a PySpark job using Cloud Dataproc, and want to log info using the logging module of Python. The goal is to then push these logs to Cloud Logging.
From this question, I learned that I can achieve this by adding a logfile to the fluentd configuration, which is located at /etc/google-fluentd/google-fluentd.conf.
However, when I look at the log files in /var/log, I cannot find the files that contain my logs. I've tried using the default python logger and the 'py4j' logger.
logger = logging.getLogger()
logger = logging.getLogger('py4j')
Can anyone shed some light as to which logger I should use, and which file should be added to the fluentd configuration?
Thanks
tl;dr
This is not natively supported now but will be natively supported in a future version of Cloud Dataproc. That said, there is a manual workaround in the interim.
Workaround
First, make sure you are sending the python logs to the correct log4j logger from the spark context. To do this declare your logger as:
import pyspark
sc = pyspark.SparkContext()
logger = sc._jvm.org.apache.log4j.Logger.getLogger(__name__)
The second part involves a workaround that isn't natively supported yet. If you look at the spark properties file under
/etc/spark/conf/log4j.properties
on the master of your cluster, you can see how log4j is configured for spark. Currently it looks like the following:
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
...
Note that this means log4j logs are sent only to the console. The dataproc agent will pick up this output and return it as the job driver ouput. However in order for fluentd to pick up the output and send it to Google Cloud Logging, you will need log4j to write to a local file. Therefore you will need to modify the log4j properties as follows:
# Set everything to be logged to the console and a file
log4j.rootCategory=INFO, console, file
# Set up console appender.
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Set up file appender.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=/var/log/spark/spark-log4j.log
log4j.appender.file.MaxFileSize=512KB
log4j.appender.file.MaxBackupIndex=3
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.conversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
...
If you set the file to /var/log/spark/spark-log4j.log as shown above, the default fluentd configuration on your Dataproc cluster should pick it up. If you want to set the file to something else you can follow the instructions in this question to get fluentd to pick up that file.

How to get rid of "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties" message?

I am trying to suppress the message
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
when i run my Spark app. I've redirected the INFO messages successfully, however this message keeps on showing up. Any ideas would be greatly appreciated.
Even simpler you just cd SPARK_HOME/conf then mv log4j.properties.template log4j.properties then open log4j.properties and change all INFO to ERROR. Here SPARK_HOME is the root directory of your spark installation.
Some may be using hdfs as their Spark storage backend and will find the logging messages are actually generated by hdfs. To alter this, go to the HADOOP_HOME/etc/hadoop/log4j.properties file. Simply change hadoop.root.logger=INFO,console to hadoop.root.logger=ERROR,console. Once again HADOOP_HOME is the root of your hadoop installation for me this was /usr/local/hadoop.
Okay, So I've figured out a way to do this. So basically, I had my own log4j.xml initially, that was being used, and hence we were seeing this property. Once I had my own "log4j.properties" file, this message went away.
If you put a log4j.properties file under both the main/resources and the test/resources this also occurs. In this case, deleting the file from the test/resources and using only the file from the main/resources fixes the issue.
None of the answers above did work for me using SBT. Turns out you need to explicitly define an appender in your log4j.properties, such as:
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.rootLogger=WARN, stdout
log4j.logger.org.apache.spark=WARN, stdout
log4j.logger.com.yourcompany=INFO, stdout
Put this in your resources directory and Bob's your uncle!

Resources