Using log4j2 in Spark java application - apache-spark

I'm trying to use log4j2 logger in my Spark job. Essential requirement: log4j2 config is located outside classpath, so I need to specify its location explicitly. When I run my code directly within IDE without using spark-submit, log4j2 works well. However when I submit the same code to Spark cluster using spark-submit, it fails to find log42 configuration and falls back to default old log4j.
Launcher command
${SPARK_HOME}/bin/spark-submit \
--class my.app.JobDriver \
--verbose \
--master 'local[*]' \
--files "log4j2.xml" \
--conf spark.executor.extraJavaOptions="-Dlog4j.configurationFile=log4j2.xml" \
--conf spark.driver.extraJavaOptions="-Dlog4j.configurationFile=log4j2.xml" \
myapp-SNAPSHOT.jar
Log4j2 dependencies in maven
<dependencies>
. . .
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>${log4j2.version}</version>
</dependency>
<!-- Bridge log4j to log4j2 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
<version>${log4j2.version}</version>
</dependency>
<!-- Bridge slf4j to log4j2 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependencies>
Any ideas what I could miss?

Apparently at the moment there is no official support official for log4j2 in Spark. Here is detailed discussion on the subject: https://issues.apache.org/jira/browse/SPARK-6305
On practical side that means:
If you have access to Spark configs and jars and can modify them, you still can use log4j2 after manually adding log4j2 jars to SPARK_CLASSPATH, and providing log4j2 configuration file to Spark.
If you run on managed Spark cluster and have no access to Spark jars/configs, then you still can use log4j2, however its use will be limited to the code executed at driver side. Any code part running by executors will use Spark executors logger (which is old log4j)

Spark falls back to log4j because it probably cannot initialize logging system during startup (your application code is not added to classpath).
If you are permitted to place new files on your cluster nodes then create directory on all of them (for example /opt/spark_extras), place there all log4j2 jars and add two configuration options to spark-submit:
--conf spark.executor.extraClassPath=/opt/spark_extras/*
--conf spark.driver.extraClassPath=/opt/spark_extras/*
Then libraries will be added to classpath.
If you have no access to modify files on cluster you can try another approach. Add all log4j2 jars to spark-submit parameters using --jars. According to the documentation all these libries will be added to driver's and executor's classpath so it should work in the same way.

Try using the --driver-java-options
${SPARK_HOME}/bin/spark-submit \
--class my.app.JobDriver \
--verbose \
--master 'local[*]' \
--files "log4j2.xml" \
--driver-java-options "-Dlog4j.configuration=log4j2.xml" \
--jars log4j-api-2.8.jar,log4j-core-2.8.jar,log4j-1.2-api-2.8.jar \
myapp-SNAPSHOT.jar

If log4j2 is being used in one of your own dependencies, it's quite easy to bipass all configuration files and use programmatic configuration for one or two high level loggers IF and only IF the configuration file is not found.
The code below does the trick. Just name the logger to your top level logger.
private static boolean configured = false;
private static void buildLog()
{
try
{
final LoggerContext ctx = (LoggerContext) LogManager.getContext(false);
System.out.println("Configuration found at "+ctx.getConfiguration().toString());
if(ctx.getConfiguration().toString().contains(".config.DefaultConfiguration"))
{
System.out.println("\n\n\nNo log4j2 config available. Configuring programmatically\n\n");
ConfigurationBuilder<BuiltConfiguration> builder = ConfigurationBuilderFactory
.newConfigurationBuilder();
builder.setStatusLevel(Level.ERROR);
builder.setConfigurationName("IkodaLogBuilder");
AppenderComponentBuilder appenderBuilder = builder.newAppender("Stdout", "CONSOLE")
.addAttribute("target", ConsoleAppender.Target.SYSTEM_OUT);
appenderBuilder.add(builder.newLayout("PatternLayout").addAttribute("pattern",
"%d [%t] %msg%n%throwable"));
builder.add(appenderBuilder);
LayoutComponentBuilder layoutBuilder = builder.newLayout("PatternLayout").addAttribute("pattern",
"%d [%t] %-5level: %msg%n");
appenderBuilder = builder.newAppender("file", "File").addAttribute("fileName", "./logs/ikoda.log")
.add(layoutBuilder);
builder.add(appenderBuilder);
builder.add(builder.newLogger("ikoda", Level.DEBUG)
.add(builder.newAppenderRef("file"))
.add(builder.newAppenderRef("Stdout"))
.addAttribute("additivity", false));
builder.add(builder.newRootLogger(Level.DEBUG)
.add(builder.newAppenderRef("file"))
.add(builder.newAppenderRef("Stdout")));
((org.apache.logging.log4j.core.LoggerContext) LogManager.getContext(false)).start(builder.build());
ctx.updateLoggers();
}
else
{
System.out.println("Configuration file found.");
}
configured=true;
}
catch(Exception e)
{
System.out.println("\n\n\n\nFAILED TO CONFIGURE LOG4J2"+e.getMessage());
configured=true;
}
}

Related

How to rebuild apache Livy with scala 2.12

I'm using Spark 3.1.1 which uses Scala 2.12, and the pre-built Livy downloaded from here uses Scala 2.11 (one could find the folder named repl_2.11-jars/ after unzip).
Referred to the comment made by Aliaksandr Sasnouskikh, Livy needs to be rebuilt or it'll throw error {'msg': 'requirement failed: Cannot find Livy REPL jars.'} even in POST Session.
In the README.md, it mentioned:
By default Livy is built against Apache Spark 2.4.5
If I'd like to rebuild Livy, how could I change the spark version that it is built with?
Thanks in advance.
You can rebuild Livy passing spark-3.0 profile in maven to create a custom build for spark 3, for example:
git clone https://github.com/apache/incubator-livy.git && \
cd incubator-livy && \
mvn clean package -B -V -e \
-Pspark-3.0 \
-Pthriftserver \
-DskipTests \
-DskipITs \
-Dmaven.javadoc.skip=true
This profile is defined in pom.xml, the default one installs Spark 3.0.0. You can change it to use different spark version.
<profile>
<id>spark-3.0</id>
<activation>
<property>
<name>spark-3.0</name>
</property>
</activation>
<properties>
<spark.scala-2.12.version>3.0.0</spark.scala-2.12.version>
<spark.scala-2.11.version>2.4.5</spark.scala-2.11.version>
<spark.version>${spark.scala-2.11.version}</spark.version>
<netty.spark-2.12.version>4.1.47.Final</netty.spark-2.12.version>
<netty.spark-2.11.version>4.1.47.Final</netty.spark-2.11.version>
<netty.version>${netty.spark-2.11.version}</netty.version>
<java.version>1.8</java.version>
<py4j.version>0.10.9</py4j.version>
<json4s.spark-2.11.version>3.5.3</json4s.spark-2.11.version>
<json4s.spark-2.12.version>3.6.6</json4s.spark-2.12.version>
<json4s.version>${json4s.spark-2.11.version}</json4s.version>
<spark.bin.download.url>
https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
</spark.bin.download.url>
<spark.bin.name>spark-3.0.0-bin-hadoop2.7</spark.bin.name>
</properties>
</profile>
As long as I know, Livy supports spark 3.0.x. But worth testing with 3.1.1, and let us know :)
I tried to build Livy for Spark 3.1.1 based on rmakoto's answer and it worked! I tinkered a lot and I couldn't exactly remember what I edited in the pom.xml so I am just going to attach my gist link here.
I also had to edit the python-api/pom.xml file to use Python3 to build since there's some syntax error issues when building with the default pom.xml file. Here's the pom.xml gist for python-api.
After that just build with
mvn clean package -B -V -e \
-Pspark-3.0 \
-Pthriftserver \
-DskipTests \
-DskipITs \
-Dmaven.javadoc.skip=true
Based on #gamberooni 's changes (but using 3.1.2 instead of 3.1.1 for the Spark version and Hadoop 3.2.0 instead of 3.2.1), this is the diff:
diff --git a/pom.xml b/pom.xml
index d2e535a..5c28ee6 100644
--- a/pom.xml
+++ b/pom.xml
## -79,12 +79,12 ##
<properties>
<asynchttpclient.version>2.10.1</asynchttpclient.version>
- <hadoop.version>2.7.3</hadoop.version>
+ <hadoop.version>3.2.0</hadoop.version>
<hadoop.scope>compile</hadoop.scope>
<spark.scala-2.11.version>2.4.5</spark.scala-2.11.version>
- <spark.scala-2.12.version>2.4.5</spark.scala-2.12.version>
- <spark.version>${spark.scala-2.11.version}</spark.version>
- <hive.version>3.0.0</hive.version>
+ <spark.scala-2.12.version>3.1.2</spark.scala-2.12.version>
+ <spark.version>${spark.scala-2.12.version}</spark.version>
+ <hive.version>3.1.2</hive.version>
<commons-codec.version>1.9</commons-codec.version>
<httpclient.version>4.5.3</httpclient.version>
<httpcore.version>4.4.4</httpcore.version>
## -1060,7 +1060,7 ##
</property>
</activation>
<properties>
- <spark.scala-2.12.version>3.0.0</spark.scala-2.12.version>
+ <spark.scala-2.12.version>3.1.2</spark.scala-2.12.version>
<spark.scala-2.11.version>2.4.5</spark.scala-2.11.version>
<spark.version>${spark.scala-2.11.version}</spark.version>
<netty.spark-2.12.version>4.1.47.Final</netty.spark-2.12.version>
## -1072,9 +1072,9 ##
<json4s.spark-2.12.version>3.6.6</json4s.spark-2.12.version>
<json4s.version>${json4s.spark-2.11.version}</json4s.version>
<spark.bin.download.url>
- https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
+ https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
</spark.bin.download.url>
- <spark.bin.name>spark-3.0.0-bin-hadoop2.7</spark.bin.name>
+ <spark.bin.name>spark-3.1.2-bin-hadoop3.2</spark.bin.name>
</properties>
</profile>
diff --git a/python-api/pom.xml b/python-api/pom.xml
index 8e5cdab..a8fb042 100644
--- a/python-api/pom.xml
+++ b/python-api/pom.xml
## -46,7 +46,7 ##
<goal>exec</goal>
</goals>
<configuration>
- <executable>python</executable>
+ <executable>python3</executable>
<arguments>
<argument>setup.py</argument>
<argument>sdist</argument>
## -60,7 +60,7 ##
<goal>exec</goal>
</goals>
<configuration>
- <executable>python</executable>
+ <executable>python3</executable>
<skip>${skipTests}</skip>
<arguments>
<argument>setup.py</argument>
My Spark version is 3.2.1, and my scala version is 2.12.15. I have successfully built and put it into use. I will show my construction process. Pull the master code of livy and modify the pom file as follows:
Finally, execute the package command
mvn clean package -B -V -e -Pspark-3.0 -Pthriftserver -DskipTests -DskipITs -Dmaven.javadoc.skip=true
After Livy's deployment:
Test:

How do I specify output log file during spark submit

I know how to provide the logger properties file to spark. So my logger properties file looks something like:
log4j.rootCategory=INFO,FILE
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/tmp/outfile.log
log4j.appender.FILE.MaxFileSize=1000MB
log4j.appender.FILE.MaxBackupIndex=2
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
log4j.appender.rolling.strategy.type = DefaultRolloverStrategy
And then I provide the logger properties file path to spark-submit via:
-Dlog4j.configuration=file:logger_file_path
However, I wanted to provide log4j.appender.FILE.File value during spark-submit. Is there a way I can do that?
With regards to justification for the above approach, I am doing spark-submit on multiple YARN queues. Since the Spark code base is the same, I would just want a different log file for spark submit on different queues.
In the log4j file properties, you can use expressions like this:
log4j.appender.FILE.File=${LOGGER_OUTPUT_FILE}
When parsed, the value for log4j.appender.FILE.File will be picked from the system property LOGGER_OUTPUT_FILE.
As per this SO post, you can set the value for the system property by adding -DLOGGER_OUTPUT_FILE=/tmp/outfile.log when invoking the JVM.
So using spark-submit you may try this (I haven't tested it):
spark-submit --master yarn \
--files /path/to/my-custom-log4j.properties \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=my-custom-log4j.properties -DLOGGER_OUTPUT_FILE=/tmp/outfile.log" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=my-custom-log4j.properties -DLOGGER_OUTPUT_FILE=/tmp/outfile.log"

spark-submit overriding default application.conf not working

I am building a jar which has application.conf under src/main/resources folder. However, I am trying to overwrite that while doing spark-submit. However it's not working.
following is my command
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--files application.conf \
$sandbox_jar flagFile/test.FLG \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
application.conf - is located in same directory my jar file is.
-Dconfig.file=path/to/config-file mayn't work due to internal cache on ConfigFactory. The documentation suggest to run ConfigFactory.invalidateCaches().
The other way is following, which merges the supplied properties with existing properties available.
ConfigFactory.invalidateCaches()
val c = ConfigFactory.parseFile(new File(path-to-file + "/" + "application.conf"))
val config : Config = c.withFallback(ConfigFactory.load()).resolve
I think the best way to override the properties would be to supply them using -D. Typesafe gives highest priority to system properties, so -D will override reference.conf and application.conf.
Considering application.conf is properties file. There is other option, which can solve the same purpose of using properties file.
Not sure but packaging properties file with jar might not provide flexibility? Here keeping properties file separate from jar packaging, this will provide flexibility as, whenever if any property changes just replace new properties file instead of building and deploying whole jar.
This can be achieved as, keep your properties in properties file are prefix your property key with "spark."
spark.inputpath /input/path
spark.outputpath /output/path
Spark Submit command would be like,
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--properties-file application.conf \
$sandbox_jar flagFile/test.FLG
Getting properties in code like,
sc.getConf.get("spark.inputpath") // /input/path
sc.getConf.get("spark.outputpath") // /output/path
Not nessesary it will solve your problem though. But here just try to put another approach to work.

Can spark explode a jar of jar's

I have my spark job called like below:
spark-submit --jar test1.jar,test2.jar \
--class org.mytest.Students \
--num-executors ${executors} \
--master yarn \
--deploy-mode cluster \
--queue ${mapreduce.job.queuename} \
--driver-memory ${driverMemory} \
--conf spark.executor.memory=${sparkExecutorMemory} \
--conf spark.rdd.compress=true \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -
XX:MaxGCPauseMillis=100
${SPARK_JAR} "${INPUT}" "${OUTPUT_PATH}"
Is is possible to pass a single jar which contain test1.jar and test2.jar . Like --jars mainTest.jar(this contain test1.jar and test2.jar)
My question is basically can spark explode a jar of jars . I am using version 1.3.
Question : Can spark explode a jar of jar's ?
Yes...
As T. Gaweda suggested, we can achieve with maven assembly plugin....
thought of putting other options here....
Option 1 : There is another Maven way (I feel better than maven assembly plugin....) i.e Apache Maven Shade Plugin
(particularly useful as it merges content of specific files instead of overwriting them. This is needed when there are resource files are have the same name across the jars and the plugin tries to package all the resource files.)
This plugin provides the capability to package the artifact in an
uber-jar, including its dependencies and to shade - i.e. rename - the
packages of some of the dependencies
Goals Overview
The Shade Plugin has a single goal:
shade:shade is bound to the package phase and is used to create a
shaded jar.
Option 2 :
SBT Way if you are using sbt :
source: creating-uber-jar-for-spark-project-using-sbt-assembly :
sbt-assembly is an sbt plugin to create a fat JAR of sbt project
with all of its dependencies.
Add sbt-assembly plugin in project/plugin.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.9.1")
Specify sbt-assembly.git as a dependency in project/project/build.scala
import sbt._
object Plugins extends Build {
lazy val root = Project("root", file(".")) dependsOn(
uri("git://github.com/sbt/sbt-assembly.git#0.9.1")
)
}
In build.sbt file add the following contents
import AssemblyKeys._ // put this at the top of the file,leave the next line blank
assemblySettings
Use full keys to configure the assembly plugin. For more details refer
target assembly-jar-name test
assembly-option main-class
full-classpath dependency-classpath assembly-excluded-files
assembly-excluded-jars
If multiple files share the same relative path the default strategy is to verify that all candidates have the same contents and error out otherwise. This behaviour can be configured for Spark projects using assembly-merge-strategy as follows.
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("javax", "servlet", xs # _*) => MergeStrategy.last
case PathList("org", "apache", xs # _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs # _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case x => old(x)
}
}
From the root folder run
sbt/sbt assembly
the assembly plugin then packs the class files and all the dependencies into a single JAR file
You can simply merge those jars into one shaded Jar. Please read this question: How can I create an executable JAR with dependencies using Maven?
You will have all classes in exactly one Jar. There will be no problem with nested Jars.

"No Filesystem for Scheme: gs" when running spark job locally

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)
When running the job locally on my Mac machine, I am getting the following error:
5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs
I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>
The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
</description>
</property>
I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
<exclusions>
<exclusion> <!-- declare the exclusion here -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
</exclusions>
</dependency>
, and Hadoop 1.2.1 as follows:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>1.2.1</version>
</dependency>
The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.
In Scala, add the following config when setting your hadoopConfiguration:
val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
There are a couple ways to help Spark pick up the relevant Hadoop configurations, both involving modifying ${SPARK_INSTALL_DIR}/conf:
Copy or symlink your ${HADOOP_HOME}/conf/core-site.xml into ${SPARK_INSTALL_DIR}/conf/core-site.xml. For example, when bdutil installs onto a VM, it runs:
ln -s ${HADOOP_CONF_DIR}/core-site.xml ${SPARK_INSTALL_DIR}/conf/core-site.xml
Older Spark docs explain that this makes the xml files included in Spark's classpath automatically: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html
Add an entry to ${SPARK_INSTALL_DIR}/conf/spark-env.sh with:
export HADOOP_CONF_DIR=/full/path/to/your/hadoop/conf/dir
Newer Spark docs seem to indicate this as the preferred method going forward: https://spark.apache.org/docs/1.1.0/hadoop-third-party-distributions.html
I can't say what's wrong, but here's what I would try.
Try setting fs.gs.project.id: <property><name>fs.gs.project.id</name><value>my-little-project</value></property>
Print sc.hadoopConfiguration.get(fs.gs.impl) to make sure your core-site.xml is getting loaded. Print it in the driver and also in the executor: println(x); rdd.foreachPartition { _ => println(x) }
Make sure the GCS jar is sent to the executors (sparkConf.setJars(...)). I don't think this would matter in local mode (it's all one JVM, right?) but you never know.
Nothing but your program needs to be restarted. There is no Hadoop process. In local and standalone modes Spark only uses Hadoop as a library, and only for IO I think.
You can apply these settings directly on the spark reader/writer as follows:
spark
.read
.option("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.option("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
.option("google.cloud.auth.service.account.enable", "true")
.option("google.cloud.auth.service.account.json.keyfile", "<path-to-json-keyfile.json>")
.option("header", true)
.csv("gs://<bucket>/<path-to-csv-file>")
.show(10, false)
And add the relevant jar dependency to your build.sbt (or whichever build tool you use) and check https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector for latest:
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.6" classifier "shaded"
See GCS Connector and Google Cloud Storage connector for non-dataproc clusters

Resources