Can spark explode a jar of jar's - apache-spark

I have my spark job called like below:
spark-submit --jar test1.jar,test2.jar \
--class org.mytest.Students \
--num-executors ${executors} \
--master yarn \
--deploy-mode cluster \
--queue ${mapreduce.job.queuename} \
--driver-memory ${driverMemory} \
--conf spark.executor.memory=${sparkExecutorMemory} \
--conf spark.rdd.compress=true \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -
XX:MaxGCPauseMillis=100
${SPARK_JAR} "${INPUT}" "${OUTPUT_PATH}"
Is is possible to pass a single jar which contain test1.jar and test2.jar . Like --jars mainTest.jar(this contain test1.jar and test2.jar)
My question is basically can spark explode a jar of jars . I am using version 1.3.

Question : Can spark explode a jar of jar's ?
Yes...
As T. Gaweda suggested, we can achieve with maven assembly plugin....
thought of putting other options here....
Option 1 : There is another Maven way (I feel better than maven assembly plugin....) i.e Apache Maven Shade Plugin
(particularly useful as it merges content of specific files instead of overwriting them. This is needed when there are resource files are have the same name across the jars and the plugin tries to package all the resource files.)
This plugin provides the capability to package the artifact in an
uber-jar, including its dependencies and to shade - i.e. rename - the
packages of some of the dependencies
Goals Overview
The Shade Plugin has a single goal:
shade:shade is bound to the package phase and is used to create a
shaded jar.
Option 2 :
SBT Way if you are using sbt :
source: creating-uber-jar-for-spark-project-using-sbt-assembly :
sbt-assembly is an sbt plugin to create a fat JAR of sbt project
with all of its dependencies.
Add sbt-assembly plugin in project/plugin.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.9.1")
Specify sbt-assembly.git as a dependency in project/project/build.scala
import sbt._
object Plugins extends Build {
lazy val root = Project("root", file(".")) dependsOn(
uri("git://github.com/sbt/sbt-assembly.git#0.9.1")
)
}
In build.sbt file add the following contents
import AssemblyKeys._ // put this at the top of the file,leave the next line blank
assemblySettings
Use full keys to configure the assembly plugin. For more details refer
target assembly-jar-name test
assembly-option main-class
full-classpath dependency-classpath assembly-excluded-files
assembly-excluded-jars
If multiple files share the same relative path the default strategy is to verify that all candidates have the same contents and error out otherwise. This behaviour can be configured for Spark projects using assembly-merge-strategy as follows.
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("javax", "servlet", xs # _*) => MergeStrategy.last
case PathList("org", "apache", xs # _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs # _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case x => old(x)
}
}
From the root folder run
sbt/sbt assembly
the assembly plugin then packs the class files and all the dependencies into a single JAR file

You can simply merge those jars into one shaded Jar. Please read this question: How can I create an executable JAR with dependencies using Maven?
You will have all classes in exactly one Jar. There will be no problem with nested Jars.

Related

spark-submit overriding default application.conf not working

I am building a jar which has application.conf under src/main/resources folder. However, I am trying to overwrite that while doing spark-submit. However it's not working.
following is my command
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--files application.conf \
$sandbox_jar flagFile/test.FLG \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
application.conf - is located in same directory my jar file is.
-Dconfig.file=path/to/config-file mayn't work due to internal cache on ConfigFactory. The documentation suggest to run ConfigFactory.invalidateCaches().
The other way is following, which merges the supplied properties with existing properties available.
ConfigFactory.invalidateCaches()
val c = ConfigFactory.parseFile(new File(path-to-file + "/" + "application.conf"))
val config : Config = c.withFallback(ConfigFactory.load()).resolve
I think the best way to override the properties would be to supply them using -D. Typesafe gives highest priority to system properties, so -D will override reference.conf and application.conf.
Considering application.conf is properties file. There is other option, which can solve the same purpose of using properties file.
Not sure but packaging properties file with jar might not provide flexibility? Here keeping properties file separate from jar packaging, this will provide flexibility as, whenever if any property changes just replace new properties file instead of building and deploying whole jar.
This can be achieved as, keep your properties in properties file are prefix your property key with "spark."
spark.inputpath /input/path
spark.outputpath /output/path
Spark Submit command would be like,
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--properties-file application.conf \
$sandbox_jar flagFile/test.FLG
Getting properties in code like,
sc.getConf.get("spark.inputpath") // /input/path
sc.getConf.get("spark.outputpath") // /output/path
Not nessesary it will solve your problem though. But here just try to put another approach to work.

Using log4j2 in Spark java application

I'm trying to use log4j2 logger in my Spark job. Essential requirement: log4j2 config is located outside classpath, so I need to specify its location explicitly. When I run my code directly within IDE without using spark-submit, log4j2 works well. However when I submit the same code to Spark cluster using spark-submit, it fails to find log42 configuration and falls back to default old log4j.
Launcher command
${SPARK_HOME}/bin/spark-submit \
--class my.app.JobDriver \
--verbose \
--master 'local[*]' \
--files "log4j2.xml" \
--conf spark.executor.extraJavaOptions="-Dlog4j.configurationFile=log4j2.xml" \
--conf spark.driver.extraJavaOptions="-Dlog4j.configurationFile=log4j2.xml" \
myapp-SNAPSHOT.jar
Log4j2 dependencies in maven
<dependencies>
. . .
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>${log4j2.version}</version>
</dependency>
<!-- Bridge log4j to log4j2 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
<version>${log4j2.version}</version>
</dependency>
<!-- Bridge slf4j to log4j2 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependencies>
Any ideas what I could miss?
Apparently at the moment there is no official support official for log4j2 in Spark. Here is detailed discussion on the subject: https://issues.apache.org/jira/browse/SPARK-6305
On practical side that means:
If you have access to Spark configs and jars and can modify them, you still can use log4j2 after manually adding log4j2 jars to SPARK_CLASSPATH, and providing log4j2 configuration file to Spark.
If you run on managed Spark cluster and have no access to Spark jars/configs, then you still can use log4j2, however its use will be limited to the code executed at driver side. Any code part running by executors will use Spark executors logger (which is old log4j)
Spark falls back to log4j because it probably cannot initialize logging system during startup (your application code is not added to classpath).
If you are permitted to place new files on your cluster nodes then create directory on all of them (for example /opt/spark_extras), place there all log4j2 jars and add two configuration options to spark-submit:
--conf spark.executor.extraClassPath=/opt/spark_extras/*
--conf spark.driver.extraClassPath=/opt/spark_extras/*
Then libraries will be added to classpath.
If you have no access to modify files on cluster you can try another approach. Add all log4j2 jars to spark-submit parameters using --jars. According to the documentation all these libries will be added to driver's and executor's classpath so it should work in the same way.
Try using the --driver-java-options
${SPARK_HOME}/bin/spark-submit \
--class my.app.JobDriver \
--verbose \
--master 'local[*]' \
--files "log4j2.xml" \
--driver-java-options "-Dlog4j.configuration=log4j2.xml" \
--jars log4j-api-2.8.jar,log4j-core-2.8.jar,log4j-1.2-api-2.8.jar \
myapp-SNAPSHOT.jar
If log4j2 is being used in one of your own dependencies, it's quite easy to bipass all configuration files and use programmatic configuration for one or two high level loggers IF and only IF the configuration file is not found.
The code below does the trick. Just name the logger to your top level logger.
private static boolean configured = false;
private static void buildLog()
{
try
{
final LoggerContext ctx = (LoggerContext) LogManager.getContext(false);
System.out.println("Configuration found at "+ctx.getConfiguration().toString());
if(ctx.getConfiguration().toString().contains(".config.DefaultConfiguration"))
{
System.out.println("\n\n\nNo log4j2 config available. Configuring programmatically\n\n");
ConfigurationBuilder<BuiltConfiguration> builder = ConfigurationBuilderFactory
.newConfigurationBuilder();
builder.setStatusLevel(Level.ERROR);
builder.setConfigurationName("IkodaLogBuilder");
AppenderComponentBuilder appenderBuilder = builder.newAppender("Stdout", "CONSOLE")
.addAttribute("target", ConsoleAppender.Target.SYSTEM_OUT);
appenderBuilder.add(builder.newLayout("PatternLayout").addAttribute("pattern",
"%d [%t] %msg%n%throwable"));
builder.add(appenderBuilder);
LayoutComponentBuilder layoutBuilder = builder.newLayout("PatternLayout").addAttribute("pattern",
"%d [%t] %-5level: %msg%n");
appenderBuilder = builder.newAppender("file", "File").addAttribute("fileName", "./logs/ikoda.log")
.add(layoutBuilder);
builder.add(appenderBuilder);
builder.add(builder.newLogger("ikoda", Level.DEBUG)
.add(builder.newAppenderRef("file"))
.add(builder.newAppenderRef("Stdout"))
.addAttribute("additivity", false));
builder.add(builder.newRootLogger(Level.DEBUG)
.add(builder.newAppenderRef("file"))
.add(builder.newAppenderRef("Stdout")));
((org.apache.logging.log4j.core.LoggerContext) LogManager.getContext(false)).start(builder.build());
ctx.updateLoggers();
}
else
{
System.out.println("Configuration file found.");
}
configured=true;
}
catch(Exception e)
{
System.out.println("\n\n\n\nFAILED TO CONFIGURE LOG4J2"+e.getMessage());
configured=true;
}
}

Which directory contains third party libraries for Spark

When we use
spark-submit
which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries to each of the slaves instead of shipping the contents in the application uber-jar.
Note: I did try adding to
$SPARK_HOME/lib_managed/jars
But the spark-submit still results in a ClassNotFoundException for classes included in the added library.
Hope these points will help you.
$SPARK_HOME/lib/ [contains the jar files ]
$SPARK_HOME/bin/ [contains the launch scripts - Spark-Submit,Spark-Class,pySpark,compute-classpath.sh etc]
Spark-Submit ---will call ---> Spark-Class.
Spark-class internally calls compute-Classpath.sh before executing / launching the job.
compute-Classpath.sh will pick the jars availble in $SPARK_HOME/lib to CLASSPATH.
(execute ./compute-classpath.sh //returns jars in lib dir)
So try these options.
option-1 - Placing user-specific-jars in $SPARK_HOME/lib/ will works
option-2 - Tweak compute-classpath.sh so that it will be able to pic
your jars specified in a user specific jars dir

"No Filesystem for Scheme: gs" when running spark job locally

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)
When running the job locally on my Mac machine, I am getting the following error:
5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs
I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>
The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
</description>
</property>
I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
<exclusions>
<exclusion> <!-- declare the exclusion here -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
</exclusions>
</dependency>
, and Hadoop 1.2.1 as follows:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>1.2.1</version>
</dependency>
The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.
In Scala, add the following config when setting your hadoopConfiguration:
val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
There are a couple ways to help Spark pick up the relevant Hadoop configurations, both involving modifying ${SPARK_INSTALL_DIR}/conf:
Copy or symlink your ${HADOOP_HOME}/conf/core-site.xml into ${SPARK_INSTALL_DIR}/conf/core-site.xml. For example, when bdutil installs onto a VM, it runs:
ln -s ${HADOOP_CONF_DIR}/core-site.xml ${SPARK_INSTALL_DIR}/conf/core-site.xml
Older Spark docs explain that this makes the xml files included in Spark's classpath automatically: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html
Add an entry to ${SPARK_INSTALL_DIR}/conf/spark-env.sh with:
export HADOOP_CONF_DIR=/full/path/to/your/hadoop/conf/dir
Newer Spark docs seem to indicate this as the preferred method going forward: https://spark.apache.org/docs/1.1.0/hadoop-third-party-distributions.html
I can't say what's wrong, but here's what I would try.
Try setting fs.gs.project.id: <property><name>fs.gs.project.id</name><value>my-little-project</value></property>
Print sc.hadoopConfiguration.get(fs.gs.impl) to make sure your core-site.xml is getting loaded. Print it in the driver and also in the executor: println(x); rdd.foreachPartition { _ => println(x) }
Make sure the GCS jar is sent to the executors (sparkConf.setJars(...)). I don't think this would matter in local mode (it's all one JVM, right?) but you never know.
Nothing but your program needs to be restarted. There is no Hadoop process. In local and standalone modes Spark only uses Hadoop as a library, and only for IO I think.
You can apply these settings directly on the spark reader/writer as follows:
spark
.read
.option("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.option("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
.option("google.cloud.auth.service.account.enable", "true")
.option("google.cloud.auth.service.account.json.keyfile", "<path-to-json-keyfile.json>")
.option("header", true)
.csv("gs://<bucket>/<path-to-csv-file>")
.show(10, false)
And add the relevant jar dependency to your build.sbt (or whichever build tool you use) and check https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector for latest:
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.6" classifier "shaded"
See GCS Connector and Google Cloud Storage connector for non-dataproc clusters

Groovy - How to Build a Jar

I've written a Groovy script which has a dependency on a SQL Server driver (sqljdbc4.jar). I can use the GroovyWrapper (link below) to compile it into a JAR, however how can I get dependencies into the Jar? I'm looking for a "best practice" sort of thing.
https://github.com/sdanzan/groovy-wrapper
Both of the replies below have been helpful, but how can I do this for signed Jar files? For instance:
Exception in thread "main" java.lang.SecurityException: Invalid signature file d
igest for Manifest main attributes
In the groovy wrapper script, you'll see this line near the bottom:
// add more jars here
That's where you can add your dependencies. If the jar file is in the same directory you're building from, add a line like this:
zipgroupfileset( dir: '.', includes: 'sqljdbc4.jar' )
Then rerun the script and your jar will include the classes from sqljdbc4.jar.
Edit:
If the jar file you depend on is signed and you need to maintain the signature, you'll have to keep the external jar. You can't include jar files inside of other jar files without using a custom classloader. You can, however, specify the dependency in the manifest to avoid having to set the classpath, i.e. your jar still executable with java -jar myjar.jar. Update the manifest section in the wrapping script to:
manifest {
attribute( name: 'Main-Class', value: mainClass )
attribute( name: 'Class-Path', value: 'sqljdbc4.jar' )
}
From your link, if you look at the source of the GroovyWrapper script, there's this line:
zipgroupfileset( dir: GROOVY_HOME, includes: 'embeddable/groovy-all-*.jar' )
zipgroupfileset( dir: GROOVY_HOME, includes: 'lib/commons*.jar' )
// add more jars here
I'd explicitly add it there.

Resources