Which directory contains third party libraries for Spark - apache-spark

When we use
spark-submit
which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries to each of the slaves instead of shipping the contents in the application uber-jar.
Note: I did try adding to
$SPARK_HOME/lib_managed/jars
But the spark-submit still results in a ClassNotFoundException for classes included in the added library.

Hope these points will help you.
$SPARK_HOME/lib/ [contains the jar files ]
$SPARK_HOME/bin/ [contains the launch scripts - Spark-Submit,Spark-Class,pySpark,compute-classpath.sh etc]
Spark-Submit ---will call ---> Spark-Class.
Spark-class internally calls compute-Classpath.sh before executing / launching the job.
compute-Classpath.sh will pick the jars availble in $SPARK_HOME/lib to CLASSPATH.
(execute ./compute-classpath.sh //returns jars in lib dir)
So try these options.
option-1 - Placing user-specific-jars in $SPARK_HOME/lib/ will works
option-2 - Tweak compute-classpath.sh so that it will be able to pic
your jars specified in a user specific jars dir

Related

spark-submit overriding default application.conf not working

I am building a jar which has application.conf under src/main/resources folder. However, I am trying to overwrite that while doing spark-submit. However it's not working.
following is my command
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--files application.conf \
$sandbox_jar flagFile/test.FLG \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
application.conf - is located in same directory my jar file is.
-Dconfig.file=path/to/config-file mayn't work due to internal cache on ConfigFactory. The documentation suggest to run ConfigFactory.invalidateCaches().
The other way is following, which merges the supplied properties with existing properties available.
ConfigFactory.invalidateCaches()
val c = ConfigFactory.parseFile(new File(path-to-file + "/" + "application.conf"))
val config : Config = c.withFallback(ConfigFactory.load()).resolve
I think the best way to override the properties would be to supply them using -D. Typesafe gives highest priority to system properties, so -D will override reference.conf and application.conf.
Considering application.conf is properties file. There is other option, which can solve the same purpose of using properties file.
Not sure but packaging properties file with jar might not provide flexibility? Here keeping properties file separate from jar packaging, this will provide flexibility as, whenever if any property changes just replace new properties file instead of building and deploying whole jar.
This can be achieved as, keep your properties in properties file are prefix your property key with "spark."
spark.inputpath /input/path
spark.outputpath /output/path
Spark Submit command would be like,
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--properties-file application.conf \
$sandbox_jar flagFile/test.FLG
Getting properties in code like,
sc.getConf.get("spark.inputpath") // /input/path
sc.getConf.get("spark.outputpath") // /output/path
Not nessesary it will solve your problem though. But here just try to put another approach to work.

Configurate log4j for external jar

In my project (myProject) I use an external jar (external.jar). Both of them make logging with log4j.jar . With the help of log4j.properties file (located in myProject) I can configure logging from myProject. How can I configurate log levels of logging from the the external.jar without changing that jar file ?
Simpy adding package from external.jar ( let say org.external) in property file
log4j.logger.org.external=ERROR does not make any difference.
Here I have found the salution.

how to find HADOOP_HOME path on Linux?

I am trying to run the below java code on a hadoop server.
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
but I am not able to locate {HADOOP_HOME}. I tried with hadoop -classpath but it is giving output as below:
/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-0.20-mapreduce/./:/usr/lib/hadoop-0.20-mapreduce/lib/*:/usr/lib/hadoop-0.20-mapreduce/.//*
Anyone has any idea about this?
Navigate to the path where hadoop is installed. locate ${HADOOP_HOME}/etc/hadoop, e.g.
/usr/lib/hadoop-2.2.0/etc/hadoop
When you type the ls for this folder you should see all these files.
capacity-scheduler.xml httpfs-site.xml
configuration.xsl log4j.properties
container-executor.cfg mapred-env.cmd
core-site.xml mapred-env.sh
core-site.xml~ mapred-queues.xml.template
hadoop-env.cmd mapred-site.xml
hadoop-env.sh mapred-site.xml~
hadoop-env.sh~ mapred-site.xml.template
hadoop-metrics2.properties slaves
hadoop-metrics.properties ssl-client.xml.example
hadoop-policy.xml ssl-server.xml.example
hdfs-site.xml yarn-env.cmd
hdfs-site.xml~ yarn-env.sh
httpfs-env.sh yarn-site.xml
httpfs-log4j.properties yarn-site.xml~
httpfs-signature.secret
Core configuration settings are available in hadoop-env.sh.
You can see classpath settings in this file and I copied some sample here for your reference.
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
# The jsvc implementation to use. Jsvc is required to run secure datanodes.
#export JSVC_HOME=${JSVC_HOME}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR}
# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH+$HADOOP_CLASSPATH:}$f
done
Hope this helps!
hadoop-core jar file is in ${HADOOP_HOME}/share/hadoop/common directory, not in ${HADOOP_HOME} directory.
You can set the environment variable in your .bashrc file.
vim ~/.bashrc
Then add the following line to the end of .bashrc file.
export HADOOP_HOME=/your/hadoop/installation/directory
Just replace the path with your hadoop installation path.

Groovy - How to Build a Jar

I've written a Groovy script which has a dependency on a SQL Server driver (sqljdbc4.jar). I can use the GroovyWrapper (link below) to compile it into a JAR, however how can I get dependencies into the Jar? I'm looking for a "best practice" sort of thing.
https://github.com/sdanzan/groovy-wrapper
Both of the replies below have been helpful, but how can I do this for signed Jar files? For instance:
Exception in thread "main" java.lang.SecurityException: Invalid signature file d
igest for Manifest main attributes
In the groovy wrapper script, you'll see this line near the bottom:
// add more jars here
That's where you can add your dependencies. If the jar file is in the same directory you're building from, add a line like this:
zipgroupfileset( dir: '.', includes: 'sqljdbc4.jar' )
Then rerun the script and your jar will include the classes from sqljdbc4.jar.
Edit:
If the jar file you depend on is signed and you need to maintain the signature, you'll have to keep the external jar. You can't include jar files inside of other jar files without using a custom classloader. You can, however, specify the dependency in the manifest to avoid having to set the classpath, i.e. your jar still executable with java -jar myjar.jar. Update the manifest section in the wrapping script to:
manifest {
attribute( name: 'Main-Class', value: mainClass )
attribute( name: 'Class-Path', value: 'sqljdbc4.jar' )
}
From your link, if you look at the source of the GroovyWrapper script, there's this line:
zipgroupfileset( dir: GROOVY_HOME, includes: 'embeddable/groovy-all-*.jar' )
zipgroupfileset( dir: GROOVY_HOME, includes: 'lib/commons*.jar' )
// add more jars here
I'd explicitly add it there.

Warbler config.java_classes and log4j.properties

I'm packaging up a rails app with warbler and I want app specific logging. I've added the log4j and commons-loggin jar to the WEB-INF/lib directory, and I want to add log4j.properties to the WEB-INF/classes directory. The problem is, I also want environment specific logging, so my staging/production use different properties (ie. INFO instead of DEBUG) than my devel. I can't just do a:
config.java_classes = FileList["lib/log4j-#{RAILS_ENV}.properties"]
because Tomcat seems to look for the specific file log4j.properties. Is there any way to get warbler to rename this file to just log4j.properties? Or is there a better mechanism for app specific, environment specific logging?
And for the final answer. RAILS_ENV doesn't seem to work in warbler, but looking through the docs on warble config, there's a webxml attribute that contains rails.env, modifying my code to pull the file like:
config.java_classes = FileList["lib/properties/log4j.properties.#{config.webxml.rails.env}"]
Worked like a charm!
Guess I should just read further down in the warble file itself. You can configure pathmaps for the java_classes. Here's what I used:
config.java_classes = FileList["lib/properties/log4j.properties.#{RAILS_ENV}"]
config.pathmaps.java_classes << "%n"
The only problem I've found is that this doesn't actually put the log4j.properties in the WEB-INF/classes directory anymore. It now puts it in the Root. Seems odd that it specifically says in the docs:
One or more pathmaps defining how the java classes should be copied into WEB-INF/classes
I wouldn't think I'd have to add in that WEB-INF/classes path manually but I did. So finally then, this worked:
config.java_classes = FileList["lib/properties/log4j.properties.#{RAILS_ENV}"]
config.pathmaps.java_classes << "WEB-INF/classes/%n"
using the files log4j.properties.#{RAILS_ENV} in the lib/properties directory

Resources