When to use SPARK_CLASSPATH or SparkContext.addJar

When to use SPARK_CLASSPATH or SparkContext.addJar - apache-spark

I'm using a standalone spark cluster, one master and 2 workers.
I really don't understand how to use wisely SPARK_CLASSPATH or SparkContext.addJar. I tried both and It looks like addJar doesn't work as I used to believe.
In my case I tried to use some joda-time function, in the closures or outside. If I set SPARK_CLASSPATH with a path to the joda-time jar, everything works ok. But if I remove SPARK_CLASSPATH and add in my program:
JavaSparkContext sc = new JavaSparkContext("spark://localhost:7077", "name", "path-to-spark-home", "path-to-the-job-jar");
sc.addJar("path-to-joda-jar");
It doesn't work anymore, although in logs I can see:
14/03/17 15:32:57 INFO SparkContext: Added JAR /home/hduser/projects/joda-time-2.1.jar at http://127.0.0.1:46388/jars/joda-time-2.1.jar with timestamp 1395066777041
and immediatly after:
Caused by: java.lang.NoClassDefFoundError: org/joda/time/DateTime
at com.xxx.sparkjava1.SimpleApp.main(SimpleApp.java:57)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.joda.time.DateTime
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
I used to suppose that SPARK_CLASSPATH was setting the classpath for the driver part of the job, and SparkContext.addJar was setting the classpath for the executors, but It does not seem right anymore.
Anyone knows better than me?

SparkContext.addJar is broken in 0.9 as well as ADD_JARS environment variable. It used to work as documented in 0.8.x and the fix is already commited to master, so it's expected in the next release. For now you can either use workaround described in Jira or make patched Spark build.
See relevant mailing list discussion: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3C5234E529519F4320A322B80FBCF5BDA6#gmail.com%3E
Jira issue: https://spark-project.atlassian.net/plugins/servlet/mobile#issue/SPARK-1089

SPARK_CLASSPATH is deprecated since Spark 1.0+. You can add jars to the classpath programatically, inside file spark-defaults.conf or with spark-submit flags.
Add jars to a Spark Job - spark-submit

Related

Cannot modify the value of a Spark config: spark.executor.instances

I am using spark 3.0 and I am setting parameters
My parameters:
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("fs.s3a.fast.upload.buffer", "bytebuffer")
spark.conf.set("spark.sql.files.maxPartitionBytes",134217728)
spark.conf.set("spark.executor.instances", 4)
spark.conf.set("spark.executor.memory", 3)
Error:
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.executor.instances
I DONT want to pass it through spark-submit as this is pytest case that I am writing.
How do I get through this?

According to spark official documentation, the spark.executor.instances property may not be affected when setting programmatically through SparkConf in runtime, so it would be suggested to set through configuration file or spark-submit command line options.
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.

You can try to add those option to PYSPARK_SUBMIT_ARGS before initialize SparkContext. Its syntax is similar to spark-submit.

Unable to start geomesa-accumulo

hduser#Neha-PC:/usr/local/geomesa-tutorials$ java -cp geomesa-tutorials-accumulo/geomesa-tutorials-accumulo-quickstart/target/geomesa-tutorials-accumulo-quickstart-2.3.0-SNAPSHOT.jar org.geomesa.example.accumulo.AccumuloQuickStart --accumulo.instance.id accumulo --accumulo.zookeepers localhost:2184 --accumulo.user root --accumulo.password PASS1234 --accumulo.catalog table1
Picked up JAVA_TOOL_OPTIONS: -Dgeomesa.hbase.coprocessor.path=hdfs://localhost:8020/hbase/lib/geomesa-hbase-distributed-runtime_2.11-2.2.0.jar
Loading datastore
java.lang.IncompatibleClassChangeError: Method org.locationtech.geomesa.security.AuthorizationsProvider.apply(Ljava/util/Map;Ljava/util/List;)Lorg/locationtech/geomesa/security/AuthorizationsProvider; must be InterfaceMethodref constant
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory$.buildAuthsProvider(AccumuloDataStoreFactory.scala:234)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory$.buildConfig(AccumuloDataStoreFactory.scala:162)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory.createDataStore(AccumuloDataStoreFactory.scala:48)
at org.locationtech.geomesa.accumulo.data.AccumuloDataStoreFactory.createDataStore(AccumuloDataStoreFactory.scala:36)
at org.geotools.data.DataAccessFinder.getDataStore(DataAccessFinder.java:121)
at org.geotools.data.DataStoreFinder.getDataStore(DataStoreFinder.java:71)
at org.geomesa.example.quickstart.GeoMesaQuickStart.createDataStore(GeoMesaQuickStart.java:103)
at org.geomesa.example.quickstart.GeoMesaQuickStart.run(GeoMesaQuickStart.java:77)
at org.geomesa.example.accumulo.AccumuloQuickStart.main(AccumuloQuickStart.java:25)

You need to ensure that all versions of GeoMesa on the classpath are the same. Just from your command, it seems you are at least mixing 2.3.0-SNAPSHOT with 2.2.0. Try checking out the git tag for tutorial project that corresponds to the GeoMesa version you want, as described here. If you want to use a SNAPSHOT version, you need to make sure that you have pulled the latest changes for each project.

spark.executor.extraLibraryPath is overriding the path instead of appending

I am setting an extra library path to the Spark's executor (in order to run a udf based on a C++ library).
When providing the extra library via the spark.executor.extraLibraryPath, I am seeing that the library path is being overridden instead of being appended.
Here is an example of seeing it:
spark-shell -Dmaster=yarn-client --conf "spark.executor.extraLibraryPath=/path/to/mylib"
And inside the Spark shell, invoking the following shows that the executor's LD_LIBRARY_PATH is indeed not proper:
scala> sc.parallelize(1 to 1).map(x => System.getenv("LD_LIBRARY_PATH")).collect
res0: Array[String] = Array(/path/to/mylib:)
It seems that there was some fix in SPARK-1719, but I am not sure that fix is correct.
Is there a better way to append a library path to the executor's runtime?

Metrics System not recognizing Custom Source/Sink in application jar

Followup from here.
I've added Custom Source and Sink in my application jar and found a way to get a static fixed metrics.properties on Stand-alone cluster nodes. When I want to launch my application, I give the static path - spark.metrics.conf="/fixed-path/to/metrics.properties". Despite my custom source/sink being in my code/fat-jar - I get ClassNotFoundException on CustomSink.
My fat-jar (with Custom Source/Sink code in it) is on hdfs with read access to all.
So here's what all I've already tried setting (since executors can't find Custom Source/Sink in my application fat-jar):
spark.executor.extraClassPath = hdfs://path/to/fat-jar
spark.executor.extraClassPath = fat-jar-name.jar
spark.executor.extraClassPath = ./fat-jar-name.jar
spark.executor.extraClassPath = ./
spark.executor.extraClassPath = /dir/on/cluster/* (although * is not at file level, there are more directories - I have no way of knowing random application-id or driver-id to give absolute name before launching the app)
It seems like this is how executors are getting initialized for this case (please correct me if I am wrong) -
Driver tells here's the jar location - hdfs://../fat-jar.jar and here are some properties like spark.executor.memory etc.
N number of Executors spin up (depending on configuration) on cluster
Start downloading hdfs://../fat-jar.jar but initialize metrics system in the mean time (? - not sure of this step)
Metrics system looking for Custom Sink/Source files - since it's mentioned in metrics.properties - even before it's done downloading fat-jar (which actually has all those files) (this is my hypothesis)
ClassNotFoundException - CustomSink not found!
Is my understanding correct? Moreover, is there anything else I can try? If anyone has experience with custom source/sinks, any help would be appreciated.

I stumbled upon the same ClassNotFoundException when I needed to extend existing GraphiteSink class and here's how I was able to solve it.
First, I created a CustomGraphiteSink class in org.apache.spark.metrics.sink package:
package org.apache.spark.metrics.sink;
public class CustomGraphiteSink extends GraphiteSink {}
Then I specified the class in metrics.properties
*.sink.graphite.class=org.apache.spark.metrics.sink.CustomGraphiteSink
And passed this file to spark-submit via:
--conf spark.metrics.conf=metrics.properties

In order to use custom source/sink, one has to distribute it using spark-submit --files and set it via spark.executor.extraClassPath

Spark streaming: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY

I'm trying to start spark streaming in standalone mode (MacOSX) and getting the following error nomatter what:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.storage.DiskBlockManager.addShutdownHook(DiskBlockManager.scala:147)
at org.apache.spark.storage.DiskBlockManager.(DiskBlockManager.scala:54)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:75)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:173)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:347)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.(SparkContext.scala:450)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:566)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:578)
at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:90)
at org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:78)
at io.ascolta.pcap.PcapOfflineReceiver.main(PcapOfflineReceiver.java:103)
Caused by: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY
at java.lang.Class.getField(Class.java:1584)
at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:220)
at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:189)
at org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala:58)
at org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala)
... 13 more
This symptom is discussed in relation to EC2 at https://forums.databricks.com/questions/2227/shutdown-hook-priority-javalangnosuchfieldexceptio.html as a Hadoop2 dependency. But I'm running locally (for now), and am using the spark-1.5.2-bin-hadoop2.6.tgz binary from https://spark.apache.org/downloads.html which I'd hoped would eliminate this possibility.
I've pruned my code down to essentially nothing; like this:
SparkConf conf = new SparkConf()
.setAppName(appName)
.setMaster(master);
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
I've permuted maven dependencies to ensure all spark stuff is consistent at version 1.5.2. Yet the ssc initialization above fails nomatter what. So I thought it was time to ask for help.
Build environment is eclipse and maven with the shade plugin. Launch/run is from eclipse debugger, not spark-submit, for now.

I meet this issue today, it is because I have two jars: hadoop-common-2.7.2.jar and hadoop-core-1.6.1.jar in my pom，and they all dependency hadoop.fs.FileSystem.
but in FileSystem-1.6.1 there is no SHUTDOWN_HOOK_PRIORITY Property in FileSystem Class. and FileSystem-2.7.2 has. but it seems that my code think FileSystem-1.6.1 is the right Class. So this issue raise.
the solution is also simple, delete hadoop-core-1.6.1 in pom. it means we need to check all FileSystem in our project is above 2.x.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

When to use SPARK_CLASSPATH or SparkContext.addJar - apache-spark

SPARK_CLASSPATH is deprecated since Spark 1.0+. You can add jars to the classpath programatically, inside file spark-defaults.conf or with spark-submit flags. Add jars to a Spark Job - spark-submit

Related

Cannot modify the value of a Spark config: spark.executor.instances

Unable to start geomesa-accumulo

spark.executor.extraLibraryPath is overriding the path instead of appending

Metrics System not recognizing Custom Source/Sink in application jar

Spark streaming: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY

Categories

Resources