Spark streaming: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY - apache-spark

I'm trying to start spark streaming in standalone mode (MacOSX) and getting the following error nomatter what:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.storage.DiskBlockManager.addShutdownHook(DiskBlockManager.scala:147)
at org.apache.spark.storage.DiskBlockManager.(DiskBlockManager.scala:54)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:75)
at org.apache.spark.storage.BlockManager.(BlockManager.scala:173)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:347)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.(SparkContext.scala:450)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:566)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:578)
at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:90)
at org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:78)
at io.ascolta.pcap.PcapOfflineReceiver.main(PcapOfflineReceiver.java:103)
Caused by: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY
at java.lang.Class.getField(Class.java:1584)
at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:220)
at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:189)
at org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala:58)
at org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala)
... 13 more
This symptom is discussed in relation to EC2 at https://forums.databricks.com/questions/2227/shutdown-hook-priority-javalangnosuchfieldexceptio.html as a Hadoop2 dependency. But I'm running locally (for now), and am using the spark-1.5.2-bin-hadoop2.6.tgz binary from https://spark.apache.org/downloads.html which I'd hoped would eliminate this possibility.
I've pruned my code down to essentially nothing; like this:
SparkConf conf = new SparkConf()
.setAppName(appName)
.setMaster(master);
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
I've permuted maven dependencies to ensure all spark stuff is consistent at version 1.5.2. Yet the ssc initialization above fails nomatter what. So I thought it was time to ask for help.
Build environment is eclipse and maven with the shade plugin. Launch/run is from eclipse debugger, not spark-submit, for now.

I meet this issue today, it is because I have two jars: hadoop-common-2.7.2.jar and hadoop-core-1.6.1.jar in my pom,and they all dependency hadoop.fs.FileSystem.
but in FileSystem-1.6.1 there is no SHUTDOWN_HOOK_PRIORITY Property in FileSystem Class. and FileSystem-2.7.2 has. but it seems that my code think FileSystem-1.6.1 is the right Class. So this issue raise.
the solution is also simple, delete hadoop-core-1.6.1 in pom. it means we need to check all FileSystem in our project is above 2.x.

Related

Hikari NoSuchMethodError on AWS EMR/Spark

I am trying to upgrade EMR from 5.13to 5.35 using spark-2.4.8. The jar I'm trying to use has a dependency on HikariCP:4.0.3 which is called to set the db pool-config setKeepaliveTime. While I can run my job fine on my local machine, it bombs out in EMR-5.35 with the following error:
java.lang.NoSuchMethodError: com.zaxxer.hikari.HikariConfig.setKeepaliveTime(J)
The problem is, in runtime, the HikariConfig is being loaded from file:/usr/lib/spark/jars/HikariCP-java7-2.4.12.jar instead of what was provided as a dependency in my custom/fat jar. The workaround right now is to remove that jar, but is there an elegant way to know where that jar is coming from just on the EMR and how could we remove that on start-up?
Just in case, anyone else faces this, the fix was shading (Process that allows renaming packages on the Uber jar), I basically had to make sure the dependency I use doesn't get overridden with the one that's stale in EMR-5.35.0. It looked something like the below:
assembly / assemblyShadeRules := Seq(
ShadeRule
.rename("com.zaxxer.hikari.**" -> "x_hikari_conf.#1")
.inLibrary("x" % "y" % z)
.inProject
)
And that was pretty much it, after the above lines were put in and the new jar was created, it worked like a charm.
More on shading can be read here

spark saveAsTextFile method is really strange in java api,it just not work right in my program

I am new to spark and get this problem when i run my test program。I install spark on an linux server,and it has just one master node and one worker node。Then I write test program on my laptop,code like this:
`JavaSparkContext ct= new JavaSparkContext ("spark://192.168.90.74:7077","test","/home/webuser/spark/spark-1.5.2-bin-hadoop2.4",new String[0]);
ct.addJar("/home/webuser/java.spark.test-0.0.1-SNAPSHOT-jar-with-dependencies.jar");
List list=new ArrayList();
list.add(1);
list.add(6);
list.add(9);
JavaRDD<String> rdd=ct.parallelize(list);
System.out.println(rdd.collect());
rdd.saveAsTextFile("/home/webuser/temp");
ct.close();`
I suppose I could get /home/webuser/temp on my server ,but in fact this program create c://home/webuser/temp in my laptop which os is win8,I don't understand,
shouldn't saveAsTextFile() run on spark's worker node?why it just run on my laptop,which is sprak's driver,I suppose.
It depends on which filesystem is the default for your Spark installation. According to what you're saying the default filesystem for you is file:/// which is the default. In order to change this, you need to modify the fs.defaultFS property in core-site.xml of your Hadoop configuration. Otherwise, you can simply change your code and specify the filesystem URL in the code, i.e.:
rdd.saveAsTextFile("hdfs://192.168.90.74/home/webuser/temp");
if 192.168.90.74 is your Namenode.

Running spark code locally on eclipse with spark installed on remote server

I have configured eclipse for scala and created a maven project and wrote a simple word count spark job on windows. Now my spark+hadoop are installed on linux server. How can I launch my spark code from eclipse to spark cluster (which is on linux)?
Any suggestion.
Actually this answer is not so simple, as you would expect.
I will make many assumptions, first that you use sbt, second is that you are working in a linux based computer, third is the last is that you have two classes in your project, let's say RunMe and Globals, and the last assumption will be that you want to set up the settings inside the program. Thus, somewhere in your runnable code you must have something like this:
object RunMe {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("mesos://master:5050") //If you use Mesos, and if your network resolves the hostname master to its IP.
.setAppName("my-app")
.set("spark.executor.memory", "10g")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext()
//your code comes here
}
}
The steps you must follow are:
Compile the project, in the root of it, by using:
$ sbt assembly
Send the job to the master node, this is the interesting part (assuming you have the next structure in your project target/scala/, and inside you have a file .jar, which corresponds to the compiled project)
$ spark-submit --class RunMe target/scala/app.jar
Notice that, because I assumed that the project has two or more classes you would have to identify which class you want to run. Furthermore, I bet that both approaches, for Yarn and Mesos are very similar.
If you are developing a project in Windows and you want to deploy it in Linux environment then you would want to create an executable JAR file and export it to the home directory of your Linux and specify the same in your spark script (on your terminal). This is possible all because of the beauty of Java Virtual Machine. Let me know if you need more help.
To achieve what you want, you would need:
First: Build the jar (if you use gradle -> fatJar or shadowJar)
Second: In your code, when you generate the SparkConf, you need to specify Master address, spark.driver.host and relative Jar location, smth like:
SparkConf conf = new SparkConf()
.setMaster("spark://SPARK-MASTER-ADDRESS:7077")
.set("spark.driver.host", "IP Adress of your local machine")
.setJars(new String[]{"path\\to\\your\\jar file.jar"})
.setAppName("APP-NAME");
And third: Just Right Click and run from your IDE. That's it... !
What you are looking for is the master where the SparkContext should be created.
You need to set your master to be the cluster you want to use.
I invite you to read the Spark Programming Guide or follow an introductory course to understand these basic concepts. Spark is not a tool you can begin work with overnight, it takes some time.
http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark

Spark 1.4 image for Google Cloud?

With bdutil, the latest version of tarball I can find is on spark 1.3.1:
gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz
There are a few new DataFrame features in Spark 1.4 that I want to use. Any chance the Spark 1.4 image be available for bdutil, or any workaround?
UPDATE:
Following the suggestion from Angus Davis, I downloaded and pointed to spark-1.4.1-bin-hadoop2.6.tgz, the deployment went well; however, run into error when calling SqlContext.parquetFile(). I cannot explain why this exception is possible, GoogleHadoopFileSystem should be a subclass of org.apache.hadoop.fs.FileSystem. Will continue investigate on this.
Caused by: java.lang.ClassCastException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2595)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.hive.metastore.Warehouse.getFs(Warehouse.java:112)
at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:144)
at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:159)
at org.apache.hadoop.hive.metastore.Warehouse.getDefaultDatabasePath(Warehouse.java:177)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:504)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:523)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:397)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:356)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:54)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4944)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:171)
Asked a separate question about the exception here
UPDATE:
The error turned out to be a Spark defect; resolution/workaround provided in the above question.
Thanks!
Haiying
If a local workaround is acceptable, you can copy the spark-1.4.1-bin-hadoop2.6.tgz from an apache mirror into a bucket that you control. You can then edit extensions/spark/spark-env.sh and change SPARK_HADOOP2_TARBALL_URI='<your copy of spark 1.4.1>' (make certain that the service account running your VMs has permission to read the tarball).
Note that I haven't done any testing to see if Spark 1.4.1 works out of the box right now, but I'd be interested in hearing your experience if you decide to give it a go.

When to use SPARK_CLASSPATH or SparkContext.addJar

I'm using a standalone spark cluster, one master and 2 workers.
I really don't understand how to use wisely SPARK_CLASSPATH or SparkContext.addJar. I tried both and It looks like addJar doesn't work as I used to believe.
In my case I tried to use some joda-time function, in the closures or outside. If I set SPARK_CLASSPATH with a path to the joda-time jar, everything works ok. But if I remove SPARK_CLASSPATH and add in my program:
JavaSparkContext sc = new JavaSparkContext("spark://localhost:7077", "name", "path-to-spark-home", "path-to-the-job-jar");
sc.addJar("path-to-joda-jar");
It doesn't work anymore, although in logs I can see:
14/03/17 15:32:57 INFO SparkContext: Added JAR /home/hduser/projects/joda-time-2.1.jar at http://127.0.0.1:46388/jars/joda-time-2.1.jar with timestamp 1395066777041
and immediatly after:
Caused by: java.lang.NoClassDefFoundError: org/joda/time/DateTime
at com.xxx.sparkjava1.SimpleApp.main(SimpleApp.java:57)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.joda.time.DateTime
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
I used to suppose that SPARK_CLASSPATH was setting the classpath for the driver part of the job, and SparkContext.addJar was setting the classpath for the executors, but It does not seem right anymore.
Anyone knows better than me?
SparkContext.addJar is broken in 0.9 as well as ADD_JARS environment variable. It used to work as documented in 0.8.x and the fix is already commited to master, so it's expected in the next release. For now you can either use workaround described in Jira or make patched Spark build.
See relevant mailing list discussion: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3C5234E529519F4320A322B80FBCF5BDA6#gmail.com%3E
Jira issue: https://spark-project.atlassian.net/plugins/servlet/mobile#issue/SPARK-1089
SPARK_CLASSPATH is deprecated since Spark 1.0+. You can add jars to the classpath programatically, inside file spark-defaults.conf or with spark-submit flags.
Add jars to a Spark Job - spark-submit

Resources