spark kinesis failing on cloudera with java.lang.AbstractMethodError - apache-spark

below is my POM file. I am writing a spark streaming with aws kinesis
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-client</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<version>1.6.0</version>
</dependency>
I am facing below exception during run of spark of spark program on Cloudera 5.10
17/04/27 05:34:04 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 58.0 (TID 179, hadoop1.local, executor 5): java.lang.AbstractMethodError
at org.apache.spark.Logging$class.log(Logging.scala:50)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.log(KinesisCheckpointer.scala:39)
at org.apache.spark.Logging$class.logDebug(Logging.scala:62)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.logDebug(KinesisCheckpointer.scala:39)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.startCheckpointerThread(KinesisCheckpointer.scala:119)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.<init>(KinesisCheckpointer.scala:50)
at org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2000)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This runs perfectly fine on EMR4.4 However CDH fails. Any suggestion

The underlying problem seems to be the use of org.apache.spark.Logging:
NOTE: DO NOT USE this class outside of Spark.
It is intended as an internal utility.
This will likely be changed or removed in future releases.
http://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/Logging.html
This is fixed in 2.0.0 as mentioned in https://issues.apache.org/jira/browse/SPARK-9307.

Related

Spark error with google/guava library: java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.refreshAfterWrite

I have a simple spark project - in which in the pom.xml the dependencies are only the basic scala, scalatest/junit, and spark:
<dependency>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.binary.version}</artifactId>
<version>3.0.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
</dependencies>
When attempting to run a basic spark program the SparkSession init fails on this line:
SparkSession.builder.master(master).appName("sparkApp").getOrCreate
Here is the output / error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/04/07 18:06:15 INFO SparkContext: Running Spark version 2.2.1
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder
.refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)
Lcom/google/common/cache/CacheBuilder;
at org.apache.hadoop.security.Groups.<init>(Groups.java:96)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2424)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2424)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2424)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:295)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
I have run spark locally many dozens of times on other projects, what might be wrong with this simple one? Is there a dependency on $HADOOP_HOME environment variable or similar?
Update By downgrading the spark version to 2.0.1 I was able to compile. That does not fix the problem (we need newer) version. But it helps point out the source of the problem
Another update In a different project the hack to downgrade to 2.0.1 does help - i.e. execution proceeds further : but then when writing out to parquet a similar exception does happen.
8/05/07 11:26:11 ERROR Executor: Exception in task 0.0 in stage 2741.0 (TID 2618)
java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
at org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
at org.apache.hadoop.io.compress.CodecPool.<clinit>(CodecPool.java:74)
at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.<init>(CodecFactory.java:92)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:169)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:303)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetFileFormat.scala:562)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
This error occurs due to version mismatch between Google's guava library and Spark.
Spark shades guava but many libraries use guava. You can try Shading the Guava dependencies as per this post.
Apache-Spark-User-List
Adding shade plugin to your pom file and relocating google package can resolve this issue.
More information can found here and here
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google.common</pattern>
<shadedPattern>shade.com.google.common</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
If this also doesn't help then adding guava library of version 15.0 works nicely. The reason of this work around is in dependencyManagement. The nice SO answer is here
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>15.0</version>
</dependency>
</dependencies>
</dependencyManagement>
I am getting this error in spring boot : java.lang.TypeNotPresentException: Type com.google.common.cache.CacheBuilderSpec
com.google.common.cache.CacheBuilder.build()Lcom/google/common/cache/Cache
The issue is due to "com.google.guava:guava" api. In springboot this api comes under some other api might be "spring-boot-starter-web" or "springfox-swagger2" api so we need to first exclude guava api from springfox-swagger2 jar and need to add updated version of guava api.spring-data-mongodb
Solution:
1. add guava dependency on the top of all the dependency so that springboot can ge the latest version:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>
Find out the spring boot dependecy where artifactId: "guava" included then exlude "guava" artifact from that dependency and then add the guava dependency like above.

Getting java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder Exception while streaming kafka from Spark streaming

I am trying to read the kafka streaming data from spark streaming application; while in the process of reading data I am getting following exception:
16/12/24 11:09:05 INFO storage.BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.inndata.RSVPSDataStreaming.KafkaToSparkStreaming.main(KafkaToSparkStreaming.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 10 more
Here is my versions info:
spark: 1.6.2
kafka: 0.8.2
Here is pom.xml:
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.8.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8-assembly_2.10</artifactId>
<version>2.0.0-preview</version>
</dependency>
Seems like implicit String Encoder is needed
Try to apply this
import org.apache.spark.sql.Encoder
implicit val stringpEncoder = org.apache.spark.sql.Encoders.kryo[String]
you can find more about Encoders here and from official Document here
You use incompatible and duplicated artifact versions. Remember that when using Spark:
all Scala artifacts have to use the same major Scala version (2.10, 2.11).
all Spark artifacts have to use the same major Spark version (1.6, 2.0).
In you build definition you mix spark-streaming 1.6 with spark-core 2.0 and you include duplicated spark-streaming-kafka for Scala 2.10 while remaining dependencies are for Scala 2.11.

Logging issue in apache spark streaming

Getting below exception while creating KafkaUtils.createStream();
Below is my spark dependency.same thing was working in spark streaming older version 1.5.2
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.2</version>
</dependency>
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91)
at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66)
at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:110)
at org.apache.spark.streaming.kafka.KafkaUtils.createStream(KafkaUtils.scala)
at com.tcs.iux.core.config.RealtimeProcessing.startSpark(RealtimeProcessing.java:78)
at com.tcs.iux.core.processor.StartRealTimeProcessing.main(StartRealTimeProcessing.java:32)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more
Logging was made private in Spark 2.0
use 2.0.0 version of spark-streaming-kafka_2.11 to match it with your spark version.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>2.0.0</version>
</dependency>
You're mixing between Spark versions, which is never a good idea. You need to be aligned both for Spark Streaming and it's external Kafka driver.
The dependencies have changed a bit. There's now experimental support for Kafka 0.10.0.x. If you're using Kafka 0.8.2.x, you'll need:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.0.0</version>
</dependency>
If you want to try out Kafka 0.10.0.x, you'll need:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.0.0</version>
</dependency>

Deploying Storm topology with Spark dependency

I'm trying to deploy a Storm topology (version 1.0.0) with a Spark dependency (version 1.6.1). That topology works normally using Local Cluster, but not submitting it to a cluster. I know that Spark and Storm needs the libs related with log4j. So, if the pom file is modified to:
<!-- Apache Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
This error occurs:
java.lang.NoSuchMethodError: org.apache.log4j.Logger.setLevel(Lorg/apache/log4j/Level;)V
at org.apache.spark.util.AkkaUtils$$anonfun$org$apache$spark$util$AkkaUtils$$doCreateActorSystem$1.apply(AkkaUtils.scala:75) ~[stormjar.jar:?]
at org.apache.spark.util.AkkaUtils$$anonfun$org$apache$spark$util$AkkaUtils$$doCreateActorSystem$1.apply(AkkaUtils.scala:75) ~[stormjar.jar:?]
at scala.Option.map(Option.scala:145) ~[stormjar.jar:?]
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:75) ~[stormjar.jar:?]
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) ~[stormjar.jar:?]
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52) ~[stormjar.jar:?]
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988) ~[stormjar.jar:?]
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) ~[stormjar.jar:?]
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979) ~[stormjar.jar:?]
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55) ~[stormjar.jar:?]
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266) ~[stormjar.jar:?]
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193) ~[stormjar.jar:?]
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288) ~[stormjar.jar:?]
at org.apache.spark.SparkContext.<init>(SparkContext.scala:457) ~[stormjar.jar:?]
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59) ~[stormjar.jar:?]
at ufrn.imd.engsoft.storm.SentimentAnalyserBolt.prepare(SentimentAnalyserBolt.java:106) ~[stormjar.jar:?]
at org.apache.storm.daemon.executor$fn__8226$fn__8239.invoke(executor.clj:795) ~[storm-core-1.0.0.jar:1.0.0]
at org.apache.storm.util$async_loop$fn__554.invoke(util.clj:482) [storm-core-1.0.0.jar:1.0.0]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]
2016-04-22 00:29:06.285 o.a.s.util [ERROR] Halting process: ("Worker died")
And without any exclusions in the Spark dependency this error occurs:
java.lang.ExceptionInInitializerError
at org.apache.log4j.Logger.getLogger(Logger.java:39) ~[log4j-over-slf4j-1.6.6.jar:1.6.6]
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:75) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.SparkContext.<init>(SparkContext.scala:457) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59) ~[spark-assembly-1.6.1-hadoop2.6.0.jar:1.6.1]
at ufrn.imd.engsoft.storm.SentimentAnalyserBolt.prepare(SentimentAnalyserBolt.java:106) ~[stormjar.jar:?]
at org.apache.storm.daemon.executor$fn__8226$fn__8239.invoke(executor.clj:795) ~[storm-core-1.0.0.jar:1.0.0]
at org.apache.storm.util$async_loop$fn__554.invoke(util.clj:482) [storm-core-1.0.0.jar:1.0.0]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_65]
Caused by: java.lang.IllegalStateException: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting StackOverflowError. See also http://www.slf4j.org/codes.html#log4jDelegationLoop for more details.
at org.apache.log4j.Log4jLoggerFactory.<clinit>(Log4jLoggerFactory.java:49) ~[log4j-over-slf4j-1.6.6.jar:1.6.6]
... 18 more
In both cases above the Storm dependency is the following:
<!-- Apache Storm -->
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>${storm.version}</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
The Storm folder contains these jars:
- log4-api-2.1
- log4j-core-2.1
- log4j-over-sl4j-1.6.6
- log4j-sl4j-impl-2.1
- sl4j-api-1.7.7
- sl4j-log4j12
Maybe the Spark being able to get the jars above could solve that problem, but I didn't find clues to do that. Somebody could help me with any idea about how to resolve that problem?
Thanks (:!
log4j only defines an logging interface and log4j-over-slf4j and slf4j-log4j12 and concrete implementations of this interface. As Storm finds both implementations, is does not know which one to use (in the case you do not exclude anything).
In your "exclude" case however, you exclude both, the interface and concrete implementation, so you get the error about the missing interface. Try to exclude only the implementation but not the interface.

Apache Spark: Streaming without HDFS checkpoint

I'm implementing a Spark job which makes use of reduceByKeyAndWindow, therefore I need to add checkpointing.
From Spark's website I see that:
Checkpointing can be enabled by setting a directory in a fault-tolerant, reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved.
My application is just for academic purposes, thus I don't want to set an HDFS for checkpointing but just a local file. Doing so in MacOS works fine (setting a temporary dir as checkpoint dir), the problem comes when doing it in Windows, which throws an exception for permissions.
I already tried starting eclipse as administrator and creating the directory manually setting setWritable, setReadable and setExecutable to true. Any hint on how to overcome the problem in Windows?
Thanks!
Update Here's my code and exception. Just to clarify again, it works fine in Mac but not in Windows.
SparkConf conf = new SparkConf().setAppName("testApp").setMaster("local[2]");
JavaSparkContext ctx = new JavaSparkContext(conf);
JavaStreamingContext jsc = new JavaStreamingContext(ctx, new Duration(1000));
jsc.checkpoint(Files.createTempDir().getAbsolutePath());
Exception:
Exception in thread "pool-7-thread-3" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:772)
at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:135)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Solved by adding the latest Hadoop libraries to my project.
If using Maven, the following set of dependencies do the trick.
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0</version>
</dependency>
On Windows, you can solve this problem as following
Download winutils.exe to a folder say MY_UTILS/bin
Create an environment variable HADOOP_HOME and point it to MY_UTILS

Resources