When running mahout spark-itemsimilarity is giving error? - apache-spark

I get the following Stack-Trace error when i run
./mahout spark-itemsimilarity --input input-file
--output /output_dir
--master spark://url_to_master
--filter1 purchase
--filter2 view
--itemIDColumn 2
--rowIDColumn 0
--filterColumn 1
in linux terminal.
I cloned the project from github Mahout branch spark-1.2 and did
mvn install
in mahout source code directory. and than cd mahout/bin/
java.lang.NoClassDefFoundError: com/google/common/collect/HashBiMap
at org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator.registerClasses(MahoutKryoRegistrator.scala:39)
at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:104)
at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:104)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:104)
at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:159)
at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:121)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:214)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.HashBiMap
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 22 more
Please Help !
Thanks.

Mahout 0.10.0 supports Spark 1.1.1 or lower. If you build from source and change the Spark version number in the main pom at mahout/pom.xml you can build for Spark 1.2 but you will have to use the work around described below. The jar with "dependency-reduced" in its name will be in mahout/spark/target. A Spark 1.2 branch is being worked on so the above fix will not be needed. It is maybe a week from being ready to try.
There is a bug in Spark 1.2 forward, not sure if it's fixed in 1.3.
See it here: https://issues.apache.org/jira/browse/SPARK-6069
What worked for me is to put the jar with guava (it will be called mahout-spark_2.10-0.11.0-SNAPSHOT-dependency-reduced.jar or something like that) on all workers then pass that location to the Mahout job using:
spark-itemsimilarity -D:spark.executor.extraClassPath=/path/to/mahout/spark/target/mahout-spark_2.10-0.11-dependency-reduced.jar
the path must contain the jar on all workers.
The code work around will go into the spark-1.2 branch in the next week or so, which will make the -D:spark.executor.extraClassPath=/path/to/mahout... unneeded.

Related

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException

I've setup a 3-node cluster (1-master & 2-workers) of Hadoop with Yarn along with Spark.
My Pyspark scripts need org.elasticsearch.spark in order to write to Elasticsearch. I'm providing this with parameter --packages org.elasticsearch:elasticsearch-spark-30_2.12:8.4.1 while executing my Pyspark script , that is while executing using spark-submit .
Stuck with this error :
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException
at org.apache.hadoop.yarn.util.timeline.TimelineUtils.<clinit>(TimelineUtils.java:60)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 13 more
What have I tried :
I have tried to add all the paths listed on this answer - https://stackoverflow.com/a/25393369/6490744 - doesn't work.
I had Hadoop-3.1.1, after checking https://github.com/apache/incubator-kyuubi/issues/2904 (they've mentioned that the issue is resolved in Hadoop 3.3.3) I have upgraded to 3.3.3. But the issue still persists.
I have also tried by manually downloading the jar to my spark/jars directory using wget -U "Any User Agent" https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/8.4.1/elasticsearch-spark-30_2.12-8.4.1.jar => after downloading, tried to do spark-submit without passing --packages (since I have the jar in path).
All of this has been giving me the same error
After 2 hours of struggle, got the clue from - https://github.com/apache/incubator-kyuubi/issues/2904#issuecomment-1158643036 :
I had yarn.timeline-service.enabled set to true in my /etc/hadoop/yarn-site.xml - updated to false , now the error is gone.
Wonder how to setup the yarn-timeline-server now

Apache Spark : How to add default and specific dependency?

My spark-defaults.conf :
#a package I need everytime
spark.jars.packages org.influxdb:influxdb-java:2.14
When I launch a job :
spark-shell --master yarn --num-executors 6 --packages "a random package that I need only for this job specifically"
I get this error :
java.lang.NoClassDefFoundError: org/influxdb/InfluxDBFactory
at ch.cern.sparkmeasure.InfluxDBSink.<init>(influxdbsink.scala:53)
at ch.cern.sparkmeasure.InfluxDBSinkExtended.<init>(influxdbsink.scala:232)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2688)
at org.apache.spark.util.Utils$$anonfun$loadExtensions$1.apply(Utils.scala:2680)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2680)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2387)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2386)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2386)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:555)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)
... 62 elided
Caused by: java.lang.ClassNotFoundException: org.influxdb.InfluxDBFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 88 more
I think --packages overriding packages in spark-default.conf so I get this error
I don't want to include every time influxdb packages.
I want to have fixe packages in spark-default.conf and dynamic package when I launch a job with --packages but it seems incompatible.
Any idea ?
That class is indeed part of influxdb-java, so you shouldn't be getting that error. Especially if your code isn't even trying to use Influx packages
However, if you have packages specific to a certain application, you should actually package them as part of that application. Then others will be able to run it as well and not remember the specific invocation of the spark submit options
If you are using a build system for your code, look into creating an uber jar
I'm not familiar with the semantics of spark-defaults or if it's overridden, but if you do have packages that always need to be part of Spark applications, you should just download the JARs directly into the Spark workers classpaths, not make applications download them every time

Spark 2.3 java.lang.NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.metric

SPARK 2.3 is throwing following exception. Can anyone please help!! I tried adding the JARs
308 [Driver] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - User class threw exception: java.lang.NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.metric()Lio/netty/buffer/PooledByteBufAllocatorMetric;
java.lang.NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.metric()Lio/netty/buffer/PooledByteBufAllocatorMetric;
at org.apache.spark.network.util.NettyMemoryMetrics.registerMetrics(NettyMemoryMetrics.java:80)
at org.apache.spark.network.util.NettyMemoryMetrics.(NettyMemoryMetrics.java:76)
at org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:109)
at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
at org.apache.spark.rpc.netty.NettyRpcEnv.(NettyRpcEnv.scala:71)
at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:461)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:57)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:249)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:256)
at org.apache.spark.SparkContext.(SparkContext.scala:423)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at com.voicebase.etl.HBasePhoenixPerformance2.main(HBasePhoenixPerformance2.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
315 [main] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:486)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.ExecutionException: Boxed Error
This is because Hadoop binaries compiled with an older version and need us to just replace them. I haven't faced any issues with Hadoop by replacing them.
You need to replace netty-3.6.2.Final.jar and netty-all-4.0.23.Final.jar from path $HADOOP_HOME\share\hadoop with netty-all-4.1.17.Final.jar and netty-3.9.9.Final.jar.
This issue plagues due to mismatch of the version that Hadoop and Spark are compiled on for Netty. So you can follow this.
Similar Issue , solved by manually compiling the Spark by using specific version of Netty
And the other one as recommended by Suhas , by copying the content of SPARK_HOME/jars folder to the various lib folder or only the one in yarn inside HADOOP_HOME/share/hadoop solves the problem also. But it's a dirty fix. So maybe use latest version of both or manually compile them.
An older version of Netty was required by the aws-java-sdk. Deleting all the netty jars and removing the aws-java-sdk from the project solved the problem.
Issue has been resolved by adding the below netty jars in the dependencies,
"io.netty" % "netty-all" % "4.1.68.Final"
"io.netty" % "netty-buffer" % "4.1.68.Final"
And excluding all existing netty jars by adding excludeAll code.
val excludeNettyBufferBinding = ExclusionRule(organization = "io.netty.buffer")
excludeAll(excludeNettyBufferBinding)

Apache Zeppelin configuration with Spark

I have been trying to configure Apache Zeppeling with Spark 2.0. I managed to install them both on a linux os and I set the the spark on the 8080 port while zeppelin server on the 8082 port number.
In the zeppelin-env.sh file from zeppelin I set the SPARK_HOME variable to the location of the Spark folder.
However when I try to create a new node nothing compiles properly. From what it seems I didn't configure the interpreters as the interpreter tab is missing from in the home tab.
Any help would be much appreciated.
EDIT: E.I. when I am trying to run the zeppelin tutorial, the 'Load data into table' process I receive the following error:
java.lang.ClassNotFoundException:
org.apache.spark.repl.SparkCommandLine at
java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:400)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
at org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I don't think it's possible to use spark 2.0 without building from
source, since some relatively big changes have happened with this release.
You can clone the zeppelin git repo and build using the spark 2.0 profile as mentioned in the readme on github https://github.com/apache/zeppelin.
I've tried it and it works.

Hadoop HDFS test running issue - org.apache.hadoop.conf.Configuration NoClassDefFoundError

I'm working with Hadoop 0.21.0. and trying to run the hdfs_test application that comes alongside the C API library. After many problems I was able to compile hdfs_test. Now when I'm running it:
./hdfs_test
I'm getting the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.hadoop.conf.Configuration.<clinit>(Configuration.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
... 1 more
Can't construct instance of class org.apache.hadoop.conf.Configuration
Oops! Failed to connect to hdfs!
Any help is appreciated.. thanks
Like any other Java program you need the dependencies in the classpath or inside the jar. Hadoop also has an HADOOP_CLASSPATH to tell the cluster where to find dependencies in map-reduce tasks. Also see How to run a Hadoop program?

Resources