Spark ClassNotFoundException with the simplest rdd.map - apache-spark

edit
This question is not a duplicate of Resolving dependency problems in Apache Spark:
the answers above are generalities about sharing code. I don't see how it applies to my issue
Here I do not use spark-submit but simply call spark from an app running on my computer
There is no dependency whatsoever. The only code to share is the simplistic function i => i
if it is a problem of spark/java/scala version, I don't understand why it would produce a ClassNotFoundException for the function in question
I run this code
object Tmp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("Spark Test")
.setMaster("spark://[ip-of-spark-master]:7077")
val sc = new SparkContext(sparkConf)
sc.parallelize(0 until 10000).map(i => i).count()
}
}
Which fails with java.lang.ClassNotFoundException: Tmp$$anonfun$main$1 when I simply run it on my computer, without sending any jar. It works if I don't do the map(i => i) (i.e. it is the function i => i that is not found).
To be clear:
I do not use spark-submit
my computer is the driver and is not on the spark cluster
I use the exact same spark version on my computer that is used in the cluster (2.0.2)
Now I thought that for simple function, spark would serialize it and there were no need to send the jar. I am pretty sure I have done it before.
Did I miss something ?
Looking at the stack trace, it fails on the driver, so on my side:
Driver stacktrace:
[error] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
[error] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
[error] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
...
[error] Caused by: java.lang.ClassNotFoundException: Tmp$$anonfun$1
[error] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[error] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[error] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[error] at java.lang.Class.forName0(Native Method)
[error] at java.lang.Class.forName(Class.java:348)
[error] at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
...

Related

java.io.EOFException when using spark-submit with yarn as master on a cluster

I'm trying to run a jar file with this spark-submit command:
spark-submit --master yarn --deploy-mode cluster --executor-memory 3g --class my.package.Main my-jar-file.jar
The class Main is the jar's main class, and here's the contents (all in Scala):
object Main{
def main(args: Array[String]){
val server = HttpServer.create(new InetSocketAddress("master", 8000), 0)
val backend = new MainProcess()
val handlerRoot = new RootHandler()
handlerRoot.initProcess(backend)
server.createContext("/", handlerRoot)
server.setExecutor(null)
server.start()
println("Server is started at " + server.getAddress().getHostString() + ":" + server.getAddress().getPort())
}
}
The class MainProcess is the class where I do the stuff with Spark and Spark GraphX library using the files obtained from HDFS. This is how I configure the SparkContext in MainProcess class:
class MainProcess{
val config = new SparkConf()
config.setAppName("Final GraphX App - Main")
val sc = new SparkContext(config)
...
}
The app seems to be running okay and the final status returned a success, but the app simply closes instead of running continuously as it's supposed to be a running server. I can only open the link master:8000 once and it's back to unable to connect when I tried refreshing the page. Here's the log from running the app:
18/04/06 15:45:59 ERROR yarn.YarnAllocator: Failed to launch executor 2 on container container_1522920902032_0027_01_000003
org.apache.spark.SparkException: Exception while starting container container_1522920902032_0027_01_000003 on host slave2
at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:125)
at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:523)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: java.io.EOFException; Host Details : local host is: "master/10.100.69.207"; destination host is: "slave2":57914;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy19.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy20.startContainers(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:201)
at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:122)
... 5 more
Caused by: java.io.IOException: java.io.EOFException
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
... 18 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560)
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:729)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725)
... 21 more
This app is basically a web app made using Java HTTP Server (com.sun.net.httpserver.HttpServer) and it uses Spark to process big data. The requests sent are accepted by the handler class and a new thread is made to run the Spark job on background. The user can send another request to check if the Spark job is finished, so the finished result can be shown to the web page. The problem is, the server is "killed" every time Spark claims to have finished a job (but in this case, failed a job).
I'm using Spark 2.2.0 built for Hadoop 2.7 and Hadoop 2.7.1. All data files are in HDFS.

AWS EMR no host: hdfs:///var/log/spark/apps

I am trying to use AWS EMR (emr-4.3.0) Spark 1.6.0, Hadoop 2.7.0
I created EMR cluster, and added Step(in AWS ERM web) with my sample jar.
It's SpringBoot application and written by Java(1.8) (I installed JDK8 in the box)
It run with following command
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class org.springframework.boot.loader.JarLauncher s3://my-test/SparkForSpring-S1.2014.jar
I created SparkContext as following code.
SparkConf conf = new SparkConf().setAppName("SparkForSpring");
return new JavaSparkContext(conf);
but it fails with following error, I feel like it's something like not related to my application, I am new to Spark, Yarn though.
Caused by: org.springframework.beans.factory.BeanDefinitionStoreException: Factory method [public org.apache.spark.api.java.JavaSparkContext com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext()] threw exception; nested exception is java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:188)
at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:586)
... 49 more
Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1650)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:66)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:547)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext(SparkConfig.java:35)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.CGLIB$javaSparkContext$0(<generated>)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b$$FastClassBySpringCGLIB$$10b15a77.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:312)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.javaSparkContext(<generated>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:166)
... 50 more
I read some document but quite not sure what should I do to fix this error. A hint will be greatly helpful.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html
I solved this problem by not using SpringBoot's executable jar rather I use maven shade plugin to package only spring related jar files in one jar and using system classloader. Here is full pom.xml
I got a hint from this question's answer
apache-spark 1.3.0 and yarn integration and spring-boot as a container

why the map function of sc.cassandraTable("test", "users").select("username") can not work?

Following spark-cassandra-connector's demo and Installing the Cassandra / Spark OSS Stack, under spark-shell, I tried the following snippet:
sc.stop
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "172.21.0.131")
.set("spark.cassandra.auth.username", "adminxx")
.set("spark.cassandra.auth.password", "adminxx")
val sc = new SparkContext("172.21.0.131", "Cassandra Connector Test", conf)
val rdd = sc.cassandraTable("test", "users").select("username")
Many operators of rdd can work fine, such as:
rdd.first
rdd.count
But when I use map:
val result = rdd.map(x => 1) //just for simple
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[61] at map at <console>:32
Then, I run:
result.first
I got the following errors:
15/12/11 15:09:00 WARN TaskSetManager: Lost task 0.0 in stage 31.0 (TID 104, 124.250.36.124): java.lang.ClassNotFoundException:
$line346.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
Caused by: java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
I don't know why I got such error? Any advice will be appreciated!
UPDATED:
According #RussSpitzer's answer for CassandraRdd.map( row => row.getInt("id)) does not work , java.lang.ClassNotFoundException happened!, I resolved this error through following errors, instead of using sc.stop and creating an new SparkContext, I start spark-shell with options:
bin/spark-shell -conf spark.cassandra.connection.host=172.21.0.131 --conf spark.cassandra.auth.username=adminxx --conf spark.cassandra.auth.password=adminxx
And then all steps are the same and work fine.
Russell Spitzer's answer from the spark-connector-user list:
I'm pretty sure the main problem here is that you start a context with --jars and then kill that context and then start another one. Try simplifying your code, instead of setting all of those spark conf options and creating a new contexts run your shell like. Also the jar that you want on the classpath is the connector assembly jar, not a custom build of a Scala script you want to run.
./spark-shell --conf spark.casandra.connection.host=10.129.20.80 ...
You should not need to modify the ack.wait.timeout or the executor.extraClasspath.
Spark applications normally send their compiled code as jar files to the executors. This way the function that you map is present on the executors.
The situation is more tricky in spark-shell. It has to compile and broadcast the code for your every line interactively. There is not even a class you're operating inside. It creates these fake $$iwC$$ classes to solve this.
Normally this works out well, but you may have hit a spark-shell bug. You can try to work around it by putting your code inside a class in spark-shell:
object Obj { val mapper = { x: String => 1 } }
val result = rdd.map(Obj.mapper)
But it is probably safest to implement your code as an application instead of just writing it in spark-shell.

Unable to query a Spark RDD using the HiveThriftServer2.startWithContext functionality

I am trying to query a Spark RDD using the HiveThriftServer2.startWithContext functionality and getting the following Exception:
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)
at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
at com.sun.proxy.$Proxy27.executeStatementAsync(Unknown Source)
at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:237)
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:392)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1373)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1358)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:244)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1259)
at org.apache.hive.service.cli.log.LogManager.createNewOperationLog(LogManager.java:101)
at org.apache.hive.service.cli.log.LogManager.getOperationLogByOperation(LogManager.java:156)
at org.apache.hive.service.cli.log.LogManager.registerCurrentThread(LogManager.java:120)
at org.apache.hive.service.cli.session.HiveSessionImpl.runOperationWithLogCapture(HiveSessionImpl.java:714)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:370)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
... 19 more
Code:
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.hive._
object Test1
{
def main(args: Array[String])
{
val sparkConf = new SparkConf().setAppName("Test1")
val sc = new SparkContext(sparkConf)
...
val hoursAug = sqlContext.sql("SELECT H.Col1, H.Col2, U.Col3, U.Col4 " +
"FROM HOURS H " +
"JOIN USERS U " +
"ON H.User = U.USERNAME")
hoursAug.registerTempTable("HOURS_AUGM")
import org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext(sqlContext)
}
}
Environment:
CDH 5.3
Spark 1.3.0 (upgraded from the default Spark 1.2.0 on CDH 5.3)
Hive Metastore is in MySQL
Configuration steps:
Rebuilt Spark with Hive support using the command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Replaced Spark Assembly jar with the result of the build.
Placed hive-site.xml into Spark conf directory.
Using Beeline to work with Spark Thrift Server.
The connect command passes successfully, but any select or show tables command results in the Null Pointer Exception with the stack trace as shown above. However, when starting Spark Thrift Server from command line using /usr/lib/spark/sbin/start-thriftserver.sh, I am able to see and query Hive tables.
Can you please help me to resolve this issue?
Got similar when I try to load file as table content.
Please try to save RDD/DF as a permanent table, which works in my scenario -- beeline "show tables" can display the permanent table, not temporary ones
Not sure the cause yet...

Hadoop HDFS test running issue - org.apache.hadoop.conf.Configuration NoClassDefFoundError

I'm working with Hadoop 0.21.0. and trying to run the hdfs_test application that comes alongside the C API library. After many problems I was able to compile hdfs_test. Now when I'm running it:
./hdfs_test
I'm getting the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.hadoop.conf.Configuration.<clinit>(Configuration.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
... 1 more
Can't construct instance of class org.apache.hadoop.conf.Configuration
Oops! Failed to connect to hdfs!
Any help is appreciated.. thanks
Like any other Java program you need the dependencies in the classpath or inside the jar. Hadoop also has an HADOOP_CLASSPATH to tell the cluster where to find dependencies in map-reduce tasks. Also see How to run a Hadoop program?

Resources