Unable to query a Spark RDD using the HiveThriftServer2.startWithContext functionality - apache-spark

I am trying to query a Spark RDD using the HiveThriftServer2.startWithContext functionality and getting the following Exception:
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)
at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
at com.sun.proxy.$Proxy27.executeStatementAsync(Unknown Source)
at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:237)
at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:392)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1373)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1358)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:244)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1259)
at org.apache.hive.service.cli.log.LogManager.createNewOperationLog(LogManager.java:101)
at org.apache.hive.service.cli.log.LogManager.getOperationLogByOperation(LogManager.java:156)
at org.apache.hive.service.cli.log.LogManager.registerCurrentThread(LogManager.java:120)
at org.apache.hive.service.cli.session.HiveSessionImpl.runOperationWithLogCapture(HiveSessionImpl.java:714)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:370)
at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
... 19 more
Code:
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.hive._
object Test1
{
def main(args: Array[String])
{
val sparkConf = new SparkConf().setAppName("Test1")
val sc = new SparkContext(sparkConf)
...
val hoursAug = sqlContext.sql("SELECT H.Col1, H.Col2, U.Col3, U.Col4 " +
"FROM HOURS H " +
"JOIN USERS U " +
"ON H.User = U.USERNAME")
hoursAug.registerTempTable("HOURS_AUGM")
import org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext(sqlContext)
}
}
Environment:
CDH 5.3
Spark 1.3.0 (upgraded from the default Spark 1.2.0 on CDH 5.3)
Hive Metastore is in MySQL
Configuration steps:
Rebuilt Spark with Hive support using the command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Replaced Spark Assembly jar with the result of the build.
Placed hive-site.xml into Spark conf directory.
Using Beeline to work with Spark Thrift Server.
The connect command passes successfully, but any select or show tables command results in the Null Pointer Exception with the stack trace as shown above. However, when starting Spark Thrift Server from command line using /usr/lib/spark/sbin/start-thriftserver.sh, I am able to see and query Hive tables.
Can you please help me to resolve this issue?

Got similar when I try to load file as table content.
Please try to save RDD/DF as a permanent table, which works in my scenario -- beeline "show tables" can display the permanent table, not temporary ones
Not sure the cause yet...

Related

spark with hive : exception while connecting to local metaStore

buddy.I got a problem while using spark(version 3.2.1) connecting hive(version 3.1.2) metastore (local ) in my mac(Catalina 10.15.7). My hadoop and hive run in my mac in local mode and they both work well (I can insert records into hive table and select them out).
Here is my problem description.
Firstly,I start the metastore Service by
hive --service metastore
Secondly,I run code in spark-shell
val spark = SparkSession.builder() .config("hive.metastore.uris", s"thrift://hiveHost:9083") .enableHiveSupport() .getOrCreate()
Finally,I got exceptions below(there are more in my terminal and I just copy the last several 'caused by' here).By the way,my mysql is version 8.0.27,I do not know if this is related to exception.Thanks for your helping!
Caused by: com.zaxxer.hikari.pool.HikariPool$PoolInitializationException: Failed to initialize pool: Could not create connection to database server.
at com.zaxxer.hikari.pool.HikariPool.checkFailFast(HikariPool.java:512)
at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:105)
at com.zaxxer.hikari.HikariDataSource.<init>(HikariDataSource.java:71)
at org.datanucleus.store.rdbms.connectionpool.HikariCPConnectionPoolFactory.createConnectionPool(HikariCPConnectionPoolFactory.java:176)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:213)
... 198 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Could not create connection to database server.
at sun.reflect.GeneratedConstructorAccessor150.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.Util.getInstance(Util.java:386)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1015)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:989)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:975)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:920)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2575)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2311)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:834)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
at sun.reflect.GeneratedConstructorAccessor146.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:416)
at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:347)
at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:95)
at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:101)
at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:341)
at com.zaxxer.hikari.pool.HikariPool.checkFailFast(HikariPool.java:506)
... 202 more
Caused by: java.lang.NullPointerException
at com.mysql.jdbc.ConnectionImpl.getServerCharacterEncoding(ConnectionImpl.java:3286)
at com.mysql.jdbc.MysqlIO.sendConnectionAttributes(MysqlIO.java:1987)
at com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1913)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1290)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2493)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2526)

Spark ClassNotFoundException with the simplest rdd.map

edit
This question is not a duplicate of Resolving dependency problems in Apache Spark:
the answers above are generalities about sharing code. I don't see how it applies to my issue
Here I do not use spark-submit but simply call spark from an app running on my computer
There is no dependency whatsoever. The only code to share is the simplistic function i => i
if it is a problem of spark/java/scala version, I don't understand why it would produce a ClassNotFoundException for the function in question
I run this code
object Tmp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("Spark Test")
.setMaster("spark://[ip-of-spark-master]:7077")
val sc = new SparkContext(sparkConf)
sc.parallelize(0 until 10000).map(i => i).count()
}
}
Which fails with java.lang.ClassNotFoundException: Tmp$$anonfun$main$1 when I simply run it on my computer, without sending any jar. It works if I don't do the map(i => i) (i.e. it is the function i => i that is not found).
To be clear:
I do not use spark-submit
my computer is the driver and is not on the spark cluster
I use the exact same spark version on my computer that is used in the cluster (2.0.2)
Now I thought that for simple function, spark would serialize it and there were no need to send the jar. I am pretty sure I have done it before.
Did I miss something ?
Looking at the stack trace, it fails on the driver, so on my side:
Driver stacktrace:
[error] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
[error] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
[error] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
...
[error] Caused by: java.lang.ClassNotFoundException: Tmp$$anonfun$1
[error] at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[error] at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[error] at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[error] at java.lang.Class.forName0(Native Method)
[error] at java.lang.Class.forName(Class.java:348)
[error] at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
...

Spark submit throws error while using Hive tables

i have a strange error, i am trying to write data to hive, it works well in spark-shell, but while i am using spark-submit, it throwing database/table not found in default error.
Following is the coding i am trying to write in spark-submit , i am using custom build of spark 2.0.0
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.table("spark_schema.iris_ori")
Following is the command i am using,
/home/ec2-user/Spark_Source_Code/spark/bin/spark-submit --class TreeClassifiersModels --master local[*] /home/ec2-user/Spark_Snapshots/Spark_2.6/TreeClassifiersModels/target/scala-2.11/treeclassifiersmodels_2.11-1.0.3.jar /user/ec2-user/Input_Files/defPath/iris_spark SPECIES~LBL+PETAL_LENGTH+PETAL_WIDTH RAN_FOREST 0.7 123 12
Following is the Error,
16/05/20 09:05:18 INFO SparkSqlParser: Parsing command: spark_schema.measures_20160520090502
Exception in thread "main" org.apache.spark.sql.AnalysisException: Database 'spark_schema' does not exist;
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:37)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.tableExists(InMemoryCatalog.scala:195)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:360)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:464)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:458)
at TreeClassifiersModels$.main(TreeClassifiersModels.scala:71)
at TreeClassifiersModels.main(TreeClassifiersModels.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:726)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The issue was because of the deprecation happened on Spark Version 2.0.0. Hive Context was deprecated in Spark 2.0.0. To read/Write Hive tables on Spark 2.0.0 we need to use Spark session as follows.
val sparkSession = SparkSession.withHiveSupport(sc)

Spark Streaming 1.6.0 EMR using Python : ClassNotFoundException: org.apache.spark.streaming.kinesis.KinesisUtilsPythonHelper

I'm running an out-of-the-box EMR cluster using Spark 1.6.0 and Zeppelin 0.5.6 on AWS. My goal is to get a simple Spark Streaming context initialized and pointed to an internal Kinesis stream just as a proof-of-concept. However, when I run it I get :
Py4JJavaError: An error occurred while calling o89.loadClass. :
java.lang.ClassNotFoundException: org.apache.spark.streaming.kinesis.KinesisUtilsPythonHelper
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
The code I'm running (via Zeppelin) is simply :
%pyspark
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
ssc = StreamingContext(sc, 1)
appName = '{my-app-name}'
streamName = '{my-stream-name}'
endpointUrl = '{my-endpoint}'
regionName = '{my-region}'
lines = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 2)
When I ran into this locally, I made sure to build spark-streaming-kinesis-asl from source and include these jars in my spark config :
spark.driver.extraClassPath /path/to/kinesis/asl/assembly/jars/*
However, I can't seem to get this to work when on EMR. To be safe, I including it in the following, to no avail :
spark.driver.extraClassPath
spark.driver.extraLibraryPath
spark.executor.extraClassPath
spark.executor.extraLibraryPath
Has anyone run into this before? I'm printing out the spark config when I restart the context to confirm that these changes are being picked up. Maybe this needs to be done on the slave nodes as well? Or perhaps another config option/key altogether?
Add the dependency to zeppelin context "z". Heres an example of adding the sparkcsv package
%dep
z.load("com.databricks:spark-csv_2.11:1.3.0")

why the map function of sc.cassandraTable("test", "users").select("username") can not work?

Following spark-cassandra-connector's demo and Installing the Cassandra / Spark OSS Stack, under spark-shell, I tried the following snippet:
sc.stop
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "172.21.0.131")
.set("spark.cassandra.auth.username", "adminxx")
.set("spark.cassandra.auth.password", "adminxx")
val sc = new SparkContext("172.21.0.131", "Cassandra Connector Test", conf)
val rdd = sc.cassandraTable("test", "users").select("username")
Many operators of rdd can work fine, such as:
rdd.first
rdd.count
But when I use map:
val result = rdd.map(x => 1) //just for simple
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[61] at map at <console>:32
Then, I run:
result.first
I got the following errors:
15/12/11 15:09:00 WARN TaskSetManager: Lost task 0.0 in stage 31.0 (TID 104, 124.250.36.124): java.lang.ClassNotFoundException:
$line346.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
Caused by: java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
I don't know why I got such error? Any advice will be appreciated!
UPDATED:
According #RussSpitzer's answer for CassandraRdd.map( row => row.getInt("id)) does not work , java.lang.ClassNotFoundException happened!, I resolved this error through following errors, instead of using sc.stop and creating an new SparkContext, I start spark-shell with options:
bin/spark-shell -conf spark.cassandra.connection.host=172.21.0.131 --conf spark.cassandra.auth.username=adminxx --conf spark.cassandra.auth.password=adminxx
And then all steps are the same and work fine.
Russell Spitzer's answer from the spark-connector-user list:
I'm pretty sure the main problem here is that you start a context with --jars and then kill that context and then start another one. Try simplifying your code, instead of setting all of those spark conf options and creating a new contexts run your shell like. Also the jar that you want on the classpath is the connector assembly jar, not a custom build of a Scala script you want to run.
./spark-shell --conf spark.casandra.connection.host=10.129.20.80 ...
You should not need to modify the ack.wait.timeout or the executor.extraClasspath.
Spark applications normally send their compiled code as jar files to the executors. This way the function that you map is present on the executors.
The situation is more tricky in spark-shell. It has to compile and broadcast the code for your every line interactively. There is not even a class you're operating inside. It creates these fake $$iwC$$ classes to solve this.
Normally this works out well, but you may have hit a spark-shell bug. You can try to work around it by putting your code inside a class in spark-shell:
object Obj { val mapper = { x: String => 1 } }
val result = rdd.map(Obj.mapper)
But it is probably safest to implement your code as an application instead of just writing it in spark-shell.

Resources