Not able to run the Hive sql via Spark

Not able to run the Hive sql via Spark - apache-spark

I am trying to execute hive SQL via spark code but it is throwing below mentioned error. I can only select data from hive table.
My spark version is 1.6.1
My Hive version is 1.2.1
command to run spark submit
spark-submit --master local[8] --files /srv/data/app/spark/conf/hive-site.xml test_hive.py
code:-
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
sc=SparkContext()
sqlContext = SQLContext(sc)
HiveContext = HiveContext(sc)
#HiveContext.setConf("yarn.timeline-service.enabled","false")
#HiveContext.sql("SET spark.sql.crossJoin.enabled=false")
HiveContext.sql("use default")
HiveContext.sql("TRUNCATE TABLE default.test_table")
HiveContext.sql("LOAD DATA LOCAL INPATH '/srv/data/data_files/*' OVERWRITE INTO TABLE default.test_table")
df = HiveContext.sql("select * from version")
for x in df.collect():
print x
Error:-
17386 [Thread-3] ERROR org.apache.spark.sql.hive.client.ClientWrapper -
======================
HIVE FAILURE OUTPUT
======================
SET spark.sql.inMemoryColumnarStorage.compressed=true
SET spark.sql.thriftServer.incrementalCollect=true
SET spark.sql.hive.convertMetastoreParquet=false
SET spark.sql.broadcastTimeout=800
SET spark.sql.hive.thriftServer.singleSession=true
SET spark.sql.inMemoryColumnarStorage.partitionPruning=true
SET spark.sql.crossJoin.enabled=true
SET hive.support.sql11.reserved.keywords=false
SET spark.sql.crossJoin.enabled=false
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ClassCastException: attempting to castjar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.class
======================
END HIVE FAILURE OUTPUT
======================
Traceback (most recent call last):
File "/home/iip/hist_load.py", line 10, in <module>
HiveContext.sql("TRUNCATE TABLE default.tbl_wmt_pos_file_test")
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o46.sql.
: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ClassCastException: attempting to castjar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.class

I can only select data from hive table.
It is perfectly normal and expected behavior. Spark SQL is not intended to be fully compatible with HiveQL or implement full set of features provided by Hive.
Overall, some compatibility is preserved, but is not guaranteed to be kept in the future, as Spark SQL converges to SQL 2003 standard.

From the post here:
Spark job fails with ClassCastException because of conflict in different version of same class in YARN and SPARK jar.
From Set below property in HiveContext:
hc = new org.apache.spark.sql.hive.HiveContext(sc)
hc.setConf("yarn.timeline-service.enabled","false")

Related

Unable to fetch data from external Hive cluster in a Spark job running on an EMR cluster

I am executing a Spark job on an EMR cluster to connect and fetch data from tables residing in an external Hive server which is protected with Kerberos authentication. As a pre-requisite, I am downloading hive-site.xml of the remote Hive server into $SPARK_HOME/conf/ folder on the cluster. I am also copying a keytab and authenticating using kinit.
Authentication and connection to Hive metastore is getting established, and I'm able to get a list of databases with the following code snipped:
from pyspark import SparkContext, HiveContext
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Hive_check")\
.config("spark.hadoop.hive.metastore.uris", "thrift://ip-xx-xx-xx-xx.ec2.internal:9083")\
.config("spark.security.credentials.hadoopfs.enabled", "true")\
.enableHiveSupport().getOrCreate()
sc = spark_session.sparkContext
sc.setSystemProperty("hive.metastore.kerberos.principal", "hive/ip-xx-xx-xx-xx.ec2.internal#MY.REALM.COM")
sc.setSystemProperty("hive.metastore.kerberos.keytab.file", "/path/to/hive.service.keytab")
sc.setSystemProperty("javax.security.auth.useSubjectCredsOnly","false")
sc.setSystemProperty("hive.metastore.execute.setugi", "true")
hc = HiveContext(sc)
db_list = hc.sql("Show databases")
db_list.show()
However, when I try to run select queries on some existing tables (can query those tables from HUE), I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 433, in sql
return self.sparkSession.sql(sqlQuery)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/home/hadoop/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1310, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/home/hadoop/anaconda3/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o75.sql.
: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
I tried copying core-site.xml and yarn-site.xml into $SPARK_HOME/conf/ to resolve this issue. But now, I'm unable to create spark session itself. The program gets stuck for over 15 minutes with no logs at SparkSession....getOrCreate() step.
Is there any other configuration I'm missing? What could be the possible reason of getting connected to hive metastore but not able to fetch the data from the tables?

How to do streaming Kafka->Zeppelin->Spark with current versions

I have a Kafka 2.3 message broker and want to do some processing with data of the messages within Spark. For the beginning I want to use the Spark 2.4.0 that is integrated in Zeppelin 0.8.1 and want to use the Zeppelin notebooks for rapid prototyping.
For this streaming task I need "spark-streaming-kafka-0-10" for Spark>2.3 according to https://spark.apache.org/docs/latest/streaming-kafka-integration.html that only supports Java and Scale (and not Python). But there are no default Java or Scala interpreters in Zeppelin.
If I try this code (taken from https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/)
%spark.pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:9092', 'spark-streaming', {'test':1})
I get the following error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.0 ...
Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
Fail to execute line 1: kafkaStream = KafkaUtils.createStream(ssc,
'localhost:9092', 'spark-streaming', {'test':1}) Traceback (most
recent call last): File
"/tmp/zeppelin_pyspark-8982542851842620568.py", line 380, in
exec(code, _zcUserQueryNameSpace) File "", line 1, in File
"/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py",
line 78, in createStream
helper = KafkaUtils._get_helper(ssc._sc) File "/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py",
line 217, in _get_helper
return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
TypeError: 'JavaPackage' object is not callable
So I wonder how to tackle the task:
Should I really use spark-streaming-kafka-0-8 despited being deprecated since some months? But spark-streaming-kafka-0-10 seems to be in the default zeppelin-jar directory.
Configure/Create interpreter in Zeppelin for Java/Scala since spark-streaming-kafka-0-10 does only support these langauges?
Ignore Zeppelin and do it on the console using "spark-submit"?

Spark 2.1 - Error While instantiating HiveSessionState

With a fresh install of Spark 2.1, I am getting an error when executing the pyspark command.
Traceback (most recent call last):
File "/usr/local/spark/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/local/spark/python/pyspark/sql/session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
I have Hadoop and Hive on the same machine. Hive is configured to use MySQL for the metastore. I did not get this error with Spark 2.0.2.
Can someone please point me in the right direction?

I was getting same error in windows environment and Below trick worked for me.
in shell.py the spark session is defined with .enableHiveSupport()
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
Remove hive support and redefine spark session as below:
spark = SparkSession.builder\
.getOrCreate()
you can find shell.py in your spark installation folder.
for me it's in "C:\spark-2.1.1-bin-hadoop2.7\python\pyspark"
Hope this helps

I had the same problem. Some of the answers sudo chmod -R 777 /tmp/hive/, or to downgrade spark with hadoop to 2.6 didn't work for me.
I realized that what caused this problem for me is that I was doing SQL queries using the sqlContext instead of using the sparkSession.
sparkSession =SparkSession.builder.master("local[*]").appName("appName").config("spark.sql.warehouse.dir", "./spark-warehouse").getOrCreate()
sqlCtx.registerDataFrameAsTable(..)
df = sparkSession.sql("SELECT ...")
this perfectly works for me now.

Spark 2.1.0 - When I run it with yarn client option - I don't see this issue, but yarn cluster mode gives "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':".
Still looking for answer.

The issue for me was solved by disabling HADOOP_CONF_DIR environment variable. It was pointing to hadoop configuration directory and while starting pyspark shell, the variable caused spark to initiate hadoop cluster which wasn't initiated.
So if you have HADOOP_CONF_DIR variable enabled, then you have to start hadoop cluster started before using spark shells
Or you need to disable the variable.

You are missing the spark-hive jar.
For example, if you are running on Scala 2.11, with Spark 2.1, you can use this jar.
https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.11/2.1.0

I saw this error on a new (2018) Mac, which came with Java 10. The fix was to set JAVA_HOME to Java 8:
export JAVA_HOME=`usr/libexec/java_home -v 1.8`

I too was struggling in cluster mode. Added hive-site.xml from sparkconf directory, if you have hdp cluster then it should be at /usr/hdp/current/spark2-client/conf. Its working for me.

I was getting this error trying to run pyspark and spark-shell when my HDFS wasn't started.

I have removed ".enableHiveSupport()\" from shell.py file and its working perfect
/*****Before********/
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
/*****After********/
spark = SparkSession.builder\
.getOrCreate()
/*************************/

Project location and file permissions would be issue. I have observed this error happening inspite of changes to my pom file.Then i changed the directory of my project to user directory where i have full permissions, this solved my issue.

hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
spark-submit --master yarn-cluster test_pokes.py
However, I encounter the error:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 683, in _ssql_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 692, in _get_hive_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
...
...
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
...
at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
I have seen a number of similar posts for other Hadoop distributions, but not for BigInsights on Cloud.

The solution to this error was to add the jars:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
However, I then get a different error:
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
I've added the other question here: Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
The final solution is captured here: https://stackoverflow.com/a/41272260/1033422

PySpark HBase/Phoenix integration

I'm supposed to read Phoenix data into pyspark.
edit:
I'm using Spark HBase converters:
Here is a code snippet:
port="2181"
host="zookeperserver"
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "camel", "hbase.mapreduce.scan.columns": "data:a"}
sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)
Traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
jconf, batchSize)
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:130)
Any help would be much appreciated.
Thank you!
/Tina

Using spark phoenix plugin is the recommended approach.
please find details about phoenix spark plugin here
Environment : tested with AWS EMR 5.10 , PySpark
Following are the steps
Create Table in phoenix https://phoenix.apache.org/language/
Open Phoenix shell
“/usr/lib/phoenix/bin/sqlline.py“
DROP TABLE IF EXISTS TableName;
CREATE TABLE TableName (DOMAIN VARCHAR primary key);
UPSERT INTO TableName (DOMAIN) VALUES('foo');
Download spark phoenix plugin jar
download spark phoenix plugin jar from https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core/4.11.0-HBase-1.3
you need phoenix--HBase--client.jar , i used phoenix-4.11.0-HBase-1.3-client.jar as per my phoenix and hbase version
From your hadoop home directory, setup the following variable:
phoenix_jars=/home/user/apache-phoenix-4.11.0-HBase-1.3-bin/phoenix-4.11.0-HBase-1.3-client.jar
Start PySpark shell and add the dependency in Driver and executer classpath
pyspark --jars ${phoenix_jars} --conf spark.executor.extraClassPath=${phoenix_jars}
--Create ZooKeeper URL ,Replace with your cluster zookeeper quorum, you can check from hbase-site.xml
emrMaster = "ZooKeeper URL"
df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.load()
df.show()
df.columns
df.printSchema()
df1=df.replace(['foo'], ['foo1'], 'DOMAIN')
df1.show()
df1.write \
.format("org.apache.phoenix.spark") \
.mode("overwrite") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.save()

There are two ways to do this :
1) As Phoenix has a JDBC layer you can use Spark JdbcRDD to read data from Phoenix in Spark https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/rdd/JdbcRDD.html
2) Using Spark HBAse Converters:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
https://github.com/apache/spark/tree/master/examples/src/main/python

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Not able to run the Hive sql via Spark - apache-spark

From the post here: Spark job fails with ClassCastException because of conflict in different version of same class in YARN and SPARK jar. From Set below property in HiveContext: hc = new org.apache.spark.sql.hive.HiveContext(sc) hc.setConf("yarn.timeline-service.enabled","false")

Related

Unable to fetch data from external Hive cluster in a Spark job running on an EMR cluster

How to do streaming Kafka->Zeppelin->Spark with current versions

Spark 2.1 - Error While instantiating HiveSessionState

hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

PySpark HBase/Phoenix integration

Categories

Resources