pyspark; Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' - apache-spark

I am trying to configure spark with hive meta store, to access hive tables from spark.
When i moved hive.site.xml to spark_home/conf i am getting errors. in spark when i tried to read data from hive tables. and the same error when i tried to execute simple select query in hive shell.
error i am getting is below.
for hive
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
for spark
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

Related

Spark application syncing with Hive metastore - "There is no primary group for UGI spark" error

I'm running a simple Spark job on Kubernetes cluster that writes data to HDFS with Hive catologization. For whatever reason my app fails to run Spark SQL commands with the following exception:
21/09/22 09:23:54 ERROR SplunkStreamListener: |exception=org.apache.spark.sql.AnalysisException
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.IOException There is no primary group for UGI spark (auth:SIMPLE));
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:183)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createDatabase(ExternalCatalogWithListener.scala:47)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:211)
at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
I'm connecting to Hive metastore via Thrift URL. The docker container runs the application as non-root user. Are there some kind of groups I need the user to be added to sync with the metastore?
Try add this before setting up the spark context
System.setProperty("HADOOP_USER_NAME", "root")

Remote Database not found while Connecting to remote Hive from Spark using JDBC in Python?

I am using pyspark script to read data from remote Hive through JDBC Driver. I have tried other method using enableHiveSupport, Hive-site.xml. but that technique is not possible for me due to some limitations(Access was blocked to launch yarn jobs from outside the cluster). Below is the only way I can connect to Hive.
from pyspark.sql import SparkSession
spark=SparkSession.builder \
.appName("hive") \
.config("spark.sql.hive.metastorePartitionPruning", "true") \
.config("hadoop.security.authentication" , "kerberos") \
.getOrCreate()
jdbcdf=spark.read.format("jdbc").option("url","urlname")\
.option("driver","com.cloudera.hive.jdbc41.HS2Driver").option("user","username").option("dbtable","dbname.tablename").load()
spark.sql("show tables from dbname").show()
Giving me below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.sql.
: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'vqaa' not found;
Could someone please help how I can access remote db/tables using this method? Thanks
add .enableHiveSupport() to your sparksession in order to access hive catalog

SparkSQL Hive Error : "HikariCP" not found in the CLASSPATH

I configured Hive with mySQL as my metastore. I can enter hive shell and create tables successfully.
Spark version: 2.4.0
Hive version: 3.1.1
When I try to run a SparkSQL program using spark submit, I'm getting the below error.
2019-03-02 15:43:41 WARN HiveMetaStore:622 - Retrying creating default database after error: Error creating transactional connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
......
......
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
......
......
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "HikariCP" plugin to create a ConnectionPool gave an error : The connection pool plugin of type "HikariCP" was not found in the CLASSPATH!
Please let me know if anyone can help me in this regard.
I don't know if you have already solved this problem. There is my advice.
the default database connection is HikariCP in the hive-site.xml. You can search for this in the hive-site.xml: datanucleus.connectionPoolingType. The value is HikariCP. So you need to change it to dbcp since you use Mysql as your metastore.
And at last, don't forget about adding the mysql-connector-java-5.x.x.jar to the path like
/home/hadoop/spark-2.3.0-bin-hadoop2.7/jars

Cassandra and Spark Thrift Server Integration

I'm trying to integrate Cassandra and Spark Thrift server. I followed the steps from here
I get the following error while registering the cassandara tables in beeline console.
Error: Error while compiling statement: FAILED: ParseException line 1:23 cannot recognize input near 'USING' 'org' '.' in create table statement (state=42000,code=40000)
Below is the query I run
CREATE TABLE test_data USING org.apache.spark.sql.cassandra OPTIONS (keyspace 'abc', table 'def');
Am I missing something?
add the cassandra connector as a aux jar in hive-site.xml

Spark Cluster mode issue to read Hive-Hbase table on Kerberized Environment

Error description
We are not able execute our Spark job in yarn-cluster or yarn-client mode, though it is working fine in the local mode.
This issue occurs when we try to read the Hive-HBase tables in a Kerberized cluster.
What we have tried so far
Passing all the HBASE jar in the –jar parameter in spark submi
--jars /usr/hdp/current/hive-client/lib/hive-hbase-handler-1.2.1000.2.5.3.16-1.jar,/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/hbase-client/lib/hbase-client.jar,/usr/hdp/current/hbase-client/lib/hbase-common.jar,/usr/hdp/current/hbase-client/lib/hbase-protocol.jar,/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/current/hbase-client/lib/protobuf-java-2.5.0.jar,/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar,/usr/hdp/current/hbase-client/lib/hbase-server.jar
Passing Hbase site and hive site in file parameter in Spark submit
--files /usr/hdp/2.5.3.16-1/hbase/conf/hbase-site.xml,/usr/hdp/current/spark-client/conf/hive-site.xml,/home/pasusr/pasusr.keytab
Doing Kerberos authentication inside the application. In the code we are explicitly passing the key tab
UserGroupInformation.setConfiguration(configuration)
val ugi: UserGroupInformation =
UserGroupInformation.loginUserFromKeytabAndReturnUGI(principle, keyTab)
UserGroupInformation.setLoginUser(ugi)
ConnectionFactory.createConnection(configuration)
return ugi.doAs(new PrivilegedExceptionActionConnection {
#throws[IOException]
def run: Connection = {
ConnectionFactory.createConnection(configuration) }
})
Passing key tab information in the Spark submit
Passing the HBASE jar in the spark.driver.extraClassPath and spark.executor.extraClassPath
Error Log
18/03/20 15:33:24 WARN TableInputFormatBase: You are using an HTable instance that relies on an HBase-managed Connection. This is usually due to directly creating an HTable, which is deprecated. Instead, you should create a Connection object and then request a Table instance from it. If you don't need the Table instance for your own use, you should instead use the TableInputFormatBase.initalizeTable method directly.
18/03/20 15:47:38 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 406, hadoopnode.server.name): java.lang.IllegalStateException: Error while configuring input job properties
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:444)
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:342)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=50, exceptions:
Caused by: java.lang.RuntimeException: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'.
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$1.run(RpcClientImpl.java:679)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
I was able to resolve this by adding following configuration in the spark-env.sh
export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar
And removing the spark.driver.extraClassPath and spark.executor.extraClassPath in which I was passing the above Jar from the Spark submit command.

Resources