How should spark sql be configured to access hive metastore? [duplicate] - apache-spark

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I'm trying using Spark SQL to read a table from Hive metastore but Spark gives an error about table not found. I'm afraid that Spark SQL creates a whole new empty metastore.
I submit the spark task through this command:
spark-submit --class etl.EIServerSpark --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' --jars $HIVE_CLASSPATH --files /etc/hive/conf/hive-site.xml,/etc/hadoop/conf/yarn-site.xml --master yarn-client /root/etl.jar
This is the error:
2015-06-30 17:50:51,563 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Copying /etc/hive/conf/hive-site.xml to /tmp/spark-568de027-8b66-40fa-97a4-2ec50614f486/hive-site.xml
2015-06-30 17:50:51,568 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Added file file:/etc/hive/conf/hive-site.xml at http://10.136.149.126:43349/files/hive-site.xml with timestamp 1435683051561
2015-06-30 17:50:51,568 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Copying /etc/hadoop/conf/yarn-site.xml to /tmp/spark-568de027-8b66-40fa-97a4-2ec50614f486/yarn-site.xml
2015-06-30 17:50:51,570 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Added file file:/etc/hadoop/conf/yarn-site.xml at http://10.136.149.126:43349/files/yarn-site.xml with timestamp 1435683051568
2015-06-30 17:50:51,637 INFO [sparkDriver-akka.actor.default-dispatcher-5] util.AkkaUtils (Logging.scala:logInfo(59)) - Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#gateway.edp.hadoop:52818/user/HeartbeatReceiver
2015-06-30 17:50:51,756 INFO [main] netty.NettyBlockTransferService (Logging.scala:logInfo(59)) - Server created on 40198
2015-06-30 17:50:51,757 INFO [main] storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Trying to register BlockManager
2015-06-30 17:50:51,759 INFO [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerMasterActor (Logging.scala:logInfo(59)) - Registering block manager localhost:40198 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 40198)
2015-06-30 17:50:51,761 INFO [main] storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Registered BlockManager
2015-06-30 17:50:52,840 INFO [main] parse.ParseDriver (ParseDriver.java:parse(185)) - Parsing command: SELECT id, name FROM eiserver.eismpt
2015-06-30 17:50:53,141 INFO [main] parse.ParseDriver (ParseDriver.java:parse(206)) - Parse Completed
2015-06-30 17:50:54,041 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:newRawStore(502)) - 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
2015-06-30 17:50:54,064 INFO [main] metastore.ObjectStore (ObjectStore.java:initialize(247)) - ObjectStore, initialize called
2015-06-30 17:50:54,227 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/datanucleus-rdbms-3.2.9.jar."
2015-06-30 17:50:54,268 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/datanucleus-api-jdo-3.2.6.jar."
2015-06-30 17:50:54,274 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/datanucleus-core-3.2.10.jar."
2015-06-30 17:50:54,314 INFO [main] DataNucleus.Persistence (Log4JLogger.java:info(77)) - Property datanucleus.cache.level2 unknown - will be ignored
2015-06-30 17:50:54,315 INFO [main] DataNucleus.Persistence (Log4JLogger.java:info(77)) - Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
2015-06-30 17:50:56,109 INFO [main] metastore.ObjectStore (ObjectStore.java:getPMF(318)) - Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
2015-06-30 17:50:56,170 INFO [main] metastore.MetaStoreDirectSql (MetaStoreDirectSql.java:<init>(110)) - MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "#" (64), after : "".
2015-06-30 17:50:57,315 INFO [main] DataNucleus.Datastore (Log4JLogger.java:info(77)) - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
2015-06-30 17:50:57,316 INFO [main] DataNucleus.Datastore (Log4JLogger.java:info(77)) - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
2015-06-30 17:50:57,688 INFO [main] DataNucleus.Datastore (Log4JLogger.java:info(77)) - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
2015-06-30 17:50:57,688 INFO [main] DataNucleus.Datastore (Log4JLogger.java:info(77)) - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
2015-06-30 17:50:57,842 INFO [main] DataNucleus.Query (Log4JLogger.java:info(77)) - Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
2015-06-30 17:50:57,844 INFO [main] metastore.ObjectStore (ObjectStore.java:setConf(230)) - Initialized ObjectStore
2015-06-30 17:50:58,113 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:createDefaultRoles(560)) - Added admin role in metastore
2015-06-30 17:50:58,115 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:createDefaultRoles(569)) - Added public role in metastore
2015-06-30 17:50:58,198 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:addAdminUsers(597)) - No user is added in admin role, since config is empty
2015-06-30 17:50:58,376 INFO [main] session.SessionState (SessionState.java:start(383)) - No Tez session required at this point. hive.execution.engine=mr.
2015-06-30 17:50:58,525 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(632)) - 0: get_table : db=eiserver tbl=eismpt
2015-06-30 17:50:58,525 INFO [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(314)) - ugi=root ip=unknown-ip-addr cmd=get_table : db=eiserver tbl=eismpt
2015-06-30 17:50:58,567 ERROR [main] metadata.Hive (Hive.java:getTable(1003)) - NoSuchObjectException(message:eiserver.eismpt table not found)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1569)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
How can I configure spark sql to access hive metastore deployed on a postgres? I'm using CDH 5.3.2.
Thank you

Configure Spark to use the Hive metastore thriftserver:
Edit $SPARK_HOME/conf/hive-site.xml to remove the direct connection information and to add this property:
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value> /*make sure to replace with your hive-metastore service's thrift url*/
<description>URI for client to contact metastore server</description>
</property>
</configuration>
If hive-site.xml is not there in $SPARK_HOME/conf then, to connect to hive metastore you need to copy the hive-site.xml file into spark/conf directory. So run the following command after logging in as root user,
cp /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/
Create Hive Context
At a scala> REPL prompt type the following:
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Create Hive Table
hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (key INT, value STRING)")
Show Hive Tables
scala> hiveContext.hql("SHOW TABLES").collect().foreach(println)
Test out the configuration(Optional)
Stop the Spark SQL thriftserver with cd $SPARK_HOME; sbin/stop-thriftserver.sh
Start the Hive metastore thriftserver with cd;./start-thriftserver.sh
Check the logs at $HIVE_HOME/logs/metastore.out for any errors.
The Spark SQL thriftserver won't start until it can make a successful connection to
this server, so it must be running.
Start the Spark SQL thriftserver
with cd $SPARK_HOME; sbin/start-thriftserver.sh
Check the log file that are indicated in the returned line.
You should see lines like this:
16/12/29 20:22:19 INFO metastore: Trying to connect to metastore with URI thrift://localhost:9083
16/12/29 20:22:19 INFO metastore: Connected to metastore.
Run $SPARK_HOME/bin/beeline -u 'jdbc:hive2://localhost:10000/' and try out the !tables command to make sure that you are able to list the metadata.

The doc says to put spark.sql.hive.metastore.sharedPrefixes = org.postgresql in the configuration file, did you try this ?

Make sure the $HIVE_HOME/conf/hive-site.xml configuration which is pointing to complete path of metastore.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/hive/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
Place the hive-site.xml file in $SPARK_HOME/conf to point SparkR to the same metastore as Hive.
Hope this solves your issue.

Related

Databricks 6.1 no database named global_temp error when initializing metastore connection

When initializing hive metastore connection (saving data frame as a table for the first time ) on cluster 6.1 (includes Apache Spark 2.4.4, Scala 2.11) (Azure), I can see health check for database global_temp failing with the error:
20/02/18 12:11:17 INFO HiveUtils: Initializing HiveMetastoreConnection version 0.13.0 using file:
...
20/02/18 12:11:21 INFO HiveMetaStore: 0: get_database: global_temp
20/02/18 12:11:21 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: global_temp
20/02/18 12:11:21 ERROR RetryingHMSHandler: NoSuchObjectException(message:There is no database named global_temp)
at org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:487)
at org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:498)
...
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:430)
...
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
This doesn't cause python script to fail, but pollutes logs.
Shouldn't the global_temp database be automatically created?
Can the check be switched off? or the error suppressed?

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided.
AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Observations:
- Using spark-shell (From EMR Master Node):
Works. Able to access Glue DB/Tables using below commands:
spark.catalog.setCurrentDatabase("test_db")
spark.catalog.listTables
- Using spark-submit (From EMR Step):
Does not work. Keep getting the error "Database 'test_db' does not exist"
Error Trace is as below:
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO HiveMetaStore: 0: get_database: default
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: default
INFO HiveMetaStore: 0: get_database: global_temp
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: global_temp
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/6d0f6b2c-cccd-4e90-a524-93dcc5301e20_resources
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/yarn/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20/_tmp_space.db
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
INFO CodeGenerator: Code generated in > 191.063411 ms
INFO CodeGenerator: Code generated in 10.27313 ms
INFO HiveMetaStore: 0: get_database: test_db
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: test_db
WARN ObjectStore: Failed to get database test_db, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Database 'test_db' does not exist.;
at org.apache.spark.sql.internal.CatalogImpl.requireDatabaseExists(CatalogImpl.scala:44)
at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:64)
at org.griffin_test.GriffinTest.ingestGriffinRecords(GriffinTest.java:97)
at org.griffin_test.GriffinTest.main(GriffinTest.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.
Reference Blogs:
https://forums.aws.amazon.com/thread.jspa?threadID=263860
Spark Catalog w/ AWS Glue: database not found
https://okera.zendesk.com/hc/en-us/articles/360005768434-How-can-we-configure-Spark-to-use-the-Hive-Metastore-for-metadata-
Fixes Tried:
- Enabling Hive support in spark-defaults.conf & SparkSession (Code):
Hive classes are on CLASSPATH and have set spark.sql.catalogImplementation internal configuration property to hive:
spark.sql.catalogImplementation hive
Adding Hive metastore config:
.config("hive.metastore.connect.retries", 15)
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
Code Snippet:
SparkSession spark = SparkSession.builder().appName("Test_Glue_Catalog")
.config("spark.sql.catalogImplementation", "hive")
.config("hive.metastore.connect.retries", 15)
.config("hive.metastore.client.factory.class","com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate();
Any suggestions in figuring out the root cause for this discrepancy would be really helpful.
Appreciate your help! Thank you!

ODBC configuration to connect to Spark Thrift Server

This question might seem repeated, in fact, I've seen a couple of questions related to this but not exactly with the same error, so I'm asking to see if anyone has a clue.
I've set up a Spark Thrift Server running with default settings. Spark version is 2.1 and it runs on YARN (Hadoop 2.7.3)
The fact is that I'm not able to setup either the Simba hive ODBC driver nor the Microsoft one so that the Test in the ODBC setup succeeds.
This is the config I'm using for the Microsoft Hive ODBC driver:
When I hit the Test button, the error message shown is the following:
While in the Spark Thrift Server logs the following is seen:
17/09/15 17:31:36 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V1
17/09/15 17:31:36 INFO SessionState: Created local directory: /tmp/00abf145-2928-4995-81f2-fea578280c42_resources
17/09/15 17:31:36 INFO SessionState: Created HDFS directory: /tmp/hive/test/00abf145-2928-4995-81f2-fea578280c42
17/09/15 17:31:36 INFO SessionState: Created local directory: /tmp/vagrant/00abf145-2928-4995-81f2-fea578280c42
17/09/15 17:31:36 INFO SessionState: Created HDFS directory: /tmp/hive/test/00abf145-2928-4995-81f2-fea578280c42/_tmp_space.db
17/09/15 17:31:36 INFO HiveSessionImpl: Operation log session directory is created: /tmp/vagrant/operation_logs/00abf145-2928-4995-81f2-fea578280c42
17/09/15 17:31:36 INFO SparkExecuteStatementOperation: Running query 'set -v' with 82d7f9a6-f2a6-4ebd-93bb-5c8da1611f84
17/09/15 17:31:36 INFO SparkSqlParser: Parsing command: set -v
17/09/15 17:31:36 INFO SparkExecuteStatementOperation: Result Schema: StructType(StructField(key,StringType,false), StructField(value,StringType,false), StructField(meaning,StringType,false))
If I connect using the JDBC driver by means of Beeline (which works ok), these are the logs:
17/09/15 17:04:24 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V8
17/09/15 17:04:24 INFO SessionState: Created HDFS directory: /tmp/hive/test
17/09/15 17:04:24 INFO SessionState: Created local directory: /tmp/c0681d6f-cc0f-40ae-970d-e3ea366aa414_resources
17/09/15 17:04:24 INFO SessionState: Created HDFS directory: /tmp/hive/test/c0681d6f-cc0f-40ae-970d-e3ea366aa414
17/09/15 17:04:24 INFO SessionState: Created local directory: /tmp/vagrant/c0681d6f-cc0f-40ae-970d-e3ea366aa414
17/09/15 17:04:24 INFO SessionState: Created HDFS directory: /tmp/hive/test/c0681d6f-cc0f-40ae-970d-e3ea366aa414/_tmp_space.db
17/09/15 17:04:24 INFO HiveSessionImpl: Operation log session directory is created: /tmp/vagrant/operation_logs/c0681d6f-cc0f-40ae-970d-e3ea366aa414
17/09/15 17:04:24 INFO SparkSqlParser: Parsing command: use default
17/09/15 17:04:25 INFO HiveMetaStore: 1: get_database: default
17/09/15 17:04:25 INFO audit: ugi=vagrant ip=unknown-ip-addr cmd=get_database: default
17/09/15 17:04:25 INFO HiveMetaStore: 1: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/09/15 17:04:25 INFO ObjectStore: ObjectStore, initialize called
17/09/15 17:04:25 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
17/09/15 17:04:25 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/09/15 17:04:25 INFO ObjectStore: Initialized ObjectStore
Well I managed to connect successfully by installing the Microsoft Spark ODBC driver instead of the Hive one.
It looked like the problem had to do with the driver rejecting to connect to Spark Thrift Server when discovering it was not a Hive2 server based on some server property. I doubt there are actual differences at the wire level between Hive2 and Spark thrift server because the latter is a port of the former without changes at the protocol level (Thrift), but in any case, the solution is to move to this driver and configuring it the same way as the Hive2 one:
Microsoft® Spark ODBC Driver

Connec to Hive from Apache Spark [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I have a simple program that I'm running on Standalone Cloudera VM. I have created a managed table in Hive , which I want to read in Apache spark, but the initial connection to hive is not being established. Please advise.
I'm running this program in IntelliJ, I have copied hive-site.xml from my /etc/hive/conf to /etc/spark/conf, even then the spark-job is not connecting to Hive metastore
public static void main(String[] args) throws AnalysisException {
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(ConnectToHive.class.getName())
.config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
.enableHiveSupport()
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
HiveContext hiveContext = new HiveContext(sparkSession);
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");
hiveContext.sql("SHOW DATABASES").show();
hiveContext.sql("SHOW TABLES").show();
sparkSession.close();
}
The output is as below, where is expect to see "Employee table" , so that I can query. Since I'm running on Standa-alone , hive metastore is in Local mySQL server.
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true is the configuration for Hive metastore
hive> show databases;
OK
default
sxm
temp
Time taken: 0.019 seconds, Fetched: 3 row(s)
hive> use default;
OK
Time taken: 0.015 seconds
hive> show tables;
OK
employee
Time taken: 0.014 seconds, Fetched: 1 row(s)
hive> describe formatted employee;
OK
# col_name data_type comment
id string
firstname string
lastname string
addresses array<struct<street:string,city:string,state:string>>
# Detailed Table Information
Database: default
Owner: cloudera
CreateTime: Tue Jul 25 06:33:01 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1500989581
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.07 seconds, Fetched: 29 row(s)
hive>
Added Spark Logs
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/07/25 11:38:30 INFO SparkContext: Running Spark version 2.1.0
17/07/25 11:38:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/25 11:38:30 INFO SecurityManager: Changing view acls to: cloudera
17/07/25 11:38:30 INFO SecurityManager: Changing modify acls to: cloudera
17/07/25 11:38:30 INFO SecurityManager: Changing view acls groups to:
17/07/25 11:38:30 INFO SecurityManager: Changing modify acls groups to:
17/07/25 11:38:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); groups with view permissions: Set(); users with modify permissions: Set(cloudera); groups with modify permissions: Set()
17/07/25 11:38:31 INFO Utils: Successfully started service 'sparkDriver' on port 55232.
17/07/25 11:38:31 INFO SparkEnv: Registering MapOutputTracker
17/07/25 11:38:31 INFO SparkEnv: Registering BlockManagerMaster
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/07/25 11:38:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-eb1e611f-1b88-487f-b600-3da1ff8353db
17/07/25 11:38:31 INFO MemoryStore: MemoryStore started with capacity 1909.8 MB
17/07/25 11:38:31 INFO SparkEnv: Registering OutputCommitCoordinator
17/07/25 11:38:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/07/25 11:38:31 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
17/07/25 11:38:31 INFO Executor: Starting executor ID driver on host localhost
17/07/25 11:38:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41433.
17/07/25 11:38:31 INFO NettyBlockTransferService: Server created on 10.0.2.15:41433
17/07/25 11:38:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/07/25 11:38:31 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:41433 with 1909.8 MB RAM, BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:31 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 41433, None)
17/07/25 11:38:32 INFO SharedState: Warehouse path is 'file:/home/cloudera/works/JsonHive/spark-warehouse/'.
17/07/25 11:38:32 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
17/07/25 11:38:32 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
17/07/25 11:38:32 INFO deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
17/07/25 11:38:32 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
17/07/25 11:38:32 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/07/25 11:38:32 INFO ObjectStore: ObjectStore, initialize called
17/07/25 11:38:32 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
17/07/25 11:38:32 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
17/07/25 11:38:34 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:35 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/07/25 11:38:35 INFO ObjectStore: Initialized ObjectStore
17/07/25 11:38:36 INFO HiveMetaStore: Added admin role in metastore
17/07/25 11:38:36 INFO HiveMetaStore: Added public role in metastore
17/07/25 11:38:36 INFO HiveMetaStore: No user is added in admin role, since config is empty
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_all_databases
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_all_databases
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_functions: db=default pat=*
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_functions: db=default pat=*
17/07/25 11:38:36 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/76258222-81db-4ac1-9566-1d8f05c3ecba_resources
17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba/_tmp_space.db
17/07/25 11:38:36 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/cloudera/works/JsonHive/spark-warehouse/
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: default
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: default
17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: global_temp
17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr cmd=get_database: global_temp
17/07/25 11:38:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Process finished with exit code 0
UPDATE
/usr/lib/hive/conf/hive-site.xml was not in the classpath so it was not reading the tables, after adding it in the classpath it worked fine ... Since I was running from IntelliJ I have this problem .. in production the spark-conf folder will have link to hive-site.xml ...
17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
This is a hint that you're not connected to the remote hive metastore (that you've set as MySQL), and the XML file is not correctly on your classpath.
You can do it programmatically without XML before you make a SparkSession
System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");
How to connect to a Hive metastore programmatically in SparkSQL?

Spark1.6 and Hive 0.14 integration issue

I have been trying to integrate the latest spark 1.6 with hive 0.14.0. I am only trying to get the Thrift server to run. I have noticed that if I don't override the following configurations: (-conf spark.sql.hive.metastore.version=0.14.0 --conf spark.sql.hive.metastore.jars=maven) when invoking start-thrifstserver.sh spark script, then any create table queries fail in spark due to incompatibility issues between hive 1.2.1 which is used by spark 1.6 by default and my hive version running in prod. However, when I override those 2 configs, then when thrift server is started, it does not connect to my hive metastore uri as specified in hive-site.xml but rather it tires to connect to derby database and then Thrift server does not start properly. Am I missing some additional overrides?
Please see the thrift server log information below:
Loaded from file:/usr/lib/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
java.vendor=Oracle Corporation
java.runtime.version=1.7.0_79-b15
user.dir=/
os.name=Linux
os.arch=amd64
os.version=2.6.32-504.23.4.el6.x86_64
derby.system.home=null
Database Class Loader started - derby.database.classpath=''
16/01/26 16:35:20 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (10.15.150.38:51475) with ID 20
16/01/26 16:35:20 INFO BlockManagerMasterEndpoint: Registering block manager 10.15.150.38:52107 with 9.9 GB RAM, BlockManagerId(20, 10.15.150.38, 52107)
16/01/26 16:35:20 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (10.15.150.38:51479) with ID 48
16/01/26 16:35:20 INFO BlockManagerMasterEndpoint: Registering block manager 10.15.150.38:47973 with 9.9 GB RAM, BlockManagerId(48, 10.15.150.38, 47973)
16/01/26 16:35:20 WARN Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream#3cf4a477:an attempt to override final parameter: mapreduce.reduce.speculative; Ignoring.
16/01/26 16:35:20 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/01/26 16:35:21 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/01/26 16:35:21 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/01/26 16:35:22 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/01/26 16:35:22 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/01/26 16:35:22 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/01/26 16:35:22 INFO ObjectStore: Initialized ObjectStore
16/01/26 16:35:22 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/01/26 16:35:22 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/01/26 16:35:22 INFO HiveMetaStore: Added admin role in metastore
16/01/26 16:35:22 INFO HiveMetaStore: Added public role in metastore
16/01/26 16:35:22 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/01/26 16:35:22 INFO HiveMetaStore: 0: get_all_databases
16/01/26 16:35:22 INFO audit: ugi=hive ip=unknown-ip-addr cmd=get_all_databases
16/01/26 16:35:22 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/01/26 16:35:22 INFO audit: ugi=hive ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/01/26 16:35:22 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/01/26 16:35:22 INFO SessionState: Created local directory: /tmp/06895c7e-e26c-42b7-b100-4222d0356b6b_resources
16/01/26 16:35:22 INFO SessionState: Created HDFS directory: /tmp/hive/hive/06895c7e-e26c-42b7-b100-4222d0356b6b
16/01/26 16:35:22 INFO SessionState: Created local directory: /tmp/hive/06895c7e-e26c-42b7-b100-4222d0356b6b
16/01/26 16:35:23 INFO SessionState: Created HDFS directory: /tmp/hive/hive/06895c7e-e26c-42b7-b100-4222d0356b6b/_tmp_space.db
16/01/26 16:35:23 WARN Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream#37f031a:an attempt to override final parameter: mapreduce.reduce.speculative; Ignoring.
16/01/26 16:35:23 INFO HiveContext: default warehouse location is /user/hive/warehouse
16/01/26 16:35:23 INFO HiveContext: Initializing HiveMetastoreConnection version 0.14.0 using maven.
Ivy Default Cache set to: /home/hive/.ivy2/cache
The jars for the packages stored in: /home/hive/.ivy2/jars
http://www.datanucleus.org/downloads/maven2 added as a remote repository with the name: repo-1
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.calcite#calcite-core added as a dependency
org.apache.calcite#calcite-avatica added as a dependency
org.apache.hive#hive-metastore added as a dependency
org.apache.hive#hive-exec added as a dependency
org.apache.hive#hive-common added as a dependency
org.apache.hive#hive-serde added as a dependency
com.google.guava#guava added as a dependency
org.apache.hadoop#hadoop-client added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]

Resources