Can't load a Hive table through Spark - apache-spark

I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark.
Background:
I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
I have downloaded pre-built Spark 1.3.1.
I am using the Hive bundled with the VM and can run hive queries through Spark-shell and Hive cmd line without any issue. This includes running the command:
LOAD DATA INPATH 'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/' INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21');
Problem:
I am writing a Java program to read data from Cassandra and load it into Hive. I have saved the results of the Cassandra read in parquet format in a folder called 'result.parquet'.
Now I would like to load this into Hive. For this, I
Copied the Hive-site.xml to the Spark conf folder.
I made a change to this xml. I noticed that I had two hive-site.xml - one which was auto generated and another which had Hive execution parameters. I combined both into a single hive-site.xml.
Code used (Java):
HiveContext hiveContext = new
HiveContext(JavaSparkContext.toSparkContext(sc));
hiveContext.sql("show databases").show();
hiveContext.sql("LOAD DATA INPATH
'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21')").show();
So, this worked. And I could load data into Hive. Except, after I restarted my VM, it has stopped working.
When I run the show databases Hive query, I get a result saying
result
default
instead of the databases in Hive, which are
default
test_spark
I also notice a folder called metastore_db being created in my Project Folder. From googling around, I know this happens when Spark can't connect to the Hive metastore, so it creates one of its own.I thought I had fixed that, but clearly not.
What am I missing?

Related

How can i show hive table using pyspark

Hello i created a spark HD insight cluster on azure and i’m trying to read hive tables with pyspark but the proble that its show me only default database
Anyone have an idea ?
If you are using HDInsight 4.0, Spark and Hive not share metadata anymore.
For default you will not see hive tables from pyspark, is a problem that i share on this post: How save/update table in hive, to be readbale on spark.
But, anyway, things you can try:
If you want test only on head node, you can change the hive-site.xml, on property "metastore.catalog.default", change the value to hive, after that open pyspark from command line.
If you want to apply to all cluster nodes, changes need to be made on Ambari.
Login as admin on ambari
Go to spark2 > Configs > hive-site-override
Again, update property "metastore.catalog.default", to hive value
Restart all required on Ambari panel
These changes define hive metastore catalog as default.
You can see hive databases and table now, but depending of table structure, you will not see the table data properly.
If you have created tables in other databases, try show tables from database_name. Replace database_name with the actual name.
You are missing details of hive server in SparkSession. If you haven't added any it will create and use default database to run sparksql.
If you've added configuration details in spark default conf file for spark.sql.warehouse.dir and spark.hadoop.hive.metastore.uris then while creating SparkSession add enableHiveSupport().
Else add configuration details while creating sparksession
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()

pyspark dataframe save to hive table can not be found

we have installed our cluster via cdh6.2.
use pyspark create a dataFrame, then save it to hive.
the file is created in warehouse correctly, but it can not be found in hive or impala using show tables.
it can be found by spark sql using sql.('show tables'). But it only show the table be created by the spark code before, which means it can not see tables create via hive or impala console.
so I think it maybe the spark code doesn't collect to the Hive Metastore server.
but I don't know how to setup it to the Hive Metastore server.
In order to connect to hive metastore you need to copy the hive-site.xml file into spark/conf directory. Try the following:
ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml

Spark sql can't find table in hive in HDP

I use HDP3.1 and I added Spark2, Hive and Other services which are needed. I turned of the ACID feature in Hive. The spark job can't find the table in hive. But the table exists in Hive. The exception likes:
org.apache.spark.sql.AnalysisException: Table or view not found
There is hive-site.xml in Spark's conf folder. It is automaticly created by HDP. But it isn't same as the file in hive's conf folder. And from the log, the spark can get the thrift URI of hive correctly.
I use spark sql and created one hive table in spark-shell. I found the table was created in the fold which is specified by spark.sql.warehouse.dir. I changed its value to the value of hive.metastore.warehouse.dir. But the problem is still there.
I also enabled hive support when creating spark session.
val ss = SparkSession.builder().appName("统计").enableHiveSupport().getOrCreate()
There is metastore.catalog.default in hive-site.xml in spark's conf folder. It value is spark. It should be changed to hive. And btw, we should disable ACID feature of hive.
You can use hivewarehouse connector and use llap in hive conf
In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms.
Spark, by default, only reads the Spark Catalog. And, this means Spark applications that attempt to read/write to tables created using hive CLI will fail with table not found exception.
Workaround:
Create the table in Hive CLI and Spark SQL
Hive Warehouse Connector

Can Spark-sql work without a hive installation?

I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.

Apache spark installation and db_metastore

I am beginner in Spark.
I installed java and spark-1.6.1-bin-hadoop2.6.tgz(I have not installed Hadoop) and with out changing any configuration in conf directory ran spark-shell.
In the director where spark is installed , I see another metastore_db created with tmp folder inside it.
why is this metastore_db is created , where is this configured ?
Also I see sqlContext being created after running spark-shell, what does this sqlContext represent?
When running spark-shell, a SparkContext and SQLContext are created. SQLContext is an extension of SparkContext to enable support of Spark SQL. It has method to execute sql queries (method sql) and to create DataFrames.
db_metastore is a Hive metastore path. Spark support Apache Hive queries via HiveContext. If there is no hive-site.xml configured, Spark will use db_metastore path, see documentation for details.
However, it would be good if you will download Spark 2.0. There you've got unified entry point to Spark, named SparkSession. This class allows you to read data from many sources, create Datasets, etc.

Resources