Where is Spark table created if Hive support is not enabled? - apache-spark

Where is table created when I use SQL syntax for create table in Spark 2.*?
CREATE TABLE ... AS
Does spark store it internally somewhere of there has to always be Impala or Hive support for Spark tables to work?

By default (when no connection specified e.g. in hive-site.xml). Spark will save it in the local metastore that is on Derby db.
And it should be stored in spark-warehouse directory in your Spark directory or whatever is defined in spark.sql.warehouse.dir prop.

Related

How can i show hive table using pyspark

Hello i created a spark HD insight cluster on azure and i’m trying to read hive tables with pyspark but the proble that its show me only default database
Anyone have an idea ?
If you are using HDInsight 4.0, Spark and Hive not share metadata anymore.
For default you will not see hive tables from pyspark, is a problem that i share on this post: How save/update table in hive, to be readbale on spark.
But, anyway, things you can try:
If you want test only on head node, you can change the hive-site.xml, on property "metastore.catalog.default", change the value to hive, after that open pyspark from command line.
If you want to apply to all cluster nodes, changes need to be made on Ambari.
Login as admin on ambari
Go to spark2 > Configs > hive-site-override
Again, update property "metastore.catalog.default", to hive value
Restart all required on Ambari panel
These changes define hive metastore catalog as default.
You can see hive databases and table now, but depending of table structure, you will not see the table data properly.
If you have created tables in other databases, try show tables from database_name. Replace database_name with the actual name.
You are missing details of hive server in SparkSession. If you haven't added any it will create and use default database to run sparksql.
If you've added configuration details in spark default conf file for spark.sql.warehouse.dir and spark.hadoop.hive.metastore.uris then while creating SparkSession add enableHiveSupport().
Else add configuration details while creating sparksession
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()

pyspark dataframe save to hive table can not be found

we have installed our cluster via cdh6.2.
use pyspark create a dataFrame, then save it to hive.
the file is created in warehouse correctly, but it can not be found in hive or impala using show tables.
it can be found by spark sql using sql.('show tables'). But it only show the table be created by the spark code before, which means it can not see tables create via hive or impala console.
so I think it maybe the spark code doesn't collect to the Hive Metastore server.
but I don't know how to setup it to the Hive Metastore server.
In order to connect to hive metastore you need to copy the hive-site.xml file into spark/conf directory. Try the following:
ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml

Spark sql can't find table in hive in HDP

I use HDP3.1 and I added Spark2, Hive and Other services which are needed. I turned of the ACID feature in Hive. The spark job can't find the table in hive. But the table exists in Hive. The exception likes:
org.apache.spark.sql.AnalysisException: Table or view not found
There is hive-site.xml in Spark's conf folder. It is automaticly created by HDP. But it isn't same as the file in hive's conf folder. And from the log, the spark can get the thrift URI of hive correctly.
I use spark sql and created one hive table in spark-shell. I found the table was created in the fold which is specified by spark.sql.warehouse.dir. I changed its value to the value of hive.metastore.warehouse.dir. But the problem is still there.
I also enabled hive support when creating spark session.
val ss = SparkSession.builder().appName("统计").enableHiveSupport().getOrCreate()
There is metastore.catalog.default in hive-site.xml in spark's conf folder. It value is spark. It should be changed to hive. And btw, we should disable ACID feature of hive.
You can use hivewarehouse connector and use llap in hive conf
In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms.
Spark, by default, only reads the Spark Catalog. And, this means Spark applications that attempt to read/write to tables created using hive CLI will fail with table not found exception.
Workaround:
Create the table in Hive CLI and Spark SQL
Hive Warehouse Connector

How to set up metadata database for Spark SQL?

Hive can have its metadata and stores the tables,columns,partitions information over there.
If I do not want to use the hive.Can we create a metadata for spark same as hive.
I want to query spark SQL (not using dataframe) like Hive (select, from and where) Can we do that? if yes, which relational DB can we use for metadata storage?
Can we create a metadata for spark same as hive.
Spark does this for you and you don't have to use a separate installation of Hive or even just part of it (e.g. a Hive metastore).
Regardless of the installation of Apache Spark you use, Spark SQL uses a Hive metastore internally for the same purpose as Hive does (but the metastore is now part of Spark SQL).
if yes which relational DB can we use for metadata storage?
Anything that Hive supports, e.g. Oracle, MySQL, PostgreSQL. The configuration is pretty much as you would do with a separate Hive installation (which is usually the case in such enterprisey installations).
You may want to read Hive Metastore.
Spark is essentially a distributed computation system instead of a distributed storage. Therefore, we mostly use Spark to do the computation work, which needs the metadata from different storage.
However, Spark internally provides an InMemoryCatalog to store the metadata if it's not configured with Hive.
You can take a look at this for more information.

Spark and Metastore Relation

I was aware of the fact that Hive Metastore is used to store metadata of the tables that we create in HIVE but why do spark required Metastore, what is the default relation between Metastore and Spark
Does metasore is being used by spark SQL, if so is this to store dataframes metadata?
Why does spark by defaults checks for metastore connectivity even though iam not using any sql libraries?
Here is explanation from spark-2.2.0 documentation
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.

Resources