How to save temptable to Hive metastore (and analyze it in Hive)? - apache-spark

I use Spark 1.3.1.
How to store/save a DataFrame data to a Hive metastore?
In Hive If I run show tables the DataFrame does not appear as a table in Hive databases. I have copied hive-site.xml to $SPARK_HOME/conf, but it didn't help (and the dataframe does not appear in Hive metastore either).
I am following this document, using spark 1.4 version.
dataframe.registerTempTable("people")
How to analyze the spark table in Hive?

You can use insertInto method to save dataframe in hive.
Example:
dataframe.insertInto("table_name", false);
//2nd arg is to specify if table should be overridden

Got the solution. I am using spark 1.3.1, so that it's not supporting all. Now using spark 1.5.1 that problem resolved.
I have noticed data-frames fully working after 1.40. Many commands are deprecated.

Related

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ...
Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable)
if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 - everything works fine.
Issue is, if I am attempting to load some existing tables (older tables created spark 2.2 or before) - facing issues with COUNT of records. Diff count when count of target hive table is done thru beeline vs spark sql.
Please assist.
There seems to be an issue with sync of hive-Metastore and spark-catalog for hive tables (with parquet file format) created o2.n spark 2 (or before - with comple /nested data tydata type) and loaded using spark v2.4.
Usual case, spark.catalog.refresh(<hive-table-name>) will refresh the stats from hiveMetastore to spark.catalog.
In this case, an explicit spark.catalog.resfreshByPath(<location-maprfs-path>) need to bed executed to refresh the stats.pet*

Bucketby Property with Spark saveastable method taking spark 'default' database instead of hive database: HDP 3.0

I'm saving Spark DataFrame using saveAsTable method and writing below code.
val options = Map("path" -> hiveTablePath)
df.write.format("orc")
.partitionBy("partitioncolumn")
.options(options)
.mode(SaveMode.Append)
.saveAsTable(hiveTable)
It's working fine and i am able to see data in hive table. but when I'm using one more property bucketby(5,bucketted_column)
df.write.format("orc")
.partitionBy("partitioncolumn")
.bucketby(5,bucketted_column)
.options(options)
.mode(SaveMode.Append)
.saveAsTable(hiveTable)
It's trying to save it in spark 'default' database instead of hive database.
can someone please suggest me why bucketby(5,bucketted_column) is not working with saveAsTable.
Note: Framework: HDP 3.0
Spark : 2.1
You can try to add this parameter:
.option("dbtable", "schema.tablename")
df.write.mode...saveAsTable("database.table") should do the trick.
Except that the bucketBy format cannot be read via Hive, Hue, Impala. Currently not supported.
Not sure what you mean by spark db, I think you mean Spark metastore?

Is it possible to use Spark with ORC file format without Hive?

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem

Spark dataframe saveAsTable not truncating data from Hive table

I am using Spark 2.1.0 and using Java SparkSession to run my SparkSQL.
I am trying to save a Dataset<Row> named 'ds' to be saved into a Hive table named as schema_name.tbl_name using overwrite mode.
But when I am running the below statement
ds.write().mode(SaveMode.Overwrite)
.option("header","true")
.option("truncate", "true")
.saveAsTable(ConfigurationUtils.getProperty(ConfigurationUtils.HIVE_TABLE_NAME));
the table is getting dropped after the first run.
When I am rerunning it, the table is getting created with the data loaded.
Even using truncate option didn't resolve my issue. Does saveAsTable consider truncating the data instead of dropping/creating the table? If so, what is the correct way to do it in Java ?
This is the reference to Apache JIRA for my question. Seems it is unresolved till now.
https://issues.apache.org/jira/browse/SPARK-21036

How is spark HiveContext/SQLContext retrieving schema/data?

I can't seem to find much documentation on it but when I pull data from Hive in Spark SQL how is it retrieving the schema, is it automatically looking in the Hive Metastore? Also is it Hive telling spark to look at the file location to pull the data into a DataFrame? And how does it handle a view or can it not handle a view yet?
Yes, it looks up hive metastore.
Spark delegates hive queries to hive. It captures output and turn it to a dataframe of rows.
From docs:
When working with Hive one must construct a HiveContext, which
inherits from SQLContext, and adds support for finding tables in the
MetaStore and writing queries using HiveQL

Resources