Spark dataframe saveAsTable not truncating data from Hive table - apache-spark

I am using Spark 2.1.0 and using Java SparkSession to run my SparkSQL.
I am trying to save a Dataset<Row> named 'ds' to be saved into a Hive table named as schema_name.tbl_name using overwrite mode.
But when I am running the below statement
ds.write().mode(SaveMode.Overwrite)
.option("header","true")
.option("truncate", "true")
.saveAsTable(ConfigurationUtils.getProperty(ConfigurationUtils.HIVE_TABLE_NAME));
the table is getting dropped after the first run.
When I am rerunning it, the table is getting created with the data loaded.
Even using truncate option didn't resolve my issue. Does saveAsTable consider truncating the data instead of dropping/creating the table? If so, what is the correct way to do it in Java ?

This is the reference to Apache JIRA for my question. Seems it is unresolved till now.
https://issues.apache.org/jira/browse/SPARK-21036

Related

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ...
Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable)
if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 - everything works fine.
Issue is, if I am attempting to load some existing tables (older tables created spark 2.2 or before) - facing issues with COUNT of records. Diff count when count of target hive table is done thru beeline vs spark sql.
Please assist.
There seems to be an issue with sync of hive-Metastore and spark-catalog for hive tables (with parquet file format) created o2.n spark 2 (or before - with comple /nested data tydata type) and loaded using spark v2.4.
Usual case, spark.catalog.refresh(<hive-table-name>) will refresh the stats from hiveMetastore to spark.catalog.
In this case, an explicit spark.catalog.resfreshByPath(<location-maprfs-path>) need to bed executed to refresh the stats.pet*

Does spark saveAsTable really create a table?

This may be a dumb question since lack of some fundamental knowledge of spark, I try this:
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").enableHiveSupport().getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("foo");
This creates table under 'default' database in Hive, and of course, I can fetch data from the table anytime I want.
I update above code to get rid of "enableHiveSupport",
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("bar");
The code runs fine, without any error, but when I try "select * from bar", spark says,
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'bar' not found in database 'default';
So I have 2 questions here,
1) Is it possible to create a 'raw' spark table, not hive table? I know Hive mantains the metadata in database like mysql, does spark also have similar mechanism?
2) In the 2nd code snippet, what does spark actually create when calling saveAsTable?
Many thanks.
Check answers below:
If you want to create raw table only in spark createOrReplaceTempView could help you. For second part, check next answer.
By default, if you call saveAsTable on your dataframe, it will persistent tables into Hive metastore if you use enableHiveSupport. And if we don't enableHiveSupport, tables will be managed by Spark and data will be under spark-warehouse location. You will loose these tables after restart spark session.

Spark SQL Fails when hive partition is missing

I have a table which has some missing partions. When I call it on hive it works fine
SELECT *
FROM my_table
but when call it from pyspark (v. 2.3.0) it fails with message Input path does not exist: hdfs://path/to/partition. The spark code I am running is just naive:
spark = ( SparkSession
.builder
.appName("prueba1")
.master("yarn")
.config("spark.sql.hive.verifyPartitionPath", "false")
.enableHiveSupport()
.getOrCreate())
spark.table('some_schema.my_table').show(10)
the config("spark.sql.hive.verifyPartitionPath", "false") has been proposed is
this question but seems to not work fine for me
Is there any way I can configure SparkSession so I can get rid of these. I am afraid that in the future more partitions will miss, so a hardcode solution is not possible
This error occurs when partitioned data dropped from HDFS i.e not using Hive commands to drop the partition.
If the data dropped from HDFS directly Hive doesn't know about the dropped the partition, when we query hive table it still looks for the directory and the directory doesn't exists in HDFS it results file not found exception.
To fix this issue we need to drop the partition associated with the directory in Hive table also by using
alter table <db_name>.<table_name> drop partition(<partition_col_name>=<partition_value>);
Then hive drops the partition from the metadata this is the only way to drop the metadata from the hive table if we dropped the partition directory from HDFS.
msck repair table doesn't drop the partitions instead only adds the new partitions if the new partition got added into HDFS.
The correct way to avoid these kind of issues in future drop the partitions by using Hive drop partition commands.
Does the other way around, .config("spark.sql.hive.verifyPartitionPath", "true") work for you? I have just managed to load data using spark-sql with this setting, while one of the partition paths from Hive was empty, and partition still existed in Hive metastore. Though there are caveats - it seems it takes significantly more time to load data compared to when this setting it set to false.

How to save temptable to Hive metastore (and analyze it in Hive)?

I use Spark 1.3.1.
How to store/save a DataFrame data to a Hive metastore?
In Hive If I run show tables the DataFrame does not appear as a table in Hive databases. I have copied hive-site.xml to $SPARK_HOME/conf, but it didn't help (and the dataframe does not appear in Hive metastore either).
I am following this document, using spark 1.4 version.
dataframe.registerTempTable("people")
How to analyze the spark table in Hive?
You can use insertInto method to save dataframe in hive.
Example:
dataframe.insertInto("table_name", false);
//2nd arg is to specify if table should be overridden
Got the solution. I am using spark 1.3.1, so that it's not supporting all. Now using spark 1.5.1 that problem resolved.
I have noticed data-frames fully working after 1.40. Many commands are deprecated.

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

I have a Spark dataframe which I want to save as Hive table with partitions. I tried the following two statements but they don't work. I don't see any ORC files in HDFS directory, it's empty. I can see baseTable is there in Hive console but obviously it's empty because of no files inside HDFS.
The following two lines saveAsTable() and insertInto()do not work. registerDataFrameAsTable() method works but it creates in memory table and causing OOM in my use case as I have thousands of Hive partitions to process. I am new to Spark.
dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").saveAsTable("baseTable");
dataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity","date").insertInto("baseTable");
//the following works but creates in memory table and seems to be reason for OOM in my case
hiveContext.registerDataFrameAsTable(dataFrame, "baseTable");
Hope you have already got your answer , but posting this answer for others reference, partitionBy was only supported for Parquet till Spark 1.4 , support for ORC ,JSON, text and avro was added in version 1.5+ please refer the doc below
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrameWriter.html

Resources