Partitioned ORC table shows up empty in Hive - apache-spark

I've written a Spark dataframe to partitioned ORC files like this:
df.repartition("LOADED")\
.write\
.partitionBy("LOADED")\
.format("orc")\
.save("location")
Everything is on the disk correctly.
After that, I wanted to create a Hive table from it, like:
CREATE TABLE table USING ORC LOCATION 'location'
The command runs without any errors. But if I try to query the table, it's empty.
I've tried to do the same without partitioning, and it works just fine. What am I doing wrong?
The partitioned folders look like: LOADED=2019-11-16
For reference: I want to write the data to Azure Blob Storage, and create a Hive table from it in a different cluster.

You just need to update the partition info on the table so Hive can list the partitions presents. This is done through the MSCK REPAIR command:
spark.sql("MSCK REPAIR TABLE <tableName>")
More info on this command here
Quick example here
scala> spark.sql("select * from table").show
20/03/28 17:12:46 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------+------+
|column|LOADED|
+------+------+
+------+------+
scala> spark.sql("MSCK REPAIR TABLE table")
scala> spark.sql("select * from table").show
+------+----------+
|column| LOADED|
+------+----------+
| a|2019-11-16|
| c|2019-11-16|
| b|2019-11-17|
+------+----------+

You are writing data directly to the location not through hiveQL statements in this case we need to update the metadata of the hive table from hive/spark using:
msck repair table <db_name>.<table_name>;
(or)
alter table <db_name>.<table_name> add partition(`LOADED`='<value>') location '<location_of the specific partition>';
Then run the below command to list out partitions from the table:
show partitions <db_name>.<table_name>;

Related

Newly Inserted Hive records do not show in Spark Session of Spark Shell

I ran a simple program of Spark-sql to get data from Hive to Spark session using spark-SQL.
scala> spark.sql("select count(1) from firsthivestreamtable").show(100,false)
+--------+
|count(1)|
+--------+
|36 |
+--------+
Ran insert statements to insert 9 new records in the Hive table (directly on Hive console). Validated that Hive table has additional rows inserted properly.
hive> select count(1) aa from firsthivestreamtable;
Total MapReduce CPU Time Spent: 4 seconds 520 msec
OK
45
Time taken: 22.173 seconds, Fetched: 1 row(s)
hive>
But spark session which was already open doesn't show the newly inserted 9 rows. So, when I do count within spark session, it still shows 36 rows. Why is this happening?
scala> spark.sql("select count(1) from firsthivestreamtable").show(100,false)
+--------+
|count(1)|
+--------+
|36 |
+--------+
What is expected to be done in spark session to get the refreshed(new) data into the session? Actual no of rows in the Hive table now are 45 and not 36 as new data has been inserted.
It is in spark shell and the table in Hive is getting loaded through the Spark structured streaming API.
When Spark retrieves the table from metastore when accessed for the first time, it then lists the files and caches it in memory.
When we perform an insert operation, the records go into a new file which Spark will not be aware of. Two options.
1. Trigger REFRESH TABLE <tblname> -> spark.sql("REFRESH TABLE firsthivestreamtable") .
2. Restart the Spark application(The table and its file will be fetched again)
The clue to the story is that the observed behaviour here in Spark aids in recomputation of the DAG - if required to a Worker Node failure.
The other answer explains the mechanics, this answer the reasoning why.

unable to insert into hive partitioned table from spark

I create an external partitioned table in hive.
in the logs it shows numinputrows. that means the query is working and sending data. but when I connect to hive using beeline and query, select * or count(*) it's always empty.
def hiveOrcSetWriter[T](event_stream: Dataset[T])( implicit spark: SparkSession): DataStreamWriter[T] = {
import spark.implicits._
val hiveOrcSetWriter: DataStreamWriter[T] = event_stream
.writeStream
.partitionBy("year","month","day")
.format("orc")
.outputMode("append")
.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
hiveOrcSetWriter
}
What can be the issue? I'm unable to understand.
msck repair table tablename
It give go and check the location of the table and adds partitions if new ones exits.
In your spark process add this step in order to query from hive.
Your streaming job is writing new partitions to the table_location. But the Hive metastore is not aware of this.
When you run a select query on the table, the Hive checks metastore to get list of table partitions. Since the information in Metastore is outdated, so the data don't show up in the result.
You need to run -
ALTER TABLE <TABLE_NAME> RECOVER PARTITIONS
command from Hive/Spark to update the metastore with new partition info.

Spark 2.1 table loaded from Hive Metastore has null values

I am trying to migrate table definitions from one Hive metastore to another.
The source cluster has:
Spark 1.6.0
Hive 1.1.0 (cdh)
HDFS
The destination cluster is an EMR cluster with:
Spark 2.1.1
Hive 2.1.1
S3
To migrate the tables I did the following:
Copy data from HDFS to S3
Run SHOW CREATE TABLE my_table; in the source cluster
Modify the returned create query - change LOCATION from the HDFS path to the S3 path
Run the modified query on the destination cluster's Hive
Run SELECT * FROM my_table;. This returns 0 rows (expected)
Run MSCK REPAIR TABLE my_table;. This passes as expected and registers the partitions in the metastore.
Run SELECT * FROM my_table LIMIT 10; - 10 lines are returned with correct values
On the destination cluster, from Spark that is configured to work with the Hive Metastore, run the following code: spark.sql("SELECT * FROM my_table limit 10").show() - This returns null values!
The result returned from the Spark SQL query has all the correct columns, and the correct number of lines, but all the values are null.
To get Spark to correctly load the values, I can add the following properties to the TBLPROPERTIES part of the create query:
'spark.sql.partitionProvider'='catalog',
'spark.sql.sources.provider'='org.apache.spark.sql.parquet',
'spark.sql.sources.schema.numPartCols'='<partition-count>',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='<json-schema as seen by spark>'
'spark.sql.sources.schema.partCol.0'='<partition name 1>',
'spark.sql.sources.schema.partCol.1'='<partition name 2>',
...
The other side of this problem is that in the source cluster, Spark reads the table values without any problem and without the extra TBLPROPERTIES.
Why is this happening? How can it be fixed?

How does createOrReplaceTempView work in Spark?

I am new to Spark and Spark SQL.
How does createOrReplaceTempView work in Spark?
If we register an RDD of objects as a table will spark keep all the data in memory?
createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view.
scala> val s = Seq(1,2,3).toDF("num")
s: org.apache.spark.sql.DataFrame = [num: int]
scala> s.createOrReplaceTempView("nums")
scala> spark.table("nums")
res22: org.apache.spark.sql.DataFrame = [num: int]
scala> spark.table("nums").cache
res23: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]
scala> spark.table("nums").count
res24: Long = 3
The data is cached fully only after the .count call. Here's proof it's been cached:
Related SO: spark createOrReplaceTempView vs createGlobalTempView
Relevant quote (comparing to persistent table): "Unlike the createOrReplaceTempView command, saveAsTable will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore." from https://spark.apache.org/docs/latest/sql-programming-guide.html#saving-to-persistent-tables
Note : createOrReplaceTempView was formerly registerTempTable
CreateOrReplaceTempView will create a temporary view of the table on memory it is not persistent at this moment but you can run SQL query on top of that. if you want to save it you can either persist or use saveAsTable to save.
First, we read data in .csv format and then convert to data frame and create a temp view
Reading data in .csv format
val data = spark.read.format("csv").option("header","true").option("inferSchema","true").load("FileStore/tables/pzufk5ib1500654887654/campaign.csv")
Printing the schema
data.printSchema
data.createOrReplaceTempView("Data")
Now we can run SQL queries on top of the table view we just created
%sql SELECT Week AS Date, Campaign Type, Engagements, Country FROM Data ORDER BY Date ASC
SparkSQl support writing programs using Dataset and Dataframe API, along with it need to support sql.
In order to support Sql on DataFrames, first it requires a table definition with column names are required, along with if it creates tables the hive metastore will get lot unnecessary tables, because Spark-Sql natively resides on hive. So it will create a temporary view, which temporarily available in hive for time being and used as any other hive table, once the Spark Context stop it will be removed.
In order to create the view, developer need an utility called createOrReplaceTempView

Hive Bucketed Tables enabled for Transactions

So we are trying to create a Hive table with ORC format bucketed and enabled for transactions using the below statement
create table orctablecheck ( id int,name string) clustered by (sno) into 3 buckets stored as orc TBLPROPERTIES ( 'transactional'='true')
The table is getting created in Hive and also Reflects in Beeline both in the Metastore as well as Spark SQL(which we have configured to run on top of Hive JDBC)
We are now inserting data into this table via Hive. However we see after insertion the data doesnt reflect in Spark SQL. It only reflects correctly in Hive.
The table only shows the data in the table if we restart the Thrift Server.
Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of below command in hive console.
desc extended <tablename> ;
If you'd need to access transactional table, consider doing a major compaction and then try accessing the tables
ALTER TABLE <tablename> COMPACT 'major';
I created a transactional table in Hive, and stored data in it using Spark (records 1,2,3) and Hive (record 4).
After major compaction,
I can see all 4 records in Hive (using beeline)
only records 1,2,3 in spark (using spark-shell)
unable to update records 1,2,3 in Hive
update to record 4 in Hive is ok

Resources