DATABRICKS SQL - can't read data from partitioned parquet file - databricks

I'm trying to read parquet files structured as:
filename/year=2020/month=12/day=1
files are under the following Mounted AzureStorage as following logic: /mnt/silver/root_folder/folder_A/parquet/year=2020/month=01/day=1
I'm trying to create a table, using this sintax:
CREATE TABLE tablename
(
FIELD1 string,
...
,FIELDn Date
,Year INT
,Month INT
,Day INT
)
USING org.apache.spark.sql.parquet
LOCATION '/mnt/silver/root_folder/folder_A/parquet/'
OPTIONS( 'compression'='snappy')
PARTITIONED BY (Year, Month, Day)
But all options I tried for LOCATION gets no Results.
I already tried:
/mnt/silver/folder/folder/parquet/* and also many variations of it.
Any suggestion please?

You need to execute MSCK REPAIR TABLE <table_name> or ALTER TABLE <table_name> RECOVER PARTITIONS - any of them forces to re-discover data in partitions.
From documentation:
When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore
P.S. when you use Delta, that's done automatically, so that's one of the good reasons for using it :-)

Related

How to write to Hive table with static partition using PySpark?

I've created a Hive table with a partition like this:
CREATE TABLE IF NOT EXISTS my_table
(uid INT, num INT) PARTITIONED BY (dt DATE)
Then with PySpark, I'm having a dataframe and I've tried to write it to the Hive table like this:
df.write.format('hive').mode('append').partitionBy('dt').saveAsTable('my_table')
Running this I'm getting an exception:
Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
I then added this config:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
This time no exception but the table wasn't populated either!
Then I removed the above config and added this:
hive.exec.dynamic.partition=false
Also altered the code to be like:
df.write.format('hive').mode('append').partitionBy(dt='2022-04-29').saveAsTable('my_table')
This time I am getting:
Dynamic partition is disabled. Either enable it by setting hive.exec.dynamic.partition=true or specify partition column values
The Spark job I want to run is going to have daily data, so I guess what I want is the static partition, but how does it work?
If you haven't predefined all the partitions you will need to use:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
Remember that hive is schema on read, and it won't automagically fix your data into partitions. You need to inform the meta-store of the paritions.
You will need to do that manually with one of the two commands:
alter table <db_name>.<table_name> add partition(`date`='<date_value>') location '<hdfs_location_of the specific partition>';
or
MSCK REPAIR TABLE [tablename]
if the table is already created, and you are using append mode anyway, you can use insertInto instead of saveAsTable, and you don't even need .partitionBy('dt')
df.write.format('hive').mode('append').insertInto('my_table')

How to perform MSCK REPAIR TABLE to load only specific partitions

I have data in AWS S3 for more than 2 months that is partitioned and stored by day. I want to start using the data using the external table that I created.
Currently I see only a couple of partitions and I want to make sure my metadata picks up all the partitions. I tried using msck repair table tablename using hive after logging in to EMR Cluster's master node. However, may be due to data volume, it is taking a lot of time to execute that command.
Can I do msck repair table so that I can load only a specific day? does msck allow to load specific partitions?
You can use
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
...as described in Hive DDL doc.

Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB
tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.

Creating hive table on spark output on HDFS

I have my Spark job which is running every 30 minutes and writing output to hdfs-(/tmp/data/1497567600000). I have this job continuously running in the cluster.
How can I create a Hive table on top of this data? I have seen one solution in StackOverFlow which creates a hive table on top of data partitioned by date field. which is like,
CREATE EXTERNAL TABLE `mydb.mytable`
(`col1` string,
`col2` decimal(38,0),
`create_date` timestamp,
`update_date` timestamp)
PARTITIONED BY (`my_date` string)
STORED AS ORC
LOCATION '/tmp/out/'
and the solution suggests to Alter the table as,
ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'
But, in my case, I have no idea on how the output directories are being written, and so I clearly can't create the partitions as suggested above.
How can I handle this case, where the output directories are being randomly written in timestamp basis and is not in format (/tmp/data/timestamp= 1497567600000)?
How can I make Hive pick the data under the directory /tmp/data?
I can suggest two solutions:
If you can change your Spark job, then you can partition your data by hour (e.g. /tmp/data/1, /tmp/data/2), add Hive partitions for each hour and just write to relevant partition
you can write bash script responsible for adding Hive partitions which can be achieved by:
listing HDFS subdirectories using command hadoop fs -ls /tmp/data
listing hive partitions for table using command: hive -e 'show partitions table;'
comparing above lists to find missing partitions
adding new Hive partitions with command provided above: ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'

Hive Bucketed Tables enabled for Transactions

So we are trying to create a Hive table with ORC format bucketed and enabled for transactions using the below statement
create table orctablecheck ( id int,name string) clustered by (sno) into 3 buckets stored as orc TBLPROPERTIES ( 'transactional'='true')
The table is getting created in Hive and also Reflects in Beeline both in the Metastore as well as Spark SQL(which we have configured to run on top of Hive JDBC)
We are now inserting data into this table via Hive. However we see after insertion the data doesnt reflect in Spark SQL. It only reflects correctly in Hive.
The table only shows the data in the table if we restart the Thrift Server.
Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of below command in hive console.
desc extended <tablename> ;
If you'd need to access transactional table, consider doing a major compaction and then try accessing the tables
ALTER TABLE <tablename> COMPACT 'major';
I created a transactional table in Hive, and stored data in it using Spark (records 1,2,3) and Hive (record 4).
After major compaction,
I can see all 4 records in Hive (using beeline)
only records 1,2,3 in spark (using spark-shell)
unable to update records 1,2,3 in Hive
update to record 4 in Hive is ok

Resources