Cannot view data in hive partition table - apache-spark

I have a external table that has a partitioned column called rundate. I can load data into the table using
DataFrame.write.mode(SaveMode.Overwrite).orc("s3://test/table")
I then create a partition using
spark.sql("ALTER TABLE table ADD IF NOT EXISTS PARTITION(rundate = '2017-12-19')")
The code works fine and i can see the partitions. But I cannot see data in the Hive table.

You have not saved the partition data in correct folder structure and also manually added the partition where data does not exist.
Two things:
1. First make sure you are saving at data at the location where external table is created and also the folder structure is same as hive expect. e.g Assume your external table name is table and partition column is rundate, partition value is 2017-12-19 and external table is pointing to location s3://test/table. Then save data for partition 2017-12-19 as below:
DataFrame.write.mode(SaveMode.Overwrite).orc("s3://test/table/rundate=2017-12-19/")
2.Once save is successful below command to update the metastore of hive with the latest added partition.
synatx: msck repair table <tablename>
msck repair table table

Related

Is there a simple way to update location to all partitions in hive external table?

I create some dataframe with spark daily, and save it to HDFS location.
Before saving I partition data by some fields, so path to data looks like this:
/warehouse/tablespace/external/hive/table_name/...
table_name directory has partitions like:
table_name/field=value1
table_name/field=value2
I create external table to operate the data with Hive and set location to data path.
Each day I want to change location to new data path. But if I use
ALTER TABLE table
SET LOCATION 'new location'
querying still return old data because partition's locations don't change.
Is there any way to tell Hive to search partitions in new location, without changing it one by one?
If you don't want to keep the old partition you can drop using the below command and try to add a clause that matches all partition
ALTER TABLE table_name DROP IF EXISTS PARTITION (field != 'non_exist_value');
Then you can check the remaining partition again using the below command
SHOW PARTITIONS table_name;
After that, you can change it to a new location and repair the hive table to create a new partition under the new location (The partition column name in new location must be the same as old)
ALTER TABLE table_name SET LOCATION '/new_location';
MSCK REPAIR TABLE table_name;

External hive table on top of parquet returns no data

I created a hive table on top of a parquet folder written via spark. In one test server it is running fine and giving out results (hive version 2.6.5.196) but in production it gives no records (hive 2.6.5.179). Could someone please point out what the exact issue could be?
If you created the table on top of an existing partition structure, you have to make it known to the table that there are partitions at this location.
MSCK REPAIR TABLE table_name; -- adds missing partitions
SELECT * FROM table_name; -- should return records now
This problem shouldn't happen if there are only files in that location, and if they are the expected format.
You can verify with:
SHOW CREATE TABLE table_name; -- to see the expected format
created hive table on top of a parquet folder written via spark.
Check for the databases that you are using is available or not using
show databases;
check the ddl of the table that you have created on your test server and the other that is there on production
show create table table_name;
Make sure that both the ddl exactly matches.
Do msck repair table table_name to load the incremental data or the data from all the partitions
select * from table_name to view records

Add New Partition to Hive External Table via databricks

I have a Folder which previously had subfolders based on ingestiontime which is also the original PARTITION used in its Hive Table.
So the Folder Looks as -
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........
Inside each ingestiontime folder, data is present in PARQUET format.
Now in the Same myStreamingData folder, I am adding another folder that holds similar data but in the folder named businessname.
So my Folder structure now looks like -
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........
So I need to add the data in the businessname partition to my current hive table too.
To achieve this , I was running the ALTER Query - ( on Databricks)
%sql
alter table gp_hive_table add partition (businessname=007,ingestiontime=20200712230000) location "s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000"
But I am getting this error -
Error in SQL statement: AnalysisException: businessname is not a valid partition column in table `default`.`gp_hive_table`.;
What part I am doing incorrectly here ?
Thanks in Advance.
Since you're already using Databricks and this is a streaming use case, you should definitely take a serious look at using Delta Lake tables.
You won't have to mess with explicit ... ADD PARTITION and MSCK statements.
Delta Lake with the ACID properties will ensure your data is committed properly, if your job fails you won't end up with partial results. As soon as the data is committed, it is available to users (again without the MSCK and ADD PARTITION) statements.
Just change 'USING PARQUET' to 'USING DELTA' in your DDL.
You can also (CONVERT) your existing parquet table to a Delta Lake table and then start using INSERT, UPDATE, DELETE, MERGE INTO, COPY INTO, from Spark batch and structured streaming jobs. OPTIMIZE will clean up the small file problem.
alter table gp_hive_table add partition is to add partition(data location, not new column) to the table with already defined partitioning scheme, it does not change current partitioning scheme, it just adds partition metadata, that in some location there is partition corresponding to some partitioning column value.
If you want to change partition columns, you need to recreate the table.:
Drop (check it is EXTERNAL) the table: DROP TABLE gp_hive_table;
Create table with new partitioning column. Partitions WILL NOT be created automatically.
Now you can add partitions using ALTER TABLE ADD PARTITION or use MSCK REPAIR TABLE to create them automatically based on directory structure. Directory structure should already match partitioning scheme before you execute these commands
So,
building upon the suggestion from #leftjoin,
Instead of having a hive table without businessname as one of the partition ,
What I did is -
Step 1 -> Create hive table with - PARTITION BY (businessname long,ingestiontime long)
Step 2 -> Executed the query - MSCK REPAIR <Hive_Table_name> to auto add partitions.
Step 3 ->
Now, there are ingestiontime folders which are not in the folder businessname i.e
folders like -
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
I wrote a small piece of code to fetch all such partitions and then ran the following query for all of them -
ALTER TABLE <hive_table_name> ADD PARTITION (businessname=<some_value>,ingestiontime=<ingestion_time_partition_name>) LOCATION "<s3_location_of_all_partitions_not_belonging_to_a_specific_businesskey>
This solved my issue.

drop table command is not deleting path of hive table which was created by spark-sql

I am trying to drop a table(Internal) table that was created Spark-Sql, some how table is getting dropped but location of the table is still exists. Can some one let me know how to do this?
I tried both Beeline and Spark-Sql
create table something(hello string)
PARTITIONED BY(date_d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "^"
LOCATION "hdfs://path"
)
Drop table something;
No rows affected (0.945 seconds)
Thanks
Spark internally uses Hive metastore to create Table. If the table is created as an external hive table from spark i.e. the data present in HDFS and Hive provides a table view on that, drop table command will only delete the Metastore information and will not delete the data from HDFS.
So there are some alternate strategy which you could take
Manually delete the data from HDFS using hadoop fs -rm -rf command
Do alter table on the table you want to delete, change the external table to internal table then drop the table.
ALTER TABLE <table-name> SET TBLPROPERTIES('external'='false');
drop table <table-name>;
The first statement will convert the external table to internal table and 2nd statement will delete the table with the data.

Hive PartitionFilter are not Applied

I have facing this issue with hive.
When i Query a table ,which is partitioned on date column,
SELECT count(*) from table_name where date='2018-06-01' the query reads the entire table data and keeps for running hours,
Using EXPLAIN I found that HIVE is not applying the PartitionFilter on the query
I have double checked that the table is partitioned on date column by desc table_name.
Execution engine is Spark And Data is stored in Azure Data lake in Parquet Format
However I have another table in the Database for which the PartitionFilter is applied and it executes as expected.
Can there be some issue with the hive metadata or it is something else
Found the cause for this issue,
Hive wasn't applying the Partition Filters on some table ,because those tables were cached.
Thus when i restared the thrift server the cache was removed and Partition Filters were applied

Resources