How to write to Hive table with static partition using PySpark? - apache-spark

I've created a Hive table with a partition like this:
CREATE TABLE IF NOT EXISTS my_table
(uid INT, num INT) PARTITIONED BY (dt DATE)
Then with PySpark, I'm having a dataframe and I've tried to write it to the Hive table like this:
df.write.format('hive').mode('append').partitionBy('dt').saveAsTable('my_table')
Running this I'm getting an exception:
Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
I then added this config:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
This time no exception but the table wasn't populated either!
Then I removed the above config and added this:
hive.exec.dynamic.partition=false
Also altered the code to be like:
df.write.format('hive').mode('append').partitionBy(dt='2022-04-29').saveAsTable('my_table')
This time I am getting:
Dynamic partition is disabled. Either enable it by setting hive.exec.dynamic.partition=true or specify partition column values
The Spark job I want to run is going to have daily data, so I guess what I want is the static partition, but how does it work?

If you haven't predefined all the partitions you will need to use:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
Remember that hive is schema on read, and it won't automagically fix your data into partitions. You need to inform the meta-store of the paritions.
You will need to do that manually with one of the two commands:
alter table <db_name>.<table_name> add partition(`date`='<date_value>') location '<hdfs_location_of the specific partition>';
or
MSCK REPAIR TABLE [tablename]

if the table is already created, and you are using append mode anyway, you can use insertInto instead of saveAsTable, and you don't even need .partitionBy('dt')
df.write.format('hive').mode('append').insertInto('my_table')

Related

DATABRICKS SQL - can't read data from partitioned parquet file

I'm trying to read parquet files structured as:
filename/year=2020/month=12/day=1
files are under the following Mounted AzureStorage as following logic: /mnt/silver/root_folder/folder_A/parquet/year=2020/month=01/day=1
I'm trying to create a table, using this sintax:
CREATE TABLE tablename
(
FIELD1 string,
...
,FIELDn Date
,Year INT
,Month INT
,Day INT
)
USING org.apache.spark.sql.parquet
LOCATION '/mnt/silver/root_folder/folder_A/parquet/'
OPTIONS( 'compression'='snappy')
PARTITIONED BY (Year, Month, Day)
But all options I tried for LOCATION gets no Results.
I already tried:
/mnt/silver/folder/folder/parquet/* and also many variations of it.
Any suggestion please?
You need to execute MSCK REPAIR TABLE <table_name> or ALTER TABLE <table_name> RECOVER PARTITIONS - any of them forces to re-discover data in partitions.
From documentation:
When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore
P.S. when you use Delta, that's done automatically, so that's one of the good reasons for using it :-)

Write files inside Hive table hdfs folder and make them available to be queried from Hive

I am using Spark 2.2.1 which has a useful option to specify how many records I want to save in each partition of a file; this feature allows to avoid a repartition before writing a file.
However, it seems this option is usable only with the FileWriter interface and not with the DataFrameWriter one:
in this way the option is ignored
df.write.mode("overwrite")
.option("maxRecordsPerFile", 10000)
.insertInto(hive_table)
while in this way it works
df.write.option("maxRecordsPerFile", 10000)
.mode("overwrite").orc(path_hive_table)
so I am directly writing orc files in the HiveMetastore folder of the specified table. The problem is that if I query the Hive table after the insertion, this data is not recognized by Hive.
Do you know if there's a way to write directly partition files inside the hive metastore and make them available also through the Hive table?
Debug steps :
1 . Check the type of file your hive table consumes
Show create table table_name
and check "STORED AS " ..
For better efficiency saves your output in parquet and on the partition location (you can see that in "LOCATION" in above query) ..If there are any other specific types create file as that type.
2 . If you are saving data in any partition and manually creating the partition folder , avoid that .. Create partition using
alter table {table_name} add partition ({partition_column}={value});
3 .After creating the output files in spark .. You can reload those and check for "_corrupt_record" (you can print the dataframe and check this)
Adding to this, I also found out that the command 'MSCK REPAIR TABLE' automatically discovers new partitions inside the hive table folder

Creating hive table on spark output on HDFS

I have my Spark job which is running every 30 minutes and writing output to hdfs-(/tmp/data/1497567600000). I have this job continuously running in the cluster.
How can I create a Hive table on top of this data? I have seen one solution in StackOverFlow which creates a hive table on top of data partitioned by date field. which is like,
CREATE EXTERNAL TABLE `mydb.mytable`
(`col1` string,
`col2` decimal(38,0),
`create_date` timestamp,
`update_date` timestamp)
PARTITIONED BY (`my_date` string)
STORED AS ORC
LOCATION '/tmp/out/'
and the solution suggests to Alter the table as,
ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'
But, in my case, I have no idea on how the output directories are being written, and so I clearly can't create the partitions as suggested above.
How can I handle this case, where the output directories are being randomly written in timestamp basis and is not in format (/tmp/data/timestamp= 1497567600000)?
How can I make Hive pick the data under the directory /tmp/data?
I can suggest two solutions:
If you can change your Spark job, then you can partition your data by hour (e.g. /tmp/data/1, /tmp/data/2), add Hive partitions for each hour and just write to relevant partition
you can write bash script responsible for adding Hive partitions which can be achieved by:
listing HDFS subdirectories using command hadoop fs -ls /tmp/data
listing hive partitions for table using command: hive -e 'show partitions table;'
comparing above lists to find missing partitions
adding new Hive partitions with command provided above: ALTER TABLE mydb.mytable ADD PARTITION (my_date=20160101) LOCATION '/tmp/out/20160101'

How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this:
dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path)
As mentioned in this question, partitionBy will delete the full existing hierarchy of partitions at path and replaced them with the partitions in dataFrame. Since new incremental data for a particular day will come in periodically, what I want is to replace only those partitions in the hierarchy that dataFrame has data for, leaving the others untouched.
To do this it appears I need to save each partition individually using its full path, something like this:
singlePartition.write.mode(SaveMode.Overwrite).parquet(path + "/eventdate=2017-01-01/hour=0/processtime=1234567890")
However I'm having trouble understanding the best way to organize the data into single-partition DataFrames so that I can write them out using their full path. One idea was something like:
dataFrame.repartition("eventdate", "hour", "processtime").foreachPartition ...
But foreachPartition operates on an Iterator[Row] which is not ideal for writing out to Parquet format.
I also considered using a select...distinct eventdate, hour, processtime to obtain the list of partitions, and then filtering the original data frame by each of those partitions and saving the results to their full partitioned path. But the distinct query plus a filter for each partition doesn't seem very efficient since it would be a lot of filter/write operations.
I'm hoping there's a cleaner way to preserve existing partitions for which dataFrame has no data?
Thanks for reading.
Spark version: 2.1
This is an old topic, but I was having the same problem and found another solution, just set your partition overwrite mode to dynamic by using:
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
So, my spark session is configured like this:
spark = SparkSession.builder.appName('AppName').getOrCreate()
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
The mode option Append has a catch!
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/mydbname.db/" + tableName)
I've tested and saw that this will keep the existing partition files. However, the problem this time is the following: If you run the same code twice (with the same data), then it will create new parquet files instead of replacing the existing ones for the same data (Spark 1.6). So, instead of using Append, we can still solve this problem with Overwrite. Instead of overwriting at the table level, we should overwrite at the partition level.
df.write.mode(SaveMode.Overwrite)
.parquet("/data/hive/warehouse/mydbname.db/" + tableName + "/y=" + year + "/m=" + month + "/d=" + day)
See the following link for more information:
Overwrite specific partitions in spark dataframe write method
(I've updated my reply after suriyanto's comment. Thnx.)
I know this is very old. As I can not see any solution posted, I will go ahead and post one. This approach assumes you have a hive table over the directory you want to write to.
One way to deal with this problem is to create a temp view from dataFrame which should be added to the table and then use normal hive-like insert overwrite table ... command:
dataFrame.createOrReplaceTempView("temp_view")
spark.sql("insert overwrite table table_name partition ('eventdate', 'hour', 'processtime')select * from temp_view")
It preserves old partitions while (over)writing to only new partitions.

How to overwrite specific top level partition in multi-level partitioning in Spark

My table have 4 partition in this order:period_dt,year,month,date
I period_dt is static partition (value is an argument) and year,month,date are dynamic.So I know period_dt partition value which I want to overwrite.
newInputDF.write().mode("overwrite").partitionBy("period_dt","year","month","date").parquet("trg_file_path");
Using above command Spark overwrites all partition.But in my case for example if partition exists overwrite it otherwise append it.I want to overwrite partition at period_dt level.
One method would be to provide complete path:
inputDFTwo.write().mode("overwrite").parquet("trg_tbl/period_dt=2016-09-21/year=2016/month=09/date=21");
But year,month,date are dynamic
second to use hive query with HiveContext.
Is there any other way to overwrite specific partition?
Solutions I came up with:
hiveContext.sql("INSERT OVERWRITE TABLE table_name PARTITION(period_dt='2016-06-08', year,month,date) , select x,y,z,year,month,date from DFTmpTable");
and
DeleteHDFSfile(/table/period_dt='2016-06-08')
DF.write().mode("append").partitionBy("period_dt","year","month","date").parquet("path")

Resources