Delete __HIVE_DEFAULT_PARTITION__ USING spark Notebook - apache-spark

Tried everything for a few hours to delete a record with a column partition value of __HIVE_DEFAULT_PARTITION__ within my delta lake table using a spark notebook. I figured it out and will post the answer. For the record my partition column is named Period.
This occurs when your partition column has a NULL value.

Make sure no other notebooks are updating the delta lake table when you run this.
The partition column in my table is named Period.
sourceFile is a variable containing the location of the storage account container and folder to effect.
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled",False)
from delta.tables import *
from pyspark.sql.functions import *
deltaTable = DeltaTable.forPath(spark, sourceFile)
deltaTable.delete("Period is NULL")
deltaTable.vacuum(0)

Related

How to Prevent Duplicate Entries to enter to delta lake of Azure Storage

I Have a Dataframe stored in the format of delta into Adls, now when im trying to append new updated rows to that delta lake it should, Is there any way where i can delete the old existing record in delta and add the new updated Record.
There is a unique Column for the schema of DataFrame stored in Delta. by which we can check whether the record is updated or new.
This is a task for Merge command - you define condition for merge (your unique column) and then actions. In SQL it could look as following (column is your unique column, and updates could be your dataframe registered as temporary view):
MERGE INTO destination
USING updates
ON destination.column = updates.column
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED
THEN INSERT *
In Python it could look as following:
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, "/data/destination/")
deltaTable.alias("dest").merge(
updatesDF.alias("updates"),
"dest.column = updates.column") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()

Issue with Spark dataframe loading timestamp data to hive table

I am trying to load a dataframe to hive table. But it is adding additional 30 minutes to the table.
I have tried the below
from pyspark import SparkContext,HiveContext
sc = SparkContext()
hive_context = HiveContext(sc)
df_load.write.mode("append").saveAsTable("default.DATA_LOAD")
the df_load has a column "currenthour" with value "2020-09-01 09:00:00". But in the table, it is loaded as "2020-09-01 09:30:00".
How to resolve this issue.
Its a common issue with Timestamp datatype because of the timezone.
Refer this:
Spark SQL to Hive table - Datetime Field Hours Bug

Pyspark parquet file sizes are drastically different

I use pyspark to process a fix set of data records on a daily basis and store them as 16 parquet files in a Hive table using the date as partition. In theory, the number of records every day should be on the same order of magnitude showing below, about 1.2 billion rows and it is indeed on the same order.
When I look at the parquet files, the size of every parquet files in each day is around 86MB like 2019-09-04 showing below
But one thing I noticed to be very strange is the date of 2019-08-03, the file size is 10x larger than the files in other date, but the number of records seems to be more or less the same. I am so confused and could not come up with a reason for it. If you have any idea as to why, please share it with me. Thank you.
I've just realised that the way I saved the data for 2019-08-03 is as follows
cols = sparkSession \
.sql("SELECT * FROM {} LIMIT 1".format(table_name)).columns
df.select(cols).write.insertInto(table_name, overwrite=True)
For other days
insertSQL = """
INSERT OVERWRITE TABLE {}
PARTITION(crawled_at_ds = '{}')
SELECT column1, column2, column3, column4
FROM calendarCrawlsDF
"""
sparkSession.sql(
insertSQL.format(table_name,
calendarCrawlsDF.take(1)[0]["crawled_at_ds"]))
For 2019-08-03, I used Dataframe insertInto method. For other days, I used sparkSession sql to execute INSERT OVERWRITE TABLE
Could this be the reason?

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

On saveAsTable from Spark

We are trying to write into a HIVE table from SPARK and we are using saveAsTable function. I want to know whether saveAsTable every time drop and recreate the hive table or not? If it does so, then is there any other possible spark function which will actually just truncate and load a table, instead drop and recreate.
It depends on which .mode value you are specifying
overwrite --> then spark drops the table first then recreates the table
append --> insert new data to the table
1.Drop if exists/create if not exists default.spark1 table in parquet format
>>> df.write.mode("overwrite").saveAsTable("default.spark1")
2.Drop if exists/create if not exists default.spark1 table in orc format
>>> df.write.format("orc").mode("overwrite").saveAsTable("default.spark1")
3.Append the new data to the existing data in the table(doesn't drop/recreate table)
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
Achieve Truncate and Load using Spark:
Method1:-
You can register your dataframe as temp table then execute insert overwrite statement to overwrite target table
>>> df.registerTempTable("temp") --registering df as temptable
>>> spark.sql("insert overwrite table default.spark1 select * from temp") --overwriting the target table.
This method will work for Internal/External tables also.
Method2:-
In case of internal tables as we can truncate the tables first then append the data to the table, by using this way we are not recreating the table but we are just appending the data to the table.
>>> spark.sql("truncate table default.spark1")
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
This method will only work for Internal tables.
Even in case of external tables we can do some workaround to truncate the table by changing table properties.
Let's assume default.spark1 table is external table and
--change external table to internal table
>>> saprk.sql("alter table default.spark1 set tblproperties('EXTERNAL'='FALSE')")
--once the table is internal then we can run truncate table statement
>>> spark.sql("truncate table default.spark1")
--change back the table as External table again
>>> spark.sql("alter table default.spark1 set tblproperties('EXTERNAL'='TRUE')")
--then append data to the table
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
You can also use insertInto("table") which doesn't recreate the table
The main difference between saveAsTable is that insertInto expects that the table already exists and is based on the order of columns instead of names.

Resources