Avoid duplicate partition in datalake - apache-spark

When I write parquet file Im passing one of the column value as partition by but when the dataframe is empty it doesnt create the partition (it is expected) and does nothing. To overcome this if I pass
df.partitionOf("department=One").write(df)
and when the dataframe is NOT empty it creates two level of partition
location/department=One/department=One
Is there any way to skip one if the partition already exists to avoid duplicates?

What is the path you are passing while writing dataframe? I didn't find partitionOf function for spark dataframe.
I think this should work for your case
df.write.mode("append").partitionBy("department").parquet("location/")
If you don't want to append data for the partitions which are already there find the partitons key from existing parquet and drop data with those partition keys and write remaining data in append mode.
scala code:
val dfi=spark.read.parquet(pathPrefix+finalFile).select(col("department"))
val finalDf = df.join(dfi, df.col("department") == dfi.col("department"), "left_outer")
.where(dfi.col("department").isNull())
.select(dfl.columns.map(col):_*)
finalDf.write.mode("append").partitionBy("department").parquet("location/")
You can optimize first step (creating dfi ) by finding partition keys from your Dataframe and keeping only those partition keys for which path exists.

Related

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

Spark join: grouping of records having same value for a particular column in the same partition

We have 2 Hive tables which are read in spark and joined using a join key, let’s call it user_id.
Then, we write this joined dataset to S3 and register it hive as a 3rd table for subsequent tasks to use this joined dataset.
One of the other columns in the joined dataset is called keychain_id.
We want to group all the user records belonging to the same keychain_id in the same partition for a reason to avoid shuffles later.
So, can I do a repartition(“keychain_id”) before writing to s3 and registering it in Hive , and when I read the same data back from this third table will it still have the same partition grouping (all users belonging to the
Same keychain_id in the same partition)? Because trying to avoid doing a repartition(“keychain_id”) every time when reading from this 3rd table.
Can you please clarify ? If there is no guarantee that it will retain the same partition grouping while reading, then is there another efficient way this can be done other than caching?
if there is no data skew(will lead to diff partition file sizes) in keychain_id you can do write with partitionBy:
df.write\
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
Update:
In order to 'retain the grouping of user records having the same keychain_id in the same dataframe partition'
You could repartition before, on unique ids and/or column
from pyspark.sql import functions as F
n = df.select(F.col('keychain_id')).distinct().count()
df.repartition(n, F.col("keychain_id)\
.write \
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")
or
df.repartition(n)\
.write \
.partitionBy("keychain_id")\
.mode("overwrite")\
.format("parquet")\
.saveAsTable("testing")

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

How to pass multiple column in partitionby method in Spark

I am a newbie in Spark.I want to write the dataframe data into hive table. Hive table is partitioned on mutliple column. Through, Hivemetastore client I am getting the partition column and passing that as a variable in partitionby clause in write method of dataframe.
var1="country","state" (Getting the partiton column names of hive table)
dataframe1.write.partitionBy(s"$var1").mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
When I am executing the above code,it is giving me error partiton "country","state" does not exists.
I think it is taking "country","state" as a string.
Can you please help me out.
The partitionBy function takes a varargs not a list. You can use this as
dataframe1.write.partitionBy("country","state").mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
Or in scala you can convert a list into a varargs like
val columns = Seq("country","state")
dataframe1.write.partitionBy(columns:_*).mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")

Getting partition list of inserted df partitions

Is there a way to get the file list or partition name of the partition that was inserted into the table?
df.write.format("parquet").partitionBy('id,name').insertInto(...)
A sample of the following command I wish to get a list :
1,Jhon
2,Jake
3,Dain
I don't think thats possible because to don't what was already present in the table and what was newely added.
Of course you can query your dataframe to get this:
val partitionList = df.select($"id,name").distinct.map(_.getString(0)).collect

Resources