Getting partition list of inserted df partitions - apache-spark

Is there a way to get the file list or partition name of the partition that was inserted into the table?
df.write.format("parquet").partitionBy('id,name').insertInto(...)
A sample of the following command I wish to get a list :
1,Jhon
2,Jake
3,Dain

I don't think thats possible because to don't what was already present in the table and what was newely added.
Of course you can query your dataframe to get this:
val partitionList = df.select($"id,name").distinct.map(_.getString(0)).collect

Related

Read partioned hive table in pyspark instead of a parquet

I have a partionned parquet. It is partioned by date like:
/server/my_dataset/dt=2021-08-02
/server/my_dataset/dt=2021-08-01
/server/my_dataset/dt=2021-07-31
...
The size is huge, so I do not want to read it at the time and I need only august part, therefore I use:
spark.read.parquet("/server/my_dataset/dt=2021-08*")
It works just fine. However I am forced to move from reading parquet directly to reading from the corresponding hive table. Something like:
spark.read.table("schema.my_dataset")
However I want to keep the same logic of reading only certain partitions of the data. Is there a way to do so?
Try with filter and like operator.
Example:
spark.read.table("schema.my_dataset").filter(col("dt").like("2021-08%"))
UPDATE:
You can get all the august partition values into a variable then use filter query with in statement.
Example:
#get the partition values into variable and filter required
val lst=df.select("dt").distinct.collect().map(x => x(0).toString)
#then use isin function to filter only required partitions
df.filter(col("dt").isin(lst:_*)).show()
For python sample code:
lst=[1,2]
df.filter(col("dt").isin(*lst)).show()

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

Avoid duplicate partition in datalake

When I write parquet file Im passing one of the column value as partition by but when the dataframe is empty it doesnt create the partition (it is expected) and does nothing. To overcome this if I pass
df.partitionOf("department=One").write(df)
and when the dataframe is NOT empty it creates two level of partition
location/department=One/department=One
Is there any way to skip one if the partition already exists to avoid duplicates?
What is the path you are passing while writing dataframe? I didn't find partitionOf function for spark dataframe.
I think this should work for your case
df.write.mode("append").partitionBy("department").parquet("location/")
If you don't want to append data for the partitions which are already there find the partitons key from existing parquet and drop data with those partition keys and write remaining data in append mode.
scala code:
val dfi=spark.read.parquet(pathPrefix+finalFile).select(col("department"))
val finalDf = df.join(dfi, df.col("department") == dfi.col("department"), "left_outer")
.where(dfi.col("department").isNull())
.select(dfl.columns.map(col):_*)
finalDf.write.mode("append").partitionBy("department").parquet("location/")
You can optimize first step (creating dfi ) by finding partition keys from your Dataframe and keeping only those partition keys for which path exists.

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

Running partition specific query in Spark Dataframe

I am working on spark streaming application, where I partition the data as per a certain ID in the data.
For eg: partition 0-> contains all data with id 100
partition 1 -> contains all data with id 102
Next I want to execute query on whole dataframe for final result. But my query is specific to each partition.
For eg: I need to run
select(col1 * 4) in case of partiton 0
while
select(col1 * 10) in case of parition 1.
I have looked into documentation but didnt find any clue. One solution i have is to create different RDDs/ Dataframe for different id in data. But that is not scalable in my case.
Any suggestion how to run query on dataframe where query can be specific to each partition.
Thanks
I think you should not couple your business logic with Spark's way of partitioning your data (you won't be able to repartition your data if required). I would suggest to add an artificial column in your DataFrame that equals with the partitionId value.
In any case, you can always do
df.rdd.mapPartitionsWithIndex{ case (partId, iter: Iterable[Row]) => ...}
See also the docs.

Resources