Koalas Dataframe Updating live as Deltalake is updated - databricks

I'm working on a solution that updates a delta lake though a given index using the following code:
dataframe = ks.read_table('data')
subdataframe = dataframe .loc[dataframe ['status']== 1,:]
for index,column in subdataframe.iterrows():
#get values for a given row
record= subdataframe.loc[index].to_dict()
if(record_needs_updating):
#update deltalake
dataframe.loc[dataframe['file']==record['file'],'status'] = 0
dataframe.to_delta('fileloc', partition_cols='pull',mode='overwrite')
#update databricks table
spark.sql("DROP TABLE IF EXISTS data")
spark.sql("CREATE TABLE data USING DELTA LOCATION fileloc)
spark.sql("OPTIMIZE data")
the problem that i am running into is a key error when trying to index into the subdataframe in the for loop.
this seems to happen because the dataframe itself gets updated to not include any records with status = 0 after the delta lake is updated, meaning the index changes, thus giving a key error.
is there any way to make the subdataframe into a non live dataframe that wont be updated as the deltalake is updated?
also to note i need the update to be live as the code is running and not update just once after all the code is run.
Thanks!

Related

How to Prevent Duplicate Entries to enter to delta lake of Azure Storage

I Have a Dataframe stored in the format of delta into Adls, now when im trying to append new updated rows to that delta lake it should, Is there any way where i can delete the old existing record in delta and add the new updated Record.
There is a unique Column for the schema of DataFrame stored in Delta. by which we can check whether the record is updated or new.
This is a task for Merge command - you define condition for merge (your unique column) and then actions. In SQL it could look as following (column is your unique column, and updates could be your dataframe registered as temporary view):
MERGE INTO destination
USING updates
ON destination.column = updates.column
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED
THEN INSERT *
In Python it could look as following:
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, "/data/destination/")
deltaTable.alias("dest").merge(
updatesDF.alias("updates"),
"dest.column = updates.column") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()

PySpark dataframe drops records while writing to a hive table

I am trying to write a pyspark dataframe to hive table which also got created using the below line
parks_df.write.mode("overwrite").saveAsTable("fs.PARKS_TNTO")
When I try to print the count of the dataframe parks_df.count() I get 1000 records.
But in the final table fs.PARKS_TNTO, I get 980 records. Hence 20 records are getting dropped. How can I resolve this issue ? . Also , how can I capture the records which are getting dropped. There are no partitions on this final table fs.PARKS_TNTO.

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

Getting partition list of inserted df partitions

Is there a way to get the file list or partition name of the partition that was inserted into the table?
df.write.format("parquet").partitionBy('id,name').insertInto(...)
A sample of the following command I wish to get a list :
1,Jhon
2,Jake
3,Dain
I don't think thats possible because to don't what was already present in the table and what was newely added.
Of course you can query your dataframe to get this:
val partitionList = df.select($"id,name").distinct.map(_.getString(0)).collect

Hive Context to load into a table

I am selecting a entire table and loading into a new table.It is loaded correctly but the value is appending not overwriting.
Spark version 1.6
Below is the code snippet
DataFrame df = hiveContext.createDataFrame(JavaRDD<Row>, StructType);
df.registerTempTable("tempregtable");
String query="insert into employee select * from tempregtable";
hiveContext.sql(query);
I am droping and creating the table (employee) and executing the above code.But still the old row value gets appended with new row.For eg if I am inserted four rows and dropped the table and again inserting four rows totally 8 rows got added.Kindly help me, how to overwrite the data instead of appending.
Regards
Prakash
try
String query="insert overwrite table employee select * from tempregtable";
INSERT OVERWRITE will overwrite any existing data in the table or partition
INSERT INTO will append to the table or partition
Reference: Hive Language Manual

Resources