I have a Delta Lake table in Azure. I'm using Databricks. When we add new entries we use merge into to prevent duplicates from getting into the table. However, duplicates did get into the table. I'm not sure how it happened. Maybe the merge into conditions weren't setup properly.
However it happened the duplicates are there. Is there any way to detect and remove the duplicates from the table? All the documentation I've found shows how to deduplicate the dataset before merging. Nothing for once the duplicates are already there. How can I remove the duplicates?
Thanks
If the duplicate exists in the target table, your only options are:
Delete the duplicated rows from the target table manually using SQL DELETE statements
Create a deduplicated replica of your target table and rename both tables (dedupped replica and original target) to ensure make your dedupped replica the main table.
you can use dataset.dropDuplicates to delete duplicates based on columns.
https://spark.apache.org/docs/2.3.2/api/java/org/apache/spark/sql/Dataset.html#dropDuplicates-java.lang.String-java.lang.String...-
In order to remove the duplicates you can follow the below approach:
Create a separate table that is the replica of the table that has duplicate records.
Drop the first table that has duplicate records. (Meta data information plus physical files)
write a python script or scala code to remove the duplicate records either using dropDuplicates function or any custom logic that defines a unique record by reading the data from the table that you created in step 1 and recreate the table that you deleted in step 2.
Once you follow the above steps your table would not have duplicate rows but this is just a workaround to make your table consistent so it does not have duplicate records and not a permanent solution.
Before or after you follow the above steps you will have to look into your merge into statements to see if that is written correctly so that it does not insert duplicate records. If the merge into statement is proper make sure that the dataset that you are processing is not having duplicate records from the source from where you are reading the data.
I would suggest the following SOP:
Fix existing job (streamer or batch) to handle duplicates
Change job configuration to write into _recovery table (also change a checkpoint path to _recovery in case of streamer job)
Run the job and validate its output
Rename the original folder in _backup and rename the _recovery to original (do the same with the checkpoints directory)
Restore the original job configuration.
Related
I am setting up a pipeline in data factory where the first part of the pipeline needs some pre-processing cleaning. I currently have a script set up to query these rows that need to be deleted, and export these results into a csv.
What I am looking for is essentially the opposite of an upsert copy activity. I would like the procedure to delete the rows in my table based on a matching row.
Apologies in advanced if this is an easy solution, I am fairly new to data factory and just need help looking in the right direction.
Assuming the source from which you are initially getting the rows is different from the sink
There are multiple ways to achieve it.
in case if the number of rows is less, we can leverage script activity or lookup activity to delete the records from the destination table
in case of larger dataset, limitations of lookup activity, you can copy the data into a staging table with in destination and leverage a script activity to delete the matching rows
in case if your org supports usage of dataflows, you can use that to achieve it
I am building an etl using pyspark in databricks.
I have a source SQL table with roughly 10 million rows of data which I want to load into a SQL staging table.
I have two basic requirements:-
When a row is added to the source table, it must to inserted into the staging table.
When a row is updated to the source table, it must to updated into the staging table.
Source data
Thankfully the source table has two timestamp columns for created and updated time. I can query the new and updated data using these two columns and put it into a dataframe called source_df.
Target data
I load all the keys (IDs) from the staging table into a dataframe called target_df.
Working out changes
I join the two dataframe together based on the key to work out which rows already exist (which form updates), and which rows don't exist (which form inserts). This gives me two new dataframes, inserts_df and updates_df.
Inserting new rows
This is easy because I can just use inserts_df.write to directly write into the staging table. Job done!
Updating existing rows
This is what I can't figure out as there is little in the way of existing examples. I am lead to believe that you can't do updates using pyspark. I could use the "overwrite" mode to replace the SQL table, but it doesn't make a lot of sense to replace 10 million rows when I only want to update half a dozen.
How can I effectively get the rows from updates_df into SQL without overwriting the whole table?
I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.
I have a dataframes which have few rows among them some already exists in db. I want to update few columns of existing rows. How can we do that?
I see we have SaveModes:
append and override which might serve the purpose but there is a limitation in both the cases.
With append, I am getting primary key error, as this option tries to create a new row in db
With ovverride, I will loose values for the unchanged attributes in the tuple.
Can someone please suggest how can I update few attributes(Columns values) of a row(tuple).?
This can be handled in MySql level, The concept is known as upsert.
case when : primary key is new
The SQL will insert into MySQL DB as new row
Case when : primary key is existing
You can use
INSERT
ON DUPLICATE KEY UPDATE
Which will update the key with the new entries/changes.
Read More here and here.
The ideal way to such use case is, insert your data into a temporary table first in your MySQL DB and post that use a trigger in order to load that data into original table. Call that trigger from spark itself.
In spark, dataframes are immutable. So you cannot change a value in place. One way would be to read the complete table, make the modification and write back the complete table in overwrite mode. This will take time.
If your modifications are always for a particular group, say user id based or date based, then you can write the data based on that column using partitionBy(). Then you can read that partition using .filter() do the modifications and overwrite only that partition using insertInto() - from pyspark 2.3.0
Refer this answer for other versions for pyspark :Overwrite specific partitions in spark dataframe write method
I had one table with more than 10,000,000 records in Cassandra, but for some reason, I want to build another Cassandra table with the same fields and several additional fields, and I will migrate the previous data into it. And now the two tables are in the same Cassandra cluster.
I want to ask how to finish this task in a shortest time?
And If my new table in the different Cassandra, How to do it?
Any advice will be appreciated!
If you just need to add blank fields to a table, then the best thing to do is use the alter table command to add the fields to the existing table. Then no copying of the data would be needed and the new fields would show up as null in the existing rows until you set them to something.
If you want to change the structure of the data in the new table, or write it to a different cluster, then you'd probably need to write an application to read each row of the old table, transform the data as needed, and then write each row to the new location.
You could also do this by exporting the data to a csv file, write a program to restructure the csv file as needed, then import the csv file into the new location.
Another possible method would be to use Apache Spark. You'd read the existing table into an RDD, transform and filter the data into a new RDD, then save the transformed RDD to the new table. That would only work within the same cluster and would be fairly complex to set up.