I am building an etl using pyspark in databricks.
I have a source SQL table with roughly 10 million rows of data which I want to load into a SQL staging table.
I have two basic requirements:-
When a row is added to the source table, it must to inserted into the staging table.
When a row is updated to the source table, it must to updated into the staging table.
Source data
Thankfully the source table has two timestamp columns for created and updated time. I can query the new and updated data using these two columns and put it into a dataframe called source_df.
Target data
I load all the keys (IDs) from the staging table into a dataframe called target_df.
Working out changes
I join the two dataframe together based on the key to work out which rows already exist (which form updates), and which rows don't exist (which form inserts). This gives me two new dataframes, inserts_df and updates_df.
Inserting new rows
This is easy because I can just use inserts_df.write to directly write into the staging table. Job done!
Updating existing rows
This is what I can't figure out as there is little in the way of existing examples. I am lead to believe that you can't do updates using pyspark. I could use the "overwrite" mode to replace the SQL table, but it doesn't make a lot of sense to replace 10 million rows when I only want to update half a dozen.
How can I effectively get the rows from updates_df into SQL without overwriting the whole table?
Related
I am using Excel power query to import csv files containing transactions from a directory. That way adding a new file to the directory automatically makes it available when refreshing the query/data model. I load the table from the csv files into the data model. I do some cleaning and data transformation in the query.
However, there are some things that I can't do in the query that loads the raw data.
There may be missing data that I need to enter manually (a column missing some values)
I may need to split a transaction/row into multiple transactions/rows to categorize the parts correctly
It seems like there should be a way to do this that allows me to make my changes and not have them overwritten when I refresh the query to import new transactions.
Currently I am experimenting with creating a column with a unique id for the transaction table as part of the query. Then creating an aux table in excel relating to the raw transactions by unique id. I then make my changes in the aux table. And finally, I create a new table that merges the raw transactions with the aux table to create the working transaction table. This does work for missing data, or incorrect values, but it still doesn't allow me to split a row into multiple rows.
I would welcome any suggestions or references.
I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.
I have a Delta Lake table in Azure. I'm using Databricks. When we add new entries we use merge into to prevent duplicates from getting into the table. However, duplicates did get into the table. I'm not sure how it happened. Maybe the merge into conditions weren't setup properly.
However it happened the duplicates are there. Is there any way to detect and remove the duplicates from the table? All the documentation I've found shows how to deduplicate the dataset before merging. Nothing for once the duplicates are already there. How can I remove the duplicates?
Thanks
If the duplicate exists in the target table, your only options are:
Delete the duplicated rows from the target table manually using SQL DELETE statements
Create a deduplicated replica of your target table and rename both tables (dedupped replica and original target) to ensure make your dedupped replica the main table.
you can use dataset.dropDuplicates to delete duplicates based on columns.
https://spark.apache.org/docs/2.3.2/api/java/org/apache/spark/sql/Dataset.html#dropDuplicates-java.lang.String-java.lang.String...-
In order to remove the duplicates you can follow the below approach:
Create a separate table that is the replica of the table that has duplicate records.
Drop the first table that has duplicate records. (Meta data information plus physical files)
write a python script or scala code to remove the duplicate records either using dropDuplicates function or any custom logic that defines a unique record by reading the data from the table that you created in step 1 and recreate the table that you deleted in step 2.
Once you follow the above steps your table would not have duplicate rows but this is just a workaround to make your table consistent so it does not have duplicate records and not a permanent solution.
Before or after you follow the above steps you will have to look into your merge into statements to see if that is written correctly so that it does not insert duplicate records. If the merge into statement is proper make sure that the dataset that you are processing is not having duplicate records from the source from where you are reading the data.
I would suggest the following SOP:
Fix existing job (streamer or batch) to handle duplicates
Change job configuration to write into _recovery table (also change a checkpoint path to _recovery in case of streamer job)
Run the job and validate its output
Rename the original folder in _backup and rename the _recovery to original (do the same with the checkpoints directory)
Restore the original job configuration.
I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.
I had one table with more than 10,000,000 records in Cassandra, but for some reason, I want to build another Cassandra table with the same fields and several additional fields, and I will migrate the previous data into it. And now the two tables are in the same Cassandra cluster.
I want to ask how to finish this task in a shortest time?
And If my new table in the different Cassandra, How to do it?
Any advice will be appreciated!
If you just need to add blank fields to a table, then the best thing to do is use the alter table command to add the fields to the existing table. Then no copying of the data would be needed and the new fields would show up as null in the existing rows until you set them to something.
If you want to change the structure of the data in the new table, or write it to a different cluster, then you'd probably need to write an application to read each row of the old table, transform the data as needed, and then write each row to the new location.
You could also do this by exporting the data to a csv file, write a program to restructure the csv file as needed, then import the csv file into the new location.
Another possible method would be to use Apache Spark. You'd read the existing table into an RDD, transform and filter the data into a new RDD, then save the transformed RDD to the new table. That would only work within the same cluster and would be fairly complex to set up.