Copying rows from multiple delta tables into one via Spark - apache-spark

I have multiple delta lake tables storing images data. Now I want to take specific rows via filter from those tables and put them in another delta table. I do not want to copy the original data just the reference or shallow copy. I am using pyapark and databricks. Can someone please help me find the correct approach for this?

What you actually need is a view over the original table. Use CREATE VIEW to create it with necessary filter expression, like this:
CREATE VIEW <name> AS
SELECT * from <source_table> WHERE <your filter condition>
Then this view could be queried like a normal table, but data will be filtered according to your condition.

Related

How to delete records from a sql database using azure data factory

I am setting up a pipeline in data factory where the first part of the pipeline needs some pre-processing cleaning. I currently have a script set up to query these rows that need to be deleted, and export these results into a csv.
What I am looking for is essentially the opposite of an upsert copy activity. I would like the procedure to delete the rows in my table based on a matching row.
Apologies in advanced if this is an easy solution, I am fairly new to data factory and just need help looking in the right direction.
Assuming the source from which you are initially getting the rows is different from the sink
There are multiple ways to achieve it.
in case if the number of rows is less, we can leverage script activity or lookup activity to delete the records from the destination table
in case of larger dataset, limitations of lookup activity, you can copy the data into a staging table with in destination and leverage a script activity to delete the matching rows
in case if your org supports usage of dataflows, you can use that to achieve it

Filter out duplicate records with azure data factory before importing into Dynamics 365 user table

I am looking to use Azure Data Factory to import a number of users from a third party source CSV file into the D365 user table/entity. This is usually straight forward but on this occasion I have a complication.
The D365 users table/entity is already populated. The source CSV user file will have a mixture of users that are already in the D365 table/entity and others that are not.
What I would like to do is ensure the users in my source file that are already in the D365 table are not copied over as this would create duplicates.
Source CSV FILE
Existing D365 User table (subset of fields just to illustrate)
Updated D365 table with two new record added from source csv
From what I can see there are two possible solutions in Azure Data Factory
Configure the D365 sink to do it. e.g configure th sink in order to ignore records that match on a certain column?
Is it possible to configure the sink in some way to accomplish this?
Pull in the D365 table/entity as a source and use it to filter my source CSV to remove user records that already exist in D365 perhaps by using a common field such as fullname to identify such records. This would ensure I only try to import new users.
I have had a look into both methods but have been struggling to find a way to implement them.
I'd like to think the scenario I have outlined above is not uncommon and there are tried and tested methods to filter out records from a source CSV that already exists in the target D365 table?
I'd apprecate any help/suggestion to help me achieve this
You can use any one of these 2 approaches.
Use Azure data flow and Upsert the data to sink using Upsert as your writeBehavior in your dynamics sink transformation. You can refer to this SO link for information to use the Upsert method in the Azure data factory.
Pull CSV data as source1 and D365 table data as source2 and connect both sources to join transformation with left outer join. Then you can use filter transformation to filter out the NULL records of source2 (or right table). The output of filter transformation will be only new records which can be directly passed to D365 sink transformation. You can refer to this SO thread to similar process.
When we do data extract as yours from Synapse into Azure, upsert is not working correctly, many times go into a dead loop
What we do:
Create a temp table in the target
select source table data and extract it into the target temp table
run stored procedure to update, insert and delete target real table based on the temp table
Here is the stored procedure query, hope it can help you:
UPDATE t
SET
t.bcfsa_teamid = syn.bcfsa_teamid
,t.bcfsa_account = syn.bcfsa_account
,t.bcfsa_name = syn.bcfsa_name
,t.bcfsa_contactteamownerid = syn.bcfsa_contactteamownerid
FROM website.bcfsa_team t
INNER JOIN syn.bcfsa_team syn on t.id = syn.id;
INSERT website.bcfsa_team (id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid)
SELECT id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid
FROM syn.bcfsa_team
EXCEPT
SELECT id,bcfsa_teamid,bcfsa_account,bcfsa_name,bcfsa_contactteamownerid
FROM website.bcfsa_team;
DELETE FROM website.bcfsa_team
WHERE Id NOT in (SELECT id FROM syn.bcfsa_team);
TRUNCATE TABLE syn.bcfsa_team;

Azure SQL External Table alternatives

Azure external tables between two azure sql databases on the same server don't perform well. This is known. I've been able to improve performance by defining a view from which the external table is defined. This works if the view can limit the data set returned. But this partial solution isn't enough. I'd love a way to at least nightly, move all the data that has been inserted or updated from the full set of tables from the one database (dbo schema) to the second database (pushing into the altdbo schema). I think Azure data factory will let me do this, but I haven't figured out how. Any thoughts / guidance? The copy option doesn't copy over table schemas or updates
Data Factory Mapping Data Flow can help you achieve that.
Using the AlterRow active and select an uptade method in Sink:
This can help you copy the new inserted or updated data to the another Azure SQL database based on the Key Column.
Alter Row: Use the Alter Row transformation to set insert, delete, update, and
upsert policies on rows.
Update method: Determines what operations are allowed on your
database destination. The default is to only allow inserts. To
update, upsert, or delete rows, an alter-row transformation is
required to tag rows for those actions. For updates, upserts and
deletes, a key column or columns must be set to determine which row
to alter.
Hope this helps.

Is there way a to use join query in Azure Data factory When copying data from Sybase source

I am trying to ingest data from Sybase source in to Azure datalake. I am ingesting several tables using a Watermark table that has tables names from Sybase source. Now process works fine for a full import, however we are trying to Import tables every 15 minutes to feed a dashboard. We don't need to ingest whole table as we don't need all the data from it.
Table doesn't have dateModified or any kind of incremental id to perform an incremental load. Only way of filtering out unwanted data is to perform a join on to another look up table at source and then using "filter" value in "Where" clause.
Is there a way we can perform this in Azure data factory ? I have attached my current pipeline screenshot just to make it a bit more clear.
Many thanks for looking in to this. I have managed to find a solution. I was using a Watermark table to ingest about 40 tables using one pipeline. My only issue was how to use join and "where" filter in my query without hard coding it in pipeline. I have achieved this by adding "Join" and "Where" fields in my Watermark table and then passing it in "Query" as #{item ().Join} #{item().Where). It Worked like a magic.

spotfire new table from file filtered

I am using spotfire client.
I have identified some records within a data table that I would like to send to a new data table. Is there some way to create a new table with marked or isolated data or using a data limiting expression on the source table? I have had to export my filtered data out and then import it back in but I am hoping there is a more direct way.
Thanks!
If you know the restrictions you need to set on your data to identify the records, you can create a second table based on the source data.
Go to the properties of the table / visualization, then go to the Data tab. You have to scroll all the way to the bottom. There you can edit the "Limit data using expression".
You could also create a detailed visualization if you want, but that is only useful if you can quickly identify the records.
Or insert a calculated column (e.g. case statement) and use this column to filter your data.

Resources