I would like to set up a pipeline to sync data from Data Warehouse to no sql CosmosDB. Copy tool works fine for one to one table relations but for one to many obviously, I will have duplicates of objects in my NoSQL DB. What is the best way to solve this issue and have an array of one to many items instead of duplicating rows?
Thanks in advance
In your case, I don’t think copy activity can achieve that. Copy activity just copy data from one table to another by appending new documents or do upsert based on cosmos dB ID. Maybe you could write your own code to do the merging and then use ADF custom activity to invoke your code.
https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity
Related
I am setting up a pipeline in data factory where the first part of the pipeline needs some pre-processing cleaning. I currently have a script set up to query these rows that need to be deleted, and export these results into a csv.
What I am looking for is essentially the opposite of an upsert copy activity. I would like the procedure to delete the rows in my table based on a matching row.
Apologies in advanced if this is an easy solution, I am fairly new to data factory and just need help looking in the right direction.
Assuming the source from which you are initially getting the rows is different from the sink
There are multiple ways to achieve it.
in case if the number of rows is less, we can leverage script activity or lookup activity to delete the records from the destination table
in case of larger dataset, limitations of lookup activity, you can copy the data into a staging table with in destination and leverage a script activity to delete the matching rows
in case if your org supports usage of dataflows, you can use that to achieve it
I have 100-150 Azure databases with same table schema. There are 300-400 tables in each database. Separate reports are enabled on all these databases.
Now I want to merge these database into a centralized database and generate some different Power BI reports from this centralized database.
The approach I am thinking is -
There will be Master table on target database which will have
DatabaseID and Name.
All the tables on target database will have the composite primary key
created with the Source Primary key and Database ID.
There will be multiple (30-35) instances of Azure data factory
pipeline and each instance will be responsible to merge data from
10-15 databases.
These ADF pipelines will be scheduled to run weekly.
Can anyone please guide me that the above approach will be feasible in this scenario? Or there could any other option we can go for.
Thanks in Advance.
You trying to create a Data Warehouse.
I hope you will never archive to merge 150 Azure SQL Databases because is soon as you try to query that beefy archive what you will see is this:
This because Power BI, as any other tool, comes with limitations:
Limitation of Distinct values in a column: there is a 1,999,999,997 limit on the number of distinct values that can be stored in a column.
Row limitation: If the query sent to the data source returns more than one million rows, you see an error and the query fails.
Column limitation: The maximum number of columns allowed in a dataset, across all tables in the dataset, is 16,000 columns.
A data warehouse is not just the merge of ALL of your data. You need to clean them and import only the most useful ones.
So the approach you are proposing is overall OK, but just import what you need.
I have a requirement where I need to move data from multiple tables in Oracle to ADLS.
The size of data is around 5TB. These files in ADLS, I might use it in future to connect Power BI.
Is their any easy and efficient way to do this.
Thanks in Advance !
You can do this by using Lookup activity and ForEach in Azure Data Factory.
Create a table or file to store the list of table names which needs to be extracted.
Use Lookup Activity get the tables list.
Pass the list to ForEach activity and by looping each table copy the current item() from oracle to ADLS.
In ForEach, settings->Items, add the following code in the Add Dynamic Content.
#activity('Get-Tables').output.value
Add a Copy activity inside ForEach activity.
In Copy data activity, source > Query and Input the following code:
SELECT * FROM #{item().Table_Name}
Now add the sink dataset(ADLS) and Execute your pipeline.
Please refer Microsoft Documentation to know about the creation of linked services for Oracle.
Please go through this article by Sean Forgatch in MODERN DATA ENGINEERING if you face any issues in the process.
Azure external tables between two azure sql databases on the same server don't perform well. This is known. I've been able to improve performance by defining a view from which the external table is defined. This works if the view can limit the data set returned. But this partial solution isn't enough. I'd love a way to at least nightly, move all the data that has been inserted or updated from the full set of tables from the one database (dbo schema) to the second database (pushing into the altdbo schema). I think Azure data factory will let me do this, but I haven't figured out how. Any thoughts / guidance? The copy option doesn't copy over table schemas or updates
Data Factory Mapping Data Flow can help you achieve that.
Using the AlterRow active and select an uptade method in Sink:
This can help you copy the new inserted or updated data to the another Azure SQL database based on the Key Column.
Alter Row: Use the Alter Row transformation to set insert, delete, update, and
upsert policies on rows.
Update method: Determines what operations are allowed on your
database destination. The default is to only allow inserts. To
update, upsert, or delete rows, an alter-row transformation is
required to tag rows for those actions. For updates, upserts and
deletes, a key column or columns must be set to determine which row
to alter.
Hope this helps.
I am trying to ingest data from Sybase source in to Azure datalake. I am ingesting several tables using a Watermark table that has tables names from Sybase source. Now process works fine for a full import, however we are trying to Import tables every 15 minutes to feed a dashboard. We don't need to ingest whole table as we don't need all the data from it.
Table doesn't have dateModified or any kind of incremental id to perform an incremental load. Only way of filtering out unwanted data is to perform a join on to another look up table at source and then using "filter" value in "Where" clause.
Is there a way we can perform this in Azure data factory ? I have attached my current pipeline screenshot just to make it a bit more clear.
Many thanks for looking in to this. I have managed to find a solution. I was using a Watermark table to ingest about 40 tables using one pipeline. My only issue was how to use join and "where" filter in my query without hard coding it in pipeline. I have achieved this by adding "Join" and "Where" fields in my Watermark table and then passing it in "Query" as #{item ().Join} #{item().Where). It Worked like a magic.