I'm using Azure Data Factory to copy data from Azure Cosmos DB to Azure Data Lake. My pipeline consists of a copy activity which copies data to the Data lake sink.
This is my query on the source dataset:
select * from c
where c.data.timestamp >= '#{formatDateTime(addminutes(pipeline().TriggerTime, -15), 'yyyy-MM-ddTHH:mm:ssZ' )}'
AND c.data.timestamp < '#{formatDateTime(pipeline().TriggerTime, 'yyyy-MM-ddTHH:mm:ssZ' )}'
I'm getting the data for the last 15 minutes before the trigger time.
Now, if there is no data retrieved by the query then the copy activity generates an empty file and stores it in the data lake. I want to prevent that. Is there any way I can achieve this?
You could use lookup activity and then use an if activity to decide whether you need to run the copy activity.
In the lookup activity, you could set firstRowOnly as true since you only want to check whether there are data.
This is an older thread but someone might have a more elegant way to handle the issue above that ADF produces a file even there are 0 records. Here are my concerns with the Lookup approach or having a post-process clean up the empty file.
It's inefficient to query database twice just to check if there are rows the first time.
Using the [IF Condition] componenet is not possible if you are already inside an [if component] or [case] component of ADF. (This is an ADF constraint/shortcoming also).
Cleaning up the empty file is also inefficient, and not an option if you are triggering off the event of the file being created since it causes a false-positive as it is written before you can clean it up.
I tried the following and it is working: I'm checking if the lookup entry returns more than 0 rows.
Related
I am working with Azure Databricks and we are moving hundreds of gigabytes of data with Spark. We stream them with Databricks' autoloader function from a source storage on Azure Datalake Gen2, process them with Databricks notebooks, then load them into another storage. The idea is that the end result is a replica, a copy-paste of the source, but with some transformations involved.
This means if a record is deleted at the source, we also have to delete it. If a record is updated or added, then we do that too. For the latter autoloader with a file level listener, combined with a MERGE INTO and with .forEachBatch() is an efficient solution But what about deletions? For technical reasons (dynamics365 azure synapse link export being extremely limited in configuration) we are not getting delta files, we have no data on whether a certain record got updated, added or deleted. We only have the full data dump every time.
To simply put: I want to delete records in a target dataset if the record's primary key is no longer found in a source dataset. In T-SQL MERGE could check both ways, whether there is a match by the target or the source, however in Databricks this is not possible, MERGE INTO only checks for the target dataset.
Best idea so far:
DELETE FROM a WHERE NOT EXISTS (SELECT id FROM b WHERE a.id = b.id)
Occasionally a deletion job might delete millions of rows, which we have to replicate, so performance is important. What would you suggest? Any best practices to this?
There is a Conditional Split in my ADF data flow. Success puts the rows to a SQL database and failure conditions collect all the incorrect records and puts them into a sink which is of type CSV (Delimited text).
In case of success condition, there is an empty CSV file of 0 bytes is getting created in the sink.
How can I stop this?
If you don't wish to write output to an external source, you can use cache sink. It writes data into the Spark cache instead of a data store. In mapping data flows, you can reference this data within the same flow many times using a cache lookup. If you want to store this later to a data store, just reference data as part of an expression.
To write to a cache sink, add a sink transformation and select Cache as the sink type. Unlike other sink types, you don't need to select a dataset or linked service because you aren't writing to an external store.
Note: A cache sink must be in a completely independent data stream from any transformation referencing it via a cache lookup. A cache
sink also must the first sink written.
When utilizing cached lookups, make sure that your sink ordering has
the cached sinks set to 1, the lowest (or first) in ordering.
Reference this data within the same flow using a cache lookup, as part of an expression to store this to a data store.
Alternately use cache lookup against source to select and write it to CSV or log in a different stream or sink.
Refer: CacheSink, CachedLookup
If you still want to delete empty Zero byte files, you can use ADF or programmatic way to delete at the end of execution (Delete Activity in Azure Data Factory)
Examples of using the Delete activity
Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.
Can i retrieve data from multiple data sources to Azure SQL DataWarehouse at the same time using single pipeline?
SQL DW can certainly load multiple tables concurrently using external (aka PolyBase) tables, bcp, or insert statements. As hirokibutterfield asks, are you referring to a specific loading tool like Azure Data Factory?
Yes you can, but there you have to mention a copy activity for each of the data source being copied to the azure data warehous.
Yes you can, and depending on the extent of transformation required, there would be 2 ways to do this. Regardless of the method, the data source does not matter to ADF since your data movement happens via the copy activity which looks at the dataset and takes care of firing the query on the related datasource.
Method 1:
If all your transformation for a table can be done in a SELECT query on the source systems, you can have a set of copy activities specifying SELECT statements. This is the simple approach
Method 2:
If your transformation requires complex integration logic, first use copy activities to copy over the raw data from the source systems to staging tables in the SQLDW instance (Step 1). Then use a set of stored procedures to do the transformations (Step 2).
The ADF datasets which are the output from Step1 will be input datasets to Step 2 in order to maintain consistency.
So I am trying to use Azure Data Factory to replace the SSIS system we have in place, and I am having some trouble...
The process I want to follow is to take a list of projects and a list of clients and create a report of the clients and projects we have. These lists update frequently, so I want to update this report every hour. To combine the data, I will be using Power BI Pro, so Data Factory just needs to load the data into a usable format.
My source right now is a call to an API that returns a list of projects. However, this data isn't separated by time at all. I don't see any sort of history. Same goes for the list of clients.
What should the availability for my dataset be?
you may use the custom activity in ADF to call the API that returns list of projects. The custom activity will then write that data in the right format to the destination.
Example of a custom activity in ADF: https://azure.microsoft.com/en-us/documentation/articles/data-factory-use-custom-activities/
The frequency will be the cadence at which you wish to run this operation.