Azure Data Factory Input dataset with sqlReaderQuery as source - azure

We are creating Azure Data Factory pipeline using .net API. Here we are providing input data source using sqlReaderQuery. By this mean, this query can use multiple table.
So problem is we can't extract any single table from this query and give tableName as typeProperty in Dataset as shown below:
"typeProperties": {
"tableName": "?"
}
While creating dataset it throws exception as tableName is mandatory. We don't want to provide tableName in this case? Is there any alternative of doing the same?
We are also providing structure in dataset.

Unfortunately you cant do that natively. You need to deploy a Dataset for each table. Azure Data Factory produce slices for every activity ahead of execution time. Without knowing the table name, Data Factory would fail when producing these input slices.
If you want to read from multiple tables, then use a stored procedure as the input to the data set. Do your joins and input shaping in the stored procedure.
You could also get around this by building a dynamic custom activity that operates, say, at the database level. When doing this you would use a dummy input dataset and a generic output data set and control most of the process yourself.

It is a bit of a nuisance this property being mandatory, particularly if you have provided a ...ReaderQuery. For Oracle copies I have used sys.dual as the table name, this is a sort of built-in dummy table in Oracle. In SQL Server you could use one of the system views or set up a dummy table.

Related

How to have dynamic sink in azure dataflow

I'm trying to create a dynamic dataflow that gets data using a dynamic query that is pass as a parameter and then I want to upsert and delete in a dynamically defined sink table. I'm having troubles with setting the sink dynamic. is it possible? how can I do it? I see no option to have a dynamic target table. what I was thinking was to add another parameter with the sink tablename and use it.
High level view of the dataflow
[Source gets data given sql query ] (OK)
[transformations in the middle] (OK)
[Sink to a dynamic table, that needs to be parameterized but I cannot find a way] (NOT OK)
You can create a parameter in Sink Dataset and then pass the table name as parameter from dataflow activity to Sink Dataset.
Step:

How do I store run-time data in Azure Data Factory between pipeline executions?

I have been following Microsoft's tutorial to incrementally/delta load data from an SQL Server database.
It uses a watermark (timestamp) to keep track of changed rows since last time. The tutorial stores the watermark to an Azure SQL database using the "Stored Procedure" activity in the pipeline so it can be reused in the next execution.
It seems overkill to have an Azure SQL database just to store that tiny bit of meta information (my source database is read-only btw). I'd rather just store that somewhere else in Azure. Maybe in the blob storage or whatever.
In short: Is there an easy way of keeping track of this type of data or are we limited to using stored procs (or Azure Functions et al) for this?
I had come across a very similar scenario, and from what I found you can't store any watermark information in ADF - at least not in a way that you can easily access.
In the end I just created a basic tier Azure SQL database to store my watermark / config information on a SQL server that I was already using in my pipelines.
The nice thing about this is when my solution scaled out to multiple business units, all with different databases, I could still maintain watermark information for each of them by simply adding a column that tracks which BU that specific watermark info was for.
Blob storage is indeed a cheaper option but I've found it to require a little more effort than just using an additional database / table in an existing database.
I agree it would be really useful to be able to maintain a small dataset in ADF itself for small config items - probably a good suggestion to make to Microsoft!
There is a way to achieve this by using Copy activity, but it is complicated to get latest watermark in 'LookupOldWaterMarkActivity', just for reference.
Dataset setting:
Copy activity setting:
Source and sink dataset is the same one. Change the expression in additional columns to #{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}
Through this, you can save watermark as column in .txt file. But it is difficult to get the latest watermark with Lookup activity. Because your output of 'LookupOldWaterMarkActivity' will be like this:
{
"count": 1,
"value": [
{
"Prop_0": "11/24/2020 02:39:14",
"Prop_1": "11/24/2020 08:31:42"
}
]
}
The name of key is generated by ADF. If you want to get "11/24/2020 08:31:42", you need to get column count and then use expression like this: #activity('LookupOldWaterMarkActivity').output.value[0][Prop_(column count - 1)]
How to get latest watermark:
use GetMetadata activity to get columnCount
use this expression:#activity('LookupOldWaterMarkActivity').output.value[0][concat('Prop_',string(sub(activity('Get Metadata1').output.columnCount,1)))]

Can you use dynamic/run-time outputs with azure stream analytics

I am trying to get aggregate data sent to different table storage outputs based on a column name in select query. I am not sure if this is possible with stream analytics.
I've looked up the stream analytics docs and different forums, so far haven't found any leads. I am looking for something like
Select tableName,count(distinct records)
into tableName
from inputStream
I hope this makes it clear what I'm trying to achieve, I am trying to insert aggregates data into table storage (defined as outputs). I want to grab the output stream/tablestorage name from a select Query. Any idea how that could be done?
I am trying to get aggregate data sent to different table storage
outputs based on a column name in select query.
If i don't misunderstand your requirement,you want to do a case...when... or if...else... structure in the ASA sql so that you could send data into different table output based on some conditions. If so,i'm afraid that it could not be implemented so far.Every destination in ASA has to be specific,dynamic output is not supported in ASA.
However,as a workaround,you could use Azure Function as output.You could pass the columns into Azure Function,then do the switches with code in the Azure Function to save data into different table storage destinations. More details,please refer to this official doc:https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-with-azure-functions

Azure Data Factory Data Migration

Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.

Can i upload data from multiple datasources to azure DW at same time

Can i retrieve data from multiple data sources to Azure SQL DataWarehouse at the same time using single pipeline?
SQL DW can certainly load multiple tables concurrently using external (aka PolyBase) tables, bcp, or insert statements. As hirokibutterfield asks, are you referring to a specific loading tool like Azure Data Factory?
Yes you can, but there you have to mention a copy activity for each of the data source being copied to the azure data warehous.
Yes you can, and depending on the extent of transformation required, there would be 2 ways to do this. Regardless of the method, the data source does not matter to ADF since your data movement happens via the copy activity which looks at the dataset and takes care of firing the query on the related datasource.
Method 1:
If all your transformation for a table can be done in a SELECT query on the source systems, you can have a set of copy activities specifying SELECT statements. This is the simple approach
Method 2:
If your transformation requires complex integration logic, first use copy activities to copy over the raw data from the source systems to staging tables in the SQLDW instance (Step 1). Then use a set of stored procedures to do the transformations (Step 2).
The ADF datasets which are the output from Step1 will be input datasets to Step 2 in order to maintain consistency.

Resources