I'm trying to create an archiving pipeline which essentially does the following:
Call stored procedure (GET) from a SQL Azure DB that returns a resultset
Archive the result from #1 onto storage account (e.g. json files)
Extract ID column from #1 into an array of int
Use result from #4 as a parameter to call a stored procedure (DELETE) in the same SQL Azure DB
So far, I've tried the Copy Data activity/tool which satisfies steps 1 & 2. However, I'm not sure how to get the outputs from that step and can't find any documentation at Microsoft.
Is it not the correct usage? Do I have to manually do it instead?
Also, I'd like to do some validation in between steps (i.e. no result? don't proceed).
I've managed to try the bare/general stored procedure activity but also can't find where to retrieve its output for use in the next step. I'm pretty new to Data Factory and don't really work with data engineering/piplines so please bear with me.
Using Copy data activity, you can copy stored procedure data to storage. Connect the source to SQL database and use stored procedure as query option, connect the sink to sink folder in a storage account.
Once the data is copied to a storage account, use lookup activity to read the data from the file which is generated from #1.
Extract the output of lookup activity into an array variable using set variable activity.
You can use If condition activity validates the result and run the other activities if it’s true.
Related
I'm using data factory to create my pipeline, and I'm facing some challenges.
the pipeline consists of a lookup which has a json array and a foreach to loop this json array and finally a set variable inside the foreach loop:
pipeline :
lookup :
variable :
now what I'm looking for is to pass the result of the set variable value(which is a like to an image) to a copy activity or something like that in order to doawnload the image in our datalake container.
and the name of doownloaded image should be like this :
id +'_'+guid()+'.png'
thanks for your help
Using copy data activity and connecting source dataset to HTTP connector to execute the URL and sink to data lake storage, you can copy the images to data lake as shown below.
Using the lookup activity, get the array value from the JSON file.
Output of lookup:
Passing the output of the lookup activity to Foreach activity.
#activity('Lookup1').output.value
Inside Foreach activity, use copy data activity to copy the image from URL to data lake.
Using a set variable is optional to store the URL from lookup output as you can directly use the current item in copy activity.
Source dataset:
Create HttpServer linked service by parameterizing the URL and with the binary dataset.
In the source dataset, pass the dataset parameter to the linked service parameter value.
In source settings, pass the current item URLs column to the dataset property item().urls.
Connect the sink to data lake storage with binary dataset and create a dataset parameter to provide the name of the file in the sink dataset.
In sink settings, provide the dataset file name as required.
#concat(item().id,'_',guid(),'.png')
Sink file in data lake:
Image1:
Image2:
I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.
As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:
There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url
I have a basic requirement where I want to append time stamp to file extracted from sql db and put it in blob.i use utcnow() and it creates a timestamp with T and all which I dont need.
any format expression to get date and just time??
New to javascript expressions as I am from ssis background
Help appreciated
The only way you can do that is copy and create a new blob with a new name concat with the timestamp.
Data Factory doesn't support rename the blob.
I only succeed with one file.
You can follow my steps:
Using lookup activity to get the timestamp from SQL database.
Using Get metadata to get the blob name from Storage.
Using Copy data activity to copy and create new file name blob.
Pileline preview:
Lookup preview:
Get metadata and Source Dataset:
Copy data activity Source setting:
Copy data activity Sink setting:
Add parameter to set the new file name in source datasaet:
Using expression to create the new file with the filename and timestamp:
#concat(split(activity('Get Metadata1').output.itemName,'.')[0],activity('Lookup1').output.firstRow.tt)
Then check the output file in the Blob Storage:
Hope this helps.
You can use expression in the destination file name, in the sink.
toTimestamp(utcnow(), 'yyyyMMdd_HHmm_ss')
I'm working on a ADF v2 pipeline, which copies data from csv blob to Azure SQL database table. For each load I would like to collect source metadata, like source blob name, and save it to a target table as a part of data lineage framework.
My blob source run the following schema:
StoreName,
StoreLocation,
StoreTaxId.
My destination table run the following schema:
StoreName,
StoreLocation,
DwhProcessDate,
DwhSourceName.
I do not know, how to properly include name of the source in the mapping section of Copy Data activity.
For the moment I have:
defined a [Get Metadata1] activity to get references to all blobs that are available from Azure Blob Storage
defined a [ForEach1] activity, iterating through the output of an expression #activity('Get Metadata1').output.childitems
inside the [ForEach1] activity, I have placed [Copy Data1] activity, where I have source and sink sections defined.
What I'm looking for is a way to add extra line to the mapping section, which will samehow bind #item().name to destination column [DwhSourceName]
Thanks for all suggestion on how to achieve this.
Actually,based on my test,you can specify the dymatic content of column key,but you can't set blob metadata as value of columns in copy data mapping at the pipeline run time. Please see the rules mentioned in this document.
You still need to add the FileName column in your source data before the copy activity.Maybe you could use Azure Blob Trigger Function to get the blob file name so hat you could add the FileName column when any data stream into the blob.(Please refer to this case:How Do I get the Name of The inputBlob That Triggered My Azure Function With Python)
I am using Stream Analytics to join streaming data (via IoT Hub) and reference data (via blob storage). The reference data blob file is generated every minute with latest data and is in a format "filename-{date} {time}.csv". The reference blob file data is used in the Azure Machine Learning function as parameters in SA job. The output of stream analytics job (into Azure SQL or Power BI) seems to be generating multiple rows instead of one for Azure Machine Learning function's output, one each for parameter values from previous blob files. My understanding is that it should only use the latest blob file content but looks like it is using all the blob files and generating multiple rows from AML output. Here is the query I am using:
SELECT
AMLFunction(Ref.Input1, Ref.Input2), *
FROM IoTInput Stream
LEFT JOIN RefBlobInput Ref ON Stream.DeviceId = Ref.[DeviceID]
Please can you advice if the query or the file path needs changing to avoid duplicating records? Thanks
To take effect of only latest file, you need to store your file in particular folder structure.
If you have note down, whenever you select reference data file as stream input; stream input dialog asks you for folder structure along with date and time format.
Stream always search for reference file from latest {date}/{time} folder. i.e. you need to store your file like,
2018-01-25/07:30/filename.json (YYYY-MM-DD/HH-mm/filename.json)
NOTE: Here your time folder needs to be unique for each minute. Same as, date folder needs to be unique for each date. Whenever you create new file, create it with under new time stamp folder and under current date folder.
You can use any datetime format that stream input supports.