how to download blobs using azure data factory

how to download blobs using azure data factory - azure

I'm using data factory to create my pipeline, and I'm facing some challenges.
the pipeline consists of a lookup which has a json array and a foreach to loop this json array and finally a set variable inside the foreach loop:
pipeline :
lookup :
variable :
now what I'm looking for is to pass the result of the set variable value(which is a like to an image) to a copy activity or something like that in order to doawnload the image in our datalake container.
and the name of doownloaded image should be like this :
id +'_'+guid()+'.png'
thanks for your help

Using copy data activity and connecting source dataset to HTTP connector to execute the URL and sink to data lake storage, you can copy the images to data lake as shown below.
Using the lookup activity, get the array value from the JSON file.
Output of lookup:
Passing the output of the lookup activity to Foreach activity.
#activity('Lookup1').output.value
Inside Foreach activity, use copy data activity to copy the image from URL to data lake.
Using a set variable is optional to store the URL from lookup output as you can directly use the current item in copy activity.
Source dataset:
Create HttpServer linked service by parameterizing the URL and with the binary dataset.
In the source dataset, pass the dataset parameter to the linked service parameter value.
In source settings, pass the current item URLs column to the dataset property item().urls.
Connect the sink to data lake storage with binary dataset and create a dataset parameter to provide the name of the file in the sink dataset.
In sink settings, provide the dataset file name as required.
#concat(item().id,'_',guid(),'.png')
Sink file in data lake:
Image1:
Image2:

Related

I would like to add the date and time string in addition to the source file name.(Azure Synapse Analytics)

When copying a file from S3 to AzureBlobStorage, I would like to add the date and time string in addition to the source file name.
In essence, the S3 folder structure looks like this
data/yyyy/mm/dd/files
*yyyy=2019-2022, mm=01-12, dd=01-31
And when copying these to Blob, we want to store them in the following folder structure.
data/year=yyyy/month=mm/day=dd/files
Attached is a picture of the folder structure of the S3 bucket and the folder structure we want to achieve with Blob Storage.
I manually renamed all the photo folders in Blob Storage, but there are thousands of files and it takes time, so I want to do it automatically.
Do I use the "GetMetadata" or "ForEach" activity?
Or use dynamic parameters in the "Copy" activity to set up a sink dataset?
Also, I am not an experienced data engineer and am not familiar with Synapse, so I have no idea how to do this due to my lack of knowledge.
Any help woud be appreciated.
Thanks.

Using the Get Metadata activity, ForEach activity, and Execute pipeline activity get the nested folder structure from the source dataset. Pass the extracted folder structure to the sink dataset dynamically by adding the required string value to the folder structure.
Create a source dataset with the dataset parameter for the directory.
Pipeline1:
Using the Get Metadata activity, get the child items under the container (data/).
Pass the child items to the ForEach activity to loop each folder.
#activity('get sub folder list_yyyy').output.childItems
Inside ForEach activity, add the execute pipeline activity. Create a new pipeline (pipeline2) with 2 parameters in it to hold the source and sink folder structure. Pass the pipeline2 parameter values from pipeline1.
Subolder1: #item().name
Sink_dir1: #concat('year=',item().name)
Pipeline2:
In pipeline2, repeat the same processes as pipeline1. Using Get Metadata activity get the child items under the folder (yyyy folder) and pass the child items to ForEach activity.
Pipeline2 parameters:
Get Metadata:
Dataset property - dir: #pipeline().parameters.SubFolder1
Inside ForEach activity, add execute pipeline to pass the current item to nested pipeline (pipeline3). Create 2 pipeline parameters inside pipeline3 to hold source and sink structures.
SubFolder2: #concat(pipeline().parameters.SubFolder1,'/',item().name)
sink_dir2: #concat(pipeline().parameters.sink_dir1,'/month=',item().name)
Pipeline3:
Using the Get Metadata activity get the child items under the source structure.
Dataset property – dir: #pipeline().parameters.SubFolder2
Pass the child items to ForEach activity. Inside ForEach activity add copy data activity to copy files from source to sink.
Connect the source to the source dataset and pass the directory parameter dynamically by concatenating the parameter value and current child item.
dir: #concat(pipeline().parameters.SubFolder2,'/',item().name,'/')
Create a sink dataset with dataset parameters to pass the directory path dynamically.
In the sink, pass the directory path dynamically by concatenating the parameter value with the current child item path.
Sink_dir: #concat(pipeline().parameters.sink_dir2,'/day=',item().name,'/')
Output structure: It creates the folder structure automatically if not available in the sink.

You will first need the file name (use Getmetadata). Then for each filename, append date and time string using functions like concat(). You can also create a variable 'NewFileName' and use it to pass as a parameter to the copy activity. Then copy source will have the original file name and sink will have the new file name. Copy activity will be parameterized as you will be passing file name dynamically.
Hope this helps.

Azure Data Factory: Cannot save the output of Set Variable into file/Database

I'm trying to store a list of file names within an Azure Blob container into a SQL db. The pipeline runs successfully, but after running the pipeline, it cannot output the values (file names) into the sink database, and the sink table doesn't get updated even after the pipeline completed. Followings are the steps I went through to implement the pipeline. I wonder which steps I made mistake.
I have followed the solutions given in the following links as well:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#add-additional-columns-during-copy
Transfer the output of 'Set Variable' activity into a json file [Azure Data Factory]
Steps:
1- Validating File Exists, Get Files metadata and child items, Iterate the files through a foreach.
2- Variable defined at the pipeline level to hold the filenames
Variable Name: Files, Type: string
3- parameter defined to dynamically specify the dataset directory name. Parameter name: dimName, parameter type: string
4- Get Metadata configurations
5- Foreach settings
#activity('MetaGetFileNames').output.childItems
6 - Foreach Activity overview. A set Variable to set the each filename into the defined variable 'files'. Copy Activity to store the set value into db.
7- set variable configuration
8- Copy Activity source configuration. Excel Dataset refers to an empty excel file in azure blob container.
9- Copy Activity sink configuration
10-Copy Activity: mapping configuration

Instead of selecting an empty excel file, refer to a dummy excel file with dummy data.
Source: dummy excel file
You can skip using Set variable activity as you can use the Foreach current item directly in the Additional column dynamic expression.
Add additional columns in the Mapping.
Sink results in SQL database.

Azure Data Factory v2 - How to get Copy Data tool output

I'm trying to create an archiving pipeline which essentially does the following:
Call stored procedure (GET) from a SQL Azure DB that returns a resultset
Archive the result from #1 onto storage account (e.g. json files)
Extract ID column from #1 into an array of int
Use result from #4 as a parameter to call a stored procedure (DELETE) in the same SQL Azure DB
So far, I've tried the Copy Data activity/tool which satisfies steps 1 & 2. However, I'm not sure how to get the outputs from that step and can't find any documentation at Microsoft.
Is it not the correct usage? Do I have to manually do it instead?
Also, I'd like to do some validation in between steps (i.e. no result? don't proceed).
I've managed to try the bare/general stored procedure activity but also can't find where to retrieve its output for use in the next step. I'm pretty new to Data Factory and don't really work with data engineering/piplines so please bear with me.

Using Copy data activity, you can copy stored procedure data to storage. Connect the source to SQL database and use stored procedure as query option, connect the sink to sink folder in a storage account.
Once the data is copied to a storage account, use lookup activity to read the data from the file which is generated from #1.
Extract the output of lookup activity into an array variable using set variable activity.
You can use If condition activity validates the result and run the other activities if it’s true.

How to set and get variable value in Azure Synapse or Data Factory pipeline

I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.

As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:

There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url

How to include blob metadata in copy data mapping

I'm working on a ADF v2 pipeline, which copies data from csv blob to Azure SQL database table. For each load I would like to collect source metadata, like source blob name, and save it to a target table as a part of data lineage framework.
My blob source run the following schema:
StoreName,
StoreLocation,
StoreTaxId.
My destination table run the following schema:
StoreName,
StoreLocation,
DwhProcessDate,
DwhSourceName.
I do not know, how to properly include name of the source in the mapping section of Copy Data activity.
For the moment I have:
defined a [Get Metadata1] activity to get references to all blobs that are available from Azure Blob Storage
defined a [ForEach1] activity, iterating through the output of an expression #activity('Get Metadata1').output.childitems
inside the [ForEach1] activity, I have placed [Copy Data1] activity, where I have source and sink sections defined.
What I'm looking for is a way to add extra line to the mapping section, which will samehow bind #item().name to destination column [DwhSourceName]
Thanks for all suggestion on how to achieve this.

Actually,based on my test,you can specify the dymatic content of column key,but you can't set blob metadata as value of columns in copy data mapping at the pipeline run time. Please see the rules mentioned in this document.
You still need to add the FileName column in your source data before the copy activity.Maybe you could use Azure Blob Trigger Function to get the blob file name so hat you could add the FileName column when any data stream into the blob.(Please refer to this case:How Do I get the Name of The inputBlob That Triggered My Azure Function With Python)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to download blobs using azure data factory - azure

Related

I would like to add the date and time string in addition to the source file name.(Azure Synapse Analytics)

Azure Data Factory: Cannot save the output of Set Variable into file/Database

Azure Data Factory v2 - How to get Copy Data tool output

How to set and get variable value in Azure Synapse or Data Factory pipeline

How to include blob metadata in copy data mapping

Categories

Resources