I use Get Metadata to retrieve files name in a blob container. I also have a README.md in this same blob container.
I would like to be able to apply filter and set variable value in function of the files present in the blob container, but without having consideration of the README.md file. How is it possible?
As an example, here is a logic I would like to implement for setting Variable value:
#if(and(equals(greater(activity('FilterOnOthers').output.FilteredItemsCount,0),true),not(equals(activity('FilterOnOthers').output.Value[0],'README.md'))),'YES','NO')
But it does not work as expected.
Thank you for your help
You can use an If Loop condition. In the If Loop condition check for the metadata output. The condition should be file name README.md. Use your desired activity inside the If Loop based on either TRUE/FALSE
Great question! Here's an almost fool-proof way to do so :
Create a variable of Array type in your pipeline, say 'x'.
Have the get metadata activity to read the folder and it's childItems by adding Child Items in the field list of the Dataset as shown below (highlighted) :
After getting the list of child items as an array in the output activity of the Get Metadata activity, chain a ForEach activity as shown in the above screenshot.
In the ForEach activity, for Items, use expression : #activity('Get Metadata1').output.childItems
In the activities tab of the forEach activity, create an ifCondition activity.
In the ifCondition activity, specify the condition. eg- #equals(item().name, 'README.md').
In the Activities tab of the ifCondition, add an "Append Variable" activity for false condition.
In the Append Variable, append value : #item().name to the variable 'x'.
Now your variable 'x' has all values except 'README.md'.
Hope I was clear in the explanation.
Related
I have a azure data factory pipeline with a lookup activity that check a JSON file.
The size is like below in azure:
Azure Blog Size Screenshot
and when I download it, I see below values for the file. so it's not larger that the value the error states: "The size 5012186 of lookup activity result exceeds the limitation 4194304"
Size of the data as opened in Notepad ++
Also below is the design of my pipeline that gets stuck:
Pipeline design - Lookup Activity to Read my model.json file to retrieve metadata
Any ideas on how to tackle this issue? thanks in advance
As lookup has the limitation of 5000 rows, you can try the below workaround for this.
To overcome this the workaround is as mentioned in Microsoft Document
Design a two-level pipeline where the outer pipeline iterates over an inner pipeline, which retrieves data that doesn't exceed the maximum rows or size.
Possible solution:
First, try to save your files list as JSON files to a folder of Blob storage with the size of 5000 rows.
create Get Metadata activity which can fetch the files from the folder
Get Metadata activity settings
Then create For-each activity to Iterate over files
In for-each activity setting give items as #activity('Get Metadata1').output.childItems
For the files create a dataset and give the folder name manually and for filename use the dataset parameter, which we can give the filename in the lookup inside the parent ForEach.
Lookup activity inside Parent ForEach give the file name as #string(item().name)
Execute Pipeline activity:
Before this create an array parameter in the child pipeline and pass the look up output inside ForEach to that in the Execute Pipeline activity.
Give look up output #activity('Lookup1').output.value
Now create inside the Child Pipeline and give the array parameter to the ForEach as #pipeline().parameters.ok
You can use which ever activity you want inside this ForEach, here I have used append.
Then create result1 variable as array and give value as #variables('arrayid')
The Output will be the array of all ids in the file
BgULa02.png)
I'm working with ADF and trying to leverage parameters to make life easier and reduce the number of objects being created in the ADF itself. What I am trying to do, would appear on the surface to be extremely simple, bu in reality its driving me slowly crazy. Would greatly appreciate any assistance!
I am trying to set up a parameterised dataset to be used as a sink target. Inside that dataset I have added a param named "filenames" of type string. In the connection tab I have added that param to the file part of the path. The folder part point to my Azure Data Lake folder and the file part is set to: #dataset().filename which is the result of choosing 'dynamic content' then selecting the param.
So far so good.. my sink target is, as far as I am aware, ready to receive "filenames" to write out to.
This is where it all goes wrong.
I now create a new pipeline. I want to use a list or array of values inside that pipeline which represent the names of the files I want to process. I have been told that I'll need a Foreach to send each of the values one at a time to the COPY DATA task behind the Foreach. I am no stranger to Foreach type loops and behaviors.. but for the life of me I CANNOT see where to set up the list of filenames. I can create a param as a type "array" but how the heck do you populate it?
I have another use case which this problem is preventing me from completing. This use case is, I think, the same problem but perhaps serves to explain the situation more clearly. It goes like this:
I have a linked service to a remote database. I need to copy data from that database (around 12 tables) into the data lake. At the moment I have about 12 "COPY DATA" actions linked together - which is ridiculous. I want to use a Foreach loop to copy the data from source to data lake one after the other. Again, I can set up the sink dataset to be parameterised, just fine... but how the heck do I create the array/list of table names in the pipeline to pass to the sink dataset?
I add the Foreach and inside the foreach a "COPY DATA" but where do I add all the table names?
Would be very grateful for any assistance. THANK YOU.
If you want to manually populate values of an array as a pipeline parameter, you create the parameter with Array type and set the value with syntax like: ["File1","File2","File3"]
You then iterate that array using a ForEach activity.
Inside the ForEach, you reference #item() to get the current file name value the loop is on.
You can also use a Lookup activity to get data from elsewhere and iterate over that using the ForEach.
I have a folder in ADLS that has few files. For the purpose of understanding, I will keep it simple. I have the following three files. When I loop through this folder, I want to get the "file name" and "source" as separate parameters so that I can pass it subsequent activities/pipelines.
employee_crm.txt
contractor_ps.txt
manager_director_sap.txt
I want to put this in an array so that it can be passed accordingly to the subsequent activities.
(employee, contractor, manager_director)
(crm, ps, sap)
I want to pass two parameters to my subsequent activity (may be a stored procedure) as usp_foo (employee, crm) and it will execute the process based on the parameters. Similary, usp_foo (contractor, ps) and usp_foo (manager_director, sap).
How do I get the child items as two separate parameters so that it can be passed to SP?
To rephrase the question, you would like to 1) get a list of blob names and 2) parse those names into 2 variables. This pattern occurs frequently, so the following steps will guide you through how to accomplish these tasks.
Define an ADLS DataSet that specifies the folder. You do not need a schema, and you can optionally parameterize the FileSystem and Directory names:
To get a list of the objects within, use the GetMetadata activity. Expand the "Field list" section and select "Child Items" in the drop down:
Add a Filter activity to make sure you are only dealing with .txt files. Note it targets the "childItems" property:
You may obviously alter these expressions to meet the specific needs of your project.
Use ForEach activity to loop through each element in the Filter sequentially:
Inside the ForEach, add activities to parse the filename. To access the fileName, use "item().name":
In my example, I am storing these values as pipeline variables, which are global [hence the need to perform this operation sequentially]. Storing them in an Array for further use gets complicated and tricky in a hurry because of the limited Array and Object support in the Pipeline Expression Language. The inability to have nested foreach activities may also be a factor.
To overcome these, at this point I would pass these values to another pipeline directly inside the ForEach loop.
This pattern has the added benefit of allowing individual file execution apart from the folder processing.
I have a pipeline built that reads metadata from a blob container subfolder raw/subfolder. I then execute a foreach loop with another get metadata task to get data for each subfolder, it returns the following type of data. /raw/subfolder1/folder1, /raw/subfolder2/folder1, /raw/subfolder2/folder1 and so on. I need another foreach loop to access the files inside of each folder. The problem is that you cannot run a foreach loop inside of another foreach loop so I cannot iterate further on the files.
I have an execute datapipline that calls the above pipeline and then uses a foreach. My issue with this is that I'm not finding a way to pass the item().name from the above iteration to my new pipeline. It doesn't appear you can pass in objects form the previous pipeline? How would I be able to accomplish this nested foreach metat data gathering so I can iterate further on my files?
Have you tried using parameters? Here is how it would look like:
In your parent pipeline, click on the "Execute Pipeline" activity which triggers the inner (your new pipeline) go to Settings and specify item name as a parameter "name".
In your inner pipeline, click anywhere on empty space and add new parameter "name".
Now you can refer to that parameter like this: pipeline().parameters.name
Using Parameters works in this scenario as #Andrii mentioned.
For more on passing parameters between activities refer to this link.
https://azure.microsoft.com/en-in/resources/azure-data-factory-passing-parameters/
I have a pipeline that includes a simple copy task that reads data from an SFTP source and writes to a table within a server. I have successfully parameterized the pipeline to prompt for which server and table I want to use at runtime but I want to specify a list of server/table pairs in a table that is accessed by a lookup task for use as parameters instead of needing to manually enter the server/table each time. For now it's only three combinations of servers and tables but that number should be able to flex as needed.
The issue I'm running into as that when I try to specify the array variable as my parameter in the lookup task within a For Each loop the pipeline fails telling me I need to specify an integer in the value array. I understand what it's telling me but it doesn't seem logical to me that I'd have to specify '0', '1','2' and so on each time.
How do I just let it iterate through the server and table pairs until there aren't any more to process? I'm not sure of the exact syntax but there has to be a way to tell it run the pipeline once with this server and table, again with a different server and table, then again and again until no more pairs are found in the table.
Not sure if it matters but I am on the data flow preview and using ADFv2
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity#iteration-expression-language
I guess you want to access the iterate item, which is item() in adf expression language.
If you append a foreach activity after a look up activity, and put the output of lookup activity in items field in foreach activity, then item() means the iterate item in the lookup output.