I'm working with ADF and trying to leverage parameters to make life easier and reduce the number of objects being created in the ADF itself. What I am trying to do, would appear on the surface to be extremely simple, bu in reality its driving me slowly crazy. Would greatly appreciate any assistance!
I am trying to set up a parameterised dataset to be used as a sink target. Inside that dataset I have added a param named "filenames" of type string. In the connection tab I have added that param to the file part of the path. The folder part point to my Azure Data Lake folder and the file part is set to: #dataset().filename which is the result of choosing 'dynamic content' then selecting the param.
So far so good.. my sink target is, as far as I am aware, ready to receive "filenames" to write out to.
This is where it all goes wrong.
I now create a new pipeline. I want to use a list or array of values inside that pipeline which represent the names of the files I want to process. I have been told that I'll need a Foreach to send each of the values one at a time to the COPY DATA task behind the Foreach. I am no stranger to Foreach type loops and behaviors.. but for the life of me I CANNOT see where to set up the list of filenames. I can create a param as a type "array" but how the heck do you populate it?
I have another use case which this problem is preventing me from completing. This use case is, I think, the same problem but perhaps serves to explain the situation more clearly. It goes like this:
I have a linked service to a remote database. I need to copy data from that database (around 12 tables) into the data lake. At the moment I have about 12 "COPY DATA" actions linked together - which is ridiculous. I want to use a Foreach loop to copy the data from source to data lake one after the other. Again, I can set up the sink dataset to be parameterised, just fine... but how the heck do I create the array/list of table names in the pipeline to pass to the sink dataset?
I add the Foreach and inside the foreach a "COPY DATA" but where do I add all the table names?
Would be very grateful for any assistance. THANK YOU.
If you want to manually populate values of an array as a pipeline parameter, you create the parameter with Array type and set the value with syntax like: ["File1","File2","File3"]
You then iterate that array using a ForEach activity.
Inside the ForEach, you reference #item() to get the current file name value the loop is on.
You can also use a Lookup activity to get data from elsewhere and iterate over that using the ForEach.
Related
I have an ADF pipeline which is iterating over a set of files, performing various operations and I have an Azure CosmosDB (SQL API) instance where I would like to insert the name of file and a timestamp, mainly to keep track on which files have been already processed and which not, but in the future I might want to add some other bits of data related to each file.
What I have is my CosmosDB
And currently I am trying to utilice the Copy Data Activity for the insert part.
One problem that I have is that this particular activity expects source while at this point I have only the filename. In theory it was an option to use the Blob Storage from where I read the file at the beginning, but since the Blob Storage is set to store binary files I got the following error if I try to use it as source
Because of that I created a dummy CosmosDB Linked service, but I have several issues with this approach:
Generally the idea for dummy source is not very appealing to me
I haven't find a lot of information on the topic but it seems that if I want to use something in the Sink I need to SELECT from the source
Even though I have selected a value for the id the item is not saved with the selected value from the Source query, but as you can see from the first screenshot I got a GUID and only the name is as I want it.
So my questions are two. I just learn ADF but this approach doesn't look like the proper way to insert item into CosmosDB from activity, so a better/more common approach would be appreciated. If there is not better proposal, how can I at least apply my own value for the id column? If I create the item in the CosmosDB GUI and save it from there, as you can see I am able to use the filename as id which for now seems like a good idea to me, but I wasn't able to add custom value (string or int) when I was trying through the activity, so how can I achieve this?
This is how my Sink looks like
While trying to build an ADF pipeline that generates datasets within Data Factory, I ran into an interesting issue. Or maybe I misunderstand some components completely, in which case I'd happily be educated.
I basically read some meta data from a SQL Database table which determines which source system, schema and tables I should pull new data from. The meta data is stored within a bunch of variables, which then feed a Web Request that attempts to generate a new Data Source as per the MS documentation. Yes, I'm trying to use Azure Data Factory to generate Azure Data Factory components.
The URL to create the DataSet and the JSON Body for the request are both generated using #Concat and a number of the variables. The resulting DataSet is a very straightforward file that does not contain references to the columns, but just the table schema and table name. I generated these manually before, and that all seems to work brilliantly. I basically have a dataset connected to the source system, referincing the table from the meta data.
The code runs, but the resulting dataset is directly published, as opposed to being added in my working branch. While this should not be a big issue once I manage to properly test everything, ideally the object would be created in my working branch (using Azure DevOps, thus a local file).
My next thought was to set up a linked service to my local PC, and simply write the same contents as above there. My challenge seems to be that I essentially am creating a file out of nothing. I am trying to use a Copy Data component, and added an empty placeholder file to act as a source.
I configure the sink with Dynamic Content for Copy Behavior, and attempted to add the JSON contents there. This gets the file created, but it's unfortunately empty. I also attempted to add a new column to the source with the data being the same contents.
However, seeing the file to be used as a sink doesn't exist, a mapping error will occur. Apart from this, I'd not want a column header to be written; just the dynamically created contents.
I'm not sure how to continue with this. I feel I'm very close to achieving my goal, but cannot seem to take this final hurdle.
Any hints or suggestions would be very welcome.
I have a folder in ADLS that has few files. For the purpose of understanding, I will keep it simple. I have the following three files. When I loop through this folder, I want to get the "file name" and "source" as separate parameters so that I can pass it subsequent activities/pipelines.
employee_crm.txt
contractor_ps.txt
manager_director_sap.txt
I want to put this in an array so that it can be passed accordingly to the subsequent activities.
(employee, contractor, manager_director)
(crm, ps, sap)
I want to pass two parameters to my subsequent activity (may be a stored procedure) as usp_foo (employee, crm) and it will execute the process based on the parameters. Similary, usp_foo (contractor, ps) and usp_foo (manager_director, sap).
How do I get the child items as two separate parameters so that it can be passed to SP?
To rephrase the question, you would like to 1) get a list of blob names and 2) parse those names into 2 variables. This pattern occurs frequently, so the following steps will guide you through how to accomplish these tasks.
Define an ADLS DataSet that specifies the folder. You do not need a schema, and you can optionally parameterize the FileSystem and Directory names:
To get a list of the objects within, use the GetMetadata activity. Expand the "Field list" section and select "Child Items" in the drop down:
Add a Filter activity to make sure you are only dealing with .txt files. Note it targets the "childItems" property:
You may obviously alter these expressions to meet the specific needs of your project.
Use ForEach activity to loop through each element in the Filter sequentially:
Inside the ForEach, add activities to parse the filename. To access the fileName, use "item().name":
In my example, I am storing these values as pipeline variables, which are global [hence the need to perform this operation sequentially]. Storing them in an Array for further use gets complicated and tricky in a hurry because of the limited Array and Object support in the Pipeline Expression Language. The inability to have nested foreach activities may also be a factor.
To overcome these, at this point I would pass these values to another pipeline directly inside the ForEach loop.
This pattern has the added benefit of allowing individual file execution apart from the folder processing.
I am new to the ADF.
While I am trying to use Copy activity for moving data from API Call output to Blob Json, I am unable to use Lookup output. I am trying to map the fields explicitly in Mapping using #item().SiteID. But JSON output returns only with input fields (not the derived fields). Can someone help me to let me know how to achieve this?
Can I use Copy activity in For Each activity (#activity('LookupAvailableChannelListForExport').output.value) to pass Lookup output value (#item().siteID)in mapping between source and sink?
As i know, the output of Look Up Activity can't be source data in copy activity,even mapping between source and sink. Acutally, Look Up activity prefers the following usage according to official document:
Dynamically determine which objects to operate on in a subsequent
activity, instead of hard coding the object name. Some object examples
are files and tables.
I think the example from above link is a good interpretation.You could see that the output of Look Up activity is configured as dynamic sql db source dataset table name.Not the data in source.
Then back to your requirement,i think you could configure the source dataset as root folder if the files are stored in the same directory with same schema. And keep this option is selected so that all the data in all files will be grabbed.
If you want to implement some variant of source data, copy activity can't cover it but data flow activity could.You could use Derived column.Such as resetting the Json structure.
I have a pipeline that includes a simple copy task that reads data from an SFTP source and writes to a table within a server. I have successfully parameterized the pipeline to prompt for which server and table I want to use at runtime but I want to specify a list of server/table pairs in a table that is accessed by a lookup task for use as parameters instead of needing to manually enter the server/table each time. For now it's only three combinations of servers and tables but that number should be able to flex as needed.
The issue I'm running into as that when I try to specify the array variable as my parameter in the lookup task within a For Each loop the pipeline fails telling me I need to specify an integer in the value array. I understand what it's telling me but it doesn't seem logical to me that I'd have to specify '0', '1','2' and so on each time.
How do I just let it iterate through the server and table pairs until there aren't any more to process? I'm not sure of the exact syntax but there has to be a way to tell it run the pipeline once with this server and table, again with a different server and table, then again and again until no more pairs are found in the table.
Not sure if it matters but I am on the data flow preview and using ADFv2
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity#iteration-expression-language
I guess you want to access the iterate item, which is item() in adf expression language.
If you append a foreach activity after a look up activity, and put the output of lookup activity in items field in foreach activity, then item() means the iterate item in the lookup output.