I have a stream analytics job which constantly dumps data in Cosmos DB. The payload has a property "Type" which determines the payload itself. i.e. which columns are included in the payload. It is an integer value of either 1 or 2.
I'm using Azure Data Factory V2 to copy data from Cosmos DB to Data Lake. I've created a pipeline with an activity that does this job. I'm setting the output path folder name using :
#concat('datafactoryingress/rawdata/',dataset().productFilter,'/',formatDateTime(utcnow(),'yyyy'),'/')
What I want in the datafactory is to identify the payload itself, i.e. determine if the type is 1 or 2 and then determine if the data goes in folder 1 or folder 2. I want to iterate the data from Cosmos DB and determine the message type and segregate based on message Type and set the folder paths dynamically.
Is there a way to do that? Can I check the Cosmos DB document to find out the message type and then how do I set the folder path dynamically based on that?
Is there a way to do that? Can I check the Cosmos DB document to find
out the message type and then how do I set the folder path dynamically
based on that?
Unfortunately, based on the doc, dynamic content from source dataset is not supported by adf so far. You can't grab the fields in the source data as sink output dynamic parameters. Based on your situation, I suggest you setting up two separate pipelines to transfer data according to the Type field respectively.
If the Type field is varied and you do want to differentiate the output path, the ADF may not be the suitable choice for you. You could write logical code to fulfill your needs.
Related
I have an ADF pipeline which is iterating over a set of files, performing various operations and I have an Azure CosmosDB (SQL API) instance where I would like to insert the name of file and a timestamp, mainly to keep track on which files have been already processed and which not, but in the future I might want to add some other bits of data related to each file.
What I have is my CosmosDB
And currently I am trying to utilice the Copy Data Activity for the insert part.
One problem that I have is that this particular activity expects source while at this point I have only the filename. In theory it was an option to use the Blob Storage from where I read the file at the beginning, but since the Blob Storage is set to store binary files I got the following error if I try to use it as source
Because of that I created a dummy CosmosDB Linked service, but I have several issues with this approach:
Generally the idea for dummy source is not very appealing to me
I haven't find a lot of information on the topic but it seems that if I want to use something in the Sink I need to SELECT from the source
Even though I have selected a value for the id the item is not saved with the selected value from the Source query, but as you can see from the first screenshot I got a GUID and only the name is as I want it.
So my questions are two. I just learn ADF but this approach doesn't look like the proper way to insert item into CosmosDB from activity, so a better/more common approach would be appreciated. If there is not better proposal, how can I at least apply my own value for the id column? If I create the item in the CosmosDB GUI and save it from there, as you can see I am able to use the filename as id which for now seems like a good idea to me, but I wasn't able to add custom value (string or int) when I was trying through the activity, so how can I achieve this?
This is how my Sink looks like
While trying to build an ADF pipeline that generates datasets within Data Factory, I ran into an interesting issue. Or maybe I misunderstand some components completely, in which case I'd happily be educated.
I basically read some meta data from a SQL Database table which determines which source system, schema and tables I should pull new data from. The meta data is stored within a bunch of variables, which then feed a Web Request that attempts to generate a new Data Source as per the MS documentation. Yes, I'm trying to use Azure Data Factory to generate Azure Data Factory components.
The URL to create the DataSet and the JSON Body for the request are both generated using #Concat and a number of the variables. The resulting DataSet is a very straightforward file that does not contain references to the columns, but just the table schema and table name. I generated these manually before, and that all seems to work brilliantly. I basically have a dataset connected to the source system, referincing the table from the meta data.
The code runs, but the resulting dataset is directly published, as opposed to being added in my working branch. While this should not be a big issue once I manage to properly test everything, ideally the object would be created in my working branch (using Azure DevOps, thus a local file).
My next thought was to set up a linked service to my local PC, and simply write the same contents as above there. My challenge seems to be that I essentially am creating a file out of nothing. I am trying to use a Copy Data component, and added an empty placeholder file to act as a source.
I configure the sink with Dynamic Content for Copy Behavior, and attempted to add the JSON contents there. This gets the file created, but it's unfortunately empty. I also attempted to add a new column to the source with the data being the same contents.
However, seeing the file to be used as a sink doesn't exist, a mapping error will occur. Apart from this, I'd not want a column header to be written; just the dynamically created contents.
I'm not sure how to continue with this. I feel I'm very close to achieving my goal, but cannot seem to take this final hurdle.
Any hints or suggestions would be very welcome.
I'm trying to add entries into my cosmosdb using Azure Data Factory - However i am not able to choose the right collection as Azure Data Factory can only see the top level of the database.
Is there any funny syntax for choosing which collection to pick from Cosmos DB SQL API? - i've tried doing, entities[0] and entities['tasks'] but none of them seem to work
The new entries are inserted as we see in the red box, how do i get the entries into the entries collection?
Update:
Original Answer:
If the requirement you mentioned in the comments is what you need, then it is possible. For example, to put JSON data into an existing ‘tasks’ item, you only need to use the upsert method, and the source json data has the same id as the ‘tasks’ item.
This is the offcial doc:
https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-cosmos-db#azure-cosmos-db-sql-api-as-sink
The random letters and numbers in your red box appear because you did not specify the document id.
Have a look of this:
By the way, if the tasks have partitional key, then you also need to specify.
is there a way to create a template to validate incoming files including such checks as empty file checks, format, data types, record counts along, and will stop the workflow if any of the checks fail. The solution for this requirement should consider multiple file-formats and reduce the burden on ETL processing and checks to enable scale.
File transfer to occur either by trigger or data currency rule
Data Factory more focus on data transfer, not the file filter.
We could using the get metadata and if-condition to achieve some of the these feature, such as validate the file format, size, file name. You can use Get Metadata to get the file properties and If-condition can help you filter the file.
But that's too complexed for Data Factory to help you achieve all the features you want.
Update:
For example, we can parameter a file in source, :
Create dataset parameter filename and pipeline parameter name:
Using Get metadata to get its properties: Item type, Exists, Size, Item name.
Output:
For example, We can build expression in if-condition to judge if it's empyt(size=0):
#equals(activity('Get Metadata1').output.size,0)
If Ture means it's empty, False no empty. Then we can build the workflow in True or False active.
Hope this helps.
I demonstrate similar techniques to validate source files and take appropriate downstream actions in your pipeline based on those values in this video.
I am new to the ADF.
While I am trying to use Copy activity for moving data from API Call output to Blob Json, I am unable to use Lookup output. I am trying to map the fields explicitly in Mapping using #item().SiteID. But JSON output returns only with input fields (not the derived fields). Can someone help me to let me know how to achieve this?
Can I use Copy activity in For Each activity (#activity('LookupAvailableChannelListForExport').output.value) to pass Lookup output value (#item().siteID)in mapping between source and sink?
As i know, the output of Look Up Activity can't be source data in copy activity,even mapping between source and sink. Acutally, Look Up activity prefers the following usage according to official document:
Dynamically determine which objects to operate on in a subsequent
activity, instead of hard coding the object name. Some object examples
are files and tables.
I think the example from above link is a good interpretation.You could see that the output of Look Up activity is configured as dynamic sql db source dataset table name.Not the data in source.
Then back to your requirement,i think you could configure the source dataset as root folder if the files are stored in the same directory with same schema. And keep this option is selected so that all the data in all files will be grabbed.
If you want to implement some variant of source data, copy activity can't cover it but data flow activity could.You could use Derived column.Such as resetting the Json structure.