I am looking for an ADF solution to introduce workload management for a metadata driven ingestion system.
In the pipeline, I read data from some metadata table into a lookup activity and say the data looks something like this
ObjectName,Tshirtsize,TaskGroup,IncrementalLoadFlag,InitialLoadFlag
Asset1,Large,1,N,Y
Asset2,Large,1,N,Y
Asset3,Large,1,N,Y
Asset4,Small,2,N,Y
Now I have to process this data in a foreach in sequential manner based on the value of TaskGroup, as in my first batch, I need to process the 3 tables which has the same TaskGroup and copy them asynchronously after determining the load flags.
However as far as I have seen foreach it will iterate every item of the lookup output one after the another and as a result I am not able to iterate on the grouped data (based on TaskGroup) for bulk load.
Is there a solution how this scenario can be implemented?
Related
Let's say that I have a delta table saved which was processed using the ForEachBatch to apply transformations and finally saved a final delta table (let's call these table Table1).
However for some requeriments the data of this table need to be merged or appended to another delta table (Table2) which is being updated by another stream.
My question here is how can I use the ForEach option instead of the ForEachBatch in the new streaming to save that data in the Table2? Considering that for requeriments we need to do append the data of the Table1 to the Table2 record by record 'cause using the option ForEachBatch when the process fail it generates duplicate data and ends breaking up the streaming?
Or is there another way to aproach the problem not using it?
It is important to consider that each table is an streaming table.
We have tried to implement these idea using two stream writting using the ForEachBatch however we have got error and duplicates in different scenarios. First due the fact that we need to use a surrogate key (an indentity) it makes the two streamings to fail.
We avoided it doing one staging table without the identity and later applying it, but the problem is that in some moments if the stream fail using the foreachbatch generates duplicated data and breaks the whole process.
That's why we tought that we could use the foreach to append the data to the table2 but we have no idea how it works and how implement it because it must be record to record and we havent' find an example or anything about how to implement it.
So any help would be appreciate.
If code it's needed, I could try to provide it.
Hi guys I'm struggling with a data pipeline.
I have a pipeline where I first fetch some data from an api.
This data contains among other things a column of ids.
I've set up a datacopy and I'm saving the json result in a blob.
What I want to do next is to iterate over all the ids and do an api call for those ids.
But I cant for the life of me figure out how to iterate over the ids.
I've looked in to using a lookup and for-each but seems that lookup is limited to 5000 results, I have just over 70k.
Any pointers for me?
As a workaround you could partition and store the API call results into smaller JSON files. Then use multiple pipeline according to the number of files you got, and iterate over to achieve this.
As the ForEach activity can do maximum batchCount of 50 for parallel processing, and a maximum of 100,000 items. Follow workaround for just the Lookup part.
Design a two-level pipeline where the outer pipeline iterates over an
inner pipeline, which retrieves data that doesn't exceed the maximum
rows or size.
Example:
Here I would get details from API and store as a number of JSON blobs to help feed small chunks of data to next LookupActivity.
Use GetMetadata Activity to get the know the number of partitioned files to iterate on and their name to pass to parameterized source dataset of LookupActivity going forward.
Use execute pipeline to call another pipeline, which would have the LookupActivity and WebActivity to call for the ids
Inside the child pipeline you have a LookupActivity which has parameterized source files to look at. When the ForEach activity iterates, for each file the child pipeline is triggered with one file at source of LookupActivity. This solves the limitation issue.
You can store the lookup result in variable or use as is dynamic expression.
Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.
Can i retrieve data from multiple data sources to Azure SQL DataWarehouse at the same time using single pipeline?
SQL DW can certainly load multiple tables concurrently using external (aka PolyBase) tables, bcp, or insert statements. As hirokibutterfield asks, are you referring to a specific loading tool like Azure Data Factory?
Yes you can, but there you have to mention a copy activity for each of the data source being copied to the azure data warehous.
Yes you can, and depending on the extent of transformation required, there would be 2 ways to do this. Regardless of the method, the data source does not matter to ADF since your data movement happens via the copy activity which looks at the dataset and takes care of firing the query on the related datasource.
Method 1:
If all your transformation for a table can be done in a SELECT query on the source systems, you can have a set of copy activities specifying SELECT statements. This is the simple approach
Method 2:
If your transformation requires complex integration logic, first use copy activities to copy over the raw data from the source systems to staging tables in the SQLDW instance (Step 1). Then use a set of stored procedures to do the transformations (Step 2).
The ADF datasets which are the output from Step1 will be input datasets to Step 2 in order to maintain consistency.
So I am trying to use Azure Data Factory to replace the SSIS system we have in place, and I am having some trouble...
The process I want to follow is to take a list of projects and a list of clients and create a report of the clients and projects we have. These lists update frequently, so I want to update this report every hour. To combine the data, I will be using Power BI Pro, so Data Factory just needs to load the data into a usable format.
My source right now is a call to an API that returns a list of projects. However, this data isn't separated by time at all. I don't see any sort of history. Same goes for the list of clients.
What should the availability for my dataset be?
you may use the custom activity in ADF to call the API that returns list of projects. The custom activity will then write that data in the right format to the destination.
Example of a custom activity in ADF: https://azure.microsoft.com/en-us/documentation/articles/data-factory-use-custom-activities/
The frequency will be the cadence at which you wish to run this operation.