I have a "master" pipeline in Azure Data factory, which looks like this:
One rectangle is Execute pipeline activity for 1 destination (target) Table, so this "child" pipeline takes some data, transform it and save as a specified table. Essentialy this means that before filling table on the right, we have to fill previous (connected with line) tables.
The problem is that this master pipeline contains more than 100 activities and the limit for data factory pipeline is 40 activities.
I was thinking about dividing pipeline into several smaller pipelines (i.e. first layer (3 rectangles on the left), then second layer etc.), however this could cause pipeline to run a lot longer as there could be some large table in each layer.
How to approach this? What is the best practice here?
Had a similar issue at work but I didn't used Execute Pipeline because it is a terrible approach in my case. I have more than 800 PLs to run with multiple parent and child dependencies that can go multiple levels deep depending the complexity of the data plus several restrictions (starting with transforming data for 9 regions in the US reusing PLs). A simplified diagram of one of many cases I have can easily look like this:
The solution:
A master dependency table where to store all the dependencies:
| Job ID | dependency ID | level | PL_name |
|--------|---------------|-------|--------------|
| Token1 | | 0 | |
| L1Job1 | Token1 | 1 | my_PL_name_1 |
| L1Job2 | Token1 | 1 | my_PL_name_2 |
| L2Job1 | L1Job1,L2Job2 | 2 | my_PL_name_3 |
| ... | ... | ... | ... |
From here it is a tree problem:
There are ways of mapping trees in SQL. Once you have all the dependencies mapped from a tree put them in a stage or tracker table:
| Job ID | dependency ID | level | status | start_date | end_date |
|--------|---------------|-------|-----------|------------|----------|
| Token1 | | 0 | | | |
| L1Job1 | Token1 | 1 | Running | | |
| L1Job2 | Token1 | 1 | Succeeded | | |
| L2Job1 | L1Job1,L2Job2 | 2 | | | |
| ... | ... | ... | ... | ... | ... |
We can easily query this table using a Look up activity to get the PLs level 1 to run and use a For Each activity to trigger the target PL to run with a dynamic Web Activity. Then Update the tracker table status, start_date, end_date, etc accordantly per PL.
There are only two PLs orchestrating:
one for mapping the tree and assign some type of unique ID for that batch.
two for validation (verifies status of parent PLs and controls which PL to run next)
Note: Both call a store procedure with some logic depending the case
I have a recursive call to the validation PL each time a target pipeline ends:
Lets assume L1Job1 and L1Job2 are running in parallel:
L1Job1 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
If L1Job2 hasn't ended the validation PL ends without triggering L2Job1.
Then L1Job2 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.
L2Job1 starts running after passing the validations.
Repeat for each level.
This works because we already mapped all the PL dependencies in the job tracker and we know exactly which PLs should run.
I know this looks complicated and maybe can't apply to your case but I hope this can give you or others a clue on how to solve complex data workflows in Azure Data Factory.
Yes as per documentation, Maximum activities per pipeline, which includes inner activities for containers is 40 only.
So, there is only option left is splitting your pipeline in to multiple small pipelines.
Please check below link to know limitations on ADF
https://github.com/MicrosoftDocs/azure-docs/blob/master/includes/azure-data-factory-limits.md
Related
Apologies if this has been asked and answered elsewhere. If it is, please do refer to the url in comments on replies. So here is the situation,
I am making an API Request, in response I get auth_token which I use in the Copy Activity as Authorization to retrieve data in JSON format and Sink it to Azure SQL Database. I am able to Map all the elements I'm receiving in JSON to the columns of Azure SQL Database. However, there are two columns (UploadId and RowId) that still need to be populated.
UploadId is a GUID which will be same for the whole batch of rows (this I've managed to solve)
RowId will be a sequence starting from 1 to end of that batch entry, and then for next batch (with new GUID value) it resets back to 1.
The database will look something like this,
| APILoadTime | UploadId | RowId |
| 2020-02-01 | 29AD7-12345-22EwQ | 1 |
| 2020-02-01 | 29AD7-12345-22EwQ | 2 |
| 2020-02-01 | 29AD7-12345-22EwQ | 3 |
| 2020-02-01 | 29AD7-12345-22EwQ | 4 |
| 2020-02-01 | 29AD7-12345-22EwQ | 5 |
--------------------------------------------------> End of Batch One / Start of Batch Two
| 2020-02-01 | 30AD7-12345-22MLK | 1 |
| 2020-02-01 | 30AD7-12345-22MLK | 2 |
| 2020-02-01 | 30AD7-12345-22MLK | 3 |
| 2020-02-01 | 30AD7-12345-22MLK | 4 |
| 2020-02-01 | 30AD7-12345-22MLK | 5 |
--------------------------------------------------> End of Batch Two and so on ...
Is there a way in Azure Pipeline's Copy Activity to achieve this RowId behavior ... Or even if it's possible within Azure SQL Database.
Apologies for a long description, and Thank you in advance for any help!
Regards
You need to use a Window Function to achieve this. ADF Data Flows have Window Transformation activities that are designed to do this exact thing.
Otherwise, you could load the data into a staging table and then use Azure SQL to window the data as you select it out like...
SELECT
APILoadTime
,UploadId
,ROW_NUMBER() OVER (PARTITION BY UploadId ORDER BY APILoadTime) AS RowId
FROM dbo.MyTable;
Thanks a lot #Leon Yue and #JeffRamos, I've managed to figure out the solution, so posting it here for everyone else who might encounter the same situation,
The solution I found was to use a Stored Procedure within Azure Data Factory from where I call the Azure Data Flow Activity. This is the code I used for creating the RowId seed function,
CREATE PROCEDURE resetRowId
AS
BEGIN
DBCC CHECKIDENT ('myDatabase', RESEED, 0)
END
GO
Once I have this Stored Procedure, all I did was something like this,
This does it for you, the reason I kept it 0 so that when new data comes in, it starts from 1 again ...
Hope this helps others too ...
Thank you all who helped in someway
When we write a use case table * (id, description, actor, precondition, postcondition, basic flow, alternate flow)*, in basic flow, we show plain steps of interactions between the actors and the system. I wonder how to show a condition in the use case basic flow? AFAIK, the basic flow contains plain simple steps one by one for use case. But I cannot show conditions without pseudocode? Are pseudocodes allowed in the basic flow of UML use case description?
What would be steps for below sequence?
For the above diagram, should be the table below?
-------------------------------------------------------------
| ID | UC01 |
-------------------------------------------------------------
| Description | do something |
-------------------------------------------------------------
| Precondition | -- |
-------------------------------------------------------------
| Postcondition | -- |
-------------------------------------------------------------
| Basic flow | 1. actor requests system to do something |
| | 2. if X = true |
| | 2.1 system does step 1 |
| | else |
| | 2.3 system does step 2 |
| | 3. system return results to actor |
-------------------------------------------------------------
| Alternate flow| -- |
-------------------------------------------------------------
In tools like Visual Paradigm you can model flow of events with the if/else and loop conditions, and specify the steps as user input and system response.
Use Alternate and Exceptional flows to document such behavior.
do something and step 1 are clearly of different levels, better put them into separate use cases.
Actor is not the best name for actor's role, let's say it's a User.
I had to change Step 1 to Calculation 1 to avoid confusion.
Example
------------------------------------------------------------------------
| ID | UC01 |
------------------------------------------------------------------------
| Level | User goal, black box |
------------------------------------------------------------------------
| Basic flow | 1. User requests Robot System to do something. |
| | 2. Robot System performs UC02. |
| | 3. Robot System return results to User. |
------------------------------------------------------------------------
------------------------------------------------------------------------
| ID | UC02 |
------------------------------------------------------------------------
| Level | SubFunction, white box |
------------------------------------------------------------------------
| Basic flow | 1. Robot System validates that X is true. |
| | 2. Robot System does Calculation 1. |
------------------------------------------------------------------------
| Alternate flow 1 | Trigger: Validation fails at step 1, X is false. |
| | 2a. Robot System does Calculation 2. |
------------------------------------------------------------------------
I have two Azure storage tables:
Table 1: This table is being updated regularly with a background task.
Table 2: Contains a subset of entities of table 1, but it need to be updated whenever there is a change in table 1.
| Table 2 | | Table 1 |
| | | A |
| B | | B |
| C | <=> | C |
| D | | D |
| | | E |
| | | F |
Basically what I want to achieve here is that Table 1 should always be listening to table 2 and whenever I add an entity to table 2, table 1 should know that I am interested in tracking that item and update both entities when there is an update available.
Well, here is a suggestion:
Have the background task send a message to a Service Bus Queue containing a reference to the changed entity
Make an Azure Function that listens to the queue, and it can check if an entity matching the updated one exists in the other table
If one does, it can update it
More info: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-service-bus
Support might be added in future for these events in Azure Event Grid.
When that happens, it can remove the need to post events about table changes, and you would only need the Event Grid subscription + Azure Function.
considering I have core data objects stored like this:
|Name | ActionType | Content | Date |
|-----|------------|---------|-----------|
|Abe | Create | "Hello" | 2014-07-01|
|Cat | Create | "Well" | 2014-07-01|
|Abe | Create | "Hi" | 2014-07-02|
|Bob | Edit | "Yo" | 2014-07-03|
|Cat | Delete | "What" | 2014-07-04|
|Abe | Edit | "Haha" | 2014-07-05|
I would like to get the last action of each user, so the results would be
|Abe | Edit | "Haha" | 2014-07-05|
|Cat | Delete | "What" | 2014-07-04|
|Bob | Edit | "Yo" | 2014-07-03|
Does anyone knows how to do that with a NSFetchRequest? So far from what I've gathered, if you want to use "group by", you can only retrieve the values in the group by cause (it will return "Abe, Cat, Bob" without the rest of the data in the core data object). Similar with "returnsDistinctResults", it will not return the whole object.
I have a feeling that core data is not equipped for that, any helps & hints would be appreciated!
Core Data is an object graph, not a database. Core Data itself has no concept of uniqueness, it's up to you to implement that in your application. This is most typically done using the find or create pattern. This pattern helps you prevent duplicate objects from being stored.
That said, you CAN return distinct results from Core Data using the NSDictionaryResultType. This will not prevent duplicates from being stored, but can be used to return distinct results from a fetch. There is an example of this in the programming guide. You can give this request all properties for a given entity by working with the NSEntityDescription of the managed object you are fetching.
For getting the object with the "last" timestamp for each, you actually want to get the object with the maximum value for that key path. That can be done a number of ways - a subquery, key path operators, expressions, etc.
I've got a handful of specflow tests that look something like this:
Scenario: Person is new and needs an email
Given a person
And the person does not exist in the repository
When I run the new user batch job
Then the person should be sent an email
Scenario: Person is not new and needs an email
Given a person
And the person does exist in the repository
When I run the new user batch job
Then the person should not be sent an email
Except instead of just 2 scenarios, I've got 10 very similar scenarios, all with the type of steps so I want to use a "Scenario Outline". Unfortunately, I'm having a really hard time coming up with a readable way to re-write my steps.
Currently, I've come up with this but looks clunky:
Scenario: Email batch job is run
Given a person
And the person '<personDoes/NotExist>' exist in the repository
When I run the new user batch job
Then the person '<personShould/NotGetEmail>' be sent an email
Examples:
| !notes | personDoes/NotExist | personShould/NotGetEmail |
| Exists | does not | should |
| No Exist | does | should not |
I also considered this, and while it is cleaner it doesn't convey meaning nearly as well
Scenario: Email batch job is run
Given a person
And the person does exist in the repository (is '<personExist>')
When I run the new user batch job
Then the person should be sent an email (is '<sendEmail>')
Examples:
| !notes | personExist | sendEmail |
| Exists | false | true |
| No Exist | does | false |
Does anybody have a better way of parameterizing concepts like "does", "does not", "should", "should not", "has", "has not"? At this point, I'm thinking about leaving the everything as a different scenario because it is more readable.
Here is what I've done in the past:
Given these people exist in the external system
| Id | First Name | Last Name | Email |
| 1 | John | Galt | x |
| 2 | Howard | Roark | y |
And the following people exist in the account repository
| Id | External Id | First Name | Last Name |
| 45 | 1 | John | Galt |
When I run the new user batch job
Then the following people should exist in the account repository
| External Id | First Name | Last Name | Email |
| 1 | John | Galt | x |
| 2 | Howard | Roark | y |
And the following accounts should have been sent an email
| External Id | Email |
| 2 | y |
You can use the table.CreateSet() and table.CreateSet() helper methods in SpecFlow to quickly turn the tables into data for your fake external system repository and your account table in the database.
Then you can use table.CompareToSet(accountRepository.GetAccounts() to compare the table in your "Then" clause to the records in your database.
The neat thing is, all of the steps you wrote are reusable for multiple situations. All you do is change the data in the tables, and SpecFlow writes the tests for you.
Hope that helps!
Maybe you should split them into two scenarios
Scenario Outline: User exists in the repository
Given a person
| Field | Value |
| First | <first> |
| Last | <last> |
And the person exists in the repository
When the user attempts to register
Then the person should be sent an email
Examples:
| first | last |
| Bob | Smith |
| Sarah | Jane |
And then another scenario for the opposite. This keeps the scenario meaning very clear. If your common steps are worded genericly you can reuse them. I also try to come from the approach of the user